Find duplicate rows

Question

Hey.

Got a question:
There is a file 1 ~ 2GB
and there is a file 2 ~ 1MB

The question is - how most quickly calculate the number of occurrences of lines from file2 in file1?

For example, file1:

Тест корова большая Флагман большая корова Трубадур золотистый Флагманский телефон N90 Логическое оформление Тест корова Бла бла стар

File2:

 Трубадур Тест корова телефон

It should be:

 Трубадур 1 Тест корова 2 телефон 1

File 1 is much larger than file2.

@Titano, are you interested in the exact entry (with leading and ending spaces, etc.) or do you need to consider breaking the lines of files into words?

Accepted Answer · 2014-12-15T20:50:02

An effective method of searching for multiple substrings simultaneously in a large text is the Aho-Korasik algorithm . The original fgrep (grep -F) uses this algorithm. GNU grep in this case uses the Commentz-Walter algorithm (the Aho-Corassik combination and the Boyer-Moore string search algorithm ). ripgrep (rg) is sometimes faster than GNU grep using a SIMD algorithm called Teddy - see ripgrep is faster than {grep, ag, git grep, ucg, pt, sift} .

To count the found lines, you can use sort | uniq -c sort | uniq -c command suggested by @BOPOH in the comment to the question. You can also use the dictionary instead of sorting:

 #!/bin/sh grep -Fof file2 file1 | perl -lne '$seen{$_}++ }{ while (my ($k,$v)=each %seen){print "$k $v"}'

Measurements can show which command ( sort+uniq or perl ) is faster in this case.

For comparison, you can see how long the Python script will take without using additional packages (for example, esm ) that implement Aho-Korasik-like algorithms:

 #!/usr/bin/env python import re import sys from collections import Counter def main(): # load substrings from file2 with open('file2', 'rb') as file2: substrings = {line.strip() for line in file2} # unique lines substrings = filter(None, substrings) # filter blank lines substrings = sorted(substrings, key=len, reverse=True) # longest first pattern = b"|".join(map(re.escape, substrings)) find_substrings = re.compile(pattern).findall # count substrings in file1 counter = Counter() counter_update = counter.update # cache the lookup (for CPython) with open('file1', 'rb') as file1: for line in file1: counter_update(find_substrings(line)) # write substrings frequencies write = getattr(sys.stdout, 'buffer', sys.stdout).write for substring, count in counter.most_common(): # most common first write(substring) write(b' ') write(str(count).encode()) write(b'\n') main()

Result

 Тест корова 2 телефон 1 Трубадур 1

Here is another option for comparison, where the lines in the file are searched using a regular expression and mmap :

 #!/usr/bin/env python3 import re import sys from collections import Counter from mmap import ACCESS_READ, mmap from operator import methodcaller def generate_tokens(filename, pattern): with open(filename) as f, mmap(f.fileno(), 0, access=ACCESS_READ) as mm: yield from re.finditer(pattern, mm) def main(): # load substrings from file2 with open('file2', 'rb') as file2: substrings = {line.strip() for line in file2} # unique lines substrings = filter(None, substrings) # filter blank lines substrings = sorted(substrings, key=len, reverse=True) # longest first pattern = b"|".join(map(re.escape, substrings)) # count substrings in file1 file1_substrings = generate_tokens('file1', pattern) counter = Counter(map(methodcaller('group'), file1_substrings)) # write substrings frequencies write = getattr(sys.stdout, 'buffer', sys.stdout).write for substring, count in counter.most_common(): # most common first write(substring) write(b' ') write(str(count).encode()) write(b'\n') main()

Works on Windows, * nix. Code for Python 3, but it can be easily adapted for Python 2, if necessary.

The option "vlob" - according to the approximate calculation it will take ~ 20 hours.
(brute force) Option without packages ~ 2 times less Option with grep -F on file ~ 5GB with keywords ~ on 4MB took: real 2m54.014s user 2m53.995s sys 0m2.652s
I wonder how much memory went there from the Aho-Korasik algorithm?
@Titano, how correct is the comparison, say, of a sample phone with the text of microphones ?
телефон and a микротелефон телефон such a combination cannot be (this is another task of collecting information).
It can be faster if lines from file2 are very rarely found in file1.

BOPOH BOPOH 4,594 2 gold marks 20 silver marks 48 bronze marks · Answer 2 · 2014-12-15T17:08:22

The decision "in the forehead" (I doubt that it will be fast)))

 grep -o -f file2 file1 | sort | uniq -c

5 lines in file2 and 2.7G in file1 processed in about 4 minutes.

If it is not necessary to count the quantity for each individual line (i.e., the total amount is sufficient), then it can be even simpler (and slightly faster):

 grep -co -f file2 file1

in file1 is a string in file2 is a substring and you need to count how many times a substring

Answer 3 · 2014-12-15T15:21:54

Well, as an option, you can try like this in the forehead:

 somelist1 = [] somelist2 = [] f1 = open('file1.txt') # файл на 2mb f2 = open('file2.txt') # файл на 2gb for line in f1: a1 = line.replace('\n', '') somevar = a1.split() for i in somevar: somelist1.append(i) for line in f2: a2 = line.replace('\n', '') somevar = a2.split() for i in somevar: somelist2.append(i) for i in somelist1: index = 0 for j in somelist2: if i == j: index += 1 print(i + " - " + str(index))

Conclusion in the form of:

 Трубадур - 481779 Тест - 963549 корова - 1445337 телефон - 481779

I tried on a file of 150 mb, I made it in 5-7 seconds, I did not measure the time. Maybe someone will offer a more optimized solution.

UPDATE

I tried a bigger file, 2gb is most likely not exactly progruzit, this option is applicable only to small files.

(by the way, what is this mysterious str (index)?) And in the question they are asked to look for occurrences of the string f1 in f2.
- About whether a 2Gb file will fit into memory in python - I’m generally silent.
If you look for exactly the occurrence of a string, it changes in one line.
"(by the way, what is this mysterious str (index)?)" and how do you concatenate a string with a number?
(without using placeholders and format) You always have the right to offer your solution, unfortunately I don’t see it yet :(

Find duplicate rows

3 answers 3

Result

More articles: