An effective method of searching for multiple substrings simultaneously in a large text is the Aho-Korasik algorithm . The original fgrep (grep -F) uses this algorithm. GNU grep in this case uses the Commentz-Walter algorithm (the Aho-Corassik combination and the Boyer-Moore string search algorithm ). ripgrep (rg) is sometimes faster than GNU grep using a SIMD algorithm called Teddy - see ripgrep is faster than {grep, ag, git grep, ucg, pt, sift} .
To count the found lines, you can use sort | uniq -c sort | uniq -c command suggested by @BOPOH in the comment to the question. You can also use the dictionary instead of sorting:
#!/bin/sh grep -Fof file2 file1 | perl -lne '$seen{$_}++ }{ while (my ($k,$v)=each %seen){print "$k $v"}'
Measurements can show which command ( sort+uniq or perl ) is faster in this case.
For comparison, you can see how long the Python script will take without using additional packages (for example, esm ) that implement Aho-Korasik-like algorithms:
#!/usr/bin/env python import re import sys from collections import Counter def main(): # load substrings from file2 with open('file2', 'rb') as file2: substrings = {line.strip() for line in file2} # unique lines substrings = filter(None, substrings) # filter blank lines substrings = sorted(substrings, key=len, reverse=True) # longest first pattern = b"|".join(map(re.escape, substrings)) find_substrings = re.compile(pattern).findall # count substrings in file1 counter = Counter() counter_update = counter.update # cache the lookup (for CPython) with open('file1', 'rb') as file1: for line in file1: counter_update(find_substrings(line)) # write substrings frequencies write = getattr(sys.stdout, 'buffer', sys.stdout).write for substring, count in counter.most_common(): # most common first write(substring) write(b' ') write(str(count).encode()) write(b'\n') main()
Result
Тест корова 2 телефон 1 Трубадур 1
Here is another option for comparison, where the lines in the file are searched using a regular expression and mmap :
#!/usr/bin/env python3 import re import sys from collections import Counter from mmap import ACCESS_READ, mmap from operator import methodcaller def generate_tokens(filename, pattern): with open(filename) as f, mmap(f.fileno(), 0, access=ACCESS_READ) as mm: yield from re.finditer(pattern, mm) def main(): # load substrings from file2 with open('file2', 'rb') as file2: substrings = {line.strip() for line in file2} # unique lines substrings = filter(None, substrings) # filter blank lines substrings = sorted(substrings, key=len, reverse=True) # longest first pattern = b"|".join(map(re.escape, substrings)) # count substrings in file1 file1_substrings = generate_tokens('file1', pattern) counter = Counter(map(methodcaller('group'), file1_substrings)) # write substrings frequencies write = getattr(sys.stdout, 'buffer', sys.stdout).write for substring, count in counter.most_common(): # most common first write(substring) write(b' ') write(str(count).encode()) write(b'\n') main()
Works on Windows, * nix. Code for Python 3, but it can be easily adapted for Python 2, if necessary.