To find phrases such as "for example" in the file, without paying attention to the type and number of spaces between words, you can normalize the spaces in the file and then find the lines that are present in the text:
def find_phrases(filename, phrases): with open(filename) as file: text = ' '.join(file.read().split()) # normalize whitespace return filter(text.__contains__, phrases) # return phrases themselves
If the entire file in memory does not fit and in order not to run through the whole file again in search of each phrase, regular expressions can be used on mmap :
import mmap import re from contextlib import closing def find_phrases(filename, phrases): # match the longest phrases literally ignoring whitespace pattern = '|'.join(['\s+'.join(map(re.escape, p.split())) for p in sorted(phrases, key=len, reverse=True)]) with open(filename, 'r+b', 0) as f, \ closing(mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)) as s: return re.findall(pattern, s) # return matched strings from the file
Example:
print find_phrases('input.txt', ['simple', 'for example']) # -> ['simple', 'for\nexample']
mmap allows you to treat a file as a byte string, continuing to work even for files that are larger than the available memory. Regular expressions allow you to search all input phrases at once simultaneously ( a|b|c regex type).
Depending on what specifically you want to find: fixed lines, taking into account space / ignoring, whole words / substrings, taking into account case / no, file size, number and size of individual lines, etc., there can be more efficient string algorithms, for example, the Aho algorithm -Coracic or using arrays of suffixes , and so on.
Using such algorithms can be the difference between a whole day of calculations and just a few minutes .