There is the following function that searches for a substring in the string

def findPattern(filename, patternList): resList = [] with open(filename, 'r') as file: for line in file: for pattern in patternList: if pattern in line: resList.append(pattern) return set(resList) 

But it works line by line. For example, in the file

 simple string for example 

it will not find a substring for example . How can I search for substrings located on different lines in the file?

    2 answers 2

    To find phrases such as "for example" in the file, without paying attention to the type and number of spaces between words, you can normalize the spaces in the file and then find the lines that are present in the text:

     def find_phrases(filename, phrases): with open(filename) as file: text = ' '.join(file.read().split()) # normalize whitespace return filter(text.__contains__, phrases) # return phrases themselves 

    If the entire file in memory does not fit and in order not to run through the whole file again in search of each phrase, regular expressions can be used on mmap :

     import mmap import re from contextlib import closing def find_phrases(filename, phrases): # match the longest phrases literally ignoring whitespace pattern = '|'.join(['\s+'.join(map(re.escape, p.split())) for p in sorted(phrases, key=len, reverse=True)]) with open(filename, 'r+b', 0) as f, \ closing(mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)) as s: return re.findall(pattern, s) # return matched strings from the file 

    Example:

     print find_phrases('input.txt', ['simple', 'for example']) # -> ['simple', 'for\nexample'] 

    mmap allows you to treat a file as a byte string, continuing to work even for files that are larger than the available memory. Regular expressions allow you to search all input phrases at once simultaneously ( a|b|c regex type).

    Depending on what specifically you want to find: fixed lines, taking into account space / ignoring, whole words / substrings, taking into account case / no, file size, number and size of individual lines, etc., there can be more efficient string algorithms, for example, the Aho algorithm -Coracic or using arrays of suffixes , and so on.

    Using such algorithms can be the difference between a whole day of calculations and just a few minutes .

      We read the lines of the file, deleting the hyphen at the end of each line, and combine them into one separated by spaces. Then we can conduct a search on this line, as required:

       def findPattern(filename, patternList): resList = [] with open(filename, 'r') as file: text = " ".join([x.rstrip("\n") for x in file.readlines()]) for pattern in patternList: if pattern in text: resList.append(pattern) return set(resList) 
      • If the entire memory file is read and normalized by the space: text = " ".join(file.read().split()) ¶ by the way, you can simply for line in file without readlines() read one line at a time. - jfs