It is necessary to check dozens of sites daily for the presence of specific texts within them (the text can be located on an arbitrary web page within the site). I would like to make such a search engine in Python. Help, please, where to start implementation?

Closed due to the fact that the issue is too general for participants Suvitruf , Sergey Gornostaev , 0xdb , insolor , Pavel Durmanov Aug 18 '18 at 18:54 .

Please correct the question so that it describes the specific problem with sufficient detail to determine the appropriate answer. Do not ask a few questions at once. See “How to ask a good question?” For clarification. If the question can be reformulated according to the rules set out in the certificate , edit it .

  • Sites known? - danilshik
  • Sites are new every day, the input parameter to start the script should be exactly the url. Site structures and everything related to their content are undefined. - Michael
  • Then you need to write a spider - danilshik

1 answer 1

It is worth studying: 1. Requests- for working with web-resources. 2. Beautiful Soup - for parsing sites (if you know the structure) 3. Re - regular expressions for text search. You can use other search methods, such as algorithms a la Boera-Moore-Horpsula

The easiest way is to get the text of the pages you need and immediately, without parsing, search for the text you want.