It is necessary to check dozens of sites daily for the presence of specific texts within them (the text can be located on an arbitrary web page within the site). I would like to make such a search engine in Python. Help, please, where to start implementation?
Closed due to the fact that the issue is too general for participants Suvitruf ♦ , Sergey Gornostaev , 0xdb , insolor , Pavel Durmanov Aug 18 '18 at 18:54 .
Please correct the question so that it describes the specific problem with sufficient detail to determine the appropriate answer. Do not ask a few questions at once. See “How to ask a good question?” For clarification. If the question can be reformulated according to the rules set out in the certificate , edit it .
- Sites known? - danilshik
- Sites are new every day, the input parameter to start the script should be exactly the url. Site structures and everything related to their content are undefined. - Michael
- Then you need to write a spider - danilshik
1 answer
It is worth studying: 1. Requests- for working with web-resources. 2. Beautiful Soup - for parsing sites (if you know the structure) 3. Re - regular expressions for text search. You can use other search methods, such as algorithms a la Boera-Moore-Horpsula
The easiest way is to get the text of the pages you need and immediately, without parsing, search for the text you want.