I am writing a script that will receive a page on the Internet by parameter. The script should download it, find all the emails on it (I think, searching on @ is enough) and save it to the document. Then find all the links (as I understand it, you can do this by searching for the word href) and download all those pages. Do the same on them. Yes, the script is likely to be infinite, but it is ok. I do not know how to implement this, it is not particularly comfortable in this situation, since in all past questions I asked for help specifically with my code. I would be happy as a ready-made solution, and any advice / tip.

  • In a good way, this is practically very difficult to implement on the shell. You need to parse the XML, and search in the nodes already (and you need to distinguish between tags and text). Among other things, the text on the page can be in a saved form, that is, it is necessary to read the encoding and decode according to the HTML / XML header. The answer below is not the answer, but so. - 0andriy
  • @ AS3Master has already solved the problem on python in the library beautifulsoupe crummy.com/software/BeautifulSoup/bs4/doc - Hellseher

1 answer 1

approximate "backbone" of the program:

main() { # первый и единственный параметр — url file=получить_уникальное_имя_для_временного_файла wget -qO "$file" "$1" grep "регулярное выражение для извлечения email" "$file" >> файл-с-email-ами grep "регулярное выражение для извлечения ссылок" | while read url do main "$url" done rm "$file" } 

about the implementation of regular expressions and getting a unique name for a temporary file - ask using the "ask a question" button in the upper right corner of the page.

  • Didn't quite understand, "get_unique_name_for_time_file" is just generating a random name? For example, filename + at the end of i, where i is incremented by the counter? - AS3Master
  • Yes, you can and so. complexity only with the counter. - aleksandr barakin