Hello to all.

There is a large list of sites. How to keep them without going to everyone?

PS It is necessary to save, to be able to watch them offline.

  • I correctly understand that it is about finding a program that can do this? Or about writing? - Qwertiy
  • one
    Once upon a time I used Teleport Pro. - Qwertiy
  • In one way or another, you will have to "view" them because to save locally they will have to be downloaded, which in general (in most cases) is tantamount to viewing. - alexoander

2 answers 2

There are some programs that are called web crawlers. The most famous free is httrack and heritrix. There is a simpler in terms of usability but a paid offline explorer. But they all have problems - if part of the content is given via links generated by JavaScript, it will not be loaded.

    You can write a small BASH script. The algorithm is as follows: 1. Download sitemap.xml 2. take all links from it 3. Next, we rob a website in a multithreaded manner and collect everything and everything we find on the disk.

    URL_PARSE=$1 MODE=$2 echo "parse: $URL_PARSE" if test $MODE = "sitemap" then echo "parse: sitemap" #Собрать и сохранить из SITEMAP curl http://$URL_PARSE | grep "<[/]*loc>" | sed 's/[<][/]*loc[>]//g;s/^[ \t]*//' > urls fi if test $MODE = "page" then #Собоать и сохранить с обычной страницы lynx -listonly -dump $URL_PARSE | grep -oP 'http.?://\S+' > urls fi echo "parse: $URL_PARSE" xargs -P 20 -n 1 wget -nv -p < urls 

    Use it like this: sh save.sh <target-site>/sitemap.xml sitemap

    1 PROBLEM weak sites will fall = (But if pumping out projects is ok then everything is ok

    • Why, if there is a wget with a recursive bypass? - free_ze
    • I agree that when the execution speed is not a priority, then you can consistently collect everything that will be found wget -nv -r <target-site> . Just in my practice, the problem was just in the criticality of the execution time and if the resource has kk pages you can wait indefinitely, and since each URL turns into a separate thr and is downloaded independently, which gives a significant increase in execution time - Redr01d
    • @redriod Ah, yes, I did not notice that the multi-threaded script) - free_ze