I decided to prepare all the links of the pages in advance, and then save them in the .mht format. For example, the .txt file contains all the links (each with a new line). The program should read links from there and save it in .mht format to the hard disk.

Is it possible to implement this programmatically and which technology is better to use?

Why do I need it?

There is one online store. From it I need to download all the images of the goods. I tried to use jQuery through Google Chrome extensions. Tried to write an application in C #. But, for some reason, not all images of the catalog are loaded.

Here, for example, this page: www.onlinetrade.ru/catalogue/smartfoni-c13 50 products are shown here. I am looking for all the construction "img". Downloading. As a result, only the first 7 product images are loaded. As a result, I decided to download in the manner described above.

  • one
    Or maybe not the first 7, not 10? There on the page filter 10, 20, 50 . Maybe your parser just can't possibly click on the number 50 for a filter? And if 7, then maybe it just turned out that it takes the first picture and goes to the next page. There, too, takes the first, etc. Maybe just something from this is not considered? Does your parser on jquery really save to a hard disk? About ______ About - Alexey Shimansky
  • He does not go to other pages. Yes, it saves through the xdFileStorage.js library. But, it saves it as a browser cache. - Andrey Dudukin

2 answers 2

Does it accurately save the first 7, not 10 photos? There is a filter 10, 20, 50 on the page. Perhaps your parser probably just doesn’t know how to click on the number 50 for a filter.
And if 7, then maybe it just turned out that it takes the first picture and goes to the next page. There, too, takes the first, etc. Accordingly, maybe you just did not consider something.

Why do I think so? Because url calmly parses itself. Pictures (and not only) get without problems.

Here is an example using php and Simple HTML DOM Parser (to use Simple HTML DOM Parser course, you need to download it ... resource )

 // Π”ΠΎΠ±Π°Π²Π»ΡΡ‚ΡŒ сообщСния ΠΎΠ±ΠΎ всСх ΠΎΡˆΠΈΠ±ΠΊΠ°Ρ…, ΠΊΡ€ΠΎΠΌΠ΅ E_WARNING error_reporting(E_ALL & ~E_WARNING); include './domParser/simple_html_dom.php'; class DomParser { public $url = ''; public $imgHost = ''; public $returnVal = 0; public function __construct($urlParse, $imgHostUrl) { $this->url = $urlParse; $this->imgHost = $imgHostUrl; } public function getImages($_data) { $i = 1; $data = $_data ? $_data : file_get_html($this->url); if ($data->innertext != '') { foreach ($data->find('div.catalog__displayedItem') as $a) { foreach ($a->find('.catalog__displayedItem__columnFoto img') as $img) { echo '<img src="' . $this->imgHost . $img->src .'" />'; $imgExt = explode('.', $img->src); // Π­Ρ‚ΠΎ для добавлСния ΠΊΠ°Ρ€Ρ‚ΠΈΠ½ΠΊΠΈ сСбС Π² ΠΏΠ°ΠΏΠΊΡƒ // Π—Π°ΠΊΠΎΠΌΠ΅Π½Ρ‚ΠΈΡ€ΠΎΠ²Π°Π» Π² Ρ„ΠΈΠ΄Π΄Π»Π΅ if ($image = file_get_contents($this->imgHost . $img->src)) { //file_put_contents('./images/' . $i . '.' . end($imgExt), $image); } } $i++; } echo '<br /><br />'; $this->getNextPage($data, 'getImages'); $data->clear(); unset($data); } } public function getNextPage($data, $repeatFunctionName) { // сдСлал ΠΏΠΎΠΊΠ° Ρ‡Ρ‚ΠΎΠ±Ρ‹ ΠΌΠ½ΠΎΠ³ΠΎ Ρ†ΠΈΠΊΠ»ΠΎΠ² Π½Π΅ Π΄Π΅Π»Π°Π» Π½Π΅ Π½Π°Π³Ρ€ΡƒΠΆΠ°Π» if ($this->returnVal >= 2) return; if ($data->innertext != '') { $this->returnVal++; foreach ($data->find('.catalogItemList__paginator a') as $a) { $str = iconv("windows-1251", "UTF-8", $a->title); if (mb_strpos(strtolower($str), 'Π»Π΅Π΄ΡƒΡŽΡ‰ΠΈΠ΅', 0, 'UTF-8') !== false) { $page = explode('?', $a->href); $data_inner_link = file_get_html($this->url . '?' . end($page)); $this->$repeatFunctionName($data_inner_link); break; } } } } } $url = 'http://www.onlinetrade.ru/catalogue/smartfoni-c13/'; $imgHost = 'http://www.onlinetrade.ru'; $parser = new DomParser($url, $imgHost); $parser->getImages(null); /* $url = 'http://www.onlinetrade.ru/catalogue/smartfoni-c13/'; $imgHost = 'http://www.onlinetrade.ru'; $data = file_get_html($url); $i = 1; function getImages($data) { global $imgHost; global $i; if ($data->innertext!='') { foreach($data->find('div.catalog__displayedItem') as $a) { foreach ($a->find('.catalog__displayedItem__columnFoto img') as $img) { echo '<img src="' . $imgHost . $img->src .'" />'; $imgExt = explode('.', $img->src); // Π­Ρ‚ΠΎ для добавлСния ΠΊΠ°Ρ€Ρ‚ΠΈΠ½ΠΊΠΈ сСбС Π² ΠΏΠ°ΠΏΠΊΡƒ // Π—Π°ΠΊΠΎΠΌΠ΅Π½Ρ‚ΠΈΡ€ΠΎΠ²Π°Π» Π² Ρ„ΠΈΠ΄Π΄Π»Π΅ if ($image = file_get_contents($imgHost . $img->src)) { file_put_contents('./images/' . $i . '.' . end($imgExt), $image); } } $i++; } echo '<br /><br />'; getNextPage($data); $data->clear(); unset($data); } } $return = 0; function getNextPage($data) { global $url; global $return; // сдСлал ΠΏΠΎΠΊΠ° Ρ‡Ρ‚ΠΎΠ±Ρ‹ ΠΌΠ½ΠΎΠ³ΠΎ Ρ†ΠΈΠΊΠ»ΠΎΠ² Π½Π΅ Π΄Π΅Π»Π°Π» Π½Π΅ Π½Π°Π³Ρ€ΡƒΠΆΠ°Π» if ($return >= 2) return; if($data->innertext != ''){ $return++; foreach($data->find('.catalogItemList__paginator a') as $a){ $str = iconv("windows-1251", "UTF-8", $a->title); if (mb_strpos(strtolower($str), 'Π»Π΅Π΄ΡƒΡŽΡ‰ΠΈΠ΅', 0, 'UTF-8') !== false) { $page = explode('?', $a->href); $data_inner_link = file_get_html($url . '?' . end($page)); getImages($data_inner_link); break; } } } } //getImages($data); */ 

The use case of the class and the usual option through functions (commented out below)

You can touch it here.

At the moment, there is a special restriction on parsing only the first 3 pages (10 products), so as not to load the feedl and the line with file_put_contents because the file_put_contents does not skip that is logical)))

Here is the proof of conservation: enter image description here And there are many more down there ...

As an option, you can and probably better use (regarding php ) cURL

cURL is a free command-line utility that allows you to interact with many different servers across many different protocols with the URL syntax.


This code is simply more likely to show that everything works, you can download it and that you most likely have an error somewhere in the code.

Most likely, my answer is not the answer, but perhaps this code will be something useful and you will want to alter it somehow to your needs. Than save mht . Although the memory he will eat a lot.

  • Thank you so much! Very helpful! I will sort this out! - Andrey Dudukin

To create mht programmatically:

From JavaScript, try calling print(document.body.innerHTML);
or document.execCommand('SaveAs','true','http://...')

  • Thanks, I also study this documentation. - Andrey Dudukin