How to save a web page in mht format program?

Question

I decided to prepare all the links of the pages in advance, and then save them in the .mht format. For example, the .txt file contains all the links (each with a new line). The program should read links from there and save it in .mht format to the hard disk.

Is it possible to implement this programmatically and which technology is better to use?

Why do I need it?

There is one online store. From it I need to download all the images of the goods. I tried to use jQuery through Google Chrome extensions. Tried to write an application in C #. But, for some reason, not all images of the catalog are loaded.
Here, for example, this page: www.onlinetrade.ru/catalogue/smartfoni-c13 50 products are shown here. I am looking for all the construction "img". Downloading. As a result, only the first 7 product images are loaded. As a result, I decided to download in the manner described above.

Maybe your parser just can't possibly click on the number 50 for a filter?
And if 7, then maybe it just turned out that it takes the first picture and goes to the next page.
Yes, it saves through the xdFileStorage.js library. But, it saves it as a browser cache.

Accepted Answer · 2015-12-22T09:38:31

Does it accurately save the first 7, not 10 photos? There is a filter 10, 20, 50 on the page. Perhaps your parser probably just doesn’t know how to click on the number 50 for a filter.
And if 7, then maybe it just turned out that it takes the first picture and goes to the next page. There, too, takes the first, etc. Accordingly, maybe you just did not consider something.

Why do I think so? Because url calmly parses itself. Pictures (and not only) get without problems.

Here is an example using php and Simple HTML DOM Parser (to use Simple HTML DOM Parser course, you need to download it ... resource )

 // Добавлять сообщения обо всех ошибках, кроме E_WARNING error_reporting(E_ALL & ~E_WARNING); include './domParser/simple_html_dom.php'; class DomParser { public $url = ''; public $imgHost = ''; public $returnVal = 0; public function __construct($urlParse, $imgHostUrl) { $this->url = $urlParse; $this->imgHost = $imgHostUrl; } public function getImages($_data) { $i = 1; $data = $_data ? $_data : file_get_html($this->url); if ($data->innertext != '') { foreach ($data->find('div.catalog__displayedItem') as $a) { foreach ($a->find('.catalog__displayedItem__columnFoto img') as $img) { echo '<img src="' . $this->imgHost . $img->src .'" />'; $imgExt = explode('.', $img->src); // Это для добавления картинки себе в папку // Закоментировал в фиддле if ($image = file_get_contents($this->imgHost . $img->src)) { //file_put_contents('./images/' . $i . '.' . end($imgExt), $image); } } $i++; } echo '<br /><br />'; $this->getNextPage($data, 'getImages'); $data->clear(); unset($data); } } public function getNextPage($data, $repeatFunctionName) { // сделал пока чтобы много циклов не делал не нагружал if ($this->returnVal >= 2) return; if ($data->innertext != '') { $this->returnVal++; foreach ($data->find('.catalogItemList__paginator a') as $a) { $str = iconv("windows-1251", "UTF-8", $a->title); if (mb_strpos(strtolower($str), 'ледующие', 0, 'UTF-8') !== false) { $page = explode('?', $a->href); $data_inner_link = file_get_html($this->url . '?' . end($page)); $this->$repeatFunctionName($data_inner_link); break; } } } } } $url = 'http://www.onlinetrade.ru/catalogue/smartfoni-c13/'; $imgHost = 'http://www.onlinetrade.ru'; $parser = new DomParser($url, $imgHost); $parser->getImages(null); /* $url = 'http://www.onlinetrade.ru/catalogue/smartfoni-c13/'; $imgHost = 'http://www.onlinetrade.ru'; $data = file_get_html($url); $i = 1; function getImages($data) { global $imgHost; global $i; if ($data->innertext!='') { foreach($data->find('div.catalog__displayedItem') as $a) { foreach ($a->find('.catalog__displayedItem__columnFoto img') as $img) { echo '<img src="' . $imgHost . $img->src .'" />'; $imgExt = explode('.', $img->src); // Это для добавления картинки себе в папку // Закоментировал в фиддле if ($image = file_get_contents($imgHost . $img->src)) { file_put_contents('./images/' . $i . '.' . end($imgExt), $image); } } $i++; } echo '<br /><br />'; getNextPage($data); $data->clear(); unset($data); } } $return = 0; function getNextPage($data) { global $url; global $return; // сделал пока чтобы много циклов не делал не нагружал if ($return >= 2) return; if($data->innertext != ''){ $return++; foreach($data->find('.catalogItemList__paginator a') as $a){ $str = iconv("windows-1251", "UTF-8", $a->title); if (mb_strpos(strtolower($str), 'ледующие', 0, 'UTF-8') !== false) { $page = explode('?', $a->href); $data_inner_link = file_get_html($url . '?' . end($page)); getImages($data_inner_link); break; } } } } //getImages($data); */

The use case of the class and the usual option through functions (commented out below)

You can touch it here.

At the moment, there is a special restriction on parsing only the first 3 pages (10 products), so as not to load the feedl and the line with file_put_contents because the file_put_contents does not skip that is logical)))

Here is the proof of conservation: And there are many more down there ...

As an option, you can and probably better use (regarding php ) cURL

cURL is a free command-line utility that allows you to interact with many different servers across many different protocols with the URL syntax.

This code is simply more likely to show that everything works, you can download it and that you most likely have an error somewhere in the code.

Most likely, my answer is not the answer, but perhaps this code will be something useful and you will want to alter it somehow to your needs. Than save mht . Although the memory he will eat a lot.

Thank you so much! Very helpful! I will sort this out! - Andrey Dudukin

Stack stack 8,662 9 silver marks 56 bronze marks · Answer 2 · 2015-12-22T09:16:25

To create mht programmatically:

C # source code for generating MHT files from an URL
CDO.Message COM object that implements the IMessage interface

From JavaScript, try calling print(document.body.innerHTML);
or document.execCommand('SaveAs','true','http://...')

How to save a web page in mht format program?

2 answers 2

More articles: