Configure PHP DiDOM parser

Question

I study the writing of the content parser, I stopped at the DiDOM library.

It turns out to parse the necessary information from one page:

require_once('vendor/autoload.php'); use DiDom\Document; $document = new Document('https://site.ru/catalog/tovar/', true); //Находим заголовок $main_heading = $document->find('.product-title h1')[0]; echo $main_heading->html(); //Находим цену $price = $document->find('.item_current_price')[0]; $price->text(); //Находим фото $foto = $document->find('.bx_bigimages_imgcontainer img')[0];

With one page, everything is clear. But I just can not understand the logic of crawling and receiving content from a large number of pages . How is this done in principle?

Does the parser have to find child pages by links in the product catalog ( for example ), or get links from their XML site maps, or else in some way upload the list of links and then follow them, finding there given information?

Please suggest an idea, please.

Dmitry Movsesyan Dmitry Movsesyan 95 ten · Answer 1 · 2019-01-07T12:07:29

One day I needed an IP parsing for monitoring game servers. There links looked like this:

 site.ru/page?page=1 - Тут 50 серверов site.ru/page?page=2 - Тут 50 серверов site.ru/page?page=3 - Тут 50 серверов

I did a simple for loop:

 for ($i = 0; $i < 10; $i++) { $url = "site.ru/page?page=$i"; #Тут идет код парсинга }

Where 10 is the number of pages.

Frankly speaking, there is no @Amsterdam, since I once dealt with parsing and used the option that I have already voiced.
Watch the video material on parsing websites in different libraries.
I am looking for, I am reading, but so far I have not identified alternative options.
Thanks again for the idea of the cycle, based on your example, I managed to create my own!
I also wanted to ask how to configure php parser to work through a proxy?
@Amsterdam Unfortunately, I don’t know a proxy, I'm just learning, excuse me)

Amsterdam amsterdam 184 one eleven · Accepted Answer · 2019-01-17T11:57:35

The result was such an option (the minimum basic code that can and should be upgraded)

 <?php // Здесь нужно подключить API своей системы, если файл внешний require_once('vendor/autoload.php'); use DiDom\Document; //Получаем каталог с товарами $document = new Document('http://site.ru/catalog/', true); //Находим ссылку по которой будем переходить на товар $links = $document->find('a.startshop-name'); //Создаем цикл foreach ($links as $key => $value) { //Получаем и проходимся в цикле по всем url товаров, что идет после site.ru/catalog $dodo = $value->getAttribute('href'); //Подставляем массив ссылок для перебора $massa = "http://site.ru/$dodo"; $document = new Document($massa, true); //Находим H1 на страницах $main_heading = $document->find('h1')[0]; //Находим фото $foto = $document->find('#slider_images a::attr(href)')[0]; // Здесь создаем новый ресурс через API вашей системы с поставлением полученных данных

Configure PHP DiDOM parser

2 answers 2

More articles: