I study the writing of the content parser, I stopped at the DiDOM library.

It turns out to parse the necessary information from one page:

require_once('vendor/autoload.php'); use DiDom\Document; $document = new Document('https://site.ru/catalog/tovar/', true); //Находим заголовок $main_heading = $document->find('.product-title h1')[0]; echo $main_heading->html(); //Находим цену $price = $document->find('.item_current_price')[0]; $price->text(); //Находим фото $foto = $document->find('.bx_bigimages_imgcontainer img')[0]; 

With one page, everything is clear. But I just can not understand the logic of crawling and receiving content from a large number of pages . How is this done in principle?

Does the parser have to find child pages by links in the product catalog ( for example ), or get links from their XML site maps, or else in some way upload the list of links and then follow them, finding there given information?

Please suggest an idea, please.

    2 answers 2

    One day I needed an IP parsing for monitoring game servers. There links looked like this:

     site.ru/page?page=1 - Тут 50 серверов site.ru/page?page=2 - Тут 50 серверов site.ru/page?page=3 - Тут 50 серверов 

    I did a simple for loop:

     for ($i = 0; $i < 10; $i++) { $url = "site.ru/page?page=$i"; #Тут идет код парсинга } 

    Where 10 is the number of pages.

    • Thank you, took note! Can you voice any other options? I wonder how you can still address the issue - Amsterdam
    • Frankly speaking, there is no @Amsterdam, since I once dealt with parsing and used the option that I have already voiced. Watch the video material on parsing websites in different libraries. What do not be so find. - Dmitriy Movsesyan
    • I am looking for, I am reading, but so far I have not identified alternative options. Thanks for the response! - Amsterdam
    • Thanks again for the idea of ​​the cycle, based on your example, I managed to create my own! I also wanted to ask how to configure php parser to work through a proxy? Also, I still can not understand the basis of this - Amsterdam
    • @Amsterdam Unfortunately, I don’t know a proxy, I'm just learning, excuse me) - Dmitry Movsesyan

    The result was such an option (the minimum basic code that can and should be upgraded)

     <?php // Здесь нужно подключить API своей системы, если файл внешний require_once('vendor/autoload.php'); use DiDom\Document; //Получаем каталог с товарами $document = new Document('http://site.ru/catalog/', true); //Находим ссылку по которой будем переходить на товар $links = $document->find('a.startshop-name'); //Создаем цикл foreach ($links as $key => $value) { //Получаем и проходимся в цикле по всем url товаров, что идет после site.ru/catalog $dodo = $value->getAttribute('href'); //Подставляем массив ссылок для перебора $massa = "http://site.ru/$dodo"; $document = new Document($massa, true); //Находим H1 на страницах $main_heading = $document->find('h1')[0]; //Находим фото $foto = $document->find('#slider_images a::attr(href)')[0]; // Здесь создаем новый ресурс через API вашей системы с поставлением полученных данных