There is an index.html file that refers to three other files one.html, two.html and three.html. The structure is extremely simple. In the one and three files there is an ul tag with the class list, and in the two file of this class there is no tag.
There is also a get.php file that parses these pages and checks for the presence of the .list class, and the answer is whether this class exists or not. Here is the code:
<?php header('Content-type: text/html; charset=UTF-8'); $start = microtime(true); set_include_path(get_include_path().PATH_SEPARATOR.'library/'); set_include_path(get_include_path().PATH_SEPARATOR.'phpQuery/'); require('config.php'); function __autoload( $className ) {require_once( "$className.php" );} echo "<br>".date('H:i:s')." Начинаем парсинг "; echo '<pre>'; $page=file_get_contents('index.html'); $document = phpQuery::newDocument($page); $links=[]; foreach($document->find('ul li a') as $link){ $links[] = pq($link)->attr('href'); } print_r($links); foreach($links as $sublink){ $pageText =new Curl(); $pagenew=$pageText->get_page($sublink); $cat_page = phpQuery::newDocument($pagenew); $catlist = []; foreach($cat_page as $cat_page){ if($item=pq($cat_page)->find('ul.list a')) { echo "class is</br>";}else{ echo "class not is</br>";} } } The problem is that the result is that the class is on all three pages, although it is not on the second page. Help please understand what is not right here? This is what is displayed:
11:07:22 Начинаем парсинг Array ( [0] => one.html [1] => two.html [2] => three.html ) class is class is class is Here are the contents of the one.html and three.html pages.
<!DOCTYPE html> <html lang="ru"> <head> <meta charset="utf-8"> </head> <body> <ul class="list"> <li><a href="link.html">link</a></li> </ul> </body> </html> and two.html pages:
<!DOCTYPE html> <html lang="ru"> <head> <meta charset="utf-8"> </head> <body> <p>нет класса</p> </body> </html>