In general, a simple, simple parser that works only at the top of the page. Google said that it is necessary to use $ start and $ finish, but if I prescribe them nothing at all parsitsya. The donor site has the following structure:

<div class="firm-list-item firm-place-1">~контент который нужно забрать~</div> <div class="firm-list-item firm-place-2">~контент который нужно забрать~</div> <div class="firm-list-item firm-place-2">~контент который нужно забрать~</div> 

Between 1 and 2 divs the content is taken, and then there is no .. Here is the code of the parser itself:

 $title=file_get_contents($url); $start = '<div class="firm-list-item firm-place-2">'; $finish = '<div class="firm-list-item firm-place-3">'; $pos=strpos($title,'<a class="firm-item-title" href='); $title=substr($title,$pos); $pos1=strpos($title,'</a>'); $title=substr($title,0,$pos1); $title=preg_replace('<a class="firm-item-title" href="/firm/id/[0-9]+/">','',$title); echo $title; echo '<br>'; 

Tell me, please, what is the problem here (where to write $ start and $ finish?).

    1 answer 1

    Use the PHPQuery library ( https://github.com/punkave/phpQuery ) will be easier.

     require ('phpQuery/phpQuery.php'); function get_content_by_url($url_target) { $ch = curl_init(); curl_setopt($ch, CURLOPT_URL, $url_target); curl_setopt($ch, CURLOPT_HEADER, false); curl_setopt($ch, CURLOPT_RETURNTRANSFER, true); curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 30); curl_setopt($ch, CURLOPT_USERAGENT, 'Google Bot'); $data = curl_exec($ch); curl_close($ch); return $data; } $url_target = 'http://example.site.com/'; $html_content = get_content_by_url($url_target); $document = phpQuery::newDocument($html_content); $found_items = $document->find('div.firm-list-item[class^="firm-place-"]'); $print = ''; foreach($found_items as $key => $item) { $pq = pq($item); $content_text = pq($item)->text(); // только текст $content_html = pq($item)->html(); // весь html (контент) $print .= '<li class="my_item">'. $content_html .'</li>'; } $final_contentt = '<ul class="my_list">'. $print . '</ul>'; echo $final_contentt; 

    in theory should work.

    • The answer is OK, but in short, describe how - Naumov
    • well and for the future get_content_by_url replaceable with file_get_contents($url) - Naumov
    • better through curl than file_get_contents - Dr. Mc My
    • what better? ....... - Naumov
    • And how is it worse? - Dr. Mc My