Hello, dear community. Please help with content parsing. I'll start from afar. There is such a well-known news site Lenta.Ru. In all the rubrics of this site there is a regularly updated block of links "Outside". I want to "spars" this block from two rubrics. Put the script on crowns, and pull out the content from there, from this block. But since the block is updated not all, but only partially, then it is necessary to remove duplicates. One person wrote a script to me, but he just captures all the links from the block, but it’s necessary that the link and the accompanying text are saved to a file.

I give the code:

<?php require_once 'phpquery.php'; // Парсинг ссылок с блока "Аутсайд" (lenta.ru) ini_set('max_execution_time', 0); $count_add_links = 0; $fname = 'link_lenta.html'; $urls = array( 'http://lenta.ru/internet/', 'http://lenta.ru/digital/' ); $hentry = array(); $links_arr = array(); foreach ($urls as $url) { $page_content = file_get_contents($url); $document = phpQuery::newDocument($page_content); $hentry[] = $document->find('div#outside table td a'); } // Если файла с линками не существует if (!file_exists($fname)) { file_put_contents($fname, ''); } // Получаем все ссылки из файла $links = phpQuery::newDocument(file_get_contents($fname))->find('a'); foreach ($links as $link) { $links_arr[] = pq($link)->attr('href'); } // Перебираем все найденные новые ссылки и записываем их файл без дублей foreach ($hentry as $url => $he) { foreach ($he as $el) { $pq = pq($el); if (!in_array($pq->attr('href'), $links_arr)) { // Дописываем нужные ссылки file_put_contents($fname, '<a class="links-lenta" href="' . $pq->attr('href') . '">' . $pq->html(). '</a>' . PHP_EOL, FILE_APPEND); $count_add_links++; } } } echo 'Done! Add ' .$count_add_links .' link(s).'; die(); 

Markup example:

  <div id=outside><h3>Аутсайд</h3> <table> <tr><td class=mz1><span>TechCrunch:</span> <a href= http://techcrunch.com/2012/11/18/facebook-https/ target=_blank>Facebook Could Slow Down A Tiny Bit As It Starts Switching All Users To Secure HTTPS Connections</a><br>Facebook начал переводить всех на защищенное соединение.</td></tr> <tr><td class=mz2><span>Quartz:</span> <a href= http://qz.com/28895/why-opera-thrives-in-europes-last-dictatorship/ target=_blank>Why Opera thrives in Europe's last dictatorship</a><br>Журналисты задаются вопросом, почему Opera гиперпопулярна в Белоруссии.</td></tr> .... </table> </div> 

Now only links are parked, for example, <a href= http://techcrunch.com/2012/11/18/facebook-https/ target=_blank>Facebook Could Slow Down A Tiny Bit As It Starts Switching All Users To Secure HTTPS Connections</a> , and it is necessary that the text around the links should also be entered into the file.

Thank you in advance.

    0