I am doing the task of parsing data from a special site with an extremely incomprehensible engine inside. The news is displayed in its tag (it is always the same), I do it with this code:

<?php $ch = curl_init(); curl_setopt($ch, CURLOPT_URL, 'http://24gadget.ru/'); curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); $result = curl_exec($ch); curl_close($ch); //Выделение новостных блоков preg_match_all("~<article class=\"news-announce\">(.*?)</article>~is",$result,$post); //Перебираем массив и делим по новостям foreach ($post[0] as $news){ //Удаляем ненужное в блоках $news = preg_replace("'<div class=\"announce-links clearfix\">.*?</div>'si","",$news); //Получаем ссылку на изображение preg_match("~<img[^>]*?>~",$news,$img); $img = strstr($img[0], '/uploads/'); $img = strstr($img, '" alt', true); $img = "http://24gadget.ru".$img; //Удаляем ненужное в блоках $news = preg_replace("'<div class=\"announce-text clearfix\">.*?</div>'si","",$news); $news = preg_replace("'\(.*?\)'si","",$news); $news = strip_tags($news, '<span><a><div><article>'); //Выделяем ссылку $link = strstr($news, 'http'); $link = strstr($link, '" class', true); //Выделяем дату и название preg_match("~<a[^>]*?>(.*)</a>~",$news,$article); preg_match("~<span[^>]*?>(.*)</span>~",$news,$date); //Ищем, сегодняшняя ли новость $pos = strpos($date[1], 'Вчера'); //Если да, то обрабатываем ссылку if($pos === false){} else { //Получаем дату для сравнений времени $date = intval(str_replace(':', '', strstr($date[1], ' '))); //Запрос самой новости $ch = curl_init(); curl_setopt($ch, CURLOPT_URL, $link); curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); $result = curl_exec($ch); curl_close($ch); //Выделяем текст статьи preg_match_all("~<article class=\"news-announce\">(.*?)</article>~is",$result,$text); //Убираем вееесь шлак в статье оставив только текст $text = $text[0][0]; $text = preg_replace("'<div align=\"center\">.*?</div>'si","",$text); $text = preg_replace("'<script[^>]*?>.*?</script>'si","",$text); $text = preg_replace("'<h1[^>]*?>.*?</h1>'si","",$text); $text = preg_replace("'<div class=\"announce-meta clearfix\">.*?</div>'si","",$text); $text = preg_replace("'<div class=\"big-share\">.*?</div>'si","",$text); $text = preg_replace("'<div class=\"announce-tags clearfix\">.*?</div>'si","",$text); $text = preg_replace("'<div style=\".*?\">.*?</div>'si","",$text); $text = preg_replace("'<!--.*?-->'si","",$text); $text = preg_replace("'<img[^>]*?>'si","",$text); $text = preg_replace('/(<br[^>]*>\s*)+/i','',$text); $text = strstr($text, 'Источник', true); $text = strip_tags($text, ''); $text = preg_replace('/\\r\\n?|\\n/', '', $text); //Убираем все пробелы в начале $text = ltrim($text); $arr = array($link, $date, $article[1], $text, $img); $end[] = $arr; $arr = array(); } } usort($end, function($a, $b){ return ($a['1'] - $b['1']); }); //Выводи массив на просмотр echo "<pre>"; print_r($end); echo "</pre>"; 

But the problem is that even though the design of the code is the same, some of the news does not parse, i.e. Some news gets the text of the article, and some articles come up empty, although they contain text. How can this problem be solved?

PS If you are reading after 00:00 and decided to check the code, then replace in line 45 Сегодня to Вчера

  • You did not try to work with the page as a DOM document, and not use regular expressions? - teran
  • @teran, I'm weak in this topic, can you give an example? file_get_contents does not parse the site - NTP
  • Look at Simple HTML DOM , PHPQuery , or standard DOM or other analogues that are not difficult to google. - teran
  • @teran, they all use file_get_contents , and this site does not parse with this function. Only cURL went. - NTP
  • there you can feed a string of data to any of the above, and not open a link, this time. And secondly, file_get_contents has a bunch of settings passed through a parameter - a context, which, incidentally, can be crammed into these libraries. - teran

2 answers 2

A complete search of libraries and regulars, phased experiments, I found a problem. The fact is that not every article has an indication of the source, but I had to erase it, but because it is not there, then all text is erased

 $text = strstr($text, 'Источник', true); 

And if you add a check for its presence, then everything works:

 $pos = strpos($text, 'Источник:'); if($pos === true){ $text = strstr($text, 'Источник', true); } 

Leave it, maybe someone will come in handy.

PS But on the other hand, I came up with a more convenient and short way of parsing during this time, sometimes it is useful to have errors.

    You can simply:

     if($pos) $text = strstr($text, 'Источник', true); 

    This is equivalent to if($pos == true) , does not check the type, but works faster and less text.