I am doing the task of parsing data from a special site with an extremely incomprehensible engine inside. The news is displayed in its tag (it is always the same), I do it with this code:
<?php $ch = curl_init(); curl_setopt($ch, CURLOPT_URL, 'http://24gadget.ru/'); curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); $result = curl_exec($ch); curl_close($ch); //Выделение новостных блоков preg_match_all("~<article class=\"news-announce\">(.*?)</article>~is",$result,$post); //Перебираем массив и делим по новостям foreach ($post[0] as $news){ //Удаляем ненужное в блоках $news = preg_replace("'<div class=\"announce-links clearfix\">.*?</div>'si","",$news); //Получаем ссылку на изображение preg_match("~<img[^>]*?>~",$news,$img); $img = strstr($img[0], '/uploads/'); $img = strstr($img, '" alt', true); $img = "http://24gadget.ru".$img; //Удаляем ненужное в блоках $news = preg_replace("'<div class=\"announce-text clearfix\">.*?</div>'si","",$news); $news = preg_replace("'\(.*?\)'si","",$news); $news = strip_tags($news, '<span><a><div><article>'); //Выделяем ссылку $link = strstr($news, 'http'); $link = strstr($link, '" class', true); //Выделяем дату и название preg_match("~<a[^>]*?>(.*)</a>~",$news,$article); preg_match("~<span[^>]*?>(.*)</span>~",$news,$date); //Ищем, сегодняшняя ли новость $pos = strpos($date[1], 'Вчера'); //Если да, то обрабатываем ссылку if($pos === false){} else { //Получаем дату для сравнений времени $date = intval(str_replace(':', '', strstr($date[1], ' '))); //Запрос самой новости $ch = curl_init(); curl_setopt($ch, CURLOPT_URL, $link); curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); $result = curl_exec($ch); curl_close($ch); //Выделяем текст статьи preg_match_all("~<article class=\"news-announce\">(.*?)</article>~is",$result,$text); //Убираем вееесь шлак в статье оставив только текст $text = $text[0][0]; $text = preg_replace("'<div align=\"center\">.*?</div>'si","",$text); $text = preg_replace("'<script[^>]*?>.*?</script>'si","",$text); $text = preg_replace("'<h1[^>]*?>.*?</h1>'si","",$text); $text = preg_replace("'<div class=\"announce-meta clearfix\">.*?</div>'si","",$text); $text = preg_replace("'<div class=\"big-share\">.*?</div>'si","",$text); $text = preg_replace("'<div class=\"announce-tags clearfix\">.*?</div>'si","",$text); $text = preg_replace("'<div style=\".*?\">.*?</div>'si","",$text); $text = preg_replace("'<!--.*?-->'si","",$text); $text = preg_replace("'<img[^>]*?>'si","",$text); $text = preg_replace('/(<br[^>]*>\s*)+/i','',$text); $text = strstr($text, 'Источник', true); $text = strip_tags($text, ''); $text = preg_replace('/\\r\\n?|\\n/', '', $text); //Убираем все пробелы в начале $text = ltrim($text); $arr = array($link, $date, $article[1], $text, $img); $end[] = $arr; $arr = array(); } } usort($end, function($a, $b){ return ($a['1'] - $b['1']); }); //Выводи массив на просмотр echo "<pre>"; print_r($end); echo "</pre>"; But the problem is that even though the design of the code is the same, some of the news does not parse, i.e. Some news gets the text of the article, and some articles come up empty, although they contain text. How can this problem be solved?
PS If you are reading after 00:00 and decided to check the code, then replace in line 45 Сегодня to Вчера
file_get_contents, and this site does not parse with this function. Only cURL went. - NTPfile_get_contentshas a bunch of settings passed through a parameter - a context, which, incidentally, can be crammed into these libraries. - teran