There is a html page to be parsed. At first I tried to do it using simple html dom, but he refused, since MAX_FILE_SIZE> 600,000 was almost 2 times, he tried to parse with preg_match_all, which in turn works well for the first time, and for the second one it calls Fatal error: Allowed memory size of 134217728 bytes exhausted (tried to allocate 35 bytes) in Z:\home\localhost\www\parser\1.php on line 50 . Page weighs 1218kb

 $html = file_get_contents('source.html'); //$url = "http://46.4.130.245:8080/scripts/grabber/serp-test.php?site=semyadro.pro"; preg_match_all("#span class='big'>(.*?)<#", $html, $loss); $res['poterianie'] = $loss[1][0]; $res['viroschie'] = $loss[1][1]; $res['prosevschie'] = $loss[1][2]; $res['novie'] = $loss[1][3]; preg_match_all('#<div class=\'color_main || medium\'>(.*?)</div>#', $html, $sites);// fatal error here( 

how to fix? what is the problem?

  • one
    I do not believe. Unshielded Metacharacters || are a syntax error and preg_match_all will not be executed, but simply return a syntax error. Accordingly, it can not take as much RAM. Give a real regular expression. You can optimize the runtime and memory for this regular expression. It is enough to replace (.*?)< By ((?:[^<]++|<)*?)< - ReinRaus
  • exactly in || there was a mistake - Dima Morgunov

1 answer 1

In your case, the problem may be solved by allocating more memory to the process.

 ini_set('memory_limit', '512M'); 

But it is more correct to parse large XML files by reading them in chunks. Here, read: http://php.net/manual/ru/book/xml.php

I have been working with this extension for a year now. Perfectly understand files in 100 M and more.

  • but I do not have xml, but html. How can I read in parts? - Dima Morgunov
  • one
    Sorry, I do not know - I did not have to deal with large HTML. Try to read here: php.net/manual/ru/book.dom.php If there is not there in parts, it's better than regular parsing ... - Yuri Maksimenko