The problem of preg_match with a large number of records

Question

Good day.

Help to deal with regular. Through curl got the page into the $ page variable. The variable contains approximately the following

<div id="middleContent"> <div class="right-content"> <div> <div id="labelRussianUL"> <h3 style="cursor: pointer; text-decoration: underline;">...</h3> </div> <div id="textRussianUL" style="display: none;"> <p>1. Запись 1</p> <p>2. Запись 2</p> <p>3. Запись 3</p> ... где то 100 ... <p>...</p> </div> <div id="labelRussianFL"> <h3 style="cursor: pointer; text-decoration: underline;">...</h3> </div> <div id="textRussianFL" style="display: none;"> <p>1. Запись 1</p> <p>2. Запись 2</p> <p>3. Запись 3</p> ... очень много записей более 5000 ... <p>...</p> </div> </div> </div> </div>

I can not understand why

preg_match('/<div[^>]*id="textRussianFL"[^>]*>(.*?)<\/div>/', $page, $match); preg_match('/<div[^>]*id="textRussianUL"[^>]*>(.*?)<\/div>/', $page, $match); does not work preg_match('/<div[^>]*id="textRussianUL"[^>]*>(.*?)<\/div>/', $page, $match); is working .

Regular checked on the site regex101.com . With a small number of records works. Thank you in advance for your help.

Why regular HTML parsing? In PHP, there are normal tools for working with the DOM. php.net/manual/ru/class.domdocument.php - Mike

Visman Visman 16.2k eight 21 52 · Answer 1 · 2016-08-22T16:08:38

Most likely your overflow is due to this group.

 (.*?)

Immediately after this group, you have the symbol < , and in the required div you have more than 5 thousand lines of the form

 <p>1. Запись 1</p>

There is a constant search of characters, followed by a return due to inconsistencies with < and the characters following it.

Here is the regular https://regex101.com/r/wK1aW4/1

 /<div[^>]*?id="textRussianFL"[^>]*?>(.*?)<\/div>/s

which causes an error in my test. Debager shows such a result

Match 1 - finished in 211352 steps

And more difficult is the regular https://regex101.com/r/wK1aW4/2

 /<div[^>]*?id="textRussianFL"[^>]*?>([^<]*(?:<(?!\/div>)[^<]*)*)<\/div>/s

which already does not cause me an error. Debager shows such a result

Match 1 - finished in 42355 steps

I searched your regular season '/ <div [^>] *? Id = "textRussianFL" [^>] *?> ([^ <] * (?: <(?! \ Div>) [^ <] *) *) <\ / div> / s' with data that curl returns does not have time to process by timeout.
copied the source code to regex101.com deleted half the data from the page that curl returned and somehow returned the result.
But it turns out that this option is not suitable for my case.
Find the regular div only <div[^>]*?id="textRussianFL"[^>]*?> , And then use the strpos function </div> starting with the character following the found opening tag (found regular).
I <div id="textRussianFL" style="display: none;"> through the regular <div id="textRussianFL" style="display: none;"> and closing <div> .
@baggi, no one has canceled the substr function and mathematical operations to find the length of the substring between the two other found substrings.

baggi baggi sixteen 3 · Answer 2 · 2016-08-22T19:25:04

Thank you all for your help, I decided to use the space with

preg_match('/<div id="textRussianFL"[^>]*>(.*?)<\/div>/us', $page, $match);

The problem of preg_match with a large number of records

2 answers 2

More articles: