Good day.

Help to deal with regular. Through curl got the page into the $ page variable. The variable contains approximately the following

<div id="middleContent"> <div class="right-content"> <div> <div id="labelRussianUL"> <h3 style="cursor: pointer; text-decoration: underline;">...</h3> </div> <div id="textRussianUL" style="display: none;"> <p>1. Запись 1</p> <p>2. Запись 2</p> <p>3. Запись 3</p> ... где то 100 ... <p>...</p> </div> <div id="labelRussianFL"> <h3 style="cursor: pointer; text-decoration: underline;">...</h3> </div> <div id="textRussianFL" style="display: none;"> <p>1. Запись 1</p> <p>2. Запись 2</p> <p>3. Запись 3</p> ... очень много записей более 5000 ... <p>...</p> </div> </div> </div> </div> 

I can not understand why

preg_match('/<div[^>]*id="textRussianFL"[^>]*>(.*?)<\/div>/', $page, $match); preg_match('/<div[^>]*id="textRussianUL"[^>]*>(.*?)<\/div>/', $page, $match); does not work preg_match('/<div[^>]*id="textRussianUL"[^>]*>(.*?)<\/div>/', $page, $match); is working .

Regular checked on the site regex101.com . With a small number of records works. Thank you in advance for your help.

2 answers 2

Most likely your overflow is due to this group.

 (.*?) 

Immediately after this group, you have the symbol < , and in the required div you have more than 5 thousand lines of the form

 <p>1. Запись 1</p> 

There is a constant search of characters, followed by a return due to inconsistencies with < and the characters following it.

Here is the regular https://regex101.com/r/wK1aW4/1

 /<div[^>]*?id="textRussianFL"[^>]*?>(.*?)<\/div>/s 

which causes an error in my test. Debager shows such a result

Match 1 - finished in 211352 steps

And more difficult is the regular https://regex101.com/r/wK1aW4/2

 /<div[^>]*?id="textRussianFL"[^>]*?>([^<]*(?:<(?!\/div>)[^<]*)*)<\/div>/s 

which already does not cause me an error. Debager shows such a result

Match 1 - finished in 42355 steps

  • I searched your regular season '/ <div [^>] *? Id = "textRussianFL" [^>] *?> ([^ <] * (?: <(?! \ Div>) [^ <] *) *) <\ / div> / s' with data that curl returns does not have time to process by timeout. copied the source code to regex101.com deleted half the data from the page that curl returned and somehow returned the result. But it turns out that this option is not suitable for my case. But anyway, thanks for the help - baggi
  • @baggi, make it easier, even without using the DOM. Find the regular div only <div[^>]*?id="textRussianFL"[^>]*?> , And then use the strpos function </div> starting with the character following the found opening tag (found regular). And so for all reps of a given div on a page. - Visman
  • did not quite understand what you mean. I <div id="textRussianFL" style="display: none;"> through the regular <div id="textRussianFL" style="display: none;"> and closing <div> . but I need to get the content. - baggi
  • @baggi, no one has canceled the substr function and mathematical operations to find the length of the substring between the two other found substrings. - Visman

Thank you all for your help, I decided to use the space with

preg_match('/<div id="textRussianFL"[^>]*>(.*?)<\/div>/us', $page, $match);