There is a view template:

<section name="myname1"> ... <div name="myname2"> ... <p name="myname3"> ... </p> ... </div> ... </section> <div name="myname4"> ... <div name="myname5"> ... </div> ... </div> 

The task is to find all DOM elements that have a top-level “name” attribute with all the elements nested in it. In this case, the attribute may contain text in Cyrillic, in fact, as well as nested constructions in it.

I can not use libraries (there is a requirement of the customer to exclude dependencies).

The first approach to the projectile was this:

/ <\ s * ([a-z0-9] ) \ b [^>] \ bname \ s * = \ s * \ "([^ \"] ) [^>] > (?> (?: [ ^ <] | <(?! \ s * /? \ 1 \ s * \ b)) | (<\ s * \ 1 [^>] > (?> (?: [^ <] | <(?! \ s /? \ s * \ 1 \ s * \ b)) | (? 3)) +? <\ s * / \ s * \ 1 \ s *>)) * </ \ 1> / is

This works as long as the nesting tag

 <section name="myname1"> ... </section> 

does not grow more than 700 lines. Then the regular expression simply does not find anything.

There is a moment here , for example, if you put the tag “div” instead of the “section” tag, everything will work.

Other surveys:

I tried the implementation through PHPDocument , but there were problems with the encoding (the thing is, I don’t know in what encoding the developed script will be used).

I tried, first find:

 <section name="myname1"> ... </section> 

and then through the function “preg_match_all” with the flag “PREG_OFFSET_CAPTURE” find the number of opening and closing tags of the same name and their position in the string, followed by the calculation of the final closing tag for the desired one. But even here I stumbled over the notorious Cyrillic alphabet.

I tried XPath , I can not get it to correctly digest not fully valid layout. Especially strongly swears on use of svg inline. At the end throws a critical error: Uncaught exception 'Exception' with message 'String could not be parsed as XML'

Library " Simple HTML DOM Parser " and others like it, build their logic on the use of PHPDocument, XPath. Therefore, have the same sores with processing invalid HTML

Does anyone have any idea how to solve this issue? At least going to some kind of algorithm.


Here is another link to a similar example , but problems remain. The following regular expression will not work for the following pattern:

 <table> ... <table></table> ... </table> 

But there will be for this:

 <table> ... <table>Обязательно текст или пробел</table> ... </table> 

Such a bug can be easily fixed, but this example will also refuse to search for a construction with a lot of nesting, if the nested tag does not use the tag of the desired DOM element - this is the theme of my post.

Who can can correct and optimize the given example?

  • Parse regulars - it is always possible (I saw somewhere an article with formal proof of this) and it is always very painful. Is that the modifier x little help. - user207618

2 answers 2

There is a very good library, Simple HTML DOM Parser: https://habrahabr.ru/post/176635/ I propose to see how the parsing for tags and receiving the content is implemented there and applied to them.

Regular expressions are not the best tool for parsing html.

  • and there, as regular races, the regulars and yuzat :) but in general, some kind of example for the completeness of the answer, add - Naumov
  • On the use of regular expressions for parsing, HTML agrees, not the best choice tool. But I have to look for a solution in this direction, because standard PHP modules (PHPDocument, XPath) require a valid DOM structure at the input to work correctly. In my case, this is an unaffordable luxury (the Simple HTML DOM Parser library has the same sore, as it is based on the above modules). But a ray of hope dawns, especially after reading the article habrahabr.ru/post/171667 . - Ivan Miroshin

Problem solved:

 /<table\b[^\>]*\bname=(\"|')?table01\1[^\>]*>(?>([^\<]+|<(?!\/?table\b))|(<table[^\>]*>(?:(?2)|(?3)|)+?<\/table>))*<\/table>/ix 

https://regex101.com/r/EzXAfL/1

Tag name: table, can be replaced by ([a-z0-9] +), then all tags will be searched. The main thing then is to substitute this group in the appropriate positions in a regular expression.

The name and value of the attribute ("name", "table01") can be dynamically substituted (in my case, set by php variables)

All the problems I described with parsing a large amount of nesting are resolved.

I hope someone will be useful :)

  • And if the quotes will not be? - Wiktor Stribiżew
  • You forgot to put the optional for closing quotes: <table \ b [^ \>] * \ bname = (\ "| ')? Table01 \ 1? [^ \>] *> (?> ([^ \ <] + | <(?! \ /? table \ b)) | (<table [^ \>] *> (? :(? 2) | (? 3) |) +? <\ / table>)) * <\ / table> - Ivan Miroshin
  • Regularity can be adjusted based on how much non-valid code you expect to receive. Some egregious cases, she naturally can not take into account. But for example, such approaches: < div ... > ... < / div> - You can decide by installing the optional white space symbol "\ s *" - Ivan Miroshin
  • Yeah, it turned out not you, but I forgot :) - Ivan Miroshin