There is a view template:
<section name="myname1"> ... <div name="myname2"> ... <p name="myname3"> ... </p> ... </div> ... </section> <div name="myname4"> ... <div name="myname5"> ... </div> ... </div> The task is to find all DOM elements that have a top-level “name” attribute with all the elements nested in it. In this case, the attribute may contain text in Cyrillic, in fact, as well as nested constructions in it.
I can not use libraries (there is a requirement of the customer to exclude dependencies).
The first approach to the projectile was this:
/ <\ s * ([a-z0-9] ) \ b [^>] \ bname \ s * = \ s * \ "([^ \"] ) [^>] > (?> (?: [ ^ <] | <(?! \ s * /? \ 1 \ s * \ b)) | (<\ s * \ 1 [^>] > (?> (?: [^ <] | <(?! \ s /? \ s * \ 1 \ s * \ b)) | (? 3)) +? <\ s * / \ s * \ 1 \ s *>)) * </ \ 1> / is
This works as long as the nesting tag
<section name="myname1"> ... </section> does not grow more than 700 lines. Then the regular expression simply does not find anything.
There is a moment here , for example, if you put the tag “div” instead of the “section” tag, everything will work.
Other surveys:
I tried the implementation through PHPDocument , but there were problems with the encoding (the thing is, I don’t know in what encoding the developed script will be used).
I tried, first find:
<section name="myname1"> ... </section> and then through the function “preg_match_all” with the flag “PREG_OFFSET_CAPTURE” find the number of opening and closing tags of the same name and their position in the string, followed by the calculation of the final closing tag for the desired one. But even here I stumbled over the notorious Cyrillic alphabet.
I tried XPath , I can not get it to correctly digest not fully valid layout. Especially strongly swears on use of svg inline. At the end throws a critical error: Uncaught exception 'Exception' with message 'String could not be parsed as XML'
Library " Simple HTML DOM Parser " and others like it, build their logic on the use of PHPDocument, XPath. Therefore, have the same sores with processing invalid HTML
Does anyone have any idea how to solve this issue? At least going to some kind of algorithm.
Here is another link to a similar example , but problems remain. The following regular expression will not work for the following pattern:
<table> ... <table></table> ... </table> But there will be for this:
<table> ... <table>Обязательно текст или пробел</table> ... </table> Such a bug can be easily fixed, but this example will also refuse to search for a construction with a lot of nesting, if the nested tag does not use the tag of the desired DOM element - this is the theme of my post.
Who can can correct and optimize the given example?
xlittle help. - user207618