When developing his "lightweight" template engine, al-I CMF MODx, I faced the problem of nesting of structures.

What we feed (simplified for perception design):

/* html-код */ [[SNIPPET_1 :if &is=`var` &then=` /* html-код */ [[$CHUNK_1 :if &is=`var` &then=`[[SNIPPET_2 :filter &name=`var` ]]` ]]` &else=`[[$CHUNK_2:upper]]` :filter &name=`var` ]] /* html-код */ [[~LINK_1:abs]] 

:if, :filter, :upper... are filters (modifiers) , and &is, &name … are filter variables .

Filter variables ( &is=`var` ), as you might guess, should contain anything: from a simple string to the html-code of a template seasoned with variables (snippets, chunks, etc.)

The problem is how to close [[SNIPPET_1]] in this case, if there are other template variables in it. It is worth noting that [[SNIPPET_1]] has two filters applied to it :if and :filter . This also needs to be considered.

It would be wonderful to parse this construction as it is (that is, to take into account the line feed - the convenience of perception)

Actually, the regexp pattern, which is used in the project:

 preg_replace_callback( '/\[{2}([\$\*\@\%\~]?|\+{1,2})([\w-\.]+)\s*((?:\:[\w]+\s*(?:\s*\&[\w]*\=`(?:.[^\n]*)`)*\s*)*)\s*\]{2}/iu', function ($call) { }, $subject ) 
  1. Selects separately the name of the template variable ( SNIPPET_1, CHUNK_1, SNIPPET_2 … ), its type ( "" - snippet, "$" - chunk, "~" - link ...) and filters with their contents ( :if&is=`var`&then=`[[$CHUNK_1]]` :filter&name=`var` ).

    In this case, [^\n] is a stub, i.e. the contents of the filter variable is written in one line without transitions to the next, to determine the end of the filter variable, namely:

     &then=`[[$CHUNK_1:if&is=`var`&then=`[[SNIPPET_2:filter&name=`var`]]`]]` 

    Agree, not very readable turns out.

  2. Next, the filter construction is parsed into an array. The name of the filter ( if, filter… ) and the variables of each filter are determined. Regexp pattern:

     preg_match_all('/\:([\w]*)((?:\s*\&[\w]*\=`(?:.[^\n]*|)`\s*)*)/iu', $call[3], $found); 
  3. And finally, the cyclical mileage for each of the filters and the function (corresponds to the name of the filter). For example, here’s the filter function :if :

     preg_match_all('/\&([\w]*)\=\`((?:.[^\&]*)?(?(?=:).*?\`\]{2}(?:.[^&]*)?|(?:.[^\&\:])?))\`/iu', $subject, $found); 

Collisions in the current template engine functionality:

  1. Again, the contents of the filter variable are written in one line without transitions to the next;

  2. Errors are not noticed, only with two-dimensional nesting. It is treated by creating an additional (new) chunk with placing the necessary construction in it.


Summarizing: Dear Regular Expression Gurus, share your experience on how to close a structure if there are similar constructions in it.

UPDATE:

@ReinRaus Thank you for the answer. In spite of the fact that the direction where to dig me was suggested by @VladD ( http://php.net/manual/ru/regexp.reference.recursive.php ), you painted possible reefs associated with this design.

  1. You are right, there is a problem, because inside attribute values ​​there is a `

    However, if you replace in the template of this kind single quotes for something that looks more like a restriction, for example, &is={{…}} , then everything is great. Here is an example:

     '/\[{2}([\$\*\@\%\~]?|\+{1,2})([\w-\.]+)((?:\s*\:[\w]+\s*(?:\s*\&[\w]*\=\s*\{{2}\s*(?:[^\{\}]++|(?R))*\}{2})*)*)(?:[^\[\]]++|(?R))*\]{2}/iu' 

    The name of the template variable ( [[имя]] ), its type ( [[$...]] - chunk ...), as well as the list of filters with their contents are highlighted. ( :if… :filter… ), and so on for each template variable.

    It was not possible to select the regexp pattern to replace the single quotes `…` with {{…}} taking into account \s , therefore you will have to edit the templates with pens. Of course, the symbol ` overwhelmed much preferable. If you have a solution, I will be glad to read.

  2. The second problem is the second pattern (inside the callback function), which parses directly the filters (for each template variable (snippet, chunk) there can be several of them).

     :if &is={{var}} &then={{ /* html-код */ [[$CHUNK_1 :if &is={{var}} &then={{[[SNIPPET_2 :filter &name={{var}} ]]}} ]]}} &else={{[[$CHUNK_2:upper]]}} :filter &name={{var}} 

    The problem lies in the allocation of a single filter, regardless of the presence of nested similar structures.

    Given the above pattern, the filters are stored in $call[3] . You can go to the trick and replace all the constructions {{…}} with their contents with something else.

     '/\{{2}(?:[^\{\}]++|(?R))*\}{2}/iu' 

    Next, parse safely with the exception of [^\:] . After all, the design of filters will get a simpler look.

     :if &is={{var_1}} &then={{var_2}} :filter &name={{var_3}} 

    Is it possible to do without a replacement?

  • @ReinRaus (ran out of comments): is there any online tool for checking? regexpal.com apparently did not understand what ++ . - VladD 7:09
  • regextester.com Option preg . If the result is a matched [digit], then the string matches your recursive pattern. If not, then just be the source code. Result screen: s55.radikal.ru/i149/1211/ec/68e06bb7e2d4.png - ReinRaus
  • @ReinRaus: hmm, you puzzled me :) I will take the help of the club on SO. Subject interesting. - VladD
  • @vladD, it happens :) the expression was incidentally written in haste on reflexes :) here’s a better view, show SO better: ^ (a (? :(? 1)) * b) $ - ReinRaus
  • @ReinRaus: figured out, modern regular expressions have become stronger. However, writing the correct parser on regular expressions is a very difficult task. Think about what, if the nested HTML in the comments is :if ? Your code will have to take this into account. - VladD

3 answers 3

UPDATE
Erase the old message due to the inconsistency of updating it.
We remove all previously imposed restrictions and eliminate the need for replacements.
The expression $RE0 first $RE0 all the snippets.
Then you determine if the snippet has filters, and if so, parse them with the expression $RE1 .
Attribute values ​​are obtained using $RE2

 $RE0=<<< REGEX_SNIPPET (?P<RegExpSnippet>\\[\\[ # открывающие скобки и именованная группа для рекурсии (?: # скобки для альтернативы (?: # что будет считаться внутренностями сниппета \\\\. | # экранированное что угодно [^\\[\\]] | # не кавычка, или \\[(?!\\[) | # кавычка за которой нет другой кавычки \\](?!\\]) )++ | # или все это выражение снова (?P>RegExpSnippet) )*+ # конец альтернативы, \\]\\]) # закрывающие скобки сниппета, конец именованной группы # ACHTUNG для всех кто решил поизучать это выражение и возможно составлять их в таком же стиле: # всегда в свободной записи делайте один лишний перевод строки в конце выражения # не повторяйте моей ошибки и полчаса убитых на ее поиск REGEX_SNIPPET; $RE2=<<< REGEX_ATTR (?: # что будет внутри аттрибута, этим куском можно выделять значение аттрибута \\\\. | # экранированное что-то [^`\\[] | # не апостроф и не скобка (чтобы не дергать постоянно рекурсию) $RE0 | # или вложенный сниппет \\[ # скобка )++ REGEX_ATTR; $RE1=<<< REGEX_FILTER \\s* # пробельные символы :\\w+ # двоеточие и лат.слово \\s* (?: # для нескольких аттрибутов &\\w+\\s*=\\s*` # амперсанд, слово, равно, апостроф $RE2 `\\s* # апостроф как конец атрибута )+ REGEX_FILTER; preg_match_all("/$RE1/xs", $text, $arr); 

I want to express my gratitude to the TS: while I was working on his question, I lost my knowledge of regular expressions, and now I understand much better how the regex engine works in certain situations :)

  • @ReinRaus, Unfortunately, there is a limit on the number of characters in the comments, so I will do UPDATE to my question. A stranger with a local alert system (for soap or something else). I hope you respond :) - romeo
  • Updated the answer - ReinRaus
  • @ReinRaus: I will definitely check your new solution to the problem :) So far I’ve stopped on a replacement, because there were minor conflicts, namely:: if & is = {{{0}}} & then = {{{}}} & else = {{ {2}}}: filter & is = {{{0}}} Everything is easy to parse and then assemble. By the way, I found a solution for how to ... replace with {{...}} <pre> <code> $ patterns = array ('/ = \\ s * `/', '/` /'); $ replacements = array ('= {{', '}}'); preg_replace ($ patterns, $ replacements, $ str); </ code> </ pre> - romeo
  • @vladD can't comment anymore. Conduct a better experiment and find the text that these expressions are not right. And the answer to the question: these expressions are all the same. They do not change the text, which means that the screened character in the source text will remain screened and as a result. - ReinRaus
  • @ReinRaus: comments ended there. Regarding the translation of the question: pastebin.com/THdsrPBZ Now, regarding the "expressions anyway": the one for whom the screening was made should remove this screening. Just like for the line "\"" line parser removes the` `. Otherwise, when the substitution happens, the substituted pattern will be incorrect. See? - VladD

Summary: recursive grammars cannot be parsed reliably with regular expressions.

Regular expressions, unfortunately, do not allow parsing recursive code (that is, code with nested deeply nested constructions). The expressive power of regular expressions is not enough to express a recursive dependence. For your grammar, you will have to either write with your hands the recursive descent parser, or (much better!) Learn lex / yacc, and write a real "adult" parser.

(Boring explanation)
The fact is that the set of languages ​​that can be parsed with regular expressions is just the set of regular languages . Your own language is described as at least context-free grammar , which is not regular. Accordingly, it cannot be processed by regular expressions.


Additive on a related topic: https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags (read the top answer, it's just a work of art!)


Update

Modern languages ​​include a modified version of regular expressions that deals with recursive structures. However, such parsing with regular expressions is known for its inadequate complexity.

In addition, I do not quite understand the formal syntax of the language: can HTML inside &then contain, for example, <!-- `]] --> or simply [[ ? (I hope no.)

  • Correction: recursive grammars can be parsed reliably with regular expressions :) - ReinRaus
  • @ReinRaus: Are you saying that the language of the vehicle is regular? - VladD
  • It is not completely regular, but if you impose minor restrictions, it will become so. The limitation is: inside the attribute values ​​the symbol: `can only be inside an embedded snippet. - ReinRaus
  • @ReinRaus: here's a simple recursive grammar: W -> empty | a W b It clearly generates all strings of the form aaa...abbb...b , which have the same number of a and b . Can you write a regular expression that determines whether a word belongs to this grammar? - VladD
  • 2
    Check: <pre> <? $ text = "aaaabbbb"; echo preg_match ("/ ^ (a ((? 1) |) ++ b) $ /", $ text); ?> </ pre> - ReinRaus

Parser needed:

  1. https://stackoverflow.com/questions/2093228/lex-and-yacc-in-php#2093228
  2. https://github.com/jakubkulhan/pacc