As without parsing, only using regular expressions, to get from a set of identical tags, tags with specific content

XML:

<tag> ... </tag> <tag> ... content ... </tag> <tag> ... content ... </tag> 

Result:

 <tag> ... content ... </tag> <tag> ... content ... </tag> 

naive solution doesn't work:

 .*?<tag>.*?content.*?<\/tag> 

an idea with a negative lookahead didn't work either:

 .*?<tag>.*?(?!<\/tag>).*?content.*?<\/tag> 

Interested in: Is it possible to implement this on regex? if not, why?

debugger example: https://regex101.com/r/ULZVO5/6

similar task with single brackets place tag:

 (...)(..)(...ABC...)(..)(.,.ABC,.) 

decision:

 \([^)]*ABC[^)]*\) 

reference to debugger: https://regex101.com/r/MyWevz/1/

  • Regular expressions are designed to parse regular grammar. The xml grammar is irregular . That is why it is difficult to get only the necessary part of the tags. However, some modern regex engines have long been able to capture so-called balanced groups (balancing groups), which allows parsing irregular texts. They, I think, will solve your problem. But you need to know if the engine you use them has. And most importantly, it takes more time to study all this than to write code using a normal xml parser. - Alexander Petrov
  • @ alexander-petrov Perhaps you meant not "irregular" a "context-free grammar"? (according to the Chomsky hierarchy) ... and yes, you are right. in my example, all groups are fairly balanced (all parentheses are neatly closed) ... and what, recursions and subroutines in regular expressions - is it a taboo or a bad tone? I admit honestly: I don’t understand very well how lookahead works, but I suppose that sub-similar measurement can be solved by such means. Therefore, I asked this question here. - Serafim

1 answer 1

The solution in this case is:

 <tag>(?:[^<]|<(?!\/tag>))*content.*?<\/tag> 

a source

code in the sandbox