Use vertical tabs (\ v), greedy (Greedy) and superjudged expressions

Question

What is vertical tab \v ?
How is the superjugal method different from the greedy one?
Similarly for the non-greedy.

Do not send me to Google (or lmgfy). If you want to send, send better to "natribu" or to a specific site with a description of the problem for three-year coders.
For in the most popular resources I read, but did not understand anything. Especially with taboy.

indeed, it would be interesting to know about \ v. - Valeriy Karchov

Community spirit ♦ one · Accepted Answer · 2011-09-29T15:11:03

Sample input:

 <tag attr='val'>123 123</tag>

Objective: to find a match between the angle brackets (no matter what, the point is to look at the difference in behavior).

Use the non-greedy (lazy, lazy) quantifier: <tag.*?> In this case, the regular expression engine will literally try to "get off quickly." How the search will occur:

found <tag . Go ahead, select under .*
track. space character. Fits under .* - excellent, enough with me, go ahead and select under >
the character a - hell, you have to go back, again we select under .* , but starting from a
a fits under .* - and that's enough, we select under >
character t - damn ... -

and so on, with constant reversals. Result: <tag attr='val'>

Use the greedy quantifier: <tag.*> The engine will try to pick up more characters for each quantifier and roll back only if no match is found:

found <tag . Go ahead, select under .*
track. space character. It fits under .* - great, but we try to go further and try to intercept the following characters with .* : attr='val'>123 123</tag> - everything, there is nothing more to add, go ahead
we select under > - and there is already no text. We'll have to roll back - go back to the character and select >
track. symbol > - fits under > . The text is over, the pattern is over

Result: <tag attr='val'>123 123</tag>

We use the supercade (jealous, possessive) quantifier: <tag.*+> No rollbacks will be made at all:

found <tag . Go ahead, select under .*
track. space character. Fits under .* - great, we try to go further and try to interrupt the following characters with .* : attr='val'>123 123</tag> - everything, there is nothing else to add, go further
we select under > - and there is already no text. Do not care, no turning back

no matches.

And on the tab of the vertical tab - it's just a control character, the same as \n or \t . It has no direct relation to regular expressions. Previously used in printers - I don’t know the exact details, but oldfags are remembered . Thus, \v simply looks for the presence of this character in the string.

UPD. (Comment on @knes and @Valeriy Karchov)

As far as I understand, it's about performance. Both greedy and lazy quantifiers store backlinks for the ability to go back. If the pattern is complex, with attachments, then there can be a lot of such returns. If a match is found (or can be found), then everything is OK, but if there is no match, then a long search of various options may begin, and in this case, the supercade quantifiers will quickly determine that there is no match.

Here is an artificial, but illustrative example: apply the pattern (x+x+)+y to a string of the form xxxxxxxxxxy . If y at the end, then OK, only one rollback will occur (when a match is found for the second x+ ) and the job is done. But if y is not at the end, then the engine will strain all possible combinations. So, on my machine this search (Java) in a row of 19 x took 2 seconds. On the other hand, it is obvious that if any section was traversed with (x+x+)+ , then y is definitely not there. This means that we can set a superjudic quantifier: (x+x+)++y - since we know for sure that a rollback will not lead to finding y .

Thus, super-nest quantifiers can be used in cases where the expression under the quantifier cannot swallow the characters that were supposed to be pulled out with the expression following the quantifier. In situations with inappropriate input, this will make it faster to determine that there is no match. So, some regular expression engines even define situations like [^x]+x and substitute there a super-greedy quantifier.

= (What kind of kickbacks: go and go yourself, as long as p. * Matches everything else - I
understood it on Wikipedia. But the exact search mechanism for the regular schedule, which is described here - is by no
And in what cases is it recommended to use a super-quantifier?
It turns out that in any case, if it does not end with. * Or. + Or is not an exact copy of the string, it will not find anything.

Use vertical tabs (\ v), greedy (Greedy) and superjudged expressions

1 answer 1

More articles: