There is a set of templates for texts, the form " Your order is available at:% w {1,3}. Take away soon! ", Where% w {1,3} is "a set of words separated by one or more spaces". By "word" we mean a set of letters, numbers, and a number of special characters. {1,3} is a quantifier that works similarly to quantifiers in regulars. The task is to learn to define texts for minimal matching of similar patterns. It was decided to convert such patterns into regular expressions. Thus, from the above example, get a regular:

/^Ваш заказ доступен по адресу:\s(?:\s*[\w\d\,\.\!]+\s*){1,3}\.\sЗабирайте скорее!$/u 

Naturally, such a regular schedule can cause problems in the form of catastrophic backtracking, but I don’t see a way to optimize it, given the fact that there can be any number of spaces between words. I would be grateful for any indication of my blindness :)

  • If you are given an exhaustive answer, mark it as correct (a daw opposite the selected answer). - Nicolas Chabanovsky

1 answer 1

I see one way to optimize: use a super-greedy seizure (1) of the line, including mandatory completion, followed by determining the presence of this very end by a retrospective (4) check. (O (66) vs. O (2441) on the test line that is not suitable and of the same complexity for the suitable ones) In the example, the points discussed in the text are marked:

 /Ваш заказ доступен по адресу:\s(?>\s*[\w\d\,\.\!]+\s*\.?){3,5}(?<=\.\sЗабирайте скорее!)$/ ^1 ^2 ^3 ^4 

But there are difficulties with the construction of such an expression from the primary line. First of all, we need the whole "correct" string to be able to be captured to the very end by the set of characters indicated in square brackets. Secondly, we need to count from how many "words" a part of the string consists of, after the mask used. And increase the number of words quantifier (3). In addition, unlike your version, greedy capture considers a separate point (in front of which is a space) as a single word, because of this, at the end of the captured expression, after the space, you had to add the possibility of such a point (2).

Actually the point is the main stumbling block. The admissibility of the mark is a point in the middle of words (in your abc.def - this is one word). If she could only be at the end of words, there could be other, less rigid options.

And of course, your regular expressions (you didn’t specify which dialect you have) should support the super-greedy capture

UPD On this proposal ReinRaus :

 /Ваш заказ доступен по адресу:(?:\s++(?:[\w\d,!]++|\.)*?){1,3}\s*+\.\sЗабирайте скорее!$/ 

So much easier to generate ...

  • Yes exactly. The problem is at the point that can be part of the “word” and enters a mandatory text at the end - ReinRaus
  • @ReinRaus Yeah, I tried to get around it for a long time ... - Mike
  • You walked around it in an interesting way, I would do it like this: regex101.com/r/iW2iA7/4 , but this gives a little more steps. - ReinRaus
  • @ReinRaus Steps This is not particularly scary when there are not thousands of them. Yes, you have much easier to generate - Mike
  • Many thanks to @Mike for the detailed answer and @RainRaus for the expression that is convenient for generating. However, the problem is that each point will be considered as a separate “word”, and sooner or later someone will come to think of it. eg. so: "Your order is available at the address: Lenin St., 6 ... Take it away soon!" Because the rate of generation of regulars is uncritical, most likely we will do this: Rule of the form Your order is available at:% w {1,3}. Take away soon! Will be converted to reg. expression: /^Ваш заказ доступен по адресу:\s(?:[\w\d\,\.\!]+)(?:\s++[\w\d\,\.\!]+){1,3}\.\sЗабирайте скорее!$/u - Alexander Sadovnikov