I am engaged in parsing texts of court decisions. There is such a piece of text from which you need to pull information about the punishment (imprisonment):
в виде лишения свободы на срок 3 года 10 месяцев со штрафом в размере 150 000 рублей с ограничением свободы на срок на 8 месяцев.

Composed by regex
лишени[а-я]+\s*?свободы\s*?на\s*?(?:срок)?\s*?(?:(?P<years>\d+).*?(?:года?|лет)?)?\s*?и?\s*?(?:(?P<months>\d+)\s*?(?:месяц[а-я]{0,3}))?
gives the result of the лишения свободы на , but if you remove the final question mark (which cannot be removed in the general case), you get the desired result:
лишения свободы на срок 3 года 10 месяцев .
The documentation says:

The '*', '+', and '?' qualifiers are all greedy; they match as much text as possible.

Question: why in my case addition ? gives the opposite (not the one that I expected) effect?

  • And why after years .*? ? Why not \s* ? In general, I have so far only succeeded regex101.com/r/4ls28x/2 . - Wiktor Stribiżew
  • @ WiktorStribiżew .*? because of such cases: ...лишения свободы на 5 (пять) лет... - Roman Yakubovich
  • Then so - regex101.com/r/4ls28x/3 - Wiktor Stribiżew
  • @ WiktorStribiżew Thank you! I tried the first option, substituting .*? - why in this case it matches zero characters, and not before the first continuation of the pattern (what exactly is “non-greedy”)? Because of standing after лет ? he doesn't even look, is there a possible continuation in the line? Because it works - regex101.com/r/4ls28x/4 - Roman Yakubovich
  • In general, the problem is that after .*? There must be at least one required template. In the original expression, they are all optional, since after each of them there is a quantifier ? . Those. год or лет should be exactly, yes? - Wiktor Stribiżew

1 answer 1

The problem is that after .*? There must be at least one required template. In the original expression, they are all optional, since after each of them there is a quantifier ? - one or zero matches.

Since the год or лет should definitely be in the search text, you can use

 лишени[а-я]+\s*свободы\s*на(?:\s*срок)?(?:\s*(?:(?P<years>\d+).*?(?:года?|лет)))?(?:\s*и)?(?:\s*(?P<months>\d+)(?:\s*месяц[а-я]{0,3}))? 

See the regular expression demo

Now (?:года?|лет) binding pattern, and .*? will have to “reach” one of these options. Otherwise,. .*? is skipped, the rest find an empty string (since they are optional), and the substring after years is not. By the way, all other quantifiers must be "greedy".