I have very long regular expressions with a lot of groups inside. For example, such

((ab+)|(qwe[rty]?)|(hjk.*)|(mmm)|(ppp)|(sss))? ((ooo[0-9]?)|(ddd)|(ggg)|(jjj))kk? zzz 

only 1-2 thousand characters long and with 100-200 groups and subgroups. Are there any restrictions on the length of a regular expression in python? For the number of groups in it? Any other nuances? Or is there no limit at all, and everything depends only on the processor power?

While it seems that I can not use regular expressions longer than 1000 characters and / or with more than 100 groups.

  • Very interesting question! Here they talk about the situation in php. I would like to know how things are in python. - mymedia
  • 2
    And you can ask why? Maybe for your task it is better to write a regular parser without regulars? - andreymal
  • No, not better, I need to look for the coincidence of text tokens with a large number of different patterns, and these are regular expressions. And such long regulars are also obtained for a reason, but due to the need to reduce the number of patterns and glue similar ones into one, since the speed of the system is inversely proportional to the number of patterns - Shelari
  • What is your version of Python? - MaxU
  • Version python 2.7.9 - Shelari

2 answers 2

Judging by this tick in Python up to version 3.5, there is a limit of 100 groups ( capturing groups ).

https://bugs.python.org/file36654/re_maxgroups.patch :

 -``(?P=name)`` - A backreference to a named group; it matches whatever text was matched by the - earlier group named *name*. +``(?P=name)``, ``(?P=number)`` + A backreference to a group; it matches whatever text was matched by the + earlier group named *name* or numbered *number*. + + .. versionchanged:: 3.5 + Added support of group numbers. ``(?#...)`` A comment; the contents of the parentheses are simply ignored. diff -r 8a2755f6ae96 Lib/sre_compile.py --- a/Lib/sre_compile.py Thu Sep 18 19:45:04 2014 +0300 +++ b/Lib/sre_compile.py Thu Sep 18 23:27:28 2014 +0300 @@ -470,12 +470,6 @@ def compile(p, flags=0): # print code - # XXX: <fl> get rid of this limitation! - if p.pattern.groups > 100: - raise AssertionError( - "sorry, but this version only supports 100 named groups" - ) - # map in either direction groupindex = p.pattern.groupdict indexgroup = [None] * p.pattern.groups 

I would still advise you to open a new question and describe your task with examples of input and "output" text / data - perhaps there is a more elegant solution ...

  • Thank you very much! - Shelari
  • @Shelari, always welcome. I added a bit of advice to my answer - maybe it’s worth looking for another solution ... - MaxU
  • Thank you, if I have the opportunity, I'll take care of it - Shelari

I will assume that you do not need such a number of PRESERVING groups. Most likely, they can all be replaced by NOT saving (?:...) .
You use a “glue stick” of regular expressions, which does not glue together correctly — groups should not be conserving, since there is no point in this, except for a reduction of two characters per group (no ?: .

  • Yes, I know about this method, but at the moment it, alas, is technically long to implement. But that is exactly what I was planning to do when the opportunity arises. And you do not tell me, there are no restrictions on the number of unnamed groups? - Shelari
  • Although it seems that unnamed groups do not understand elasticsearch, which is part of the system, so perhaps this solution will not work. - Shelari
  • There are no restrictions on such groups. - ReinRaus