Question about Python regular expressions, or rather about findall and search functions

There is a pattern of the form pattern = re.compile(r'([aeiouy])\1{2}')

There is a test string str_1= 'hoooowe'

If I use re.search(pattern, str_1).group() , then I get the value of the Match object, i.e. 'ooo'

It is logical to assume that findall should find the same for the same template, but in the form of a list. However, as a result, I get not list_1= [[ooo]] , but list_1 = [[o]] .

What am I doing wrong?

  • one
    Do not name variables str , list , etc. - these are the names of the built-in functions of python - docs.python.org/3.6/library/functions.html :) - gil9red
  • This is just for clarity, I do not call - 0racul
  • In fact, methods for working with regular expressions always look for non-overlapping matches by default ( re.findall , re.search , re.finditer - everyone works this way). There is no intersection in the condition. See the correct explanation and examples of solutions to this problem in my answer . - Wiktor Stribiżew

2 answers 2

In the description of re.findall there is such a line:

Return to list of strings.

I think that's why you didn't find it in seacrh - because the condition implies the intersection of matches (in the template you use grouping).

Try the finditer method instead of the re.findall :

 import re pattern = re.compile(r'([aeiouy])\1{2}') text = 'hooooweee' for m in pattern.finditer(text): print(m) 

Result:

 <_sre.SRE_Match object; span=(1, 4), match='ooo'> <_sre.SRE_Match object; span=(6, 9), match='eee'> 

Ps.

As for findall , simplified the regular findall bit so that only vowel sequences were searched for and it works as expected:

 items = re.findall('[aeiouy]{3}', 'hooooweee') print(items) # ['ooo', 'eee'] 

Pps.

Another example, in which with 3 groups we seize each vowel in a sequence of three consecutive symbols.

And as we see findall in its result puts, what indicates in the group, in contrast to the search and finditer :

 import re pattern = re.compile(r'([aeiouy])([aeiouy])([aeiouy])') text = 'hooooweie' m = pattern.search(text) print(m) # <_sre.SRE_Match object; span=(1, 4), match='ooo'> print(m.group(), m.group(1), m.group(2), m.group(3)) # ooo ooo items = pattern.findall(text) print(items) # [('o', 'o', 'o'), ('e', 'i', 'e')] for m in pattern.finditer(text): print(m.group(), m.group(1), m.group(2), m.group(3)) # ooo ooo # eie eie 

so for the pattern '([aeiouy][aeiouy])([aeiouy])' returns [('oo', 'o'), ('ei', 'e')]

    Solution for this case ( demo ):

     import re str_1= 'hoooowe' pattern = re.compile(r'([aeiouy])\1{2}') print([x.group() for x in pattern.finditer(str_1)]) # => ['ooo'] 

    The fact is that re.findall returns only lists of substrings captured by exciting submasks (groups), if any are defined in the template (if there are more than one, lists of tuples with substrings are returned). The re.search method (like re.match ) returns a Match object that contains its own set of methods , and you can access the whole match using match.group() .

    See the findall :

    If there is a list of groups; this will be a list of tuples if the pattern has more than one group. Empty matches are not included.

    Those. (in my translation) "if there is one or more submasks in the template, returns a list of groups; if the template contains more than one submasks, the result will be a list of tuples. Empty matches will also appear in the resulting list if they do not border on the start of another match."

    To re.findall return only a list of matches , the following is usually used:

    • Removing unnecessary exciting submasks (for example, (a(b)c) -> abc )
    • Replacing exciting submasks with non- exciting ones (i.e. () -> (?:) ), except in cases where there are backlinks in the template, without which the regular schedule does not work
    • Use re.finditer instead of re.findall (see the solution at the beginning of the answer).