Python 2.7.6 problem with applying re to a string in Russian.

  1. the task is to find three alphanumeric characters followed by a dot;

  2. code:

    #!/usr/bin/python # -*- coding: utf-8 *-* import re new = re.findall("\w{3}\.", "gth. Ср. дек. 7 21:22:29 EET 2016" ) print new 
  3. result >> ['gth.']

  4. question: why is 'дек.' ignored ?

  • 2
    3 "alphanumeric characters" including _ ? \w finds underscores. And Alex’s answer is correct: in 2.7 you must use re.U / u"" . - Wiktor Stribiżew
  • Thanks, because \ w really includes _. In my case, not critical, but I did not remember it. - young_podaffan
  • 2
    The underscore is excluded like this - [^\W_] - Wiktor Stribiżew

1 answer 1

Use Unicode strings and the re.UNICODE flag:

 #!/usr/bin/env python2 # -*- coding: utf-8 *-* import re pattern = re.compile(ur"\w{3}\.", re.UNICODE) match = pattern.findall(u"gth. Ср. дек. 7 21:22:29 EET 2016") print(match) for i in match: print(i) 

Result:

 [u'gth.', u'\u0434\u0435\u043a.'] gth. дек.