Let me explain right away : I know that python 2 requires explicit declaration of strings as Unicode. I understand that this should not work correctly. I'm interested in anatomy of a breakdown. What exactly inside re.compile() and regex.search() produces such a result?


Judging by the code below, the 'а-яё' range does not include the 'р-ю' range, but the 'р-ю' range includes the 'ё' .

mcve.py:

 # coding=utf-8 import re # Это панграмма, она содержит все буквы алфавита test = 'широкая электрификация южных губерний даст мощный толчок подъёму сельского хозяйства' regex1 = re.compile('[а-яА-ЯёЁ\s]+') regex2 = re.compile('[а-яА-ЯёЁшьрэтфцюыхущчъ\s]+') regex3 = re.compile('[а-яёшьрэтфцюыхущчъс\s]+') regex4 = re.compile('[а-яр-ю\s]+') print regex1.search(test).group() print regex2.search(test).group() print regex3.search(test).group() print regex4.search(test).group() 

Result:


wide electrification of the southern provinces will give a powerful impetus to the rise of agriculture
wide electrification of the southern provinces will give a powerful impetus to the rise of agriculture
wide electrification of the southern provinces will give a powerful impetus to the rise of agriculture

I made sure that all the letters of the alphabet from "A" to "I" and from "a" to "I" go in Unicode in a row , except for "Yo", which are explicitly added to the regular expression.

Gradually adding letters in which the search is interrupted by the first expression, I came to the range а-яА-ЯёЁшьрэтфцюыхущчъ . If you sort the added letters, it turns out an almost continuous interval: "ртуфхцчшщъыьэю" . "ртуфхцчшщъыьэю"

If you remove capital letters, i.e. "[А-ЯЁ]" , then in an unexpected way the search is interrupted by "with". The interval becomes solid: from "p" to "y". This is regex3 .

And finally it turns out that now the interval can be minimized and even remove the "e" ( regex4 ).

What is going on?

  python --version Python 2.7.6 

If you explicitly make a unicode string and a regular expression, then everything works as it should. But somehow it works without it. Explain how?

 test2 = u'широкая электрификация южных губерний даст мощный толчок подъёму сельского хозяйства' regex5 = re.compile(u'[а-яА-ЯёЁ\s]+') 
  • And why regulars do not have Unicode? - Visman
  • @Visman probably because it worked. :)) Checked with unicode - it works as it should. But without it somehow works. I want to understand this phenomenon, what is happening and why. - Nick Volynkin
  • In general, regulars usually do not write unicode strings, but raw strings, that is, probably should be re.compile (r '[a-YaA-Yayo \ s] +') - FeroxTL

2 answers 2

The easiest way to see what is wrong with the regular schedule is to set the debug flag. And in this case there is absolutely no need to climb into the guts - they will show the same thing. You can make sure of this by climbing your hands in %python_folder%/Lib/sre_parse.py and adding a couple of %python_folder%/Lib/sre_parse.py to line 438 — it looks like a tactic:

 elif this == "[": # character set set = [] setappend = set.append ## if sourcematch(":"): ## pass # handle character classes if sourcematch("^"): setappend((NEGATE, None)) # check remaining characters start = set[:] while 1: this = sourceget() if len(this) == 1: # Вот ровно в этом месте парсер перебирает содержимое [] print(source.tell(), "ORD: ", ord(this)) if this == "]" and set != start: break 

So this print will show everything exactly the same thing as debug - that the letters are not really letters.

 regex1 = re.compile('[а-яА-ЯёЁ\s]+', re.DEBUG) max_repeat 1 2147483647 in literal 208 range (176, 209) literal 143 literal 208 range (144, 208) literal 175 literal 209 literal 145 literal 208 literal 129 category category_space 

It is immediately obvious that something is completely unclean, because the first must be range(ord('a')-ord('я') , and some kind of nonsense in its place. And everything from the fact that the strings are in UTF8 ( Explicitly indicated in the file), and their type is bytes. They can be displayed normally in the terminal, if the encodings are the same. I use Pycharm and its console in UTF8. But if I ran the same on a standard Windows terminal, it would appear naturally porridge (such as ╨╤А╨╕╨▓╨╡╤В), for the CP866 encoding.

For example,

 print("Привет") # НО! for char in "Привет": print(ord(char), repr(char)) 

In the debug output, the number 208 is the first half of the UTF8 character 'a' - 176 - the second half, followed by a hyphen and again the first half of the character 'I' - you can verify this by opening the source in any HEX editor. As a result, the wrong interval. Accordingly, when the parser iterates over the contents of square brackets, it stumbles not on letters, but on bytes, or more precisely on half of letters in UTF8 encoding. You can simulate the behavior of the regular code with this code:

 reg_min = 144 reg_max = 209 result = [] for char in test: ordedr = ord(char) if ordedr >= reg_min and ordedr <= reg_max: result.append(char) else: break print(ord(regex1.search(test).group())) print(list(map(ord, result))) >>> 209 >>> [209] 
  • Good analysis. Maybe it is worth chewing a little in the answer that literal 208 is the first byte from the first range to [а-яА-ЯёЁ] , and range (176-209) translates the range from the second byte to the first byte я (otherwise not everyone can to figure it out right away) - avp
  • @avp, I think I understood right away, but you can chew) - Nick Volynkin
  • @NickVolynkin, interesting, but in python you can call setlocale() yourself (from libc)? If so, you could specify UTF-8 yourself and see what happens. - avp
  • @avp, you can, of course: import locale locale.setlocale(locale.LC_ALL, ('RU','UTF8')) But the locale changes the fonts (character set?), date format, time zone, etc., but single-byte strings will not be interpreted as two-byte strings. - m9_psy
  • Not really. It depends on the library (I don't know what python uses). In Ubuntu 16.04 LTS , the regcomp()/regexec() - regcomp()/regexec() for the pattern [а-я] regcomp()/regexec() [а-я] behave differently at different locale - C (this is the default if setlocale() not called) and en_US.UTF-8 (this is my environment LANG, i.e., call setlocale(LC_ALL, "") ). / Well, you can play yourself. - avp pm

Native Unicode support in Python - only from the third version. Use it. Pay attention to the first two lines of the corrected file.

 #!/usr/bin/python3 # -*- coding: utf-8 -*- import re # Это панграмма, она содержит все буквы алфавита test = 'широкая электрификация южных губерний даст мощный толчок подъёму сельского хозяйства' regex1 = re.compile('[а-яА-ЯёЁ\s]+') regex2 = re.compile('[а-яА-ЯёЁшьрэтфцюыхущчъ\s]+') regex3 = re.compile('[а-яёшьрэтфцюыхущчъс\s]+') regex4 = re.compile('[а-яр-ю\s]+') print(regex1.search(test).group()) print(regex2.search(test).group()) print(regex3.search(test).group()) print(regex4.search(test).group()) 

Conclusion:

 gaal@linux-t420:~/WORK/test> ./1.py широкая электрификация южных губерний даст мощный толчок подъёму сельского хозяйства широкая электрификация южных губерний даст мощный толчок подъёму сельского хозяйства широкая электрификация южных губерний даст мощный толчок подъёму сельского хозяйства широкая электрификация южных губерний даст мощный толчок подъ 
  • Yes, I know about it, and if you explicitly specify the u'строка' , then everything is fine. But what happens in this example? How are regular expressions compiled from such strings? - Nick Volynkin
  • Well, apparently, because python 2 works in the current code page. What is the target platform for your script - which version of Windows, Linux? - gecube 2:22 pm
  • Linux Ubuntu 14.04. By the way, # -*- coding: utf-8 -*- something different from the effect produced by just # coding: utf-8 ? - Nick Volynkin
  • one
    Your answer is correct in the sense of "XY problem". Indeed, one should not expect correct results from incorrect use. But I would like an excursion into the interior of Python with an explanation of the reasons for such a breakdown. Such is the vivisector interest. :) And as for hashbang - I know, but there are no habits yet. By the way, post-topic question: ru.stackoverflow.com/q/541589/181472 - Nick Volynkin 2:55 pm
  • one
    Yes, yes, yes, plus that question. Really relevant. Regarding the vivisector. What can be done. The first is to print character tables. The second is to look at the guts of the regex module in Python2. And draw the appropriate conclusions. - gecube