Let me explain right away : I know that python 2 requires explicit declaration of strings as Unicode. I understand that this should not work correctly. I'm interested in anatomy of a breakdown. What exactly inside re.compile() and regex.search() produces such a result?
Judging by the code below, the 'а-яё' range does not include the 'р-ю' range, but the 'р-ю' range includes the 'ё' .
mcve.py:
# coding=utf-8 import re # Это панграмма, она содержит все буквы алфавита test = 'широкая электрификация южных губерний даст мощный толчок подъёму сельского хозяйства' regex1 = re.compile('[а-яА-ЯёЁ\s]+') regex2 = re.compile('[а-яА-ЯёЁшьрэтфцюыхущчъ\s]+') regex3 = re.compile('[а-яёшьрэтфцюыхущчъс\s]+') regex4 = re.compile('[а-яр-ю\s]+') print regex1.search(test).group() print regex2.search(test).group() print regex3.search(test).group() print regex4.search(test).group() Result:
wide electrification of the southern provinces will give a powerful impetus to the rise of agriculture
wide electrification of the southern provinces will give a powerful impetus to the rise of agriculture
wide electrification of the southern provinces will give a powerful impetus to the rise of agriculture
I made sure that all the letters of the alphabet from "A" to "I" and from "a" to "I" go in Unicode in a row , except for "Yo", which are explicitly added to the regular expression.
Gradually adding letters in which the search is interrupted by the first expression, I came to the range а-яА-ЯёЁшьрэтфцюыхущчъ . If you sort the added letters, it turns out an almost continuous interval: "ртуфхцчшщъыьэю" . "ртуфхцчшщъыьэю"
If you remove capital letters, i.e. "[А-ЯЁ]" , then in an unexpected way the search is interrupted by "with". The interval becomes solid: from "p" to "y". This is regex3 .
And finally it turns out that now the interval can be minimized and even remove the "e" ( regex4 ).
What is going on?
python --version Python 2.7.6 If you explicitly make a unicode string and a regular expression, then everything works as it should. But somehow it works without it. Explain how?
test2 = u'широкая электрификация южных губерний даст мощный толчок подъёму сельского хозяйства' regex5 = re.compile(u'[а-яА-ЯёЁ\s]+')