Let me explain right away : I know that python 2 requires explicit declaration of strings as Unicode. I understand that this should not work correctly. I'm interested in anatomy of a breakdown. What exactly inside re.compile()
and regex.search()
produces such a result?
Judging by the code below, the 'а-яё'
range does not include the 'р-ю'
range, but the 'р-ю'
range includes the 'ё'
.
mcve.py:
# coding=utf-8 import re # Это панграмма, она содержит все буквы алфавита test = 'широкая электрификация южных губерний даст мощный толчок подъёму сельского хозяйства' regex1 = re.compile('[а-яА-ЯёЁ\s]+') regex2 = re.compile('[а-яА-ЯёЁшьрэтфцюыхущчъ\s]+') regex3 = re.compile('[а-яёшьрэтфцюыхущчъс\s]+') regex4 = re.compile('[а-яр-ю\s]+') print regex1.search(test).group() print regex2.search(test).group() print regex3.search(test).group() print regex4.search(test).group()
Result:
wide electrification of the southern provinces will give a powerful impetus to the rise of agriculture
wide electrification of the southern provinces will give a powerful impetus to the rise of agriculture
wide electrification of the southern provinces will give a powerful impetus to the rise of agriculture
I made sure that all the letters of the alphabet from "A" to "I" and from "a" to "I" go in Unicode in a row , except for "Yo", which are explicitly added to the regular expression.
Gradually adding letters in which the search is interrupted by the first expression, I came to the range а-яА-ЯёЁшьрэтфцюыхущчъ
. If you sort the added letters, it turns out an almost continuous interval: "ртуфхцчшщъыьэю"
. "ртуфхцчшщъыьэю"
If you remove capital letters, i.e. "[А-ЯЁ]"
, then in an unexpected way the search is interrupted by "with". The interval becomes solid: from "p" to "y". This is regex3
.
And finally it turns out that now the interval can be minimized and even remove the "e" ( regex4
).
What is going on?
python --version Python 2.7.6
If you explicitly make a unicode string and a regular expression, then everything works as it should. But somehow it works without it. Explain how?
test2 = u'широкая электрификация южных губерний даст мощный толчок подъёму сельского хозяйства' regex5 = re.compile(u'[а-яА-ЯёЁ\s]+')