Russian characters in python regular expressions

Question

if "!anon" in message: ss = re.compile(ur"^!anon") else: ss = re.compile(ur"^!анон") print(message) message = ss.sub(u"", message).encode("utf-8").strip()

Why, if you write !anon - deletes it, and !анон - no?

Related question: How to remove the left part of a string? - jfs

Answer 1 · 2017-07-31T03:26:41

Use Unicode in Python to work with text.

Your Python 2 code works if isinstance(message, unicode) . Example :

 print re.sub(u'б', '', 'абба') #XXX DOESN'T WORK # -> абба print re.sub(u'б', '', u'абба') # -> аа

'б' in 'абба' constant in Python 2 creates a sequence of bytes by default (an object of type bytes ), an analogue of b'\xd0\xb1' in Python 3 (assuming Python 2 the source code is declared with utf-8 encoding, otherwise it would be a different sequence bytes can be represented). Note, you can't even b'б' write in Python 3 ( SyntaxError ).

Do not mix bytes and unicode. Python 3 would throw a TypeError if you tried to use a unicode regular expression with isinstance(message, bytes) . Python 2 here behind the scenes performs analogue u'б'.encode('latin-1') , which leads to an error .

In general, it may be necessary to re.UNICODE flag, so that for example \w+ regular expression recognizes Unicode letters and numbers.

Remove .encode('utf-8') - this path leads to krakozyabram . Use bytes only on the border with interfaces that explicitly require them. Inside the program, transfer text as Unicode.

Russian characters in python regular expressions

1 answer 1

More articles: