if "!anon" in message: ss = re.compile(ur"^!anon") else: ss = re.compile(ur"^!анон") print(message) message = ss.sub(u"", message).encode("utf-8").strip() Why, if you write !anon - deletes it, and !анон - no?
if "!anon" in message: ss = re.compile(ur"^!anon") else: ss = re.compile(ur"^!анон") print(message) message = ss.sub(u"", message).encode("utf-8").strip() Why, if you write !anon - deletes it, and !анон - no?
Use Unicode in Python to work with text.
Your Python 2 code works if isinstance(message, unicode) . Example :
print re.sub(u'б', '', 'абба') #XXX DOESN'T WORK # -> абба print re.sub(u'б', '', u'абба') # -> аа 'б' in 'абба' constant in Python 2 creates a sequence of bytes by default (an object of type bytes ), an analogue of b'\xd0\xb1' in Python 3 (assuming Python 2 the source code is declared with utf-8 encoding, otherwise it would be a different sequence bytes can be represented). Note, you can't even b'б' write in Python 3 ( SyntaxError ).
Do not mix bytes and unicode. Python 3 would throw a TypeError if you tried to use a unicode regular expression with isinstance(message, bytes) . Python 2 here behind the scenes performs analogue u'б'.encode('latin-1') , which leads to an error .
In general, it may be necessary to re.UNICODE flag, so that for example \w+ regular expression recognizes Unicode letters and numbers.
Remove .encode('utf-8') - this path leads to krakozyabram . Use bytes only on the border with interfaces that explicitly require them. Inside the program, transfer text as Unicode.
Source: https://ru.stackoverflow.com/questions/699840/
All Articles