for example, I have this line

desc=u"привет 123123123 🙆🏼🙆🏼🙆🏼 тут какой то текст 12349! abcde 123" 

I found a partial solution:

 re.sub(r'[^\x00-\x7F]+',' ', desc) 

or

 "".join(filter(lambda x: ord(x)<128,desc.decode('utf-8'))) 

but the problem is that all Cyrillic characters are deleted and it turns out:

  123123123 12349! abcde 123 

and the line may also have m², this is also a special character. I would like to leave him.

  • 3
    This character, this one does not want ... What should I leave as a result? You need to formalize the requirement and the question will disappear by itself. - m9_psy pm
  • 2
    Why you do not want to supplement the regular calendar with Cyrillic characters or do not remove only the left symbols (instead of all less than 128)? - Arnial pm

2 answers 2

The simplest option is a solution to the forehead, create a list of "right" symbols and kill, delete the wrong ones!

  • This is the most crutch option: / - John Doe
  • @JohnDoe, He is the only one yet. The creators of python did not have this problem. - Mihail Ris
  • I found a solution. it is simple. and do not need to create crazy lists of necessary characters - John Doe
  • @JohnDoe: there is no miracle - what is the "left character" is not written in any standard - anyway you will have to specify the ranges of characters you want to leave (white list) and the characters you want to exclude (black list) For example, you may want to exclude emoji characters such as 🙆 (U + 1F646 FACE WITH OK GESTURE) - jfs
  • it happens. specifically in this case for sure. those weird characters are four-byte unicode. muscle is out of the box can not. and this is solved by replacing the regular re.compile (u '[\ U00010000- \ U0010ffff]') - John Doe

Disable special characters - put r before quotes. Or create a list of "left" characters