I need to break the text into words, at the same time, so that all words except Russian and all punctuation marks are excluded. wordpunct_tokenize
does not allow this. It seems that this can be done with the help of the RegexpTokenizer
, giving it a regular expression. Tell me which regular expression can be used to do this or advise other tokenizers that can give you what I need.
|
1 answer
import re for word in re.findall(r'[А-Яа-я]+', 'one, два, три, four', re.U): print(word)
- fourThe letter yo / yo forgotten. - Wiktor Stribiżew
- 21- This is not true: not all Russian letters are covered. 2- If you give the example of code for Python 3, then it is worth mentioning it explicitly if the question does not have a corresponding label. It is also worth mentioning Unicode, problems of normalization 3- in general, words are not required to contain only letters. - jfs
- In any case, you will need a dictionary of exceptions for words containing hyphens and abbreviations with spaces and periods: " someone ", " BC ", " , etc. ". - bl79
|
nltk.download()
). - jfs