Tokenization with RegexpTokenizer: find Russian words

Question

I need to break the text into words, at the same time, so that all words except Russian and all punctuation marks are excluded. wordpunct_tokenize does not allow this. It seems that this can be done with the help of the RegexpTokenizer , giving it a regular expression. Tell me which regular expression can be used to do this or advise other tokenizers that can give you what I need.

“Saltykov-Shchedrin” —is it one word or two? Do you need an example using nltk? (Deliver data in Russian using nltk.download() ). - jfs

Sergey Gornostaev Sergey Gornostaev 53.3k 6 28 66 · Accepted Answer · 2016-08-11T09:46:45

 import re for word in re.findall(r'[А-Яа-я]+', 'one, два, три, four', re.U): print(word)

Sergey Gornostaev

53.3k 6 28 66

four
The letter yo / yo forgotten. - Wiktor Stribiżew
2
1- This is not true: not all Russian letters are covered. 2- If you give the example of code for Python 3, then it is worth mentioning it explicitly if the question does not have a corresponding label. It is also worth mentioning Unicode, problems of normalization 3- in general, words are not required to contain only letters. - jfs
In any case, you will need a dictionary of exceptions for words containing hyphens and abbreviations with spaces and periods: " someone ", " BC ", " , etc. ". - bl79

|

Tokenization with RegexpTokenizer: find Russian words

1 answer 1

More articles: