I need to break the text into words, at the same time, so that all words except Russian and all punctuation marks are excluded. wordpunct_tokenize does not allow this. It seems that this can be done with the help of the RegexpTokenizer , giving it a regular expression. Tell me which regular expression can be used to do this or advise other tokenizers that can give you what I need.

  • “Saltykov-Shchedrin” —is it one word or two? Do you need an example using nltk? (Deliver data in Russian using nltk.download() ). - jfs

1 answer 1

 import re for word in re.findall(r'[А-Яа-я]+', 'one, два, три, four', re.U): print(word) 
  • four
    The letter yo / yo forgotten. - Wiktor Stribiżew
  • 2
    1- This is not true: not all Russian letters are covered. 2- If you give the example of code for Python 3, then it is worth mentioning it explicitly if the question does not have a corresponding label. It is also worth mentioning Unicode, problems of normalization 3- in general, words are not required to contain only letters. - jfs
  • In any case, you will need a dictionary of exceptions for words containing hyphens and abbreviations with spaces and periods: " someone ", " BC ", " , etc. ". - bl79