There is an urgent need to break the Russian text from the file into sentences. Simple division (split) by . ! or ? will not work. It is necessary to take into account options for abbreviations such as t. O., Others, and so on; abbreviations in front of a proper name (Moscow), abbreviations of the type Ivanov I. I. and others. Now the regular expression code looks like this:

 string[] splitSentences = Regex.Split(sTemp, @"(?<!\w\.\w.)(?<![AZ][az]\.)(?<=\.|\?)(\s|[AZ].*)"); 

It is clear that this is not enough. Help me please.

  • as an option, ignore point by words of one or two letters - Primus Singularis

1 answer 1

I certainly can not vouch for 100% valid offers.

 [А-ЯЁ][\S\s]+?(?:[\S][^А-ЯЁ\.]){1,}(?:\.+|[?!])(?!(\s*[а-яё)\-"«0-9\.])) 

But such an option came up for particular cases of Moscow or Ivanov I.I.

I am waiting for comments if some part of you is an eye blister!

Here I tested