Split text into C # sentences

Question

There is an urgent need to break the Russian text from the file into sentences. Simple division (split) by . ! or ? will not work. It is necessary to take into account options for abbreviations such as t. O., Others, and so on; abbreviations in front of a proper name (Moscow), abbreviations of the type Ivanov I. I. and others. Now the regular expression code looks like this:

 string[] splitSentences = Regex.Split(sTemp, @"(?<!\w\.\w.)(?<![AZ][az]\.)(?<=\.|\?)(\s|[AZ].*)");

It is clear that this is not enough. Help me please.

as an option, ignore point by words of one or two letters - Primus Singularis

Answer 1 · 2018-05-19T16:26:14

I certainly can not vouch for 100% valid offers.

 [А-ЯЁ][\S\s]+?(?:[\S][^А-ЯЁ\.]){1,}(?:\.+|[?!])(?!(\s*[а-яё)\-"«0-9\.]))

But such an option came up for particular cases of Moscow or Ivanov I.I.

I am waiting for comments if some part of you is an eye blister!

Here I tested

Split text into C # sentences

1 answer 1

More articles: