How to break the text into separate sentences? [duplicate]

Question

This question has already been answered:

Count the number of sentences in the text 8 answers

How to break the text into separate sentences? The splitlines () variant is not appropriate, since the text can be written in one line.

Answer 1 · 2013-02-27T17:46:46

Expression ignores
1980
100rub.
100r.
100kop
100k
etc.
etc.
as well as combined punctuation marks.
Code here
http://ideone.com/pNpffv

ReinRaus

16k 3 32 77

one
@ReinRaus: It seems to me that it is better to implement such semantics: there should be a space after the punctuation mark / s, the next word should be with a capital letter. Year and the price can quite finish the sentence ... - VladD
Uh, so do we have every new sentence begins with a capital letter? Why not look for the place where the dot stands and the capital letter goes? Of course, there are exceptions, like names, cities, etc. - but this is already pouring into the whole project. Ugh, duplicate @VladD Well, don't care. - lampa 9:46 pm
UPD Fixed yesterday's errors, it turns out the problem was not feeling bad, but in this problem - ReinRaus
one
Interesting, and the enclosed offers can be? I will clarify that there is a sentence, a quote begins in it (in quotes. Or brackets, like this), and then it continues. How to disassemble such garbage? In theory, this is a question on the Russian Language forum, but it’s not to be registered there (about the problem of little significance to me). - avp

|

qnub qnub 2,958 eight 14 · Answer 2 · 2013-02-27T09:47:15

parts = all_text.split('.')

qnub

2,958 eight 14

Then at least re.compile ("[.!? \ N]"). Split (all_text) re.split ("[.!? \ N]", all_text) - alexlz
How to deal with the triple point? - moden
2
If you give a hungry fish - he will eat only once, and if you teach how to fish - he will always be full ... - qnub
one
So it should be better with composite delimiters: re.split ("\\ b [.!? \\ n] + (? = \\ s)", all_text) - ReinRaus
one
@moden ... is called ellipsis (at least it was called when I was in school). Secondly, there is such a sign ?!? . You can approach the task "creatively" (delete empty lines). filter (lambda x: not re.match ("^ \ s * $", x), re.split ("[!.? \ n]", all_text) However, there is a suspicion that the punctuation marks should be present in the resulting list. And then - just search by pattern [^!.?\n]+[!.?\n]+ . That also does not give 100% correct result "In 1998 there was a default." - alexlz

|

moden moden 802 five 15 · Answer 3 · 2013-02-27T10:05:28

 s = "Properties are a little different. They need a special declaration since they're handled in a very different way. (Hmmmm... I may have figured out an obvious way around that, but I want to get this out the door first.) Here's how you'd mock out calls to a property. Note that unlike other calls, all the calls to an overridden property must be played back in order." def srtip_sent(str_): separators = ['.', '?', '!'] start = 0 s_split = [] for i in range(len(str_)): if s[i] in separators: s_split.append(str_[start:i+1]) start = i + 1 return map(lambda s: s.strip(), s_split) srtip_sent(s) ['Properties are a little different.', "They need a special declaration since they're handled in a very different way.", '(Hmmmm.', '.', '.', 'I may have figured out an obvious way around that, but I want to get this out the door first.', ") Here's how you'd mock out calls to a property.", 'Note that unlike other calls, all the calls to an overridden property must be played back in order.']

Does not work correctly with compound characters, for example, with a triple-point.

And with internal punctuation marks like "Что за хрень?" -- поинтересовалась Алиса.
"Что за хрень?" -- поинтересовалась Алиса.

How to break the text into separate sentences? [duplicate]

Reported as a duplicate by Oceinic , Qwertiy ♦ , Streletz , Alex , Suvitruf ♦ Nov 21 '15 at 21:59 .

3 answers 3

More articles: