Split text by sentences with preserving separator

Question

It can be limited to the fact that the sentence ends:

"lower case" - ". or! or?" - "space" - "capital letter"

For example:

"Hello! I am a simple text. Can you share me?"

[“Hello!”, “I am simple text.”, “Can you share me?”]

There was an attempt, but unsuccessful:

re.split(r'\w[.!?]+\s+[А-Я]', "Hello! I'm John. Are you OK? fine... and so")

Accepted Answer · 2016-08-10T03:43:42

We divide by space, but use a positive look back to make sure that there is a letter and (dot or! Or?) In front of the space:

 import re result = re.split(r'(?<=\w[.!?]) ', "Hello! I'm John. Are you OK? fine... and so") print (result) result = re.split(r'(?<=\w[.!?]) ', "Привет! Я простой текст. Ты сможешь разделить меня?") print (result)

Result:

  ['Hello!', "I'm John.", 'Are you OK?', 'fine... and so'] ['Привет!', 'Я простой текст.', 'Ты сможешь разделить меня?']

PS On Unicode did not check. Tested on https://repl.it/languages/python3

UPD \w may be worth replacing with the enumeration of valid characters, since these can be letters, numbers, and underscores .

It works, but for the sake of interest, is it possible to somehow break without losing a character (the space disappears)?
@ YevgenyKuzmin, the python does not want to split into an empty pattern (without capturing at least 1 character), returns ValueError: split() requires a non-empty pattern match.

0xdef 0xdef 789 3 12 · Answer 2 · 2016-08-10T02:20:40

 (.+?[.!?]) - разбивает по . ! ?

0xdef

789 3 12

re.split(r'(.+?[.!?])', 'dfg! Dgfg? ddf. Dfdg. fdgdfg') <br/> returns with empty elements: ['', 'dfg!', ' ',' Dgfg? ',' ',' Ddf. ',' ',' Dfdg. ',' Fdgdfg '] - Yevgeny Kuzmin

|

Split text by sentences with preserving separator

2 answers 2

More articles: