Hello! I need to divide the text into sentences.

I try to do this:

from nltk.tokenize import sent_tokenize text = 'A clam for supper? a cold clam; is THAT what you mean, Mrs. Hussey? says I, "but that\'sa rather cold and clammy reception in the winter time, ain\'t it, Mrs. Hussey?"' print(sent_tokenize(text)) 

I get the output:

 >>> ['A clam for supper?', 'a cold clam; is THAT what you mean, Mrs. Hussey?', 'says I, "but that\'sa rather cold and clammy reception in the winter time, ain\'t it, Mrs.', 'Hussey?"'] 

The last sentence was divided wrong. Prompt ways / libraries with which help it is possible to solve this problem.

This is just an example. I need to parse the big text into sentences. I do not know what other nuances can arise there, so write what kind of proposals tokenizer you think is the best (for English).

Thank.

    1 answer 1

    Having poked your example, it turned out that the most popular tokenizers somehow very crookedly perceive quotes following the punctuation mark: Mrs. Hussey?" Mrs. Hussey?" , But does Mrs. Hussey? " parse very well Mrs. Hussey? " Mrs._Hussey also parses very well Mrs._Hussey?" - without spaces around quotes, but with modified Mrs. From this situation you can’t see beautiful exits, but how to distort it is your choice. In this case, you can start from here: English honorifics . Then you would replace the points with spaces after these words with dots with underscores, and then back. You can write a regular record or something like this:

     import nltk blah_blah_with_dots = {'Dr', 'Ms', 'Mr', 'Mrs', 'Prof', 'Inc', 'Fr'} SENTENCE_TOKENIZER = nltk.data.load('tokenizers/punkt/english.pickle') text = 'A clam for supper? a cold clam; is THAT what you mean, Mrs. Hussey? says I, "but that\'sa rather cold and clammy reception in the winter time, ain\'t it, Mrs. Hussey?"' for blah in blah_blah_with_dots: text = text.replace(blah + ". ", blah + "._") for index, sentence in enumerate(SENTENCE_TOKENIZER.tokenize(text)): for blah in blah_blah_with_dots: sentence = sentence.replace(blah + "._", blah + ". ") print("Sentence: ", index, sentence) print() print("Full text: ", text) 

    It looks ugly, but nothing can be done. Life is pain. You can try to replace quotes. Without knowing the internal cuisine nltk difficult to say who the culprit. Immediately I advise you to think about replacing the Unicode quotes with the Orthodox " or ' Or here are some more related questions:

    https://stackoverflow.com/questions/14095971/how-to-tweak-the-nltk-sentence-tokenizer/25375857#25375857

    https://stackoverflow.com/questions/18941997/why-does-nltk-mis-tokenize-quote-at-end-of-sentence

    In order to establish which tokenizer is the best one, you need to come up with a formal set of tests and see who and how will behave. I have never seen such tests, usually everyone uses the default tokenizer: this one here . It should be trained on a large array of text - there is a model already trained, it can be downloaded by calling the nltk.download() method. Of course, you can train the model yourself - the body would be suitable.