SENTENCE implementation in Sphinx

Question

Sphinx has the ability to search by the presence of words in one sentence. For example, there is a text:

Vasya done, ate a cucumber, because got hungry. So it goes.

If you request

молодец SENTENCE огурец

Then we will find this text. If you request

 молодец SENTENCE проголодался

Then we will not find this text anymore, since apparently in Sphinx the implementation of the breakdown into sentences is implemented in a simple way and the first dot that comes across here is considered the end of the sentence. Therefore, the question.

How can Sphinx be configured to make smarter breakdowns into offers when preparing an index? Any option is suitable - specify something in the configs or slip an external package to break into offers, for example, Tomita's parser from Yandex.

UPDATE

There was an idea to break into proposals beforehand with the help of Tomit Parser and specify the Sphinx to use a line break as the separator of sentences, but judging by the source code of the Sphinx, this is unlikely to succeed .

stopwords used stopwords or are they used only for the search string?
threatens with unnecessary gluing together of sentences, when the sentence will end with an abbreviation (" Одно предложение и т.п. Другое предложение ").
I do not know, I, roughly speaking, got this theory out of theory.
perhaps, the sphinx applies them after tokenization (in this case it is worth throwing out the sphinx and taking the lasticsearch)

Accepted Answer · 2016-09-17T19:25:53

The solution that arranged.

With the help of Tomit Parser break the text into sentences. We get the text in which the sentences are separated by a line break.

In each sentence received, we delete all points, exclamation points and question marks, leaving only the last point, "?" or "!".

Based on these processed data, we build an index in Sphinx. Splitting into sentences will take place as necessary, since Sphinx divides the text into sentences when it finds ".", "?" or "!".

SENTENCE implementation in Sphinx

1 answer 1

More articles: