Sphinx has the ability to search by the presence of words in one sentence. For example, there is a text:

Vasya done, ate a cucumber, because got hungry. So it goes.

If you request

молодец SENTENCE огурец 

Then we will find this text. If you request

 молодец SENTENCE проголодался 

Then we will not find this text anymore, since apparently in Sphinx the implementation of the breakdown into sentences is implemented in a simple way and the first dot that comes across here is considered the end of the sentence. Therefore, the question.

How can Sphinx be configured to make smarter breakdowns into offers when preparing an index? Any option is suitable - specify something in the configs or slip an external package to break into offers, for example, Tomita's parser from Yandex.

UPDATE

There was an idea to break into proposals beforehand with the help of Tomit Parser and specify the Sphinx to use a line break as the separator of sentences, but judging by the source code of the Sphinx, this is unlikely to succeed .

  • stopwords used stopwords or are they used only for the search string? - Makarenko_I_V
  • @Makarenko_I_V Thought interesting. With stopwords now tried, but failed. But it turned out with exceptions . I tried to ask т.к. => тк т.к. => тк and it helped. But again, this is a very compromise version, because threatens with unnecessary gluing together of sentences, when the sentence will end with an abbreviation (" Одно предложение и т.п. Другое предложение "). - mnv
  • stopwords need - etki
  • @Etki I tried to register them: but it had no effect. How to use them correctly in this case? - mnv
  • I do not know, I, roughly speaking, got this theory out of theory. perhaps, the sphinx applies them after tokenization (in this case it is worth throwing out the sphinx and taking the lasticsearch) - etki

1 answer 1

The solution that arranged.

With the help of Tomit Parser break the text into sentences. We get the text in which the sentences are separated by a line break.

In each sentence received, we delete all points, exclamation points and question marks, leaving only the last point, "?" or "!".

Based on these processed data, we build an index in Sphinx. Splitting into sentences will take place as necessary, since Sphinx divides the text into sentences when it finds ".", "?" or "!".