How are reference systems built?

Question

There is some reference book for N addresses that is accessed by multi-users.

It would seem, place the reference book in the DBMS and everything ...

But, usually, help systems offer a partial match search => there are a lot of LIKE queries coming from users, which the DBMS cannot optimize because the indices do not work.

Here's how to get out of this situation? Is it all about buying a more powerful iron?

If the search is at the beginning or end of the line, then the indices are very much working for example in Orakle.
That is, requests of the form like '%text' and like 'text% quite work for themselves by indices.
In addition, directories are usually not very large, and searching even without indices is fast
But if you search for a word from the middle, the indexes no longer work.
It can be embedded directly in the subd or be a separate application.
And anyway, does someone hammer letters into a computer from the middle of a word?
Can this also be in reverse order, from the middle to the beginning, then from the end to the middle?
Read about ElasticSearch (or SOLR) and Hadoop (as a platform) ...

D-side D-side 22.4k four thirty 54 · Answer 1 · 2016-10-25T11:07:53

There is an N-gram search . It perceives a string as a set of individual substrings of length N, and the relevance indicator is the number of such substrings that are common between the document and the search query.

This approach allows you to detect minor typos in words or to find words only piece by piece of any significant (from N + 1) length.

There are enough implementations, there is a choice.

There is an N-gram tokenizer in ElasticSearch . There is also in Apache Solr . In Sphinx, N-gram search can be enabled (it is argued that it makes sense for Korean, Japanese, and Chinese, where the trouble is word-breaking).

As you can see, there are practically all known search engines, and if you need a really powerful search, it is better to use a product developed specifically for this purpose. Digging in the instructions for the selected, you can find other algorithms that you might like more.

And now something less ordinary.

For PostgreSQL there is a trigram search module ( pg_trgm ). As you might guess, this is an N-gram search, where N = 3.

It practically requires a separate GIN or GiST index, compiled according to the class of operators from this module. The GIN is quite large in size, not very quickly updated, but quick to search; while GiST is more compact and faster to update, but it can give false matches. Therefore, for infrequently updated GIN data, it is better.

This is a good option if you already use PostgreSQL, it is not particularly loaded (or you can provide this) and you don’t need much intelligence from the search.

It also has a full-text search implementation , but this no longer applies to N-gram.

N-gram search is designed for fuzzy search (when you need to get rid of duplicate misprints, for example, or look for all more or less similar options).
@minamoto just for addresses this search fits very well, because users often enter toponyms in addresses with a letter or two.
About deviations in addresses on a letter or another there is a funny anecdote: "xxx: that's all, I arrived, I was at the Leningrad station xxx: I took a taxi, remind yyy address: streetnikna, house xx xxx: here are you Muscovites with your Olban xxx : Copernicus Street, Copernicus xxx: Nicolai Copernicus, Polish scientist, great astronomer, and you are not ashamed of him, oh, yyy: surely, go there already "

How are reference systems built?

1 answer 1

More articles: