It uses the GIN-index of tsvector 's documents for full-text search. The tsvector themselves tsvector not have much to store ... with one reservation, about it further.
tsvector is a sorted array of word stems (words processed by a stemmer ) from a document.- It is sorted to speed up the search and does not contain the same words several times, so it also takes up less space.
- Stemmer converts different forms of the same word into identical lines, this way a search is performed taking into account different forms of the same words. It is usually highly specific to a particular language, both by algorithms and dictionaries.
- Usually they are still filtered from stop words (which do not reflect the content of the document and are needed more for the structure), because searching for them does more harm than good. Of course, it also depends on the language.
GIN-index is a display of individual elements of a certain value (in which there are many elements, like an array, which is also a tsvector ) in the set of rows in the values ​​of which these elements exist. That is, the search tree, in which the key is the basis of the word , and the value is a set of document identifiers.
Having a tsvector query, the search by the described index is performed by folding (fold, reduce) by intersecting individual sets from the index by tsvector elements from the query.
- PostgreSQL actually uses a different type for queries,
tsquery , with support for search operators, but I am considering a simplified case.
The described solution, however, does not take into account the ranking by distance between words , but it is very difficult to verify effectively. In any case, I can’t name anything at once.
Just sort of Levenshteyn’s distance between tsvector ’s without sorting (arrays of tsvector words without stop words) query and coincidence comes to mind. But I see this as a very inefficient solution, especially for large documents (for them, Levenshtein will actually be sorted by increase in size).