I can not yet figure out how (or rather in what form) to build the index (s) for organizing full-text search through documents. If there are options, please share.

Initial data

  • Documents are placed in a table in the form of blobs;
  • Before entering into a table - the document is compressed using the LZMA and / or PPMd / PPMz algorithm, depending on which one is compressed better;
  • It is necessary to provide a search for part of the word. For example, the search string "mom" should be found a document containing "Mom soap frame";
  • It is necessary to ensure sorting of the found documents depending on the "distance" between the found words. For example, in the search string “Mom ram”, among the documents found “Mom washed the frame” and “Mom washed the white frame” - the first document in the sorting should be higher, since there are fewer words between “Mom” and “frames” in the second;

    1 answer 1

    You can follow the example of PostgreSQL

    It uses the GIN-index of tsvector 's documents for full-text search. The tsvector themselves tsvector not have much to store ... with one reservation, about it further.

    • tsvector is a sorted array of word stems (words processed by a stemmer ) from a document.
      • It is sorted to speed up the search and does not contain the same words several times, so it also takes up less space.
      • Stemmer converts different forms of the same word into identical lines, this way a search is performed taking into account different forms of the same words. It is usually highly specific to a particular language, both by algorithms and dictionaries.
      • Usually they are still filtered from stop words (which do not reflect the content of the document and are needed more for the structure), because searching for them does more harm than good. Of course, it also depends on the language.
    • GIN-index is a display of individual elements of a certain value (in which there are many elements, like an array, which is also a tsvector ) in the set of rows in the values ​​of which these elements exist. That is, the search tree, in which the key is the basis of the word , and the value is a set of document identifiers.

      Having a tsvector query, the search by the described index is performed by folding (fold, reduce) by intersecting individual sets from the index by tsvector elements from the query.

      • PostgreSQL actually uses a different type for queries, tsquery , with support for search operators, but I am considering a simplified case.

    The described solution, however, does not take into account the ranking by distance between words , but it is very difficult to verify effectively. In any case, I can’t name anything at once.

    Just sort of Levenshteyn’s distance between tsvector ’s without sorting (arrays of tsvector words without stop words) query and coincidence comes to mind. But I see this as a very inefficient solution, especially for large documents (for them, Levenshtein will actually be sorted by increase in size).

    • Thank you, I will consider! On a nearby resource, they were advised to read the "Suffix Tree", "Invert Index", and "Search Index". Although GIN and Inverted Index are similar at first glance, these are different articles on the wiki. I will sort this out. - Majestio
    • @Majestio is the purest inverted index. In Wikipedia about GIN some kind of trash was written, to be honest: D It can also be applied to regular arrays, not only in full-text search. - D-side
    • Well, yes) So I noticed it) - Majestio
    • @Majestio with regard to the "suffix tree" ... yes, you can probably make a suffix tree from the stem of words of individual documents (instead of the traditional letters of individual words) and do a quick computation of the weight on this tree with limited gaps. But in my head it still looks more like a "vague idea." - D-side