What means are best to organize the indexing and search of Russian words that are a record of English transcription? For example, the word vision in the Russian version will be recorded as вижн or вижен , but the meaning of the essence does not change. The search, accordingly, should be made both according to the vision and the вижн without losing results at the exit. Soundex and Metaphone, as far as I understand, work strictly in the same language without crossing.

    1 answer 1

    It seems to me, in most cases, functional synonyms will suffice. The database most likely has a specific thematic focus, for example, medicine or electrical engineering. And in order to customize the search taking into account the transliteration / sound of words in different languages, it is enough to fill out 500-1000 synonyms. But using the analyzer with the synonym type filter is not for indexing, but for analyzing a search query.

    It is better to put the synonyms in a separate file, and not to store them in the settings.

     { "index" : { "analysis" : { "analyzer" : { "synonym" : { "tokenizer" : "whitespace", "filter" : ["synonym"] } }, "filter" : { "synonym" : { "type" : "synonym", "synonyms_path" : "analysis/synonym.txt" } } } } 

    An example of the contents of the synonym.txt file:

    vision, vision


    If this is not enough, then we get a very difficult task.

    If the matter is limited to transliteration, then look towards the ICU Transform Token Filter plug-in. If not, and you need vision, vision, vision for all possible words, then look towards machine learning. This solution Rosette for Elasticsearch , apparently paid.

    Useful links.

    Soundex and Metaphone are phonetic algorithms that work, as you wrote, in the same language. This is not what you need.

    • Thanks for the links, just to the point! Most likely you will have to solve the problem through synonyms, or Soundex / Metaphone as advised here: stackoverflow.com/questions/30843475/… - bme
    • I wouldn't bother either. Synonyms work great. But they have a significant disadvantage - it is such a global setting, you can not specify synonyms for a specific field or group of fields. - Andrey Morozov