Tell me please. If there is a task of clustering several thousand search queries (that is, short sentences of 2-5 words, not necessarily Russian) Which clustering algorithm is better to use and what approach is generally used in solving such problems (transition from text to some vector representation and clustering of vectors? ) And how to assess the quality of the implemented approach (which metric should be used if there are no marked data with certain clusters)?

    1 answer 1

    What you are asking about is the classic task of clustering natural language texts. Well studied and repeatedly described. It is solved approximately as you wrote - first transfer to some multidimensional space, then clustering, that is, automatic marking of data in the specified space without the presence of a training markup. Information on the topic - a huge amount. Well, for example, for starters, you can look here:

    https://compscicenter.ru/courses/nlp/2014-spring/classes/325/

    https://kelijah.livejournal.com/196774.html

    https://logic.pdmi.ras.ru/~sergey/teaching/mlbeeline16/N16_BeelineTextMining.pdf

    http://habr.com/post/170619/

    https://www.linkedin.com/pulse/nlp-text-analytics-simplified-document-clustering-parsa-ghaffari/

    http://arxiv.org/pdf/1707.02919.pdf

    http://sntbul.bmstu.ru/file/759414.html?__s=1