Which way of finding the semantic proximity of two sentences gives the highest accuracy when comparing sentences of 3-10 words? It is better to find the vector sum of all words of each sentence, and then find the distance between them, compare each word with each and then find the average distance? I also met several more complex variants — for example, with a partition of a vector field into clusters and finding the distance between the center of the cluster and each word of the sentences being studied, which allows us to more accurately determine the vector of the sentence. What is the best option in terms of speed + quality?

    1 answer 1

    Much depends on the specifics of your body, IMHO. The most common is through cos-similarity (scikit-learn works fairly quickly). I did through the vectorization of each sentence (b-grams) and then counted.

    You can also try through pagerank and find vectors.

    Through the use of centroids, there is such an interesting approach ( http://www.aclweb.org/anthology/W17-1003 ), there is a link to the python code.