Through long searches, deletions, parsing, stemming, and other analyzes, I came to the TOP 10 words for each lecture from ted.com. They are quite unique, i.e. among the 2346 sets of words, the maximum of them repeats no more than 50 words.

The task is to identify on the basis of the resulting 2346 sets of words of 10 each similar. A simple intersection of sets of each with each leads to nothing (the length of the intersection is 1-3). It is necessary to use synonyms. I tried to deal with gensim and nltk, but did not come to anything. LSA is not configured.

Direct on the true path in this difficult matter.

  • one
    m. to define cosine distance for top words using word2vec? - Lol4t0
  • one
    It will be difficult to show what specific words were found? - sanmai
  • And exactly among the words found there are no different forms of the same? If so, then you can use, for example, Levenshtein distance to identify them. Well, to use the base of synonyms is a good idea (perhaps something will suit with en.softonic.com/s/english-synonym-dictionary ). Further, the number of sets should be reduced several times. - Ilya
  • I would use word2vec. I would try to find any database with synonyms. And, of course, lemmatization, if you have not done it yet - Alexey Lobanov

1 answer 1

If without using any word2vec and the like, you can try to develop an algorithm of this type:

  1. For each source word from your 2346 sets, make a table of the frequency of occurrence of other words taken from the same sentences in which your word occurs, you can not even take all the words, but the closest words are before and after the source word.
  2. For each word found, find adjacent words in the same way (ie, context), and group these words into a single list sorted by frequency of occurrence and filtered so that only those words that are present in your 2346 word sets remain in the list.
  3. Top10 from this list will consist of words that can replace the original word.
  4. PROFIT

Of course, beforehand, you need to process all this with everyone with a stemming and other analysis.

And about the simple intersection of sets, such an idea arose: try to cross not by words, but by distinguished roots of words, i.e. to intersect the same words

And finally, a useful link: https://ru.wikipedia.org/wiki/%D0%94%D0%B8%D1%81%D1%82%D1%80% D8% % D1% 82% D0% B8% D0% B2% D0% BD% D0% B0% D1% 8F_% D1% 81% D0% B5% D0% BC% D0% B0% D0% BD% D1% 82% D0 % B8% D0% BA% D0% B0