Through long searches, deletions, parsing, stemming, and other analyzes, I came to the TOP 10 words for each lecture from ted.com. They are quite unique, i.e. among the 2346 sets of words, the maximum of them repeats no more than 50 words.
The task is to identify on the basis of the resulting 2346 sets of words of 10 each similar. A simple intersection of sets of each with each leads to nothing (the length of the intersection is 1-3). It is necessary to use synonyms. I tried to deal with gensim and nltk, but did not come to anything. LSA is not configured.
Direct on the true path in this difficult matter.