Cosine similarity (cosine measure) for strings

Question

Suppose there are two text documents whose vectors are two arrays consisting of their words: D1 = {word1, word2, word3} and D2 = {word1, word5, word8} .

How to find the cosine similarity of these two vectors? How to multiply (or carry out other mathematical operations) with strings within a programming language?

And what if the vectors are of different sizes? (For example, in one document there will be more words than in another).

The cosine measure for numerical vectors is their scalar product divided by the product of their module: x * y / (| x | * | y |).
Apparently, it is assumed that you somehow characterize the texts with numerical vectors.

Accepted Answer · 2017-05-11T22:50:59

In the process of writing, it was found that the topic is extremely large, so that it can be so simply and concisely written in the answer. I would recommend to refer to the classic textbooks - from them you will learn much more. For example, IR-Book . However:

You cannot perform mathematical operations on strings. You can with vectors. The process of transition to vectors is called vectorization. The vectorization of the natural languages of languages is a relatively well-studied topic — the most famous example to me is word2vec — you can immediately start with it instead of your bicycles (or with some twin doc2vec ). One problem with it - a detailed analysis of the parsing mechanism requires fairly high qualifications. But no one forbids using this tool as a black box.

Now about the bikes. You can take a simple implementation of TF-IDF - with it you can get vectors and operate with vectors. I will write in Python without using specific constructions (your preferred language is not specified).

The order of my actions is:

From the corpus compile a dictionary of words, where the key is the word, and the value is the index in the vector (each word is a separate dimension in the resulting vectors)
For each document in the corpus, count TF-IDF for all words from the document. After that, for each word in the document, it will be possible to compare a vector of dimension n , where n is the number of unique words in the body (see 1). As a result, we obtain a matrix for the occurrence of individual words in documents . This approach is called Bag of Words (the word order in the document is broken and seems like a bag of mixed words) and you can replace TF-IDF with any measure you like. You can simply calculate how much a single word occurs in a document.
In the future, you can compare already obtained vectors with each other or with new vectors.

The body is as follows:

 str1 = "Я люблю тортики больше, чем яблоки" str2 = "Я уважаю апельсины больше, чем торты" str3 = "Яблочные сады раскинулись над дорогой" str4 = "Ехал Грека через реку"

As you can see, the number of words is different - another problem also appears - the same words are used in different forms / cases (cakes-cakes). This problem is solved in different ways - by stemming (stemming), by normalization - words are reduced to a single number, nominative case (cakes - cake, cakes - cake) - but this is a separate big topic. Also there - filtering punctuation marks - больше, and больше - different words (tokens).

 from collections import Counter import operator import math def tokenize(doc): words = [word.replace(',', '').lower() for word in doc.split()] return words def build_terms(corpus): terms = {} current_index = 0 for doc in corpus: for word in tokenize(doc): if word not in terms: terms[word] = current_index current_index += 1 return terms def tf(document, terms): words = tokenize(document) total_words = len(words) doc_counter = Counter(words) for word in doc_counter: # Можно и не делить, а оставить как есть, с частотой doc_counter[word] /= total_words tfs = [0 for _ in range(len(terms))] for term, index in terms.items(): tfs[index] = doc_counter[term] return tfs def _count_docs_with_word(word, docs): counter = 1 for doc in docs: if word in doc: counter += 1 return counter # documents - это корпус def idf(documents, terms): idfs = [0 for _ in range(len(terms))] total_docs = len(documents) for word, index in terms.items(): docs_with_word = _count_docs_with_word(word, documents) # Основание логарифма не важно # Боюсь отрицательныз значений, только положительные idf = 1 + math.log10(total_docs / docs_with_word) idfs[index] = idf return idfs def _merge_td_idf(tf, idf, terms): return [tf[i] * idf[i] for i in range(len(terms))] def build_tfidf(corpus, document, terms): doc_tf = tf(document, terms) doc_idf = idf(corpus, terms) return _merge_td_idf(doc_tf, doc_idf, terms) def cosine_similarity(vec1, vec2): # Целиком отсюда: http://stackoverflow.com/questions/18424228/cosine-similarity-between-2-number-lists def dot_product2(v1, v2): return sum(map(operator.mul, v1, v2)) def vector_cos5(v1, v2): prod = dot_product2(v1, v2) len1 = math.sqrt(dot_product2(v1, v1)) len2 = math.sqrt(dot_product2(v2, v2)) return prod / (len1 * len2) return vector_cos5(vec1, vec2) str1 = "Я люблю тортики больше, чем яблоки" str2 = "Я уважаю апельсины больше, чем торты" str3 = "Яблочные сады раскинулись над дорогой" str4 = "Ехал Грека через реку" # Проверочные документы check_str1 = "Тортики делают из муки, апельсины и воды" check_str2 = "Торты исчезли там, где появился я" check_str3 = "Ехал тортик через реку" # --------------------- Основной код -------------------- tf_idf_total = [] corpus = (str1, str2, str3, str4) terms = build_terms(corpus) for document in corpus: tf_idf_total.append(build_tfidf(corpus, document, terms)) print(terms.keys()) for doc_rating in tf_idf_total: print(doc_rating) queries = (check_str1, check_str2, check_str3) for query in queries: print("QUERY:", query) query_tfidf = build_tfidf(corpus, query, terms) for index, document in enumerate(tf_idf_total): print("Similarity with DOC", index, "=", cosine_similarity(query_tfidf, document))

I wanted a shorter, but did not. The results are as follows:

 dict_keys(['я', 'люблю', 'тортики', 'больше', 'чем', 'яблоки', 'уважаю', 'апельсины', 'торты', 'яблочные', 'сады', 'раскинулись', 'над', 'дорогой', 'ехал', 'грека', 'через', 'реку']) [0.21683833261066354, 0.21683833261066354, 0.21683833261066354, 0.18748978943471664, 0.18748978943471664, 0.21683833261066354, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] [0.21683833261066354, 0.0, 0.0, 0.18748978943471664, 0.18748978943471664, 0.0, 0.21683833261066354, 0.21683833261066354, 0.21683833261066354, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.3204119982655925, 0.26020599913279624, 0.26020599913279624, 0.26020599913279624, 0.26020599913279624, 0.0, 0.0, 0.0, 0.0] [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.4005149978319906, 0.4005149978319906, 0.3252574989159953, 0.3252574989159953] QUERY: Тортики делают из муки, апельсины и воды Similarity with DOC 0 = 0.3016416923422043 Similarity with DOC 1 = 0.3016416923422043 Similarity with DOC 2 = 0.0 Similarity with DOC 3 = 0.0 QUERY: Торты исчезли там, где появился я Similarity with DOC 0 = 0.3016416923422042 Similarity with DOC 1 = 0.6032833846844085 Similarity with DOC 2 = 0.0 Similarity with DOC 3 = 0.0 QUERY: Ехал тортик через реку Similarity with DOC 0 = 0.0 Similarity with DOC 1 = 0.0 Similarity with DOC 2 = 0.0 Similarity with DOC 3 = 0.8358857904935546

Cosine similarity (cosine measure) for strings

1 answer 1

More articles: