Suppose there are two text documents whose vectors are two arrays consisting of their words: D1 = {word1, word2, word3} and D2 = {word1, word5, word8} .

How to find the cosine similarity of these two vectors? How to multiply (or carry out other mathematical operations) with strings within a programming language?

And what if the vectors are of different sizes? (For example, in one document there will be more words than in another).

  • The cosine measure for numerical vectors is their scalar product divided by the product of their module: x * y / (| x | * | y |). Apparently, it is assumed that you somehow characterize the texts with numerical vectors. The question is, how will you do it? - Taras

1 answer 1

In the process of writing, it was found that the topic is extremely large, so that it can be so simply and concisely written in the answer. I would recommend to refer to the classic textbooks - from them you will learn much more. For example, IR-Book . However:


You cannot perform mathematical operations on strings. You can with vectors. The process of transition to vectors is called vectorization. The vectorization of the natural languages ​​of languages ​​is a relatively well-studied topic — the most famous example to me is word2vec — you can immediately start with it instead of your bicycles (or with some twin doc2vec ). One problem with it - a detailed analysis of the parsing mechanism requires fairly high qualifications. But no one forbids using this tool as a black box.

Now about the bikes. You can take a simple implementation of TF-IDF - with it you can get vectors and operate with vectors. I will write in Python without using specific constructions (your preferred language is not specified).

The order of my actions is:

  1. From the corpus compile a dictionary of words, where the key is the word, and the value is the index in the vector (each word is a separate dimension in the resulting vectors)
  2. For each document in the corpus, count TF-IDF for all words from the document. After that, for each word in the document, it will be possible to compare a vector of dimension n , where n is the number of unique words in the body (see 1). As a result, we obtain a matrix for the occurrence of individual words in documents . This approach is called Bag of Words (the word order in the document is broken and seems like a bag of mixed words) and you can replace TF-IDF with any measure you like. You can simply calculate how much a single word occurs in a document.
  3. In the future, you can compare already obtained vectors with each other or with new vectors.

The body is as follows:

 str1 = "Я люблю тортики больше, чем яблоки" str2 = "Я уважаю апельсины больше, чем торты" str3 = "Яблочные сады раскинулись над дорогой" str4 = "Ехал Грека через реку" 

As you can see, the number of words is different - another problem also appears - the same words are used in different forms / cases (cakes-cakes). This problem is solved in different ways - by stemming (stemming), by normalization - words are reduced to a single number, nominative case (cakes - cake, cakes - cake) - but this is a separate big topic. Also there - filtering punctuation marks - больше, and больше - different words (tokens).

 from collections import Counter import operator import math def tokenize(doc): words = [word.replace(',', '').lower() for word in doc.split()] return words def build_terms(corpus): terms = {} current_index = 0 for doc in corpus: for word in tokenize(doc): if word not in terms: terms[word] = current_index current_index += 1 return terms def tf(document, terms): words = tokenize(document) total_words = len(words) doc_counter = Counter(words) for word in doc_counter: # Можно и не делить, а оставить как есть, с частотой doc_counter[word] /= total_words tfs = [0 for _ in range(len(terms))] for term, index in terms.items(): tfs[index] = doc_counter[term] return tfs def _count_docs_with_word(word, docs): counter = 1 for doc in docs: if word in doc: counter += 1 return counter # documents - это корпус def idf(documents, terms): idfs = [0 for _ in range(len(terms))] total_docs = len(documents) for word, index in terms.items(): docs_with_word = _count_docs_with_word(word, documents) # Основание логарифма не важно # Боюсь отрицательныз значений, только положительные idf = 1 + math.log10(total_docs / docs_with_word) idfs[index] = idf return idfs def _merge_td_idf(tf, idf, terms): return [tf[i] * idf[i] for i in range(len(terms))] def build_tfidf(corpus, document, terms): doc_tf = tf(document, terms) doc_idf = idf(corpus, terms) return _merge_td_idf(doc_tf, doc_idf, terms) def cosine_similarity(vec1, vec2): # Целиком отсюда: http://stackoverflow.com/questions/18424228/cosine-similarity-between-2-number-lists def dot_product2(v1, v2): return sum(map(operator.mul, v1, v2)) def vector_cos5(v1, v2): prod = dot_product2(v1, v2) len1 = math.sqrt(dot_product2(v1, v1)) len2 = math.sqrt(dot_product2(v2, v2)) return prod / (len1 * len2) return vector_cos5(vec1, vec2) str1 = "Я люблю тортики больше, чем яблоки" str2 = "Я уважаю апельсины больше, чем торты" str3 = "Яблочные сады раскинулись над дорогой" str4 = "Ехал Грека через реку" # Проверочные документы check_str1 = "Тортики делают из муки, апельсины и воды" check_str2 = "Торты исчезли там, где появился я" check_str3 = "Ехал тортик через реку" # --------------------- Основной код -------------------- tf_idf_total = [] corpus = (str1, str2, str3, str4) terms = build_terms(corpus) for document in corpus: tf_idf_total.append(build_tfidf(corpus, document, terms)) print(terms.keys()) for doc_rating in tf_idf_total: print(doc_rating) queries = (check_str1, check_str2, check_str3) for query in queries: print("QUERY:", query) query_tfidf = build_tfidf(corpus, query, terms) for index, document in enumerate(tf_idf_total): print("Similarity with DOC", index, "=", cosine_similarity(query_tfidf, document)) 

I wanted a shorter, but did not. The results are as follows:

 dict_keys(['я', 'люблю', 'тортики', 'больше', 'чем', 'яблоки', 'уважаю', 'апельсины', 'торты', 'яблочные', 'сады', 'раскинулись', 'над', 'дорогой', 'ехал', 'грека', 'через', 'реку']) [0.21683833261066354, 0.21683833261066354, 0.21683833261066354, 0.18748978943471664, 0.18748978943471664, 0.21683833261066354, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] [0.21683833261066354, 0.0, 0.0, 0.18748978943471664, 0.18748978943471664, 0.0, 0.21683833261066354, 0.21683833261066354, 0.21683833261066354, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.3204119982655925, 0.26020599913279624, 0.26020599913279624, 0.26020599913279624, 0.26020599913279624, 0.0, 0.0, 0.0, 0.0] [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.4005149978319906, 0.4005149978319906, 0.3252574989159953, 0.3252574989159953] QUERY: Тортики делают из муки, апельсины и воды Similarity with DOC 0 = 0.3016416923422043 Similarity with DOC 1 = 0.3016416923422043 Similarity with DOC 2 = 0.0 Similarity with DOC 3 = 0.0 QUERY: Торты исчезли там, где появился я Similarity with DOC 0 = 0.3016416923422042 Similarity with DOC 1 = 0.6032833846844085 Similarity with DOC 2 = 0.0 Similarity with DOC 3 = 0.0 QUERY: Ехал тортик через реку Similarity with DOC 0 = 0.0 Similarity with DOC 1 = 0.0 Similarity with DOC 2 = 0.0 Similarity with DOC 3 = 0.8358857904935546