The problem of accuracy in the implementation of fuzzy search

Question

I am new to python , so my algorithms are most likely inefficient. I implement a fuzzy search in the course for finding the answer in the database. My idea of implementation.

Bringing the base to a normal view, that is, removing the auxiliary parts of speech and lemmatization + stemming Realization of removing the auxiliary parts of speech:

 def serviceWordsPosition(word, wordAnalyser=pymorphy2.MorphAnalyzer()): #Анализатор типа слова return wordAnalyser.parse(word)[0].tag.POS def SentenceToBase(sentence): #Преобразование строки в базовый вид accomulatingString = '' words = sentence.split() functors_QuestionsToBase = {'INTJ', 'PRCL', 'CONJ', 'PREP'} for word in words: if serviceWordsPosition(word) not in functors_QuestionsToBase: accomulatingString += word + ' ' return accomulatingString def QuestionsToBase(QuestionsList): #Преобразование базы в базовый вид QuestionsListBaseForm = [] for sentence in QuestionsList: QuestionsListBaseForm.append(SentenceToBase(sentence)) return QuestionsListBaseForm

Bringing the requested phrase to normal

 def SentenceToRoot(sentence): morph = pymorphy2.MorphAnalyzer() sentence=' '.join([morph.normal_forms(w)[0] for w in sentence.split()]) return sentence

Using the Levenshtein distance, I find the editorial distance between each line in the normalized database and the phrase from the query In the library used, the distance is realized by the similarity in%
```
 SimularitySum = [] for questionsentence in QuestionDataBase: s = fuzz.ratio(questionsentence,messagebaseform) SimularitySum.append(s) print(questionsentence,messagebaseform,s) 
```
I deduce the answer corresponding to the minimum editorial distance
```
 print(AnswerDataBase[SimularitySum.index(max(SimularitySum))]) 
```

The problem is that the accuracy of the results is too low. (according to tests 32-44%) How should improve the algorithm to improve the accuracy?

And the challenge in finding one word or a combination of words or a sentence?
By 3 points: I understood correctly, are you comparing a normalized string from a database and not a normalized query?
@insolor, I compare both lines in a normalized form, just did not insert a fragment
@ Kirya522, how do you measure the accuracy of the results, 32-34% - is it compared to what?
@ Kirya522, look at this answer - a similar approach might suit you.
It's hard to say without seeing examples of your data and "inaccurate queries" ...

The problem of accuracy in the implementation of fuzzy search

0

More articles: