I am new to python , so my algorithms are most likely inefficient. I implement a fuzzy search in the course for finding the answer in the database. My idea of ​​implementation.

  1. Bringing the base to a normal view, that is, removing the auxiliary parts of speech and lemmatization + stemming Realization of removing the auxiliary parts of speech:

     def serviceWordsPosition(word, wordAnalyser=pymorphy2.MorphAnalyzer()): #Анализатор типа слова return wordAnalyser.parse(word)[0].tag.POS def SentenceToBase(sentence): #Преобразование строки в базовый вид accomulatingString = '' words = sentence.split() functors_QuestionsToBase = {'INTJ', 'PRCL', 'CONJ', 'PREP'} for word in words: if serviceWordsPosition(word) not in functors_QuestionsToBase: accomulatingString += word + ' ' return accomulatingString def QuestionsToBase(QuestionsList): #Преобразование базы в базовый вид QuestionsListBaseForm = [] for sentence in QuestionsList: QuestionsListBaseForm.append(SentenceToBase(sentence)) return QuestionsListBaseForm 
  2. Bringing the requested phrase to normal

     def SentenceToRoot(sentence): morph = pymorphy2.MorphAnalyzer() sentence=' '.join([morph.normal_forms(w)[0] for w in sentence.split()]) return sentence 
  3. Using the Levenshtein distance, I find the editorial distance between each line in the normalized database and the phrase from the query In the library used, the distance is realized by the similarity in%

     SimularitySum = [] for questionsentence in QuestionDataBase: s = fuzz.ratio(questionsentence,messagebaseform) SimularitySum.append(s) print(questionsentence,messagebaseform,s) 
  4. I deduce the answer corresponding to the minimum editorial distance

     print(AnswerDataBase[SimularitySum.index(max(SimularitySum))]) 

The problem is that the accuracy of the results is too low. (according to tests 32-44%) How should improve the algorithm to improve the accuracy?

  • What do you understand as a base? And the challenge in finding one word or a combination of words or a sentence? - MaxU
  • By 3 points: I understood correctly, are you comparing a normalized string from a database and not a normalized query? Try to compare both in normalized form. - insolor
  • @insolor, I compare both lines in a normalized form, just did not insert a fragment - Kirya522
  • @ Kirya522, how do you measure the accuracy of the results, 32-34% - is it compared to what? - insolor
  • one
    @ Kirya522, look at this answer - a similar approach might suit you. It's hard to say without seeing examples of your data and "inaccurate queries" ... - MaxU

0