I am new to python , so my algorithms are most likely inefficient. I implement a fuzzy search in the course for finding the answer in the database. My idea of implementation.
Bringing the base to a normal view, that is, removing the auxiliary parts of speech and lemmatization + stemming Realization of removing the auxiliary parts of speech:
def serviceWordsPosition(word, wordAnalyser=pymorphy2.MorphAnalyzer()): #Анализатор типа слова return wordAnalyser.parse(word)[0].tag.POS def SentenceToBase(sentence): #Преобразование строки в базовый вид accomulatingString = '' words = sentence.split() functors_QuestionsToBase = {'INTJ', 'PRCL', 'CONJ', 'PREP'} for word in words: if serviceWordsPosition(word) not in functors_QuestionsToBase: accomulatingString += word + ' ' return accomulatingString def QuestionsToBase(QuestionsList): #Преобразование базы в базовый вид QuestionsListBaseForm = [] for sentence in QuestionsList: QuestionsListBaseForm.append(SentenceToBase(sentence)) return QuestionsListBaseFormBringing the requested phrase to normal
def SentenceToRoot(sentence): morph = pymorphy2.MorphAnalyzer() sentence=' '.join([morph.normal_forms(w)[0] for w in sentence.split()]) return sentenceUsing the Levenshtein distance, I find the editorial distance between each line in the normalized database and the phrase from the query In the library used, the distance is realized by the similarity in%
SimularitySum = [] for questionsentence in QuestionDataBase: s = fuzz.ratio(questionsentence,messagebaseform) SimularitySum.append(s) print(questionsentence,messagebaseform,s)I deduce the answer corresponding to the minimum editorial distance
print(AnswerDataBase[SimularitySum.index(max(SimularitySum))])
The problem is that the accuracy of the results is too low. (according to tests 32-44%) How should improve the algorithm to improve the accuracy?