I want to implement a verification function to which a string is transmitted and then this string is checked against templates from the database until a suitable one is found or false if nothing is found at all
The string pattern is something like regex but more simplistic. For example:
"[Привет,Хеллоу,Добрый день,Здарова]"- these are all words that can be in the input data, examples:- Input line:
"Здарова"- true - Input string:
"Привет"- true - Input string:
"Пока"- false
- Input line:
*- any number of characters, pattern:[Привет,Хеллоу,Добрый день,Здарова], * [дела,жизнь,здоровье,семья]?- Input line:
"Здарова, как ты?"- false - Input line:
"Привет, как там твои дела?"- true - Input line:
"Добрый день, как ваше здоровье?"- true
- Input line:
[:param]- set parameter, pattern[Привет,Здарова], я [:name]"Input string:"Привет, я Человек"- the function returns json{name: "Человек"}~word with arbitrary ending, pattern:[Прив~]- Input line:
"Прувет"- false - Input string:
"Привет"- true - Input string:
"Привот"- true - Input string:
"Прива"- true - Input string:
"Привандр"- true
- Input line:
Further there is much more that I would like to add, but for a start, at least some of this would be realized.
For implementation, I decided to use full-text search, but since I do not store big data in the database, but basically the database records will be strings of templates for verification, I changed a bit the approach:
- There is a database in which all templates are stored in the form:
{id: 1: template: '......'} - There is another database where all the words from the template and id templates are stored where they meet:
{id:1, word: 'hello', id_temp: '1|5|16|23|43'}
When adding a template to the database, I delete all unnecessary characters and break the string into words and get an array of strings:
str = "[[Hello,Hi,Hey, Good day, Good Morning]] * [$[how are]$] [$[you]$]?" str = str\ .lower()\ .replace('[[', ' ')\ .replace(']]', ' ')\ .replace('[$[', ' ')\ .replace(']$]', ' ')\ .replace(',', ' ')\ .replace('*', ' '); while ' ' in str: str = str.replace(' ', ' ') print(str) #hello, hi, hey, good, day, good, morning, how, are, you Next is the addition of words to the database, then there is a check whether there are already some words in the database. If there is, then in the field {id_temp: '...|id_new_temp'} add the id of the new template, otherwise I add the word itself and the id of the new template:
from tests.models import Words def writeWord(word_to_write, temp_id): newWord = Words(word=word_to_write, temp_id=temp_id) newWord.save() def findWord(word_to_find, new_temp_id): w = Words.select().where(Words.word == word_to_find) if w: print(w) else: writeWord(word_to_find, new_temp_id) id = 2 str = 'Hello my friend' word = str.lower().split(' ') for s in word: findWord(s, id) peewee used
The process of adding a template ends here. Further, search for a string in the template database, there is no code yet, but the algorithm is not complicated:
- I get a string, break into words and get an array of strings
- I
id_tempfor every word in the database of words, from the found records I take theid_tempfield and also break and write to the dictionary:{'id': count} - In the future, when adding new
idto the dictionary, it is checked whether there is such anidthere, if there is, it increases thecount, otherwise an entry is added to the dictionary withcount=1 - After all the words have been searched by database, I take the dictionary and look for the entry with the largest
countvalue and this will be theidtemplate that is most likely suitable for the input string.
The question was: is there any specific approach to solve this problem or how to optimize the algorithm of the bike, which I collected here. For example, if one word is found in 2000 templates, then the word in the database will be cumbersome {id: 35, word: 'hello', 'id_temp': '1|2|3|523|2323|32|123|124|345|234|133|234|123|234|123|42|545......'} - how can this be optimized?