I want to implement a verification function to which a string is transmitted and then this string is checked against templates from the database until a suitable one is found or false if nothing is found at all

The string pattern is something like regex but more simplistic. For example:

  • "[Привет,Хеллоу,Добрый день,Здарова]" - these are all words that can be in the input data, examples:

    1. Input line: "Здарова" - true
    2. Input string: "Привет" - true
    3. Input string: "Пока" - false
  • * - any number of characters, pattern: [Привет,Хеллоу,Добрый день,Здарова], * [дела,жизнь,здоровье,семья]?

    1. Input line: "Здарова, как ты?" - false
    2. Input line: "Привет, как там твои дела?" - true
    3. Input line: "Добрый день, как ваше здоровье?" - true
  • [:param] - set parameter, pattern [Привет,Здарова], я [:name]" Input string: "Привет, я Человек" - the function returns json {name: "Человек"}

  • ~ word with arbitrary ending, pattern: [Прив~]

    1. Input line: "Прувет" - false
    2. Input string: "Привет" - true
    3. Input string: "Привот" - true
    4. Input string: "Прива" - true
    5. Input string: "Привандр" - true

Further there is much more that I would like to add, but for a start, at least some of this would be realized.

For implementation, I decided to use full-text search, but since I do not store big data in the database, but basically the database records will be strings of templates for verification, I changed a bit the approach:

  1. There is a database in which all templates are stored in the form: {id: 1: template: '......'}
  2. There is another database where all the words from the template and id templates are stored where they meet: {id:1, word: 'hello', id_temp: '1|5|16|23|43'}

When adding a template to the database, I delete all unnecessary characters and break the string into words and get an array of strings:

 str = "[[Hello,Hi,Hey, Good day, Good Morning]] * [$[how are]$] [$[you]$]?" str = str\ .lower()\ .replace('[[', ' ')\ .replace(']]', ' ')\ .replace('[$[', ' ')\ .replace(']$]', ' ')\ .replace(',', ' ')\ .replace('*', ' '); while ' ' in str: str = str.replace(' ', ' ') print(str) #hello, hi, hey, good, day, good, morning, how, are, you 

Next is the addition of words to the database, then there is a check whether there are already some words in the database. If there is, then in the field {id_temp: '...|id_new_temp'} add the id of the new template, otherwise I add the word itself and the id of the new template:

 from tests.models import Words def writeWord(word_to_write, temp_id): newWord = Words(word=word_to_write, temp_id=temp_id) newWord.save() def findWord(word_to_find, new_temp_id): w = Words.select().where(Words.word == word_to_find) if w: print(w) else: writeWord(word_to_find, new_temp_id) id = 2 str = 'Hello my friend' word = str.lower().split(' ') for s in word: findWord(s, id) 

peewee used

The process of adding a template ends here. Further, search for a string in the template database, there is no code yet, but the algorithm is not complicated:

  1. I get a string, break into words and get an array of strings
  2. I id_temp for every word in the database of words, from the found records I take the id_temp field and also break and write to the dictionary: {'id': count}
  3. In the future, when adding new id to the dictionary, it is checked whether there is such an id there, if there is, it increases the count , otherwise an entry is added to the dictionary with count=1
  4. After all the words have been searched by database, I take the dictionary and look for the entry with the largest count value and this will be the id template that is most likely suitable for the input string.

The question was: is there any specific approach to solve this problem or how to optimize the algorithm of the bike, which I collected here. For example, if one word is found in 2000 templates, then the word in the database will be cumbersome {id: 35, word: 'hello', 'id_temp': '1|2|3|523|2323|32|123|124|345|234|133|234|123|234|123|42|545......'} - how can this be optimized?

  • What is the problem? Show the code. - Dmitry Erohin
  • @DmitryErohin problem in the approach and the most optimal method of implementation. Completed the question. - Huffy Grams

0