There is such a non-standard question. Let's just say, you need some kind of library or method, the principle for creating, let's say, chat bot. There is an input line and there is a base of ready-made templates in which the blanks are defined. The input text is compared with the base of the templates and if there are matches, it returns some result.

For example, the input string "Hello, how are you?"

The database has such templates

[ {id: 0, text: "Привет, как дела?"}, {id: 1, text: "Здравствуй, как дела?"}, {id: 2, text: "Братан, ты как?"}, {id: 3, text: "Добрый день, как вы?"}, {id: 4, text: "Здарова, как жизнь?"} ... ] 

After comparing the input string, we should get {id: 1} , but of course it is quite simple to implement. If you look at the 5 records in the database are almost similar and they could somehow be merged and instead of the text, write which regex check, so that we get the input text and check the database records until the regular expression is true . But what if you need to create 10,000 templates? Writing regular expressions in this case would not be very good.

Here I also need to make something on similarity of more simplified regular expressions. For example, if someone is familiar with the service iii.ru, then such a function is implemented there, but they say that everything works for them on AI, and I somehow cannot create AI. There is also the site Flow.ai where this whole thing is very simple done, I don’t know how they have implemented it.

For example, I want like this

There is a pattern: {Привет, Добрый день, Здравствуй, Хеллов}, как [ {у вас, у тебя, ты},[не обязательно]] [{дела, жизнь, всё}, [не обязательно]]?

And this template is suitable for such lines:

  • Hi, how are you?

  • Hello how are you?

  • Hello, how are you?

  • Good afternoon, how are you?

  • Hello, how is life?

    etc.

I think I didn’t explain the essence of the question well enough, but I hope someone will understand and be able to help with something.

  • Well, in general, human language is not such a simple thing, so that all questions and answers can be foreseen. What if they write abbreviated or with an error? But, in principle, the container map may suit you. As keys, you can specify the input phrases, and the answers - as useful information. The search is fast due to the tree structure. - Andrej Levkovitch
  • Do you mean additional literals output when typing in a text message that automatically substitutes the continuation options? If yes, then, as said above -> create a factory of such phrases, the same array will do, check for the match of the words list.contains (from_value) and output quietly in the response options - GenCloud

1 answer 1

For the fastest speed, string comparison needs to be converted to shorter numbers. Each word can be represented by a certain number, which can be calculated on the basis of the symbols of this word. It is also necessary for the bot to recognize well the meaning of the entered phrases and find similar ones in the database. For example, “Hello!”, “Zdarova!”, “Yoh!” Have the same meaning. It is possible to translate all similar questions in the vector representation, when words / questions with the same meaning will be close to each other on the coordinate plane / space. For example:

 { {a:"Йо", id:(1 1)} {a:"Привет", id:(1 2)} {a:"Здарова", id: (2 1)} {a:"Пока", id:(340 400)} } 

Entering the user divided into words, calculate the coordinates, the vector of the question, and find the nearest similar in meaning, then remove the answer. That is, each line is represented by a vector in the space of meaning. This is called the "vector representation of the text" in the space of meanings.

Presentation methods vary. The most popular models are word2vec and GloVe. Using them, you can achieve a more or less normal bot chat, but not perfect. There are more advanced models, but then a number of problems arise:

Distributed ways of presenting text in a vector (like word2vec) use the statistics of probability according to the hypothesis: “words that occur in the same environments have similar meanings”. That is, the text is loaded, analyzed, probabilities are compiled and vectors are created on their basis. The more text, the more words, the better the analysis. However, in rare cases it is difficult to calculate the “meaning” of the text, the model will suddenly turn from an experimental one into an untrained one.

That is, in fact, training is required, but in the end we get a not very perfect result.

If, however, to improve such a model, then it trivially degenerates into a neural network.

I advise you to read: Fuzzy search in the text and dictionary. Correction of errors in the text, search queries.

In your case, you just need a fuzzy search in the text and dictionary. The strings are compared through the permutation model, the distance between them is found and the most appropriate string is selected.

"Distance Damerau-Levenshteyn" - this is exactly what you need.

  • Very useful information, thank you, but this is not what I need, I corrected the question - Huffy Grams
  • @HuffyGrams the answer remains the same, I will only slightly correct. - Askalite