📜 ⬆️ ⬇️

Clean, mark: how we taught chatbot to distinguish between client questions


Anton Chaynikov, developer Data Science, Redmadrobot
Hi, Habr! Today I will talk about thorns on the way to chatbot, which facilitates the work of the insurance company chat operators. Or rather, as we taught the bot to distinguish requests from one another through machine learning. What models they experimented with and what results they got. How did four approaches to cleaning and enriching the data of decent quality and five attempts to clean the data quality "indecent".


Task


+100500 customer calls per day come to the insurance company chat. Most of the questions are simple and repetitive, but the operators are no better at it, and customers still have to wait five to ten minutes. How to improve the quality of service and optimize labor costs so that operators have less routine work, and users have more pleasant sensations from quickly solving their questions?


And we will make chatbot. Let him read the messages of users, give simple instructions for simple cases, and ask standard questions for complex cases in order to get the information the operator needs. A live operator has a script tree — a script (or a block diagram) that says what questions users can ask and how to respond to them. We would take this scheme and put it in the chatbot, but what a bad luck - the chatbot does not understand humanly and does not know how to relate the user's question to the script branch.


So, we will teach him with the help of the good old machine learning. But you can not just take a piece of data generated by users, and teach him a model of decent quality. To do this, you need to experiment with the architecture of the model, to clean the data, and sometimes to collect it again.


How to learn bot:



Texture


We had two clients (insurance companies with online chat rooms) and chatboat training projects (we will not call them, it doesn’t matter), with dramatically different data quality. Well, if half of the problems of the second project were solved by manipulations from the first. Details below.


From a technical point of view, our task is to classify texts. This is done in two stages: first, the texts are vectorized (with the help of tf-idf, doc2vec, etc.), then the classifying model is studied on the vectors (and classes) - random forest, SVM, neural network, and so on. and so on


Where does the data come from:



Without validation, of course, nowhere. All models were trained on 70% of the data and evaluated on the results for the remaining 30%.


Quality metrics for the models we used:



Experiments with models


Rarely, when the task is immediately clear which model will give the best results. So here: no experiments anywhere.


We will try vectorization options:



The classification options on this background look rather poor: SVM, XGBoost, LSTM, random forests, naive bayes, random forest on top of the SVM and XGB predictions.


And although we checked the reproducibility of the results on three independently assembled datasets and their fragments, we will only vouch for the wide applicability.


The results of the experiments:



Experience 1. Data cleaning, or what to do with markup


Chat operators are just people. In determining the category of user query, they often make mistakes and differently understand the boundaries between categories. Therefore, the source data must be ruthlessly and intensively cleaned.


Our data on the training model on the first project:



What exactly the data is bad, we formulated as hypotheses, then checked and, where we could, corrected. Here's what happened:


Approach the first. From the whole huge list of classes you can safely leave 5-10.
We reject small classes (<1% of the sample): little data + small impact. We combine difficult classes to which the operators still react the same way. For example:
'dms' + 'how to sign up for a doctor' + 'question about filling the program'
'cancellation' + 'status of cancellation' + 'cancellation of paid policy'
'question to extend the' + 'how to extend the policy?'


Then we throw out classes like “other”, “other” and the like: they are useless for chatbot (redirecting to the operator anyway), and at the same time spoil the accuracy, because 20% (30, 50, 90) of requests are not categorized and here. Now we throw out the class that the chatbot cannot work with (yet).


Result: in one case - growth from accuracy 0.40 to 0.69, in the other - from 0.66 to 0.77.


Approach the second. At the beginning of the chat, the operators themselves are poorly aware of how to choose a class for the user's appeal, therefore there is a lot of “noise” and errors in the data.


Experiment: we take only the last two (three, six, ...) months of dialogues and train the model on
of them.


Result: in one remarkable case, the accuracy increased from 0.40 to 0.60, in the other - from 0.69 to 0.78.


Approach the third. Sometimes an accuracy of 0.70 means not “in 30% of cases the model is mistaken”, but “in 30% of cases the marking is lying, and the model corrects it very reasonably”.


You cannot test this hypothesis with metrics such as accuracy or logloss. For the purposes of the experiment, we limited ourselves to the data scientist's gaze, but in the ideal case, it is necessary here to qualitatively redefine datasets, not forgetting about cross-training.


To work with such samples, we came up with the process of "iterative enrichment":


  1. Split datasets into 3-4 fragments.
  2. To train the model on the first fragment.
  3. Predict the trained model classes second.
  4. Closely look at the predicted classes and the degree of confidence of the model, choose the boundary value of confidence.
  5. Remove from the second fragment texts (objects), predicted with certainty below the boundary, to train the model on this.
  6. Repeat until you get bored or fragments run out.

On the one hand, the results are excellent: the model of the first iteration has an accuracy of 70%, the second - 95%, the third - 99 +%. A close look at the results of the predictions quite confirm this accuracy.


On the other hand, how to systematically make sure in this process that subsequent models are not learned from the errors of the previous ones? There is an idea to test the process on manually “noisy” datasets with high-quality source markup, such as MNIST. But time for this, alas, was not enough. And without verification, we did not dare to launch iterative enrichment and the resulting models in production.


Approach the fourth. Dataset can be expanded - and thus increase the accuracy and reduce retraining, adding to the existing texts a lot of typos.
Types of typos - doubling a letter, skipping a letter, rearranging adjacent letters in places, replacing a letter with an adjacent one on the keyboard.


Experiment: The proportion of p letters in which a typo will occur: 2%, 4%, 6%, 8%, 10%, 12%. Increase dataset: usually up to 60,000 replicas. Depending on the initial size (after the filters), this meant an increase of 3–30 times.


Result: depends on dataset. On a small dataset (~ 300 replicas) 4–6% of typos give a stable and significant increase in accuracy (0.40 → 0.60). On big it is worse. With a share of misprints of 8% or more, the texts turn into nonsense and the accuracy drops. With an error rate of 2–8%, accuracy varies in the range of a few percent, very rarely exceeds accuracy without typing errors and, by sensation, it is not worth increasing the training time several times.


As a result, we obtain a model that distinguishes 5 classes of references with an accuracy of 0.86. We coordinate with the client the texts of questions and answers for each of the five forks, fasten the texts to the chatbot, send it to QA.


Experience 2. Knee-high in the data, or what to do without markup


Having obtained good results on the first project, we approached the second with all confidence. But, fortunately, we have not forgotten how to be surprised.


What we met:



First of all, we look at the classes: in the script tree, in the sample of the SVM model, in the main sample. And that's what we see:



What to do in such cases? We rolled up our sleeves and went on our own to pull out classes and markup from the data.


Attempt the first. Let's try to cluster user questions, i.e. The first messages in the dialogue, with the exception of greetings.


We are checking. We vector replicas counting 3-gram. Reduce the dimension to the first ten measurements of TSVD. Cluster agglomerative clustering with Euclidean distance and the target Ward function. Once again, we lower the dimension using t-SNE (to two dimensions so that the results can be viewed with the eyes). We draw replica points on the plane, painting in the color of clusters.


Result: fear and horror. Sane clusters, we can assume that there is not:



Almost not - there is one, orange on the left, this is because all the messages in it contain a 3-gram "@". This 3G is a preprocessing artifact. Somewhere in the process of filtering punctuation marks “@” was not only not filtered out, but also overgrown with spaces. But the artifact is useful. In this cluster are users who first write their email. Unfortunately, only by the presence of mail it is not at all clear what the user's request is. Moving on.


Attempt the second. What if operators often respond with more or less standard links?
We are checking. We pull out link-like substrings from operator messages, we slightly edit the links, differing in spelling, but identical in meaning (http / https, / search? City =% city%), we consider the link frequencies.


Result: unpromising. First, operators respond with links only to a small proportion of requests (<10%). Secondly, even after manual cleansing and filtering out the links that have been met once, there are more than thirty of them left. Thirdly, there is no particular similarity in the behavior of users who end the dialogue with a link.


Attempt the third. Let's look for the standard answers of the operators - what if they are indicators of some sort of message classification?


We are checking. In each dialogue, we take the last remark of the operator (not counting the farewells: “I can help with something else,” etc.) and consider the frequency of the unique replicas.


Result: promising, but inconvenient. 50% of operators' responses are unique, another 10–20% are met twice, the remaining 30–40% are covered by a relatively small number of popular templates. Relatively small - about three hundred. A close look at these templates shows that many of them are variants of the same meaning of the answer - where they differ by one letter, where by one word, where by one paragraph. I would like to group these similar answers.


Attempt fourth. We cluster the last replicas of the operators. These clusters are much better:



With this you can already work.


We cluster and draw replicas on the plane, as in the first attempt, manually determine the most clearly separated clusters, remove them from the dataset and cluster anew. After the separation of approximately half of the dataset, the clear clusters end, and we begin to think about which classes to assign to them. We scatter clusters according to the original five classes - the sample is skewed, and three of the five original classes do not receive a single cluster. Poorly. We scatter clusters in five classes, which we plan randomly, to: “call”, “come”, “wait for an answer a day”, “problems with captcha”, “other”. The skew is smaller, but the accuracy is only 0.4–0.5. Bad again. We assign each of the 30+ clusters its own class. The sample is again “skewed”, and the accuracy is again 0.5, although about five selected classes have decent accuracy and completeness (0.8 and higher). But the result is still not impressive.


Attempt the fifth. We need all the ins and outs of clustering. Extract the full clustering dendrogram instead of the top thirty clusters. We save it in a format that is accessible to client analysts, and help them to make markup - we sketch the list of classes.


For each message, we compute a chain of clusters that include each message, starting from the root. We build a table with columns: text, id of the first cluster in the chain, id of the second cluster in the chain, ..., id of the cluster corresponding to the text. We save the table in csv / xls. Then you can work with it with office tools.


We give the data and a sketch of the list of classes for marking the client. Client analysts have redefined ~ 10,000 first user posts. We have already learned from experience, asked to mark each message at least twice. And for good reason - 4,000 of these 10,000 have to be thrown away, because the two analysts have marked out differently. On the remaining 6,000, we rather quickly repeated the successes of the first project:



The model is ready, now you need to draw a script tree. For reasons that we will not explain, we did not have access to the scripts of the operators' responses. We were not confused, pretended to be users, and for a couple of hours in the field we collected response templates and clarifying questions from operators for all occasions. Decorated them in a tree, packed in a bot and went to test. Customer approved.


Conclusions, or that experience has shown:



To be concluded.



Source: https://habr.com/ru/post/436072/