📜 ⬆️ ⬇️

Nlp The basics. Techniques. Self development. Part 1

Hello! My name is Ivan Smurov, and I head the NLP research team at ABBYY. About what our group does, you can read here . Recently, I gave a lecture on Natural Language Processing (NLP) at the School of Deep Learning - this is a circle at the MIPT Fiztech School of Applied Mathematics and Computer Science for high school students who are interested in programming and mathematics. Perhaps the theses of my lecture will be useful to someone, so I will share them with Habr.

Since we cannot embrace everything at once, we divide the article into two parts. Today I will talk about how neural networks (or deep learning) are used in NLP. In the second part of the article we will focus on one of the most common tasks of NLP - the task of extracting named entities (Named entity recognition, NER) and analyze in detail the architecture of its solutions.



What is NLP?


This is a wide range of tasks in natural language processing (i.e., in the language that people speak and write). There is a set of classic NLP problems, the solution of which is of practical use.


This is certainly not the entire NLP task list. There are dozens of them. By and large, everything that can be done with the text in natural language can be attributed to the tasks of NLP, just the listed topics are well known, and they have the most obvious practical applications.

Why is it difficult to solve NLP problems?


Problem statements are not very complex, but the tasks themselves are not at all simple, because we work with natural language. Phenomena of polysemy (multi-valued words have a common initial meaning) and homonymy (words that are different in meaning are pronounced and written in the same way) are characteristic of any natural language. And if a Russian carrier is well aware that in a warm reception there is little in common with a combat reception , on the one hand, and a warm beer , on the other, an automatic system has a long time to learn this. Why " Press space bar to continue " is better translated boring " To continue, press the space bar " than " Space Press Bar will continue to work ."


By the way, this year at the conference " Dialogue " tracks will be held both on the anaphor and on the gapping (a kind of ellipse) for the Russian language. For both tasks, shells were assembled with a volume several times larger than the volumes of currently existing shells (and, for gapping, the sheath volume is an order of magnitude larger than the shells not only for Russian, but generally for all languages). If you want to participate in competitions on these buildings, click here (with registration, but without SMS) .

How to solve the problems of NLP


Unlike image processing, NLP still contains articles describing solutions that use not neural networks, but classical algorithms like SVM or Xgboost , and show results that are not too inferior to state-of-the-art solutions.

However, a few years ago, the neural networks began to defeat the classic models. It is important to note that for the majority of problems, solutions based on classical methods were unique, as a rule, not similar to the solutions of other problems, both in architecture and in the way signs are collected and processed.

However, neural network architectures are much more common. The architecture of the network itself, most likely, is also different, but much less, there is a trend towards full universalization. Nevertheless, the fact with which signs and how exactly we work is almost the same for most NLP tasks. Only the last layers of neural networks are different. Thus, we can assume that a single NLP pipeline has been formed. About how it works, we now will tell more.

Pipeline NLP


This way of working with signs, which is more or less the same for all tasks.

When it comes to language, the basic unit with which we work is the word. Or more formally "token". We use this term because it is not very clear what 2128506 is a word or not? The answer is not obvious. The token is usually separated from other tokens by spaces or punctuation marks. And as can be understood from the difficulties that we have described above, the context of each token is very important. There are different approaches, but in 95% of cases such a context, which is considered when the model is working, is a proposal that includes the source token.

Many problems are generally solved at the level of supply. For example, machine translation. Most often, we simply translate one sentence and do not use a broader context. There are tasks where this is not the case, for example, interactive systems. It is important to remember what the system was asked before so that it could answer questions. However, the proposal is also the basic unit with which we work.

Therefore, the first two steps of the pipeline, which are performed practically to solve any tasks, are segmentation (division of the text into sentences) and tokenization (division of sentences into tokens, that is, individual words). This is done by simple algorithms.

Next you need to calculate the signs of each token. As a rule, this occurs in two stages. The first is to calculate context-independent token attributes. This is a set of signs that do not depend on those surrounding our token in other words. Common context-independent features are:


About embeddings and symbolic signs, we will talk in detail later (about symbolic signs - not today, but in the second part of our article), but for now let us give possible examples of additional signs.

One of the most frequently used signs is a part of speech or a POS (part of speech) tag. Such signs can be important for solving many problems, for example, tasks of syntactic parsing. For languages ​​with complex morphology, such as Russian, morphological features are also important: for example, in what case is the noun, what kind of adjective. From this we can draw different conclusions about the structure of the proposal. Also, morphology is needed for lemmatization (reduction of words to initial forms), with the help of which we can reduce the dimension of the attribute space, and therefore, morphological analysis is actively used for most NLP tasks.

When we solve the problem, where the interaction between different objects is important (for example, in the problem of relation extraction extraction or when creating a question-answer system), we need to know a lot about the structure of the sentence. This requires parsing. At school, everyone did the analysis of the sentence on the subject, predicate, addition, etc. The syntactic analysis is something like that, but more complicated.

Another example of an additional feature is the position of the token in the text. We may know a priori that an entity is more common at the beginning of the text or vice versa at the end.

All together - embeddingings, symbolic and additional signs - form a vector of signs of a token that does not depend on the context.

Context-sensitive features


Context-sensitive token attributes are a feature set that contains information not only about the token itself, but also about its neighbors. There are different ways to calculate these signs. In classical algorithms, people often just went "window": they took several (for example, three) tokens to the original and several tokens after, and then calculated all the signs in that window. Such an approach is unreliable, since important information for analysis may be located at a distance greater than the window; accordingly, we may miss something.

Therefore, now all context-sensitive features are calculated at the sentence level in the standard way: using two-way recurrent neural networks LSTM or GRU. To obtain context-dependent token attributes from context-independent, context-independent features of all tokens of a sentence are submitted to the Bidirectional RNN (one or several layers). The output of the Bidirectional RNN at the i-th moment of time is a context-dependent sign of the i-th token, which contains information about both previous tokens (since this information is contained in the i-th value of the direct RNN) and about subsequent ones ( . This information is contained in the corresponding value of the inverse RNN).

Then for each individual task we do something different, but the first few layers - up to the Bidirectional RNN can be used for almost any task.

This method of obtaining features is called an NLP pipeline.



It is worth noting that in the last 2 years, researchers have been actively trying to improve the NLP pipeline - both in terms of speed (for example, a transformer - an architecture based on self-attention, does not contain an RNN and therefore is able to learn and use faster) and with points of view of used signs (now they actively use signs on the basis of pre-trained language models, for example ELMo , or use the first layers of the pre-trained language model and retrain them on the building available for the task - ULMFit , BERT ).

Word form embeddings


Let's take a closer look at what is embedding. Roughly speaking, embedding is a concise view of the context of a word. Why is it important to know the context of a word? Because we believe in the distributional hypothesis - that words of similar meaning are used in similar contexts.

Let's now try to give a strict definition of embedding. Embedding is a mapping from a discrete vector of categorical features into a continuous vector with a predetermined dimension.

A canonical example of embedding is the embedding of a word (word embedding).

What usually acts as a discrete feature vector? A Boolean vector corresponding to all possible values ​​of a category (for example, all possible parts of speech or all possible words from some limited vocabulary).

For word form embeddings, this category is usually the word index in the dictionary. Suppose there is a dictionary with a dimension of 100 thousand. Accordingly, each word has a discrete feature vector — a boolean vector of dimension 100 thousand, where in one place (the index of the word in our dictionary) is one, and the rest are zeros.

Why do we want to display our discrete feature vectors in continuous given dimensions? Because a vector with a dimension of 100 thousand is not very convenient to use for calculations, but a vector of integer numbers of dimension 100, 200 or, for example, 300, is much more convenient.

In principle, we can not try to impose any additional restrictions on such a mapping. But since we are building such a mapping, let's try to ensure that the vectors of words of similar meaning are also in some sense close. This is done using a simple feed-forward network.

Embedding Training


How embeds are trained? We are trying to solve the problem of recovering a word by context (or vice versa, restoring context by word). In the simplest case, we get the input into the index in the dictionary of the previous word (Boolean vector of the dictionary dimension) and try to define the index in the dictionary of our word. This is done using a grid with an extremely simple architecture: two fully connected layers. First comes a fully connected layer from the Boolean vector of the dictionary dimension to the hidden layer of the embedding dimension (i.e., simply multiplying the Boolean vector by the matrix of the desired dimension). And then on the contrary, a fully connected layer with softmax from a hidden layer of embedding dimension into a vector of dictionary dimension. Thanks to the softmax activation function, we get the probability distribution of our word and can choose the most likely option.


Embedding of the ith word is just the ith row in the transition matrix W.

In the models used in practice, the architecture is more complicated, but not by much. The main difference is that we use not one vector from the context to define our word, but several (for example, everything in a window of size 3). A somewhat more popular option is the situation when we are trying to predict not a word by context, but rather a context by word. This approach is called Skip-gram.

Let's give an example of the application of a problem that is solved during embedding training (in the variant of CBOW - prediction of a word in context). For example, let the token context consist of 2 previous words. If we studied on the corpus of texts about modern Russian literature and the context consists of the words “poet Marina”, then most likely the most likely next word will be the word “Tsvetaeva”.

We emphasize once again, embeddings are only trained in the task of predicting a word by context (or vice versa by the word), and they can be used in any situations where we need to calculate the token feature.

Whichever option we choose, the embedding architecture is very simple, and their big plus is that they can be trained on unallocated data (indeed, we only use information about our token’s neighbors, and only the text itself is used to define them). The resulting embeddings are the averaged context for exactly this body.

Embeddings of word forms, as a rule, are trained in the case that is as large and accessible for training. Usually this is all Wikipedia in the language, because you can deflate it all, and any other corps you can get.

Similar considerations are used in pre-learning for the modern architectures mentioned above - ELMo, ULMFit, BERT. They also use unpartitioned data when training, and therefore they study on the largest available case (although the architects themselves are, of course, more complicated than those of classical embeddings).

Why do we need embeddings?


As already mentioned, there are 2 main reasons for using embeddingings.


image

But one should not think that such vector arithmetic works reliably. In the article where the embeddings were introduced, there were examples that Angela treats Merkel in much the same way as Barack towards Obama, Nicolas towards Sarkozy and Putin towards Medvedev. Therefore, relying on this arithmetic is not worth it, although it is still important, and the computer is much easier when it knows this information, even if it contains inaccuracies.

In the next part of this article we will talk about the NER task. We will talk about what the task is, why it is needed and what pitfalls can be hidden in its solution. We will tell in detail about how this problem was solved using classical methods, how it was solved using neural networks, and we describe the modern architectures created to solve it.

Source: https://habr.com/ru/post/437008/