📜 ⬆️ ⬇️

Neural network speech synthesis using the Tacotron 2 architecture, or “Get alignment or die tryin '”



Our team was assigned the task: to repeat the results of the work of the artificial neural network of Tacotron2 speech synthesis by DeepMind. This is a story about the thorny path we have traveled during the project.

The problem of computer speech synthesis has long been interested in scientists and technical experts. However, classical methods do not allow to synthesize speech, indistinguishable from human. And here, as in many other areas, deep learning has come to the rescue.

Let's look at the classic methods of synthesis.

Concatenative speech synthesis


This method is based on pre-recording short audio sequences, which are then combined to create coherent speech. It turns out very clean and clear, but is absolutely devoid of emotional and intonational components, that is, it sounds unnatural. And all because it is impossible to get an audio recording of all possible words, pronounced in all possible combinations of emotions and prosody. Concatenative systems require huge databases and hard coding of combinations to form words. Developing a reliable system takes a lot of time.

Parametric speech synthesis


The use of concatenated TTS is limited due to the large data requirements and development time. Therefore, a statistical method was developed that explores the very nature of the data. It generates speech by combining parameters such as frequency, amplitude spectrum, etc.

Parametric synthesis consists of two stages.

  1. First, linguistic features, such as phonemes, duration, etc., are extracted from the text.
  2. Then, for a vocoder (a system that generates a wave-form), features that represent the corresponding speech signal are extracted: cepstrum, frequency, linear spectrogram, chalk spectrogram.
  3. These manually configured parameters, along with linguistic features, are transmitted to the vocoder model, and it performs many complex transformations to generate a sound wave. In this case, the vocoder evaluates speech parameters, such as phase, prosody, intonation, and others.

If we can approximate the parameters that define speech on each of its units, then we can create a parametric model. Parametric synthesis requires much less data and hard work than concatenative systems.

Theoretically, everything is simple, but in practice there are many artifacts that lead to muffled speech with a “buzzing” tone, which is not at all like a natural sound.

The fact is that at each stage of the synthesis we hard-code some signs and hope to get realistic speech. But the selected data is based on our understanding of speech, and in fact human knowledge is not absolute, so the signs taken will not necessarily be the best possible solution.

And here Deep Learning takes the stage in all its glory.

Deep neural networks are a powerful tool that, theoretically, can approximate an arbitrarily complex function, that is, to bring some space of input data X into output space Y. In the context of our task, this will be, respectively, text and audio with speech.

Data preprocessing


To begin with, let's define what we have as input and what we want to get at the output.

The input will be the text, and the output will be the chalk spectrogram. This is a low-level representation obtained by applying a fast Fourier transform to a discrete audio signal. Immediately it should be noted that the spectrograms obtained in this way still need to be normalized by compressing the dynamic range. This reduces the natural ratio between the loudest and quietest sound on the record. In our experiments, the use of spectrograms reduced to the [-4; 4] range showed itself best of all.


Figure 1: Chalk spectrogram of speech audio signal, reduced to the range [-4; 4].

We chose LJSpeech dataset as a training data set, which contains 13,100 audio tracks for 2-10 seconds. and a file with text corresponding to English speech recorded on audio.

Sound using the above transformations is encoded in the chalk spectrogram. The text is tokenized and transformed.

into a sequence of integers. Immediately, I emphasize that the texts are normalized: all numbers are written verbally, and possible abbreviations are deciphered, for example: “Mrs. Robinson ”-“ Missis Robinson ”.

Thus, after preprocessing, we obtain sets of numpy-arrays of numerical sequences and chalk spectrograms recorded in npy-files on the disk.

So that at the training stage all the dimensions in the batches-tensors coincide, we will add paddings to short sequences. For sequences in the form of texts, these will be reserved for padding 0, and for spectrograms - frames, the values ​​of which are slightly lower than the minimum value of the spectrograms defined by us. This is recommended for separating these paddings, separating them from noise and silence.

Now we have data representing text and audio that are suitable for processing by an artificial neural network. Let's consider the architecture of Feature prediction net, which we will call Tacotron2 by the name of the central element of the entire synthesis system.

Architecture


Tacotron 2 is not one network, but two: Feature prediction net and NN-vocoder WaveNet . The original article, as well as our own vision of the work done, makes it possible to consider the first violin of the Feature prediction net, while the WaveNet vocoder plays the role of a peripheral system.

Tacotron2 is a sequence to sequence architecture. It consists of an encoder (encoder), which creates some internal representation of the input signal (symbolic tokens), and a decoder (decoder), which turns this representation into a chalk spectrogram. Also an extremely important element of the network is the so-called PostNet , designed to improve the spectrogram generated by the decoder.


Figure 2: Tacotron network architecture 2.

Let us consider in more detail the network units and their modules.

The first layer of the encoder is the Embedding layer. It creates multidimensional (512-dimensional) vectors on the basis of a sequence of natural numbers representing symbols.

Next, the embedding vectors are fed into a block of three one-dimensional convolutional layers. Each layer includes 512 filters with a length of 5. This value is a good filter size in this context, because it captures a certain character, as well as two previous and two subsequent neighbors. Each convolutional layer is followed by mini-batch normalization and ReLU activation.

The tensors obtained after the convolutional block are fed to bidirectional LSTM layers, 256 neurons each. Forward and backward results are concatenated.

The decoder has a recurrent architecture, that is, at each subsequent step, the output from the previous step is used. Here they will be one frame of the spectrogram. Another important, if not the key element of this system is the mechanism of soft (trainee) attention - a relatively new and increasingly popular technique. At each decoder step, attention uses the following to form the context vector and update the attention weight:


The idea of ​​attention should be understood as: “what part of the encoder data should be used at the current decoder step”.


Figure 3: Scheme of the mechanism of attention.

At each step in the operation of the decoder, the context vector C i is computed (in the figure above it is designated as "visited encoder outputs"), which is the product of the encoder output ( h ) and attention weights ( α ):



where α ij - attention weights, calculated by the formula:



where e ij is the so-called “energy”, the calculation formula of which depends on the type of attention mechanism you use (in our case it will be a hybrid type using both location-based attention and content-based attention). The energy is calculated by the formula:

e ij = v aT tanh (Ws i-1 + Vh j + Uf i, j + b)

Where:


For a clear understanding of what is happening, we add that some of the modules described below assume the use of information from the previous step of the decoder. But if this is the first step, the information will be zero-value tensors, which is a common practice in creating recurrent structures.

Now consider the algorithm of work .

First, the decoder output from the previous time step is fed into a small PreNet module, which is a stack of two fully connected layers of 256 neurons each, alternating with dropout layers with a rate of 0.5. A distinctive feature of this module is that the dropout is used in it not only at the stage of learning the model, but also at the stage of withdrawal.

The output of PreNet in concatenation with the context vector, obtained as a result of the attention mechanism, is fed to the entrance to the unidirectional two-layer LSTM network, 1024 neurons in each layer.

Then the concatenation of the output data of LSTM-layers with the same (and possibly other) context vector is fed into a fully connected layer with 80 neurons, which corresponds to the number of spectrogram channels. This final decoder layer forms the predicted spectrogram frame by frame. And already its output is fed as input to the next time step of the decoder in PreNet.

Why did we mention in the previous paragraph that the context vector might already be different? One possible approach is the recalculation of the context vector after the latent state of the LSTM network is obtained at this step. However, in our experiments, this approach was not justified.

In addition to the projection on the 80-neural fully connected layer, the concatenation of the output data of LSTM layers with a context vector is fed into a fully connected layer with one neuron, followed by sigmoid activation — this is the “stop token prediction” layer. He predicts the probability that the frame created at this step of the decoder is the final one. This layer is designed to generate a spectrogram not of a fixed, but of arbitrary length at the stage of model output. That is, at the output stage, this element determines the number of decoder steps. It can be considered as a binary classifier.

The decoder output from all its steps will be the predicted spectrogram. However, this is not all. To improve the quality of the spectrogram, it is passed through the PostNet module, which is a stack of five one-dimensional convolutional layers with 512 filters in each and with a filter size of 5. Each layer (except the last) is followed by a batch normalization and tangent activation. To return to the spectrogram dimension, we skip the output of the post-net through a fully connected layer with 80 neurons and add the obtained data to the initial result of the decoder. We receive the chalk spectrogram generated from the text. Profit.

All convolutional modules are regularized with dropout layers with a rate of 0.5, and recurrent layers with a newer Zoneout method with a rate of 0.1. It is quite simple: instead of submitting to the next time step of the LSTM network the hidden state and the cell state obtained at the current step, we replace part of the data with the values ​​from the previous step. This is done both at the training stage and at the withdrawal stage. At the same time, only the hidden state is exposed to the Zoneout method (which is transmitted to the next LSTM step) at each step, while the output of the LSTM cell at the current step remains unchanged.

We chose PyTorch as a deep learning framework. Although at the time of implementation of the network, he was in a state of pre-release, but he was already a very powerful tool for building and training artificial neural networks. In our work we use other frameworks, such as TensorFlow and Keras. However, the latter was rejected due to the need to implement non-standard custom structures, and if we compare TensorFlow and PyTorch, then using the second one does not give the impression that the model is torn out of the Python language. However, we do not undertake to assert that one of them is better and the other worse. The use of one or another framework may depend on various factors.

Learning network backward propagation method of error. ADAM is used as an optimizer, Mean Square Error is used as an error function before and after PostNet, and also Binary Cross Entropy over the actual and predicted values ​​of the Stop Token Prediction layer. The resulting error is the simple sum of these three.

We studied the model on a single GPU GeForce 1080Ti with 11 GB of memory.

Visualization


When working with such a large model, it is important to see how the learning process goes. And here TensorBoard became a convenient tool. We tracked the value of the error on both training and validation iterations. In addition, we displayed target spectrograms, predicted spectrograms at the training stage, predicted spectrograms at the validation stage, as well as alignment, which represents additive accumulated attention weights from all training steps.

It is possible that at first your attention will not be too informative:


Figure 4: Example of poorly trained attention scales.

But after all your modules start working like a Swiss watch, you finally get something like this:


Figure 5: Example of well-trained attention scales.

What does this chart mean? At each step of the decoder, we are trying to decode one frame of the spectrogram. However, it is not entirely clear what encoder information should be used at each step of the decoder. It can be assumed that this correspondence will be direct. For example, if we have an input text sequence of 200 characters and the corresponding spectrogram of 800 frames, then there will be 4 frames per character. However, you must agree that speech generated on the basis of such a spectrogram would be completely devoid of naturalness. We pronounce some words more quickly, some more slowly, somewhere we pause, and somewhere we don’t. And to consider all possible contexts is not possible. That is why attention is a key element of the entire system: it sets the correspondence between the decoder step and the information from the encoder in order to obtain the information necessary to generate a specific frame. And the larger the value of attention scales, the more “attention should be paid” to the corresponding part of the encoder data when generating a spectrogram frame.

At the training stage, it will also be useful to generate audio, and not only visually assess the quality of spectrograms and attention. However, those who worked with WaveNet would agree that using it as a vocoder during the training phase would be an unaffordable luxury in terms of time costs. Therefore, it is recommended to use the Griffin-Lima algorithm , which allows partially recovering the signal after fast Fourier transforms. Why partially? The point is that converting the signal into spectrograms we lose the information about the phase. However, the quality of the audio thus obtained will be quite enough to understand in which direction you are moving.

Lessons learned


Here we will share some thoughts on the construction of the development process, submitting them in the format of tips. Some are quite common, others are more specific.

About the organization of the workflow :


Building and training models :


And at the end of the article we will share examples of speech generation from texts that were not contained in the training set.

Source: https://habr.com/ru/post/436312/