Neural network speech synthesis using the Tacotron 2 architecture, or “Get alignment or die tryin '”

Our team was assigned the task: to repeat the results of the work of the artificial neural network of Tacotron2 speech synthesis by DeepMind. This is a story about the thorny path we have traveled during the project.

The problem of computer speech synthesis has long been interested in scientists and technical experts. However, classical methods do not allow to synthesize speech, indistinguishable from human. And here, as in many other areas, deep learning has come to the rescue.

Let's look at the classic methods of synthesis.

Concatenative speech synthesis

This method is based on pre-recording short audio sequences, which are then combined to create coherent speech. It turns out very clean and clear, but is absolutely devoid of emotional and intonational components, that is, it sounds unnatural. And all because it is impossible to get an audio recording of all possible words, pronounced in all possible combinations of emotions and prosody. Concatenative systems require huge databases and hard coding of combinations to form words. Developing a reliable system takes a lot of time.

Parametric speech synthesis

The use of concatenated TTS is limited due to the large data requirements and development time. Therefore, a statistical method was developed that explores the very nature of the data. It generates speech by combining parameters such as frequency, amplitude spectrum, etc.

Parametric synthesis consists of two stages.

First, linguistic features, such as phonemes, duration, etc., are extracted from the text.
Then, for a vocoder (a system that generates a wave-form), features that represent the corresponding speech signal are extracted: cepstrum, frequency, linear spectrogram, chalk spectrogram.
These manually configured parameters, along with linguistic features, are transmitted to the vocoder model, and it performs many complex transformations to generate a sound wave. In this case, the vocoder evaluates speech parameters, such as phase, prosody, intonation, and others.

If we can approximate the parameters that define speech on each of its units, then we can create a parametric model. Parametric synthesis requires much less data and hard work than concatenative systems.

Theoretically, everything is simple, but in practice there are many artifacts that lead to muffled speech with a “buzzing” tone, which is not at all like a natural sound.

The fact is that at each stage of the synthesis we hard-code some signs and hope to get realistic speech. But the selected data is based on our understanding of speech, and in fact human knowledge is not absolute, so the signs taken will not necessarily be the best possible solution.

And here Deep Learning takes the stage in all its glory.

Deep neural networks are a powerful tool that, theoretically, can approximate an arbitrarily complex function, that is, to bring some space of input data X into output space Y. In the context of our task, this will be, respectively, text and audio with speech.

Data preprocessing

To begin with, let's define what we have as input and what we want to get at the output.

The input will be the text, and the output will be the chalk spectrogram. This is a low-level representation obtained by applying a fast Fourier transform to a discrete audio signal. Immediately it should be noted that the spectrograms obtained in this way still need to be normalized by compressing the dynamic range. This reduces the natural ratio between the loudest and quietest sound on the record. In our experiments, the use of spectrograms reduced to the [-4; 4] range showed itself best of all.

Figure 1: Chalk spectrogram of speech audio signal, reduced to the range [-4; 4].

We chose LJSpeech dataset as a training data set, which contains 13,100 audio tracks for 2-10 seconds. and a file with text corresponding to English speech recorded on audio.

Sound using the above transformations is encoded in the chalk spectrogram. The text is tokenized and transformed.

into a sequence of integers. Immediately, I emphasize that the texts are normalized: all numbers are written verbally, and possible abbreviations are deciphered, for example: “Mrs. Robinson ”-“ Missis Robinson ”.

Thus, after preprocessing, we obtain sets of numpy-arrays of numerical sequences and chalk spectrograms recorded in npy-files on the disk.

So that at the training stage all the dimensions in the batches-tensors coincide, we will add paddings to short sequences. For sequences in the form of texts, these will be reserved for padding 0, and for spectrograms - frames, the values of which are slightly lower than the minimum value of the spectrograms defined by us. This is recommended for separating these paddings, separating them from noise and silence.

Now we have data representing text and audio that are suitable for processing by an artificial neural network. Let's consider the architecture of Feature prediction net, which we will call Tacotron2 by the name of the central element of the entire synthesis system.

Architecture

Tacotron 2 is not one network, but two: Feature prediction net and NN-vocoder WaveNet . The original article, as well as our own vision of the work done, makes it possible to consider the first violin of the Feature prediction net, while the WaveNet vocoder plays the role of a peripheral system.

Tacotron2 is a sequence to sequence architecture. It consists of an encoder (encoder), which creates some internal representation of the input signal (symbolic tokens), and a decoder (decoder), which turns this representation into a chalk spectrogram. Also an extremely important element of the network is the so-called PostNet , designed to improve the spectrogram generated by the decoder.

Figure 2: Tacotron network architecture 2.

Let us consider in more detail the network units and their modules.

The first layer of the encoder is the Embedding layer. It creates multidimensional (512-dimensional) vectors on the basis of a sequence of natural numbers representing symbols.

Next, the embedding vectors are fed into a block of three one-dimensional convolutional layers. Each layer includes 512 filters with a length of 5. This value is a good filter size in this context, because it captures a certain character, as well as two previous and two subsequent neighbors. Each convolutional layer is followed by mini-batch normalization and ReLU activation.

The tensors obtained after the convolutional block are fed to bidirectional LSTM layers, 256 neurons each. Forward and backward results are concatenated.

The decoder has a recurrent architecture, that is, at each subsequent step, the output from the previous step is used. Here they will be one frame of the spectrogram. Another important, if not the key element of this system is the mechanism of soft (trainee) attention - a relatively new and increasingly popular technique. At each decoder step, attention uses the following to form the context vector and update the attention weight:

the projection of the previous hidden state of the decoder's RNN network onto the fully connected layer,
the projection of the output of the encoder on a fully connected layer,
as well as additive (accumulated at each time step of the decoder operation) attention weights.

The idea of attention should be understood as: “what part of the encoder data should be used at the current decoder step”.

Figure 3: Scheme of the mechanism of attention.

At each step in the operation of the decoder, the context vector C _{i is} computed (in the figure above it is designated as "visited encoder outputs"), which is the product of the encoder output ( h ) and attention weights ( α ):

where α _ij - attention weights, calculated by the formula:

where e _ij is the so-called “energy”, the calculation formula of which depends on the type of attention mechanism you use (in our case it will be a hybrid type using both location-based attention and content-based attention). The energy is calculated by the formula:

e _ij = v _aT tanh (Ws _i-1 + Vh _j + Uf _{i, j} + b)

Where:

s _i-1 is the previous hidden state of the decoder's LSTM network,
α _i-1 - previous attention weights,
h _j is the jth hidden state of the encoder,
W , V , U , v _a and b - trained parameters,
f _{i, j} - location-signs, calculated by the formula:

f _i = F * α _i-1

where F is a convolution operation.

For a clear understanding of what is happening, we add that some of the modules described below assume the use of information from the previous step of the decoder. But if this is the first step, the information will be zero-value tensors, which is a common practice in creating recurrent structures.

Now consider the algorithm of work .

First, the decoder output from the previous time step is fed into a small PreNet module, which is a stack of two fully connected layers of 256 neurons each, alternating with dropout layers with a rate of 0.5. A distinctive feature of this module is that the dropout is used in it not only at the stage of learning the model, but also at the stage of withdrawal.

The output of PreNet in concatenation with the context vector, obtained as a result of the attention mechanism, is fed to the entrance to the unidirectional two-layer LSTM network, 1024 neurons in each layer.

Then the concatenation of the output data of LSTM-layers with the same (and possibly other) context vector is fed into a fully connected layer with 80 neurons, which corresponds to the number of spectrogram channels. This final decoder layer forms the predicted spectrogram frame by frame. And already its output is fed as input to the next time step of the decoder in PreNet.

Why did we mention in the previous paragraph that the context vector might already be different? One possible approach is the recalculation of the context vector after the latent state of the LSTM network is obtained at this step. However, in our experiments, this approach was not justified.

In addition to the projection on the 80-neural fully connected layer, the concatenation of the output data of LSTM layers with a context vector is fed into a fully connected layer with one neuron, followed by sigmoid activation — this is the “stop token prediction” layer. He predicts the probability that the frame created at this step of the decoder is the final one. This layer is designed to generate a spectrogram not of a fixed, but of arbitrary length at the stage of model output. That is, at the output stage, this element determines the number of decoder steps. It can be considered as a binary classifier.

The decoder output from all its steps will be the predicted spectrogram. However, this is not all. To improve the quality of the spectrogram, it is passed through the PostNet module, which is a stack of five one-dimensional convolutional layers with 512 filters in each and with a filter size of 5. Each layer (except the last) is followed by a batch normalization and tangent activation. To return to the spectrogram dimension, we skip the output of the post-net through a fully connected layer with 80 neurons and add the obtained data to the initial result of the decoder. We receive the chalk spectrogram generated from the text. Profit.

All convolutional modules are regularized with dropout layers with a rate of 0.5, and recurrent layers with a newer Zoneout method with a rate of 0.1. It is quite simple: instead of submitting to the next time step of the LSTM network the hidden state and the cell state obtained at the current step, we replace part of the data with the values from the previous step. This is done both at the training stage and at the withdrawal stage. At the same time, only the hidden state is exposed to the Zoneout method (which is transmitted to the next LSTM step) at each step, while the output of the LSTM cell at the current step remains unchanged.

We chose PyTorch as a deep learning framework. Although at the time of implementation of the network, he was in a state of pre-release, but he was already a very powerful tool for building and training artificial neural networks. In our work we use other frameworks, such as TensorFlow and Keras. However, the latter was rejected due to the need to implement non-standard custom structures, and if we compare TensorFlow and PyTorch, then using the second one does not give the impression that the model is torn out of the Python language. However, we do not undertake to assert that one of them is better and the other worse. The use of one or another framework may depend on various factors.

Learning network backward propagation method of error. ADAM is used as an optimizer, Mean Square Error is used as an error function before and after PostNet, and also Binary Cross Entropy over the actual and predicted values of the Stop Token Prediction layer. The resulting error is the simple sum of these three.

We studied the model on a single GPU GeForce 1080Ti with 11 GB of memory.

Visualization

When working with such a large model, it is important to see how the learning process goes. And here TensorBoard became a convenient tool. We tracked the value of the error on both training and validation iterations. In addition, we displayed target spectrograms, predicted spectrograms at the training stage, predicted spectrograms at the validation stage, as well as alignment, which represents additive accumulated attention weights from all training steps.

It is possible that at first your attention will not be too informative:

Figure 4: Example of poorly trained attention scales.

But after all your modules start working like a Swiss watch, you finally get something like this:

Figure 5: Example of well-trained attention scales.

What does this chart mean? At each step of the decoder, we are trying to decode one frame of the spectrogram. However, it is not entirely clear what encoder information should be used at each step of the decoder. It can be assumed that this correspondence will be direct. For example, if we have an input text sequence of 200 characters and the corresponding spectrogram of 800 frames, then there will be 4 frames per character. However, you must agree that speech generated on the basis of such a spectrogram would be completely devoid of naturalness. We pronounce some words more quickly, some more slowly, somewhere we pause, and somewhere we don’t. And to consider all possible contexts is not possible. That is why attention is a key element of the entire system: it sets the correspondence between the decoder step and the information from the encoder in order to obtain the information necessary to generate a specific frame. And the larger the value of attention scales, the more “attention should be paid” to the corresponding part of the encoder data when generating a spectrogram frame.

At the training stage, it will also be useful to generate audio, and not only visually assess the quality of spectrograms and attention. However, those who worked with WaveNet would agree that using it as a vocoder during the training phase would be an unaffordable luxury in terms of time costs. Therefore, it is recommended to use the Griffin-Lima algorithm , which allows partially recovering the signal after fast Fourier transforms. Why partially? The point is that converting the signal into spectrograms we lose the information about the phase. However, the quality of the audio thus obtained will be quite enough to understand in which direction you are moving.

Lessons learned

Here we will share some thoughts on the construction of the development process, submitting them in the format of tips. Some are quite common, others are more specific.

About the organization of the workflow :

Use a version control system that clearly describes all changes. This may seem like an obvious recommendation, but still. When searching for the optimal architecture, changes constantly occur. And having received some satisfactory intermediate result, be sure to make yourself a checkpoint in order to make further changes boldly.
From our point of view, in such architectures one should adhere to the principles of encapsulation: one class - one Python-module. Such an approach is rarely found in ML tasks, but it will help you structure the code and speed up debugging and development. Both in the code and in your vision of the architecture, divide it into blocks, blocks into modules, and modules into layers. If the module has code that performs a certain role, then combine it into a module class method. These are truisms, but we were not too lazy to say about them again.
Provide the numpy-style classes with documentation . This will greatly simplify the work of both you and colleagues who will read your code.
Always paint the architecture of your model. Firstly, it will help you to comprehend it, and secondly, a look from the outside on the architecture and on the hyperparameters of the model will allow you to quickly identify discrepancies in your approach.
It is better to work as a team. If you work alone, still collect colleagues and discuss your work. At a minimum, they may ask you a question that will lead you to some thoughts, and as a maximum, they will point to a specific inaccuracy that does not allow you to successfully train the model.
Another useful trick is already associated with data preprocessing. Suppose you decide to test some hypothesis and make appropriate changes to the model. But restarting the training, especially before the weekend, will be risky. The approach may be initially wrong and you will waste time. What then to do? Increase the size of the fast Fourier transform window. The default setting is 1024; increase it 4, or even 8 times. This will “compress” the spectrograms an appropriate number of times and significantly speed up learning. The restored audio will have lower quality, but isn't that your task right now? In 2-3 hours you can already get an alignment (“alignment” of attention scales, as shown in the figure above), this will indicate the architectural correctness of the approach and can be tested on big data.

Building and training models :

We assumed that if the batches were formed not in a random way, but on the basis of their length, it would speed up the process of learning the model and make the generated spectrograms more qualitative. A logical assumption based on the hypothesis that the more useful signal (and not paddings) is applied to network training, the better. However, this approach did not justify itself; in our experiments we could not train the network in this way. This is probably due to the loss of randomness of the choice of instances for training.
Use modern network initialization algorithms with some optimized initial states. For example, in our experiments we used Xavier Uniform Weight Initialization. If in your module you need to use the normalization for mini-batch and some activation function, then they should alternate with each other in exactly this order. After all, if we apply, for example, ReLU activation, we immediately lose all the negative signal that should be involved in the process of normalizing the data of a specific batch.
From a specific learning step, use a dynamic learning rate. It really helps to reduce the value of the error and increase the quality of the generated spectrograms.
After creating a model and unsuccessful attempts to train it in batches from the entire data set, it will be useful to try to retrain it in one batch. If you succeed, you will get alignment, and recreated audio based on the generated spectrograms will contain speech (or at least its similarity). This will confirm that your overall architecture is correct, and only small details are missing.

Speaking of these details. Errors in the construction of the model can be very different. For example, in the initial experiments we got a classic error - an incorrect activation function after a fully connected layer. Therefore, always ask yourself why you want to use one or another activation function in a specific layer. Here it is useful to decompose everything into separate modules, so it will be easier to inspect each element of the model.
When working with RNN networks, we tried to transmit hidden states and cell states as initializing to the next iteration of training. However, this approach is not justified. Yes, it will give you some hidden idea of the entire data set. However, is it necessary in the context of this task? Much more interesting and relevant approach may be learning the initial hidden state of LSTM-layers in exactly the same way as the usual weights parameter.
Another practical note about recurrent structures, namely the LSTM network, which we learned from the book “Deep Learning”: “ Usually, the weights of a neural network are initialized with small random numbers, and this works great for almost all weights of LSTM cells. A special case is the free member of the "forgetting" gate bf. The fact is that if this free member is initialized around zero, this will actually mean that all LSTM cells will initially have a value of ft around 1/2. And this means that the same carousel of a constant error stops working: we start with the fact that we actually introduce the forgetting factor in 1/2 into all the cells, and as a result, the errors and memory will decay exponentially. Therefore, the free member bf needs to be initialized with large values, about 1 or even 2: then the values of forgetting gates ft at the beginning of the training will be close to zero and the gradients will freely pour over the expanses of our recurrent architecture ”.
Working with seq2seq-models you face the problem of different lengths of sequences in the batch. It is very easily solved by adding paddings — reserved characters in the case of encoder input data, or frames with specific values in the case of a decoder. And how to correctly apply the error function to the predicted spectrograms and real? In our experiments, the use of a mask in the error function showed itself well in order to read the error only on the useful signal (excluding paddings).
Now a specific recommendation for the PyTorch framework. Although the LSTM layer in the decoder is, in essence, its LSTM cell, which at each step of the decoder receives information for only one sequence element, it is recommended to use the class torch.nn.LSTM , and not torch.nn.LSTMCell . The reason is that the LSTM backend is implemented in the CUDNN library in C, and LSTMCell is implemented in Python. This trick will allow you to significantly increase the speed of the system.

And at the end of the article we will share examples of speech generation from texts that were not contained in the training set.