Specialists from the University of Texas at Austin (UT Austin) have
developed a neural network that processes mono-channel audio recording on video and recreates its “surround” sound.
We tell how it works.
Photo marneejill / CC BY-SANew method for creating 3D sound
Surround sound is often found in games or movies, but 3D sound is rare in conditional videos on the web. To record it requires expensive equipment, which is not always accessible to video creators - often smartphones are used exclusively for shooting.
The audio track recorded in this way limits our perception of the video: it is not able to convey how sound sources are located in space and how they move. Because of this, the sound of the video can be felt "flat."
The solution to this problem was taken up at UT Austin - a university professor Kristen Grauman and a student Ruohan Gao. They created a system based on machine learning algorithms, which makes it possible to turn a mono-channel audio recording into a “volumetric” video recording. The technology is called "2.5D Visual Sound".
This is not a full-fledged spatial sound, but “simulated”. However, according to the developers, for an ordinary listener the difference will be almost imperceptible.
How technology works
The system, developed at UT Austin,
uses two neural networks.
The first neural network is based on the
ResNet architecture, which in 2015 was presented by researchers from Microsoft. It recognizes objects in the video and collects information about their movement in the frame. At the output, the network generates a matrix, called a feature map, with the coordinates of the objects on each frame of the video.
This information is transmitted to the second neural network - Mono2Binaural. It was developed at the University of Texas. The network also takes as input
spectrograms of audio recordings obtained using the
window Fourier transform using
the Hann function .
Mono2Binaural consists of ten
convolutional layers. After each of these layers, there is a batch normalization block (batch normalization) in the network, which
increases the prediction accuracy of the algorithm, and a linear rectification unit with the ReLU
activation function .
The convolutional layers of the neural network analyze the frequency changes in the spectrogram and make up a matrix containing information about which part of the spectrogram should belong to the left audio channel and which part should belong to the right one. After that, using the inverse window Fourier transform, a new audio recording is generated.
In this case, Mono2Binaural is able to reproduce the spatial sound for each of the objects in the video separately. For example, a neural network can recognize two instruments in a video clip - a drum and a pipe - and create a separate audio track for each of them.
Opinions on "2.5D Visual Sound"
According to the developers themselves, they managed to create a technology that recreates "realistic spatial sensation." Mono2Binaural showed a good result during testing, and therefore the authors are confident that their project has great potential.
To prove the effectiveness of its technology, experts conducted a series of experiments. They invited a group of people who compared the sound of two tracks: one was created using Mono2Binaural, and the second - by the Ambisonics method.
The latter was developed at the University of California at San Diego. This method also creates “surround” audio from mono sound, but, unlike the new technology, it works only with 360-degree video.
Most listeners chose Mono2Binaural audio as closest to the actual sound. Testing also showed that in 60% of cases, users correctly identified the location of the sound source by ear.
The algorithm still has some drawbacks. For example, a neural network poorly distinguishes the sounds of a large number of objects. Plus, obviously, she will not be able to determine the position of the sound source, which is not in the video. However, developers are planning to solve these problems.
Analogs of technology
In the field of sound recognition by video, there are several similar projects. We wrote about one of them earlier. This is a “
visual microphone ” from MIT specialists. Their algorithm recognizes on silent video microscopic oscillations of objects under the influence of acoustic waves and restores the sound that was heard in the room on the basis of these data. Scientists managed to “count” the melody of the song
Mary Had a Little Lamb from a pack of chips, a homemade plant, and even a brick.
Photo by Quinn Dombrowski / CC BY-SAOther projects are developing technologies for recording sound in 360-degree videos. One of them is Ambisonics, which we mentioned earlier. The principle of the algorithm is similar to Mono2Binaural: it
analyzes the movement of objects in the frame and relates them to changes in the sound. However, Ambisonics technology has several limitations: the neural network only works with 360-degree video and doesn’t produce sound if there is an echo on the recording.
Another project in this area is G-Audio Sol VR360. Unlike other developments, the technology
has already been implemented in a custom service for sound processing Sol. It creates spatial audio for 360-degree videos from concerts or sports. Lack of service - generated videos are played only in Sol applications.
findings
The developers of systems for creating spatial sound see the main area of application of technology in VR and AR-applications for maximum immersion of a person into the atmosphere of a game or film. If we manage to overcome a number of difficulties that they face, the technology can also be applied to help visually impaired people. With the help of such systems, they will be able to understand in more detail what is happening in the frame on video clips.
More about audio technology in our Telegram channel:
Go InSight recorded the sound of the Martian wind for the first time.
Eight audio technologies that will fall into the hall of fame TECnology in 2019
Windows with active noise cancellation muffle the sounds of the metropolis