Smartphones have long been more than just phones. Every day we use our gadget for chatting with friends, watching videos on YouTube, communicating on Telegram, and at the same time, today the phone for many can become a replacement for a full-fledged computer. I'm not joking, I am ready to show by my example how neural networks from Google are able to write an article for me. Until recently, I was skeptical about the voice input that is built into the gBoard keyboard, but I decided to give it a try and was extremely surprised at how well the keyboard is able to recognize my speech. In this article, we will look at how the company managed to create such high-quality speech recognition, and how this function can help us in our work.
How I wrote an article with voice input
All this time I have been writing articles using my laptop or PC. I have always found it easier to express my thoughts with my voice than with my fingers. This makes the process of expressing thoughts more natural, smoother (good flow) and faster. Typing with the keyboard very often led to situations where I lost my train of thought. Already now I am ready to publish the second article dictated by my phone, well, and I do not need to have a quick blind ten-finger print at the same time (I want to note that I type well). The ability to write material with voice makes me feel happy about how much technology has advanced. If earlier I could write material in an hour or two of my time, now I manage to reduce these indicators by 2 times simply because the expression of thoughts by voice is faster than by typing.
I studied how gBoard voice input works and, to be honest, was surprised. Previously, the company used fairly old speech recognition methods, they were based on the Gaussian Mixture Model. This model has been in use for 30 years. However, that all changed in 2012 when neural networks began to become popular. Of course, they existed before, but it was in 2012 that a new stage in development began. Deep neural networks, recurrent and others began to be used. And it is the latter type of neural networks that underlies voice recognition technology. Google is currently using Recurrent Neural Network Transducers (RNN-T) architecture for speech recognition. And now Pixel smartphone owners can use gBoard voice input without the Internet. This was achieved through several stages of optimization, one of which was the final compression, due to which the size of the original model was reduced from 2 gigabytes to 80 megabytes. I propose to discuss this in Telegram.
There are several components in traditional speech recognition systems: a model that breaks audio into 10 millisecond chunks – called phonemes, a pronunciation model that links phonemes together to form words, and a language model that offers the user ready-made phrases. In early systems, these components worked independently of each other. Around 2014, researchers began focusing on training a general neural network to feed one audio file as input, and receive a finished sentence at the output. This sequence-to-sequence method made it possible to make the recognition more accurate, but it worked only after the full sentence was entered. Meanwhile, the CTC technology existed, it made it possible to reduce the delay in recognition, at that time it was a serious step towards the creation of recurrent neural networks with RNN-T converters. From this point on, it became possible to accurately recognize speech at the moment of direct speech input.
Recurrent Neural Network Transducers
What conclusions can be drawn from all this? Of course, you can already use voice input for accurate recognition of Russian text, and it did not work so well before. So far, unfortunately, the neural network is not able to understand where to put punctuation symbols, but the recognition itself is quite accurate, which inspires hope that in the future we will be offered even more opportunities. I do not exclude that in the next two years Google will adapt its new neural network to work with the Russian language in offline mode. In the meantime, we will be content with what we have.
Share your opinion in the comments using language input.
Based on materials from Google