Speech Recognition and Synthesis at Human Parity

Microsoft Research dominates Speech-to-Text

Achieving Human Parity in Conversational Speech Recognition (W. Xiong, J. Droppo, X. Huang, F. Seide, M. Seltzer, A. Stolcke, D. Yu, G. Zweig) [Published: Oct 2016]

Microsoft Research has made a number of advancements in order to achieve this momentous accomplishment. First, they measured the human error rate on NIST’s 2000 conversational telephone speech recognition task, which consists of two parts: the Switchboard
and CallHome subsets, with 5.9% and 11.3% error rates respectively. Significantly, they found a great deal of variability between the two datasets, which highlights the fact that it would be a misnomer to place a single number to represent "human-level" accuracy - it really depends on the dataset. With that said, MS claims that for the first time, automatic recognition performance is on par with human performance on this task.

This was accomplished through a variety of SOTA tricks from deep learning. To start, a combination of triplet of CNNs are fused together for acoustic modeling, namely a (1) VGGNet for senone prediction (2) ResNet with ReLUs and batch norm (3) a LACE network. What is a senone? Well, a senone is a set of tied triphone states, normally in a HMM, but this time in a CNN. From CMU Sphinx: Speech is a continuous audio stream where rather stable states mix with dynamically changed states. In this sequence of states, one can define more or less similar classes of sounds, or phones.
Words are understood to be built of phones, but this is certainly not true. The acoustic properties of a waveform corresponding to a phone can vary greatly depending on many factors - phone context, speaker, style of speech and so on. The so called coarticulation makes phones sound very different from their “canonical” representation. Next, since transitions between words are more informative than stable regions, developers often talk about diphones - parts of phones between two consecutive phones. Sometimes developers talk about subphonetic units - different substates of a phone. Often three or more regions of a different nature can easily be found.
The number three is easily explained. The first part of the phone depends on its preceding phone, the middle part is stable, and the next part depends on the subsequent phone. That's why there are often three states in a phone selected for HMM recognition.
Sometimes phones are considered in context. There are triphones or even quinphones. But note that unlike phones and diphones, they are matched with the same range in waveform as just phones. They just differ by name. That's why we prefer to call this object senone. A senone's dependence on context could be more complex than just left and right context. It can be a rather complex function defined by a decision tree, or in some other way."
What is a LACE? Well it stands for a layer-wise context expansion
with attention model, but that doesn't help too much either. LACE is variation on time-delay neural networks where each higher layer studies the broad context, which "expands" the context of lower layers. Put another way, lower layers focus on extracting simple local patterns while higher layers extract complex patterns that cover broader contexts. Where Since not all frames in a window carry the same importance, an attention is applied to tell the network which window to focus on.

A whole bunch of LSTMs are used for acoustic and language modeling as well. In order to extract all the data, many steps of processing are performed. We trained both standard, forward-predicting RNNLMs and backward RNN-LMs that predict words in
reverse temporal order. The log probabilities from both
models are added. There is also some interpolation with N-grams. Then some work is done to account for words that are out-of-domain of the training data. Some other techniques are used for scoring and rescoring the potential candidate terms. To quote: "Here we also used a two-phase training schedule to train the LSTM LMs. First we train the model on the combination of in-domain and out-domain data for four data passes without any learning rate adjustment. We then start from the resulting model and train on
in-domain data until convergence." As is now clear, speech recognition is a deep learning task beyond my understanding, but this is just too great a result to not report :P

One final note, while I don't think the "human parity" claim is marketing fluff, I do think the rebranding of CNTK to "Cognitive Toolkit" is PR from MS to combat the popularity of Torch (FB) and Tensorflow (GOOG). Why is this worth mentioning? Because they devoted an entire section in the paper to say how great their library is. Anyway, great achievement for the team!

Google Deepmind dominates Text-to-Speech

Wavenet: a Generative Model for Raw Audio
(A. Oord, K. Simonyan, N. Kalchbrenner, S. Dieleman, O. Vinyals, A. Senior, H. Zen, A. Graves, K. Kavukcuoglu) [Published: Sept 2016]

Traditionally, raw audio waveforms are created using Concatenative methods or Statistical Parametric methods. For the former, there's just a giant library of sound snippets (likely phonemes) that are glued together to form each words and sentences. For the later, the combinations and transitions from one sound bite to another are done a bit smarter using Hidden Markov Models and other statistical techniques. As of 2016, this represents the state-of-the-art. Well, represented state-of-the-art until WaveNet came by and improved on both by over 50%!

WaveNet is a deep neural network for generating raw audio waveforms. The model is fully probabilistic and autoregressive, with the predictive distribution for each audio sample conditioned on all previous ones. Human listeners rate the speech as significantly more natural sounding than the best parametric and concatenative systems for both English and Chinese. This was supposed to be very hard because audio files used for training are now at 16,000 samples per second, which means too much data for most other models to handle. But what works in PixelRNNs for generating images also works for generating waveforms, once the convolution filters are tweaked a bit for this new purpose.

At each timestep, it takes in the word it is trying to say, and the outputs samples at all previous timesteps to generate the current output. But there are no timesteps since its a CNN, not a RNN! Ok true, but basically, the CNN "looks back" at previous timesteps by expanding the receptive field - higher layers have higher dilation leading to larger receptive fields. A normal dilation will expand in both directions (forward and backward), so to prevent the CNN from looking ahead, the authors apply masks to form causal convolutions. (Actually a dilation normally expands in four directions up, down, left, right, but the waveform is 2D, so it can only expand left and right.) Since wave From what I can gather, this basically means applying a mask tensor on the convolution kernel through elementwise multiplication (that blocks signal from the "future") before applying the kernel to the waveform at that layer.

The output of the model has the same time dimensionality as the input, which makes taking the output from one layer as the input of the next layer really easy (no deconvolutions necessary). From the image though, it's clear each output comes from multiple inputs. However, this is not achieved pooling layers in the network (as typical in CNNs). Instead, in order to shrink down the dimensions as we move up the layers, the dilated convolutions are used to pad zeros between the filters of each layer. To quote: "The intuition behind this configuration is two-fold. First, exponentially increasing the dilation factor results in exponential receptive field growth with depth (Yu & Koltun, 2016). For example each 1, 2, 4, . . . , 512 block has receptive field of size 1024, and can be seen as a more efficient and dis- criminative (non-linear) counterpart of a 1×1024 convolution. Second, stacking these blocks further increases the model capacity and the receptive field size."


Also interesting, by tweaking one of the inputs, WaveNet can even generate the same speech using different speakers, such as female or male, when conditioned on a speaker identity. Just for fun, it can also generate "novel and often highly realistic musical fragments". Finally, it might just be me, but do all the sound clips sound like they have some underlying static in the background? Might it have to do with the checkerboard artifacts in other generative models?

Perhaps most promising is the use of WaveNet as a discriminative model, since it generally points to the possibility of semi-supervised learning that takes advantage of unlabeled data. Here, the output is chopped off, and replaced with a mean-pool followed by some more non-causal convs. (Was there a softmax at the end?) Applying this to TIMIT, the authors acheieved 18.8 PER on the test set, which is SOTA when trained on raw audio (as opposed to log mel-filterbank energies or mel-frequency cepstral coefficients.) The official blog post is pretty easy to understand, and includes sound clips from the system. Definitely worth a look.