Conversational AI | Towards Data Science

Sesame Speech Model: How This Viral AI Model Generates Human-Like Speech

Avishek Biswas — Sat, 12 Apr 2025 01:09:27 +0000

Recently, Sesame AI published a demo of their latest Speech-to-Speech model. A conversational AI agent who is really good at speaking, they provide relevant answers, they speak with expressions, and honestly, they are just very fun and interactive to play with.

Note that a technical paper is not out yet, but they do have a short blog post that provides a lot of information about the techniques they used and previous algorithms they built upon.

Thankfully, they provided enough information for me to write this article and make a YouTube video out of it. Read on!

Training a Conversational Speech Model

Sesame is a Conversational Speech Model, or a CSM. It inputs both text and audio, and generates speech as audio. While they haven’t revealed their training data sources in the articles, we can still try to take a solid guess. The blog post heavily cites another CSM, 2024’s Moshi, and fortunately, the creators of Moshi did reveal their data sources in their paper. Moshi uses 7 million hours of unsupervised speech data, 170 hours of natural and scripted conversations (for multi-stream training), and 2000 more hours of telephone conversations (The Fischer Dataset).

Sesame builds upon the Moshi Paper (2024)

But what does it really take to generate audio?

In raw form, audio is just a long sequence of amplitude values — a waveform. For example, if you’re sampling audio at 24 kHz, you are capturing 24,000 float values every second.

There are 24000 values here to represent 1 second of speech! (Image generated by author)

Of course, it is quite resource-intensive to process 24000 float values for just one second of data, especially because transformer computations scale quadratically with sequence length. It would be great if we could compress this signal and reduce the number of samples required to process the audio.

We will take a deep dive into the Mimi encoder and specifically Residual Vector Quantizers (RVQ), which are the backbone of Audio/Speech modeling in Deep Learning today. We will end the article by learning about how Sesame generates audio using its special dual-transformer architecture.

Preprocessing audio

Compression and feature extraction are where convolution helps us. Sesame uses the Mimi speech encoder to process audio. Mimi was introduced in the aforementioned Moshi paper as well. Mimi is a self-supervised audio encoder-decoder model that converts audio waveforms into discrete “latent” tokens first, and then reconstructs the original signal. Sesame only uses the encoder section of Mimi to tokenize the input audio tokens. Let’s learn how.

Mimi inputs the raw speech waveform at 24Khz, passes them through several strided convolution layers to downsample the signal, with a stride factor of 4, 5, 6, 8, and 2. This means that the first CNN block downsamples the audio by 4x, then 5x, then 6x, and so on. In the end, it downsamples by a factor of 1920, reducing it to just 12.5 frames per second.

The convolution blocks also project the original float values to an embedding dimension of 512. Each embedding aggregates the local features of the original 1D waveform. 1 second of audio is now represented as around 12 vectors of size 512. This way, Mimi reduces the sequence length from 24000 to just 12 and converts them into dense continuous vectors.

Before applying any quantization, the Mimi Encoder downsamples the input 24KHz audio by 1920 times, and embeds it into 512 dimensions. In other words, you get 12.5 frames per second with each frame as a 512-dimensional vector. (Image from author’s video)

What is Audio Quantization?

Given the continuous embeddings obtained after the convolution layer, we want to tokenize the input speech. If we can represent speech as a sequence of tokens, we can apply standard language learning transformers to train generative models.

Mimi uses a Residual Vector Quantizer or RVQ tokenizer to achieve this. We will talk about the residual part soon, but first, let’s look at what a simple vanilla Vector quantizer does.

Vector Quantization

The idea behind Vector Quantization is simple: you train a codebook , which is a collection of, say, 1000 random vector codes all of size 512 (same as your embedding dimension).

A Vanilla Vector Quantizer. A codebook of embeddings is trained. Given an input embedding, we map/quantize it to the nearest codebook entry. (Screenshot from author’s video)

Then, given the input vector, we will map it to the closest vector in our codebook — basically snapping a point to its nearest cluster center. This means we have effectively created a fixed vocabulary of tokens to represent each audio frame, because whatever the input frame embedding may be, we will represent it with the nearest cluster centroid. If you want to learn more about Vector Quantization, check out my video on this topic where I go much deeper with this.

More about Vector Quantization! (Video by author)

Residual Vector Quantization

The problem with simple vector quantization is that the loss of information may be too high because we are mapping each vector to its cluster’s centroid. This “snap” is rarely perfect, so there is always an error between the original embedding and the nearest codebook.

The big idea of Residual Vector Quantization is that it doesn’t stop at having just one codebook. Instead, it tries to use multiple codebooks to represent the input vector.

First, you quantize the original vector using the first codebook.
Then, you subtract that centroid from your original vector. What you’re left with is the residual — the error that wasn’t captured in the first quantization.
Now take this residual, and quantize it again, using a second codebook full of brand new code vectors — again by snapping it to the nearest centroid.
Subtract that too, and you get a smaller residual. Quantize again with a third codebook… and you can keep doing this for as many codebooks as you want.

Residual Vector Quantizers (RVQ) hierarchically encode the input embeddings by using a new codebook and VQ layer to represent the previous codebook’s error. (Illustration by the author)

Each step hierarchically captures a little more detail that was missed in the previous round. If you repeat this for, let’s say, N codebooks, you get a collection of N discrete tokens from each stage of quantization to represent one audio frame.

The coolest thing about RVQs is that they are designed to have a high inductive bias towards capturing the most essential content in the very first quantizer. In the subsequent quantizers, they learn more and more fine-grained features.

If you’re familiar with PCA, you can think of the first codebook as containing the primary principal components, capturing the most critical information. The subsequent codebooks represent higher-order components, containing information that adds more details.

Residual Vector Quantizers (RVQ) uses multiple codebooks to encode the input vector — one entry from each codebook. (Screenshot from author’s video)

Acoustic vs Semantic Codebooks

Since Mimi is trained on the task of audio reconstruction, the encoder compresses the signal to the discretized latent space, and the decoder reconstructs it back from the latent space. When optimizing for this task, the RVQ codebooks learn to capture the essential acoustic content of the input audio inside the compressed latent space.

Mimi also separately trains a single codebook (vanilla VQ) that only focuses on embedding the semantic content of the audio. This is why Mimi is called a split-RVQ tokenizer – it divides the quantization process into two independent parallel paths: one for semantic information and another for acoustic information.

The Mimi Architecture (Source: Moshi paper) License: Free

To train semantic representations, Mimi used knowledge distillation with an existing speech model called WavLM as a semantic teacher. Basically, Mimi introduces an additional loss function that decreases the cosine distance between the semantic RVQ code and the WavLM-generated embedding.

Audio Decoder

Given a conversation containing text and audio, we first convert them into a sequence of token embeddings using the text and audio tokenizers. This token sequence is then input into a transformer model as a time series. In the blog post, this model is referred to as the Autoregressive Backbone Transformer. Its task is to process this time series and output the “zeroth” codebook token.

A lighterweight transformer called the audio decoder then reconstructs the next codebook tokens conditioned on this zeroth code generated by the backbone transformer. Note that the zeroth code already contains a lot of information about the history of the conversation since the backbone transformer has visibility of the entire past sequence. The lightweight audio decoder only operates on the zeroth token and generates the other N-1 codes. These codes are generated by using N-1 distinct linear layers that output the probability of choosing each code from their corresponding codebooks.

You can imagine this process as predicting a text token from the vocabulary in a text-only LLM. Just that a text-based LLM has a single vocabulary, but the RVQ-tokenizer has multiple vocabularies in the form of the N codebooks, so you need to train a separate linear layer to model the codes for each.

The Sesame Architecture (Illustration by the author)

Finally, after the codewords are all generated, we aggregate them to form the combined continuous audio embedding. The final job is to convert this audio back to a waveform. For this, we apply transposed convolutional layers to upscale the embedding back from 12.5 Hz back to KHz waveform audio. Basically, reversing the transforms we had applied originally during audio preprocessing.

In Summary

Check out the accompanying video on this article! (Video by author)

So, here is the overall summary of the Sesame model in some bullet points.

Sesame is built on a multimodal Conversation Speech Model or a CSM.
Text and audio are tokenized together to form a sequence of tokens and input into the backbone transformer that autoregressively processes the sequence.
While the text is processed like any other text-based LLM, the audio is processed directly from its waveform representation. They use the Mimi encoder to convert the waveform into latent codes using a split RVQ tokenizer.
The multimodal backbone transformers consume a sequence of tokens and predict the next zeroth codeword.
Another lightweight transformer called the Audio Decoder predicts the next codewords from the zeroth codeword.
The final audio frame representation is generated from combining all the generated codewords and upsampled back to the waveform representation.

Thanks for reading!

References and Must-read papers

Check out my ML YouTube Channel

Sesame Blogpost and Demo

Relevant papers:
Moshi: https://arxiv.org/abs/2410.00037
SoundStream: https://arxiv.org/abs/2107.03312
HuBert: https://arxiv.org/abs/2106.07447
Speech Tokenizer: https://arxiv.org/abs/2308.16692

The post Sesame Speech Model: How This Viral AI Model Generates Human-Like Speech appeared first on Towards Data Science.

Overcoming Automatic Speech Recognition Challenges: The Next Frontier

Tal Rosenwein — Thu, 30 Mar 2023 07:18:26 +0000

TL;DR:

This post focuses on the advancements in Automatic Speech Recognition (ASR) technology and its impact on various domains. ASR has become prevalent in multiple industries, with improved accuracy driven by scaling model size and constructing larger labeled and unlabelled training datasets.

Looking ahead, ASR technology is expected to continue improving with the scaling of the acoustic model size and the enhancement of the internal language model. Additionally, self-supervised and multi-task training techniques will enable low-resource languages to benefit from ASR technology, while multilingual training will boost performance even further, allowing for basic usage such as voice commands in many low-resource languages.

ASR will also play a significant role in Generative AI, as interaction with avatars will be via an audio/text interface. With the emergence of textless NLP, some end-tasks, such as speech-2-speech translation, may be solved without using any explicit ASR model. Multimodal models that can be prompted using text, audio, or both will be released and generate text or synthesize audio as an output.

Furthermore, open-ended dialogue systems with voice-based human-machine interfaces will improve robustness to transcription errors and differences between written and spoken forms. This will provide robustness to challenging accents and children’s speech, enabling ASR technology to become an essential tool for many applications.

An end-to-end speech enhancement-ASR-diarization system is set to be released, enabling the personalization of ASR models and improving performance on overlapped speech and challenging acoustic scenarios. This is a significant step towards solving ASR technology’s challenges in real-world scenarios.

Lastly, A wave of speech APIs is expected. And still, there are opportunities for small startups to outperform big tech companies in domains with more legal or regulatory restrictions on the use of technology/data acquisition and in populations with low technology adoption rates.

2022 In A Review

Automatic Speech Recognition (ASR) technology is gaining momentum across various industries such as education, podcasts, social media, telemedicine, call centers, and more. A great example is the growing prevalence of voice-based human-machine interface (HMI) in consumer products, such as smart cars, smart homes, smart assistive technology [1], smartphones, and even artificial intelligence (AI) assistants in hotels [2]. In order to meet the increasing demand for fast and accurate responses, low-latency ASR models have been deployed for tasks like keyword spotting [3], endpointing [4], and transcription [5]. Speaker-attributed ASR models [6–7] are also gaining attention as they enable product personalization, providing greater value to end-users.

Prevalence of Data. Streaming audio and video platforms such as social media and YouTube have led to the easy acquisition of unlabeled audio data [8]. New self-supervised techniques have been introduced to utilize this audio without needing ground truth [9–10]. These techniques improve the performance of ASR systems in the target domain, even without fine-tuning on labeled data for that domain [11]. Another approach gaining attention due to its ability to utilize this unlabeled data is self-training using pseudo-labeling [12–13]. The main concept is to automatically transcribe unlabeled audio data using an automatic speech recognition (ASR) system and then use the generated transcription as ground truth for training a different ASR system in a supervised fashion. OpenAI took a different approach, assuming they can find human-generated transcripts at scale online. They generated a high-quality and large-scale (640K hours) training dataset by crawling publicly available audio data with human-generated subtitles. Using this dataset, they trained an ASR model (a.k.a Whisper) in a fully supervised manner, achieving state-of-the-art (SoTA) results on several benchmarks in zero-shot settings [14].

Losses. Despite End-2-end (E2E) losses dominating SoTA ASR models [15–17], new losses are still being published. A new technique called hybrid autoregressive transducer (HAT) [18] has been introduced, enabling to measure the quality of the internal language model (ILM) by separating the blank and label posteriors. Later work [19] used this factorization to effectively adapt the ILM using only textual data, which improved the overall performance of ASR systems, particularly the transcription of named entities, slang terms, and nouns, which are major pain points for ASR systems. New metrics have also been developed to better align with human perception and overcome word error rate (WER) semantic issues [20].

Architecture Choice. Regarding the acoustic model’s architectural choices, Conformer [21] remained preferred for streaming models, while Transformers [22] is the default architecture for non-streaming models. As for the latter, encoder-only (wav2vec2 based [23–24]) and encoder-decoder (Whisper [14]) multi-lingual models were introduced and improved over the SoTA results across several benchmarks in zero-shot settings. These models outperform their streaming counterparts due to model size, training data size, and their larger context.

Multilingual AI Developments from Tech Giants. Google has announced its "1,000 Languages Initiative" to build an AI model that supports the 1,000 most spoken languages [25], while Meta AI has announced its long-term effort to build language and machine translation (MT) tools that include most of the world’s languages [26].

Spoken Language Breakthrough. Multi-modal (speech/text) and multi-task pre-trained seq-2-seq (encoder-decoder) models such as SpeechT5 [27] were released, showing great success on a wide variety of spoken language processing tasks, including ASR, speech synthesis, speech translation, voice conversion, speech enhancement, and speaker identification.

These advancements in ASR technology are expected to drive further innovation and impact a wide range of industries in the years to come.

A Look Ahead

Despite its challenges, the field of Automatic Speech Recognition (ASR) is expected to make significant advancements in various domains, ranging from acoustic and semantic modeling to conversational and generative AI, and even speaker-attributed ASR. This section provides detailed insights into these areas and shares my predictions for the future of ASR technology.

Photo by Nik on Unsplash

General Improvements:

The improvement of ASR systems is expected on both the acoustic and semantic parts.

On the acoustic model side, larger model and training data sizes are anticipated to enhance the overall performance of ASR systems, similar to the progress observed in LLMs. Although scaling Transformer encoders, such as Wav2Vec or Conformer, poses a challenge, a breakthrough is expected to enable their scaling or see a shift towards encoder-decoder architectures as in Whisper. However, encoder-decoder architectures have drawbacks that need to be addressed, such as hallucinations. Optimizations such as faster-whisper [28] and NVIDIA-wav2vec2 [29] will reduce training and inference time, lowering the barrier to deploying large ASR models.

On the semantic side, researchers will focus on improving ASR models by incorporating larger acoustic or textual contexts. Injecting large-scale unpaired text into the ILM during E2E training, as in JEIT [30], will also be explored. These efforts will help to overcome key challenges such as accurately transcribing named entities, slang terms, and nouns.

Although Whisper and Google’s universal speech model (USM) [31] have improved ASR system performances over several benchmarks, some benchmarks still need to be solved as the word error rate (WER) remains around 20% [32]. Using speech foundation models, adding more diverse training data, and applying multi-task learning will significantly improve performance in such scenarios, opening up new business opportunities. Moreover, new metrics and benchmarks are expected to emerge to better align new end-tasks and domains, such as non-lexical conversational sounds [33] in the medical domain and filler word detection and classification [34] in media editing and educational domains. Task-specific fine-tuned models may be developed for this purpose. Finally, with the growth of multi-modality, more models, training datasets, and new benchmarks for several tasks are also expected to be released [35–36].

As progress continues, a wave of speech APIs is expected, similar to natural language processing (NLP). Google’s USM, OpenAI’s Whisper, and Assembly’s Conformer-1 [37] are some of the early examples.

Although it sounds silly, force alignment is still challenging for many companies. An open-source code for that may help many achieve accurate alignment between audio segments and their corresponding transcript.

Low Resources Languages:

Advancements in self-supervised learning, multi-task learning, and multi-lingual models are expected to improve performance on low-resource and unwritten languages significantly. These methods will achieve acceptable performances by utilizing pre-trained models and fine-tuning on a relatively small number of labeled samples [24]. Another promising approach is dual learning [38], a paradigm for semi-supervised machine learning that seeks to leverage unsupervised data by solving two opposite tasks (text-to-speech (TTS) and ASR in our case) at once. In this method, each model produces pseudo-labels for unlabeled examples, which are used to train the other model.

Additionally, improving ILM using unpaired text can enhance model robustness, which will be especially advantageous for closed-set challenges such as voice commands. The performance will be acceptable but not flawless in some applications, such as captioning YouTube videos, while in others, such as generating verbatim transcripts in court, it may take more time for models to meet the threshold. We anticipate that companies will gather data based on these models while manually correcting transcripts in 2023, and we will see significant improvements in low-resource languages after fine-tuned on proprietary data in 2024.

Generative AI:

The use of avatars is expected to revolutionize human interaction with digital assets. In the short term, ASR will serve as one of the foundations in Generative AI as these avatars will communicate through textual/auditory interface.

But in the future, changes could occur as attention shifts towards new research directions. For example, an emerging technology that is likely to be adopted is Textless NLP, which represents a new language modeling approach to audio generation [39]. This approach uses learnable discrete audio units [40], and auto-regressively generates the next discrete audio unit one unit at a time, similar to text generation. These discrete units can be later decoded back to the audio domain. Thus far, this technology has been able to generate syntactically and semantically plausible speech continuations while also maintaining speaker identity and prosody for unseen speakers, as can be seen in GSLM/AudioLM [39, 41]. The potential of this technology is enormous, as one can skip the ASR component (and its errors) in many tasks. For example, traditional speech-2-speech (S2S) translation methods work as follows: They transcribe the utterance in the source language, then translate the text to the target language using a machine translation model, and finally generate the audio in the target languages using a TTS engine. Using textless-NLP technology, S2S translation can be done using a single encoder-decoder architecture that works directly on discrete audio units without using any explicit ASR model [42]. We predict that future Textless NLP models will solve many other tasks without going through explicit transcription, such as question-answering systems. However, the main drawback of this method is backtracking errors and debugging, as things will get less intuitive when working on the discrete units space rather than working on the transcription.

T5 [43] and T0 [44] showed great success in NLP by utilizing their multi-task training and showing zero-shot task generalization. In 2021 SpeechT5 [27] was published, showing great success in various spoken language processing tasks. Earlier this year, VALL-E [45] and VALL-EX [46] were released. They showed impressive in-context learning capabilities for TTS models by using textless NLP technology, enabling cloning speaker’s voice by using only a few seconds of their audio, and without requiring any fine-tuning, doing it even in cross-lingual settings.

By joining the concepts taken from SpeechT5 and VALL-E, we can expect the release of T0-like models that can be prompted using either text, audio, or both, and generate text or synthesize audio as an output, depending on the task. A new era of models will begin, as in-context learning will enable generalization in zero-shot settings to new tasks. This will allow semantic search over audio, transcribing a target speaker using speaker-attributed ASR or describing it in free text, e.g., ‘what did the young kid that coughed say?". Furthermore, it will enable us to classify or synthesize audio using audio or textual description and solve NLP tasks directly from audio using explicit/implicit ASR.

Conversational AI:

Conversational AI has been adopted mainly through task-oriented dialogue systems, namely AI personal assistants (PA) such as Amazon’s Alexa and Apple’s Siri. These PAs have become popular due to their ability to provide quick access to features and information through voice commands. As big tech companies dominate this technology, new regulations on AI assistants will force them to offer third-party options for voice assistants, opening up competition [47]. As this happens, we can expect interoperability between personal assistants, meaning they will start communicating. This will be great as one can use any device to connect to any conversational agent anywhere in the world [48]. From the ASR perspective, this will pose new challenges as the contextualization will be much broader, and assistants will must have the robustness to different accents and possibly support multilingualism.

Over the past few years, a great technological leap has happened in text-based open-ended dialogue systems, e.g., Blender-Bot and LaMDA [49–50]. Initially, these dialogue systems were text-based, meaning they were fed by text and trained to output text, all in the written-form domain. As ASR performances improved, open-ended dialogue systems were augmented with voice-based HMI, which resulted in misalignment between modalities due to differences between the spoken and written forms. One of the main challenges is to bridge this gap by overcoming new types of errors introduced due to the audio-related processing, e.g., differences between spoken and written forms such as disfluencies and entity resolution, and transcription errors such pronunciation errors [51–52].

Possible solutions can be derived from improved transcription quality and robust NLP models that can effectively handle transcription and pronunciation errors. A reliable acoustic model’s confidence score [53] will serve as a key player in these systems, enabling it to point out speaker errors or serve as another input to the NLP model or decoding logic. Furthermore, we expect that ASR models will predict non-verbal cues such as sarcasm, enabling agents to understand the conversation more deeply and provide better responses.

These improvements will enable to push further dialogue systems with an auditory HMI to support challenging accents and children’s speech, such as in Loora [54] and Speaks [55].

Pushing the limits even further, we expect the release of an E2E multi-task learning framework for spoken language tasks using joint modeling of the speech and NLP problems as in MTL-SLT [56]. These models will train in an E2E fashion that will reduce the cumulative error between sequential modules and will address tasks such as spoken language understanding, spoken summarization, and spoken question answering, by taking speech as input and emitting various outputs such as transcription, intent, named entities, summaries, and answers to text queries.

Personalization will play a huge factor for AI assistants and open-ended dialogue systems, leading us to the next point: speaker-attributed ASR.

Speaker Attributed ASR:

There is still a challenge in transcribing distant conversations involving multiple microphones and parties in home environments. Even state-of-the-art (SoTA) systems can only achieve around 35% WER [57].

Early birds of joint ASR and diarization were released in 2019 [58]. This year, we can expect a release of an end-to-end speech enhancement-ASR-diarization system which will improve performance on overlapped speech and enable better performance in challenging acoustic scenarios such as reverberant rooms, far-field settings, and low Signal-to-Noise (SNR) ratios. The improvement will be achieved through joint task optimization, improved pre-training methods (such as WavLM [10]), applying architectural changes [59], data augmentation, and training on in-domain data during pre-training and fine-tuning [11]. Moreover, we can expect the deployment of speaker-attributed ASR systems for personalized speech recognition. This will further improve the transcription accuracy of the target speaker’s voice and bias the transcript towards user-defined words, such as contact names, proper nouns, and other named entities, which are crucial for smart assistants [60]. Additionally, low latency models will continue to be a significant area of focus to enhance edge devices’ overall experience and response time [61–62].

The Role of Startups Compared to Big Tech Companies in The ASR Landscape

Although big tech companies are expected to continue dominating the market with their APIs, small startups can still outperform them in specific domains. These include areas that are underrepresented in the big tech’s training data due to regulations, such as the medical domain and children’s speech, and populations that have not yet adopted technology, such as immigrants with challenging accents or individuals learning English worldwide. In markets where there isn’t enough demand for big tech companies to invest in, such as languages that are not widely spokem small startups may find opportunities to succeed and generate profit.

To create a win-win situation, big tech companies can provide APIs that offer full access to the output of their acoustic models while allowing others to write the decoding logic (WFST/beam-search) instead of merely adding customizable vocabulary or using current model adaptation features [63–64]. This approach will enable small startups to excel in their domains by incorporating priming or multiple language models during inference on top of the given acoustic model, rather than having to train the acoustic models themselves, which can be costly in terms of human capital and domain knowledge. In turn, big tech companies will benefit from broader adoption of their paid models.

How Does ASR Fit Into The Broader Machine Learning Landscape?

On one hand, ASR is on par with the importance of computer vision (CV) and NLP when considering it as the end task. This is the current situation in low-resource languages and domains where the transcript is the main business, e.g., court, medical records, movie subtitles, etc.

On the other hand, ASR is no longer the bottleneck in other domains where it has passed a certain usability threshold. In these cases, the NLP is the bottleneck, which means that improving ASR performances toward perfectionism is not essential for extracting insights for the end task. For example, meeting summarization or action item extraction can be achieved in many cases using current ASR quality.

Closing Remarks

The advancements in ASR technology have brought us closer to achieving seamless communication between humans and machines, for example in Conversational AI and Generative Ai. With the continued development of speech enhancement-ASR-diarization systems and the emergence of textless NLP, we are poised to witness exciting breakthroughs in this field. As we look forward to the future, we can’t help but anticipate the endless possibilities that ASR technology will unlock.

Thank you for taking the time to read this post! Your thoughts and feedback on these projections are highly valued and appreciated. Please feel free to share your comments and ideas.

References:

[1] https://www.orcam.com/en/home/

[2] https://voicebot.ai/2022/12/01/hey-disney-custom-alexa-assistant-rolls-out-at-disney-world/

[3] Jose, Christin, et al. "Latency Control for Keyword Spotting." ArXiv, 2022, https://doi.org/10.21437/Interspeech.2022-10608.

[4] Bijwadia, Shaan, et al. "Unified End-to-End Speech Recognition and Endpointing for Fast and Efficient Speech Systems." ArXiv, 2022, https://doi.org/10.1109/SLT54892.2023.10022338.

[5] Yoon, Ji, et al. "HuBERT-EE: Early Exiting HuBERT for Efficient Speech Recognition." ArXiv, 2022, https://doi.org/10.48550/arXiv.2204.06328.

[6] Kanda, Naoyuki, et al. "Transcribe-to-Diarize: Neural Speaker Diarization for Unlimited Number of Speakers Using End-to-End Speaker-Attributed ASR." ArXiv, 2021, https://doi.org/10.48550/arXiv.2110.03151.

[7] Kanda, Naoyuki, et al. "Streaming Speaker-Attributed ASR with Token-Level Speaker Embeddings." ArXiv, 2022, https://doi.org/10.48550/arXiv.2203.16685.

[8] https://www.fiercevideo.com/video/video-will-account-for-82-all-internet-traffic-by-2022-cisco-says

[9] Chiu, Chung, et al. "Self-Supervised Learning with Random-Projection Quantizer for Speech Recognition." ArXiv, 2022, https://doi.org/10.48550/arXiv.2202.01855.

[10] Chen, Sanyuan, et al. "WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing." ArXiv, 2021, https://doi.org/10.1109/JSTSP.2022.3188113.

[11] Hsu, Wei, et al. "Robust Wav2vec 2.0: Analyzing Domain Shift in Self-Supervised Pre-Training." ArXiv, 2021, https://doi.org/10.48550/arXiv.2104.01027.

[12] Lugosch, Loren, et al. "Pseudo-Labeling for Massively Multilingual Speech Recognition." ArXiv, 2021, https://doi.org/10.48550/arXiv.2111.00161.

[13] Berrebbi, Dan, et al. "Continuous Pseudo-Labeling from the Start." ArXiv, 2022, https://doi.org/10.48550/arXiv.2210.08711.

[14] Radford, Alec, et al. "Robust Speech Recognition via Large-Scale Weak Supervision." ArXiv, 2022, https://doi.org/10.48550/arXiv.2212.04356.

[15] Graves, Alex, et al. "Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks." ICML, 2016, https://www.cs.toronto.edu/~graves/icml_2006.pdf

[16] Graves, Alex. "Sequence Transduction with Recurrent Neural Networks." ArXiv, 2012, https://doi.org/10.48550/arXiv.1211.3711.

[17] Chan, William, et al. "Listen, Attend and Spell." ArXiv, 2015, https://doi.org/10.48550/arXiv.1508.01211.

[18] Variani, Ehsan, et al. "Hybrid Autoregressive Transducer (Hat)." ArXiv, 2020, https://doi.org/10.48550/arXiv.2003.07705.

[19] Meng, Zhong, et al. "Modular Hybrid Autoregressive Transducer." ArXiv, 2022, https://doi.org/10.48550/arXiv.2210.17049.

[20] Kim, Suyoun, et al. "Evaluating User Perception of Speech Recognition System Quality with Semantic Distance Metric." ArXiv, 2021, https://doi.org/10.48550/arXiv.2110.05376.

[21] Gulati, Anmol, et al. "Conformer: Convolution-Augmented Transformer for Speech Recognition." ArXiv, 2020, https://doi.org/10.48550/arXiv.2005.08100.

[22] Vaswani, Ashish, et al. "Attention Is All You Need." ArXiv, 2017, https://doi.org/10.48550/arXiv.1706.03762.

[23] Baevski, Alexei, et al. "Wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations." ArXiv, 2020, https://doi.org/10.48550/arXiv.2006.11477.

[24] Babu, Arun, et al. "XLS-R: Self-Supervised Cross-Lingual Speech Representation Learning at Scale." ArXiv, 2021, https://doi.org/10.48550/arXiv.2111.09296.

[25] https://blog.google/technology/ai/ways-ai-is-scaling-helpful/

[26] https://ai.facebook.com/blog/teaching-ai-to-translate-100s-of-spoken-and-written-languages-in-real-time/

[27] Ao, Junyi, et al. "SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing." ArXiv, 2021, https://doi.org/10.48550/arXiv.2110.07205.

[28] https://github.com/guillaumekln/faster-whisper

[29] https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/SpeechRecognition/wav2vec2

[30] Meng, Zhong, et al. "JEIT: Joint End-to-End Model and Internal Language Model Training for Speech Recognition." ArXiv, 2023, https://doi.org/10.48550/arXiv.2302.08583.

[31] Zhang, Yu, et al. "Google USM: Scaling Automatic Speech Recognition Beyond 100 Languages." ArXiv, 2023, https://doi.org/10.48550/arXiv.2303.01037.

[32] Kendall, T. and Farrington, C. "The corpus of regional african american language". Version 2021.07. Eugene, OR: The Online Resources for African American Language Project. http://oraal.uoregon.edu/coraal, 2021

[33] Brian, D Tran, et al. ‘"Mm-hm," "Uh-uh": are non-lexical conversational sounds deal breakers for the ambient clinical documentation technology?,’ Journal of the American Medical Informatics Association, 2023, https://doi.org/10.1093/jamia/ocad001

[34] Zhu, Ge, et al. "Filler Word Detection and Classification: A Dataset and Benchmark." ArXiv, 2022, https://doi.org/10.48550/arXiv.2203.15135.

[35] Anwar, Mohamed, et al. "MuAViC: A Multilingual Audio-Visual Corpus for Robust Speech Recognition and Robust Speech-to-Text Translation." ArXiv, 2023, https://doi.org/10.48550/arXiv.2303.00628.

[36] Jaegle, Andrew, et al. "Perceiver IO: A General Architecture for Structured Inputs & Outputs." ArXiv, 2021, https://doi.org/10.48550/arXiv.2107.14795.

[37] https://www.assemblyai.com/blog/conformer-1/

[38] Peyser, Cal, et al. "Dual Learning for Large Vocabulary On-Device ASR." ArXiv, 2023, https://doi.org/10.48550/arXiv.2301.04327.

[39] Lakhotia, Kushal, et al. "Generative Spoken Language Modeling from Raw Audio." ArXiv, 2021, https://doi.org/10.48550/arXiv.2102.01192.

[40] Zeghidour, Neil, et al. "SoundStream: An End-to-End Neural Audio Codec." ArXiv, 2021, https://doi.org/10.48550/arXiv.2107.03312.

[41] Borsos, Zalán, et al. "AudioLM: a Language Modeling Approach to Audio Generation." ArXiv, 2022, https://doi.org/10.48550/arXiv.2209.03143.

[42] https://about.fb.com/news/2022/10/hokkien-ai-speech-translation/

[43] Raffel, Colin, et al. "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer." ArXiv, 2019, /abs/1910.10683.

[44] Sanh, Victor, et al. "Multitask Prompted Training Enables Zero-Shot Task Generalization." ArXiv, 2021, https://doi.org/10.48550/arXiv.2110.08207.

[45] Wang, Chengyi, et al. "Neural Codec Language Models Are Zero-Shot Text to Speech Synthesizers." ArXiv, 2023, https://doi.org/10.48550/arXiv.2301.02111.

[46] Zhang, Ziqiang, et al. "Speak Foreign Languages with Your Own Voice: Cross-Lingual Neural Codec Language Modeling." ArXiv, 2023, https://doi.org/10.48550/arXiv.2303.03926.

[47] https://voicebot.ai/2022/07/05/eu-passes-new-regulations-for-voice-ai-and-digital-technology/

[48] https://www.speechtechmag.com/Articles/ReadArticle.aspx?ArticleID=154094

[49] Thoppilan, Romal, et al. "LaMDA: Language Models for Dialog Applications." ArXiv, 2022, https://doi.org/10.48550/arXiv.2201.08239.

[50] Shuster, Kurt, et al. "BlenderBot 3: a Deployed Conversational Agent that Continually Learns to Responsibly Engage." ArXiv, 2022, https://doi.org/10.48550/arXiv.2208.03188.

[51] Xiaozhou, Zhou, et al. "Phonetic Embedding for ASR Robustness in Entity Resolution." Proc. Interspeech 2022, 3268–3272, doi: 10.21437/Interspeech.2022–10956

[52] Chen, Angelica, et al. "Teaching BERT to Wait: Balancing Accuracy and Latency for Streaming Disfluency Detection." ArXiv, 2022, https://doi.org/10.48550/arXiv.2205.00620.

[53] Li, Qiujia, et al. "Improving Confidence Estimation on Out-of-Domain Data for End-to-End Speech Recognition." ArXiv, 2021, https://doi.org/10.48550/arXiv.2110.03327.

[54] https://loora.ai/

[55] https://techcrunch.com/2022/11/17/speak-lands-investment-from-openai-to-expand-its-language-learning-platform/

[56] Zhiqi, Huang, et al. "MTL-SLT: Multi-Task Learning for Spoken Language Tasks." NLP4ConvAI, 2022, https://aclanthology.org/2022.nlp4convai-1.11

[57] Watanabe, Shinji, et al. "CHiME-6 Challenge: Tackling Multispeaker Speech Recognition for Unsegmented Recordings." ArXiv, 2020, https://doi.org/10.48550/arXiv.2004.09249.

[58] Shafey, Laurent, et al. "Joint Speech Recognition and Speaker Diarization via Sequence Transduction." ArXiv, 2019, https://doi.org/10.48550/arXiv.1907.05337.

[59] Kim, Juntae, and Lee, Jeehye. "Generalizing RNN-Transducer to Out-Domain Audio via Sparse Self-Attention Layers." ArXiv, 2021, https://doi.org/10.48550/arXiv.2108.10752.

[60] Sathyendra, Kanthashree, et al. "Contextual Adapters for Personalized Speech Recognition in Neural Transducers." ArXiv, 2022, https://doi.org/10.48550/arXiv.2205.13660.

[61] Tian, Jinchuan, et al. "Bayes Risk CTC: Controllable CTC Alignment in Sequence-to-Sequence Tasks." ArXiv, 2022, https://doi.org/10.48550/arXiv.2210.07499.

[62] Tian, Zhengkun, et al. "Peak-First CTC: Reducing the Peak Latency of CTC Models by Applying Peak-First Regularization." ArXiv, 2022, https://doi.org/10.48550/arXiv.2211.03284.

[63] https://docs.rev.ai/api/custom-vocabulary/

[64] https://cloud.google.com/speech-to-text/docs/adaptation-model

The post Overcoming Automatic Speech Recognition Challenges: The Next Frontier appeared first on Towards Data Science.

Choosing the right language model for your NLP use case

Dr. Janna Lipenkova — Mon, 26 Sep 2022 17:59:50 +0000

A guide to understanding, selecting and deploying Large Language Models

Large Language Models (LLMs) are Deep Learning models trained to produce text. With this impressive ability, LLMs have become the backbone of modern Natural Language Processing (NLP). Traditionally, they are pre-trained by academic institutions and big tech companies such as OpenAI, Microsoft and NVIDIA. Most of them are then made available for public use. This plug-and-play approach is an important step towards large-scale AI adoption – instead of spending huge resources on the training of models with general linguistic knowledge, businesses can now focus on fine-tuning existing LLMs for specific use cases.

However, picking the right model for your application can be tricky. Users and other stakeholders have to make their way through a vibrant landscape of language models and related innovations. These improvements address different components of the language model including its training data, pre-training objective, architecture and fine-tuning approach – you could write a book on each of these aspects. On top of all this research, the marketing buzz and the intriguing aura of Artificial General Intelligence around huge language models obfuscate things even more.

In this article, I explain the main concepts and principles behind LLMs. The goal is to provide non-technical stakeholders with an intuitive understanding as well as a language for efficient interaction with developers and AI experts. For broader coverage, the article includes analyses that are rooted in a large number of NLP-related publications. While we will not dive into mathematical details of language models, these can be easily retrieved from the references.

The article is structured as follows: first, I situate language models in the context of the evolving NLP landscape. The second section explains how LLMs are built and pre-trained. Finally, I describe the fine-tuning process and provide some guidance on model selection.

The world of language models

Bridging the human-machine gap

Language is a fascinating skill of the human mind – it is a universal protocol for communicating our rich knowledge of the world, and also more subjective aspects such as intents, opinions and emotions. In the history of AI, there have been multiple waves of research to approximate ("model") human language with mathematical means. Before the era of Deep Learning, representations were based on simple algebraic and probabilistic concepts such as one-hot representations of words, sequential probability models and recursive structures. With the evolution of Deep Learning in the past years, linguistic representations have increased in precision, complexity and expressiveness.

In 2018, BERT was introduced as the first LLM on the basis of the new Transformer architecture. Since then, Transformer-based LLMs have gained strong momentum. Language modelling is especially attractive due to its universal usefulness. While many real-world NLP tasks such as sentiment analysis, information retrieval and information extraction do not need to generate language, the assumption is that a model that produces language also has the skills to solve a variety of more specialised linguistic challenges.

Size matters

Learning happens based on parameters – variables that are optimized during the training process to achieve the best prediction quality. As the number of parameters increases, the model is able to acquire more granular knowledge and improve its predictions. Since the introduction of the first LLMs in 2017–2018, we saw an exponential explosion in parameter sizes – while breakthrough BERT was trained with 340M parameters, Megatron-Turing NLG, a model released in 2022, is trained with 530B parameters – a more than thousand-fold increase.

Figure 1: The parameter sizes of language models increase exponentially over time [11]

Thus, the mainstream keeps wowing the public with ever bigger amounts of parameters. However, there have been critical voices pointing out that model performance is not increasing at the same rate as model size. On the other side, model pre-training can leave a considerable carbon footprint. Downsizing efforts have countered the brute-force approach to make progress in language modelling more sustainable.

The life of a language model

The LLM landscape is competitive and innovations are short-lived. The following chart shows the top-15 most popular LLMs in the timespan 2018–2022, along with their share-of-voice over time:

Figure 2: Mentions and share-of-voice of the top-15 most popular language models [12]

We can see that most models fade in popularity after a relatively short time. To stay cutting-edge, users should monitor the current innovations and evaluate whether an upgrade would be worthwhile.

Most LLMs follow a similar lifecycle: first, at the "upstream", the model is pre-trained. Due to the heavy requirements on data size and compute, it is mostly a privilege of large tech companies and universities. Recently, there have also been some collaborative efforts (e.g. the BigScience workshop) for the joint advancement of the LLM field. A handful of well-funded startups such as Cohere and AI21 Labs also provide pre-trained LLMs.

After the release, the model is adopted and deployed at the "downstream" by application-focussed developers and businesses. At this stage, most models require an extra fine-tuning step to specific domains and tasks. Others, like GPT-3, are more convenient in that they can learn a variety of linguistic tasks directly during prediction (zero- or few-shot prediction).

Finally, time knocks at the door and a better model comes around the corner – either with an even larger number of parameters, more efficient use of hardware or a more fundamental improvement to the modelling of human language. Models that brought about substantial innovations can give birth to whole model families. For example, BERT lives on in BERT-QA, DistilBERT and RoBERTa, which are all based on the original architecture.

In the next sections, we will look at the first two phases in this lifecycle – the pre-training and the fine-tuning for deployment.

Pre-training: how LLMs are born

Most teams and NLP practitioners will not be involved in the pre-training of LLMs, but rather in their fine-tuning and deployment. However, to successfully pick and use a model, it is important to understand what is going on "under the hood". In this section, we will look at the basic ingredients of an LLM:

Training data
Input representation
Pre-training objective
Model architecture (encoder-decoder)

Each of these will affect not only the choice, but also the fine-tuning and deployment of your LLM.

Training data

The data used for LLM training is mostly text data covering different styles, such as literature, user-generated content and news data. After seeing a variety of different text types, the resulting models become aware of the fine details of language. Other than text data, code is regularly used as input, teaching the model to generate valid programs and code snippets.

Unsurprisingly, the quality of the training data has a direct impact on model performance – and also on the required size of the model. If you are smart in preparing the training data, you can improve model quality while reducing its size. One example is the T0 model, which is 16 times smaller than GPT-3 but outperforms it on a range of benchmark tasks. Here is the trick: instead of just using any text as training data, it works directly with task formulations, thus making its learning signal much more focussed. Figure 3 illustrates some training examples.

Figure 3: T0 is trained on explicit task formulations for a wide range of linguistic tasks

A final note on training data: we often hear that language models are trained in an unsupervised manner. While this makes them appealing, it is technically wrong. Instead, well-formed text already provides the necessary learning signals, sparing us the tedious process of manual data annotation. The labels to be predicted correspond to past and/or future words in a sentence. Thus, annotation happens automatically and at scale, making possible the relatively quick progress in the field.

Input representation

Once the training data is assembled, we need to pack it into form that can be digested by the model. Neural networks are fed with algebraic structures (vectors and matrices), and the optimal algebraic representation of language is an ongoing quest – reaching from simple sets of words to representations containing highly differentiated context information. Each new step confronts researchers with the endless complexity of natural language, exposing the limitations of the current representation.

The basic unit of language is the word. In the beginnings of NLP, this gave rise to the naive bag-of-words representation that throws all words from a text together, irrespectively of their ordering. Consider these two examples:

In the bag-of-words world, these sentences would get exactly the same representation since they consist of the same words. Clearly, it embraces only a small part of their meaning.

Sequential representations accommodate information about word order. In Deep Learning, the processing of sequences was originally implemented in order-aware Recurrent Neural Networks (RNN).[2] However, going one step further, the underlying structure of language is not purely sequential but hierarchical. In other words, we are not talking about lists, but about trees. Words that are farther apart can actually have stronger syntactic and semantic ties than neighbouring words. Consider the following example:

Here, her refers to the girl. When an RNN reaches the end of the sentence and finally sees her, its memory of the beginning of the sentence might already be fading, thus not allowing it to recover this relationship.

To solve these long-distance dependencies, more complex neural structures were proposed to build up a more differentiated memory of the context. The idea is to keep words that are relevant for future predictions in memory while forgetting the other words. This was the contribution of Long-Short Term Memory (LSTM)[3] cells and Gated Recurrent Units (GRUs)[4]. However, these models don’t optimise for specific positions to be predicted, but rather for a generic future context. Moreover, due to their complex structure, they are even slower to train than traditional RNNs.

Finally, people have done away with recurrence and proposed the attention mechanism, as incorporated in the Transformer architecture.[5] Attention allows the model to focus back and forth between different words during prediction. Each word is weighted according to its relevance for the specific position to be predicted. For the above sentence, once the model reaches the position of her, girl will have a higher weight than at, despite the fact that it is much farther away in the linear order.

To date, the attention mechanism comes closest to the biological workings of the human brain during information processing. Studies have shown that attention learns hierarchical syntactic structures, incl. a range of complex syntactic phenomena (cf. the Primer on BERTology and the papers referenced therein). It also allows for parallel computation and, thus, faster and more efficient training.

Pre-training objectives

With the appropriate training data representation in place, our model can start learning. There are three generic objectives used for pre-training language models: sequence-to-sequence transduction, autoregression and auto-encoding. All of them require the model to master broad linguistic knowledge.

The original task addressed by the encoder-decoder architecture as well as the Transformer model is sequence-to-sequence transduction: a sequence is transduced into a sequence in a different representation framework. The classical sequence-to-sequence task is machine translation, but other tasks such as summarisation are frequently formulated in this manner. Note that the target sequence is not necessarily text – it can also be other unstructured data such as images as well as structured data such as programming languages. An example of sequence-to-sequence LLMs is the BART family.

The second task is autoregression, which is also the original language modelling objective. In autoregression, the model learns to predict the next output (token) based on previous tokens. The learning signal is restricted by the unidirectionality of the enterprise – the model can only use information from the right or from the left of the predicted token. This is a major limitation since words can depend both on past as well as on future positions. As an example, consider how the verb written impacts the following sentence in both directions:

Here, the position of paper is restricted to something that is writable, while the position of student is restricted to a human or, anyway, another intelligent entity capable of writing.

Many of the LLMs making today’s headlines are autoregressive, incl. the GPT family, PaLM and BLOOM.

The third task – auto-encoding – solves the issue of unidirectionality. Auto-encoding is very similar to the learning of classical word embeddings.[6] First, we corrupt the training data by hiding a certain portion of tokens – typically 10–20% – in the input. The model then learns to reconstruct the correct inputs based on the surrounding context, taking into account both the preceding and the following tokens. The typical example of auto-encoders is the BERT family, where BERT stands for Bidirectional Encoder Representations from Transformers.

Model architecture (encoder-decoder)

The basic building blocks of a language model are the encoder and the decoder. The encoder transforms the original input into a high-dimensional algebraic representation, also called a "hidden" vector. Wait a minute – hidden? Well, in reality there are no big secrets at this point. Of course you can look at this representation, but a lengthy vector of numbers will not convey anything meaningful to a human. It takes the mathematical intelligence of our model to deal with it. The decoder reproduces the hidden representation in an intelligible form such as another language, programming code, an image etc.

Figure 4: Basic schema of an encoder-decoder architecture (example of English-German translation)

The encoder-decoder architecture was originally introduced for Recurrent Neural Networks. Since the introduction of the attention-based Transformer model, traditional recurrence has lost its popularity while the encoder-decoder idea lives on. Most Natural Language Understanding (NLU) tasks rely on the encoder, while Natural Language Generation (NLG) tasks need the decoder and sequence-to-sequence transduction requires both components.

We will not go into the details of the Transformer architecture and the attention mechanism here. For those who want to master the details, be prepared to spend a good amount of time to wrap your head around it. Beyond the original paper, [7] and [8] provide excellent explanations. For a lightweight introduction, I recommend the corresponding sections in Andrew Ng’s Sequence models course.

Using language models in the real world

Fine-tuning

Language modelling is a powerful upstream task – if you have a model that successfully generates language, congratulations – it is an intelligent model. However, the business value of having a model bubbling with random text is limited. Instead, NLP is mostly used for more targeted downstream tasks such as sentiment analysis, question answering and information extraction. This is the time to apply transfer learning and reuse the existing linguistic knowledge for more specific challenges. During fine-tuning, a portion of the model is "freezed" and the rest is further trained with domain- or task-specific data.

Explicit fine-tuning adds complexity on the path towards LLM deployment. It can also lead to model explosion, where each business task requires its own fine-tuned model, escalating to an unmaintainable variety of models. So, folks have made an effort to get rid of the fine-tuning step using few- or zero-shot learning (e.g. in GPT-3 [9]). This learning happens on-the-fly during prediction: the model is fed with a "prompt" – a task description and potentially a few training examples – to guide its predictions for future examples.

While much quicker to implement, the convenience factor of zero- or few-shot learning is counterbalanced by its lower prediction quality. Besides, many of these models need to be accessed via cloud APIs. This might be a welcome opportunity at the beginning of your development – however, at more advanced stages, it can turn into another unwanted external dependency.

Picking the right model for your downstream task

Looking at the continuous supply of new language models on the AI market, selecting the right model for a specific downstream task and staying in synch with the state-of-the-art can be tricky.

Research papers normally benchmark each model against specific downstream tasks and datasets. Standardised task suites such as SuperGLUE and BIG-bench allow for unified benchmarking against a multitude of NLP tasks and provide a basis for comparison. Still, we should keep in mind that these tests are prepared in a highly controlled setting. As of today, the generalisation capacity of language models is rather limited – thus, the transfer to real-life datasets might significantly affect model performance. The evaluation and selection of an appropriate model should involve experimentation on data that is as close as possible to the production data.

As a rule of thumb, the pre-training objective provides an important hint: autoregressive models perform well on text generation tasks such as Conversational AI, question answering and text summarisation, while auto-encoders excel at "understanding" and structuring language, for example for sentiment analysis and various information extraction tasks. Models intended for zero-shot learning can theoretically perform all kinds of tasks as long as they receive appropriate prompts – however, their accuracy is generally lower than that of fine-tuned models.

To make things more concrete, the following chart shows how popular NLP tasks are associated with prominent language models in the NLP literature. The associations are computed based on multiple similarity and aggregation metrics, incl. embedding similarity and distance-weighted co-occurrence. Model-task pairs with higher scores, such as BART / Text Summarization and LaMDA / Conversational AI, indicate a good fit based on historical data.

Figure 5: Association strengths between language models and downstream tasks [12]

Key takeaways

In this article, we have covered the basic notions of LLMs and the main dimensions where innovation is happening. The following table provides a summary of the key features for the most popular LLMs:

Table 1: Summary of the features of the most popular Large Language Models

Let’s summarise some general guidelines for the selection and deployment of LLMs:

When evaluating potential models, be clear about where you are in your AI journey:

At the beginning, it might be a good idea to experiment with LLMs deployed via cloud APIs.
Once you have found product-market fit, consider hosting and maintaining your model on your side to have more control and further sharpen model performance to your application.

To align with your downstream task, your AI team should create a short-list of models based on the following criteria:

Benchmarking results in the academic literature, with a focus on your downstream task
Alignment between the pre-training objective and downstream task: consider auto-encoding for NLU and autoregression for NLG
Previous experience reported for this model-task combination (cf. Figure 5)

The short-listed models should be then tested against your real-world task and dataset to get a first feeling for the performance.
In most cases, you are likely to achieve a better quality with dedicated fine-tuning. However, consider few-/zero-shot-learning if you don’t have the internal tech skills or budget for fine-tuning, or if you need to cover a large number of tasks.
LLM innovations and trends are short-lived. When using language models, keep an eye on their lifecycle and the overall activity in the LLM landscape and watch out for opportunities to step up your game.

Finally, be aware of the limitations of LLMs. While they have the amazing, human-like capacity to produce language, their overall cognitive power is galaxies away from us humans. The world knowledge and reasoning capacity of these models are strictly limited to the information they find at the surface of language. They also can’t situate facts in time and might provide you with outdated information without blinking an eye. If you are building an application that relies on generating up-to-date or even original knowledge, consider combining your LLM with additional multimodal, structured or dynamic knowledge sources.

References

[1] Victor Sanh et al. 2021. Multitask prompted training enables zero-shot task generalization. CoRR, abs/2110.08207. [2] Yoshua Bengio et al. 1994. Learning long-term dependencies with gradient descent is difficult. IEEE Transactions on Neural Networks, 5(2):157–166. [3] Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Computation, 9(8): 1735–1780. [4] Kyunghyun Cho et al. 2014. On the properties of neural machine translation: Encoder–decoder approaches. In Proceedings of SSST-8, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation, pages 103–111, Doha, Qatar. [5] Ashish Vaswani et al. 2017. Attention is all you need. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc. [6] Tomas Mikolov et al. 2013. Distributed representations of words and phrases and their compositionality. CoRR, abs/1310.4546. [7] Jay Jalammar. 2018. The illustrated transformer. [8] Alexander Rush et al. 2018. The annotated transformer. [9] Tom B. Brown et al. 2020. Language models are few-shot learners. In Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS’20, Red Hook, NY, USA. Curran Associates Inc. [10] Jacob Devlin et al. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. [11] Julien Simon 2021. Large Language Models: A New Moore’s Law? [12] Underlying dataset: more than 320k articles on AI and NLP published 2018–2022 in specialised AI resources, technology blogs and publications by the leading AI think tanks.

All images unless otherwise noted are by the author.

The post Choosing the right language model for your NLP use case appeared first on Towards Data Science.

Decrease your operational costs by blending the Artificial Intelligence (AI) & Human Intelligence…

Raghuvardhan Reddy Suram — Mon, 27 Jun 2022 16:05:51 +0000

An effective way to improve your bots & agents performance

Decrease your operational costs by blending the Artificial Intelligence (AI) & Human Intelligence (HI)

Photo by Andrea De Santis on Upslash

Using conversation AI bot technologies is one of the best ways companies can decrease operational costs and build a long-standing company.

Conversational AI is a branch of machine learning that understands the user’s query and provides responses to resolve their query. However, this AI has not reached a state where it can solve the complex questions that require the skill, intuition, and empathy of a human to resolve. So, Most companies now realize to provide a great customer experience, it’s essential to augment conversational AI based chat or voice bots with the human agents.

Having AI at the forefront and Human agents as fallback decreases the costs by 30%¹ more than just having individual agents.

If we improve the working relationship between Artificial Intelligence and Human-agent we can provide faster resolution of the customer queries, decrease the costs even more & increase the revenues.

Bot and Agents work independently to resolve the customer queries.

Most companies are now familiar with the ‘bots & agents’ setup in resolving customer queries.

A conversational AI-powered bot operates as the first touchpoint, and when it does not understand the query, it will transfer to the human agent. The transfer is usually triggered by the limitations in the AI powering the bot, and sometimes, the business rules are set up to route to the human agent automatically. The business rules usually are set up for the queries that generate high revenues or need multiple systems contexts that AI could not infer. These rules can be much more complex, and companies offer dedicated tools to configure these rules.

Bots can handle an unlimited number of conversations, but a human agent in the chat world typically takes about 6–10 conversations, and in the voice world, handle 1 or 2 at a time. So typically, there is a wait time for the user when the transfer is triggered. As chat interactions are cost-effective, most voice bot-enabled organizations prefer to transfer to the chat-enabled human agents.

Upon transfer, the information collected by the bot is passed for the agent to continue the conversation where it was left off. This information not only comes from user interactions with the bot but also from the channel, profile, and history of user interactions.

As we can see in these interactions, the bot & agent are working independently for the most part. There can be much more cost savings if we bring in a tighter collaboration between the bot & human agent.

Many top Chat & Voice bot providers are experimenting with a tighter collaboration between bots and human agents. Here are a few things most bot providers are working on to bring in better collaboration and save more costs for the entrepreneurs.

These are a way few ways one could bring in a tighter collaboration between Artificial Intelligence and human intelligence.

Bots Assist human agents when they are resolving the user’s query

The companies want human agents to handle the high revenue-generating queries for various reasons. In these situations, Bots can help in assisting the agents in many of these ways.

Auto-populate the responses
Show the context based on the conversation history
Suggest knowledge articles where to find the responses
Auto tune the text for more empathy etc

For example: When a customer buys a painting, the bot can assist the agent by showing the various discounts applicable for closing the sale.

Human agents disambiguate when the bot is confused

When the bot is confused with the user questions, the agent can disambiguate the specific queries and guide the bot to complete the conversation. For example, when the user types a long sentence and the bot is not trained to handle these long sentences, the human agent can quickly disambiguate this query and put the bot back on track. This human agent disambiguates the question and passes it back to the bot to handle the rest of the conversation has higher cost savings. Also, the user does not have to wait in line for their query resolution.

Bot to handle the mundane or straightforward tasks

After the conversation is transferred to the agent there are many mundane tasks such as collecting payment information, address, basic profile details, etc can be passed over to the bot, and the agent takes over as soon as this information is collected.

This process of collecting mundane information not only helps the agent resolve the customer query faster but also helps auto-store the information so that it can be leveraged later.

Auto tune the agent conversations

One of the possible futuristic solutions is when the agent is resolving the query, depending on the conversation style of the user, the bot can auto-tune the agent response to match this conversation style. This auto-tuning helps provide a personalized experience that could help increase the users’ engagement with the brand.

This is one of the new-age technology that is still in the research stage, and it might take a few years for the Conversational AI based bot providers to adopt this technology.

Summary

Human Intelligence is needed to handle complex queries even though the conversation AI bot tech has improved drastically over the last decade. Most applications have a bot that helps solve user queries transferring to a human agent when the bot does not handle the query. However, the performance of these applications becomes much better when Artificial intelligence and Human intelligence have a tighter collaboration.

[1] Danica Jovic, The Future is Now – 37 Fascinating Chatbot Statistics (2022), Smallbizgenius.net

The post Decrease your operational costs by blending the Artificial Intelligence (AI) & Human Intelligence… appeared first on Towards Data Science.

Developing a Conversational AI Program

Rachel Bimbi, Conversational AI Specialist at EPAM — Sat, 17 Jul 2021 03:46:16 +0000

How to avoid common pitfalls and unlock customer and business value

_By Rachel Bimbi, Product Manager at EPAM Systems, Inc._

Credit: EPAM Systems

Conversational AI technologies have evolved rapidly in the last decade, with chatbots, virtual agents, voice assistants and conversational user interfaces now part of our daily lives. This explosive transformation toward AI assistance hasn’t come from an individual technological innovation, but rather multiple innovations developed as an assistive layer between our lives and our digital services, whether we’re asking for directions, purchasing online or banking. In fact, IDC predicts global spend on AI will double from 2020 to 2024, growing to more than $110 billion, with retail banking expected to spend the most.

Surprisingly, for all the benefits conversational AI offers, many projects fail as a result of poor discovery done at the beginning, which is why spending time upfront to examine what’s being built and the value it will deliver to customers is critical. With lessons learned in the field instrumental in improving odds for success, the following seven steps can serve as a guide for enterprises in nearly every industry embarking on or advancing an existing conversational AI platform:

1. Identify the Problems You’re Trying to Solve

Just by numbers, the business value for chatbots is often apparent. But numbers alone will not guarantee success. When deciding where to start or what to do next, it’s vital to balance ROI with customer needs. Many banks, for example, will spend months building a system only to find that customers have no interest in what is delivered. There are a variety of ways to test ideas early with customers. With Botmock or Botsociety, for instance, rapid prototypes can be created and put into customers’ hands in a research setting. Many enterprises use the Wizard of Oz approach, which allows a user in a live environment to interact with a real agent acting as a virtual assistant to test hypotheses and validate risky assumptions they’ve made, from using chatbots to switch their mortgage deal to receiving their account balance from their Alexa. In the long run, this can save months of wasted design and development time.

2. Align the Organization on a Conversational AI Vision

To ensure your conversational AI program isn’t viewed as a digital side-project, create a shared vision and ensure it’s a vital pillar of the broader contact strategy. Whether experimenting with a first conversational interface or developing more sophisticated platforms and experience capabilities, it’s important that stakeholders across the business embrace conversational AI as not just an interface type but as a means for achieving larger organizational goals. To ensure the business is aligned around a common vision and organizationally prepared to make that vision real, proactively address these four core areas:

Articulating a realistic yet ambitious future vision
Addressing organizational siloes
Knowing when to pivot
Building future roadmaps with flexibility

3. Think Strategically About Your Conversational Platform Architecture

Beware of the sales pitch from organizations that say implementing this Technology is quick and easy. Despite advances in Natural Language Generation (NLG), implementing and training conversational AI is relatively manual and time-consuming without the right approach. A flexible architecture with the right building blocks and a data-driven approach is key to automating as many processes as possible and delivering value fast. Where possible, leverage existing investments and consider the needs of all business units that may want to deploy conversational AI solutions in the future. It is helpful to think about virtual agents as having two distinct architectural phases:

A conversational platform that integrates with critical communication channels and can seamlessly hand over to human agents within those channels. It doesn’t have any integrations into back-end enterprise systems, but it can already deliver significant value.
A conversational platform with authentication that can hook into back-end enterprise system to unlock end-to-end use cases, such as transactional queries.

It’s important to select the right NLU (IBM Watson, Google Dialogflow or Amazon Alexa, etc.) and dialogue building tooling if the platform providing the NLU does not provide one that meets the required needs. There is no best in market, as the best solution will depend on a business’s requirements, broader ecosystem and technology and cloud providers.

4. Secure the Right Funding and Generate Momentum

While creating a conversational AI program may prove essential, some can be put off by the time and money needed to implement and run a successful one. So, start with a low initial investment, demonstrate the benefits and then scale up to millions of customers if that’s the goal. To help in decision making, answer the following questions:

· Can this technology help leverage data to understand customers better than before?

· How might staff benefit from virtual agents?

· How could conversational AI help boost revenue?

· Are there key complaint drivers that could be solved with automation?

· How can this technology be utilized to leverage other forms of AI and automation to really solve business and customer problems?

· If capacity from human agents is freed up from Conversational AI, how can this be reinvested?

· Can a virtual agent enable a launch or scale more channels that customers want?

· How could a virtual agent help execute proactive marketing or engagement?

· How could conversational AI help reduce fraud or ensure compliance?

5. Staff the Right Talent: Starting with a Conversational Analyst

One of the most important learnings is that the roles and skillsets needed to deliver great conversational experiences are different to web or app teams. Expect to have to build new role profiles and hire externally. The hard part is finding talent with relevant experience in this field when they are in such high demand across the industry. It may be necessary to create new role types that may not currently exist, such as a conversational analyst, who will use machine learning algorithms and natural language processing to study the way your customers speak and use these insights to train your virtual agent, as well as a conversational designer, a copywriter/UX designer who can create the conversational flows, write the dialogue, utilize rich features (such as quick replies, buttons and carousels) and optimize the experience over time.

6. Create a Persona for Your Conversational AI Early

The ultimate goal for implementing conversational AI is to create a virtual agent that is a brand ambassador with an engaging persona. Start by thinking about the demographics and psychographics of the typical customer. Use customer personas if available or create them from scratch if not. Then create a backstory as a guide for this conversational AI program. Think about age, gender, ethnicity, family background, experience, job title, likes, dislikes and personality traits. Run customer testing to gain insight into a multitude of factors that go into a customer’s perception of personality including the size, style and layout of the entry point and interface, the use of rich features, the length of messages being sent and the delivery speed of text bubbles to name just a few. Persona is important from an engagement point of view, but it’s also the only way to encourage customers to talk to your Virtual Agent using natural language and unlock the real power of this technology.

7. Optimize Your Virtual Agent Through Agile Design and Delivery

To deliver a successful conversational AI solution, adopt an agile mindset and embrace design thinking. Many conversational AI teams are still heavily reliant upon process mapping tools, like Visio or Lucid Chart, to create designs. Instead, opt for designing in a no-code, rapid prototyping conversation design tool. This allows designers to create mock-ups quickly and even interact with prototypes using natural language. The most powerful benefit of this is the ability to test the virtual assistant with real customers in hours and shortcut learnings, totally independent from the development team.

Once launched, keep monitoring and make improvements as necessary so that customers’ needs are met – and exceeded. There is no doubt that today’s organizations have the opportunity to be at the forefront of the next wave of transformation, seamlessly integrating conversational AI to solve the complex challenge of serving and assisting time-challenged customers at every step of their journey. Of course, conversational AI is not the solution for everything, but there are almost certainly quick wins to be gained by identifying customer interactions that will deliver maximum value with the lowest effort.

The post Developing a Conversational AI Program appeared first on Towards Data Science.

Breakthroughs in speech recognition achieved with the use of transformers

Dmitrii Obukhov — Wed, 10 Mar 2021 14:53:38 +0000

Thoughts and Theory

Breakthroughs in Speech Recognition Achieved with the Use of Transformers

Let’s talk about key breakthroughs that have occurred in speech recognition thanks to transformers!

Image by Author

Dear reader. Let’s talk about something that cannot but excite your imagination – let’s talk about key breakthroughs that have occurred in speech recognition thanks to transformers.

This article provides an overview of the main tricks that were used when using Transformer-based architectures in speech recognition. Every particularly exciting idea is highlighted in bold. Along the way, there will be many links that will allow you to parse the details of the described techniques in more detail. At the end of the article, you will find benchmarks of Transformer-based speech recognition models.

A bit about speech recognition

Developers use speech recognition to create user experiences for a variety of products. Smart voice AI assistants, call center agent enhancement and conversational voice AI are just a few of the most common uses. Analysts like Gartner expect the use of speech to text (STT) to only increase in the next decade.

The task of speech recognition (speech-to-text, STT) is seemingly simple – to convert a speech (voice) signal into text data.

There are many approaches to solving this problem, and new breakthrough techniques are constantly emerging. To date, the most successful approaches can be divided into hybrid and end-to-end solutions.

In hybrid approaches to STT, the recognition system consists of several components, usually an acoustic machine learning model, a pronunciation ML model, and a language ML model. The training of individual components is performed independently, and a decoding graph is built for inference, in which the search for the best transcription is performed.

End-to-end approaches are a system, all parts of which are trained together. In inference, such systems often return text immediately. End-to-end approaches are categorized according to learning criteria and architecture type.

It is interesting that transformer-based solutions not only found application in both hybrid and end-to-end systems but also turned out to be better than many other modern solutions!

A bit about transformers

The Transformer architecture appeared in 2017 in the following paper [1] to solve the problem of machine translation. There are awesome papers that explain in detail how this architecture works – check out these two (1. 2).

Image by Author

Later on, there was a boom in NLP, transformer architectures evolved, the range of tasks to be solved increased, the results of transformer-based solutions more and more went into the gap.

Having taken over NLP, transformers have been introduced into other machine learning areas: speech recognition, speech synthesis, and computer vision, and so on.

Now let’s get to the point.

Speech-Transformer

The first mentions of the transformer in speech recognition date back to 2018, when a group of Chinese scientists published a research paper [2].

Image by Author

The changes in the architecture are minimal – convolutional neural networks (CNN) layers have been added before submitting features to the input to the transformer. This makes it possible to reduce the difference in the dimensions of the input and output sequences (since the number of frames in audio is significantly higher than the number of tokens in the text), which has a beneficial effect on training.

Even though the results were not dizzying, this work confirmed that transformers can indeed be successfully used in speech recognition!

First improvements

In 2019 there were several key Speech-Transformer improvements in different directions:

the authors of this paper [3] proposed a way to integrate CTC loss into Speech-Transformer. CTC loss has been used in speech recognition for a long time and has several advantages.

First, it allows us to take into account the correspondence of specific audio frames to specific transcription characters, due to the allowable alignments using the blank character.

Secondly, and this is the second improvement of Speech-Transformer, it simplifies the integration of the language model into the learning process.

rejection of Sinusoid Positional Encoding (PE). The problems associated with long sequences are more acute in speech recognition. The rejection occurred in different ways – in some papers, a transition was made from absolute positional encoding to relative PE (as seen in the [[following paper](https://arxiv.org/abs/1904.11660)](https://arxiv.org/abs/1910.07204) [4]), in others – by replacing PE with pooling layers (as seen in the following paper [5]), in the third – replacing positional encoding with trainable convolution layers (as seen in the following paper [6]). Much later work has confirmed the superiority of other techniques against sidusion PE.
the first adaptations of the transformer for streaming recognition. The authors of papers these two papers [5] and [7] did this in two stages – first, they adapted the encoder so that it was able to receive information as input in blocks and preserve the global context, and then used the Monotonic Chunkwise Attention (MoChA) technique for online decoding.
using only Encoder blocks of the transformer. For some systems (for example, hybrid approaches or transducer-based solutions), it is required that our acoustic model works exactly as an encoder. This technique made it possible to use transformers in hybrid systems [8], and in transducer recognition systems [9].

In October 2019, in a research paper ([10]), an extensive comparison of transformers with other approaches based on the ESPNet framework was carried out, which confirmed the quality of recognition of transformer-based models. In 13 out of 15 tasks, the transformer-based architecture turned out to be better than recurrent systems.

Hybrid Speech Recognition with Transformers

In late 2019 – early 2020, transformers achieved SOTA results in hybrid speech recognition (as seen in [8]).

As mentioned earlier, one of the components of the hybrid approach is the acoustic model, which today uses neural networks. The acoustic model in this paper consists of several layers of the transformer encoder. A diagram of one such layer is shown in Figure 3.

Image by Author

Of the most interesting things in this work, I would like to highlight that the authors again demonstrate the advantage of trainable convolutional (namely, VGG-like) embeddings compared to sinusoid PE. They also use iterated loss to improve convergence when training deep transformers. The topic of deep transformers will be discussed further.

Transformer Transducer

More precisely, two Transformer Transducer – one from Facebook [9] and one from Google [11] appear at the end of 2019 and the first half of 2020. Formally, in the work of Facebook, it is called Transformer-Transducer (separated by a hyphen). But the essence of both works is the same – the integration of the transformer into the RNN-Transducer architecture.

Image by Author

Integration does not take place for the entire transformer, but only for the encoder, as an audio encoder in the RNN-T framework. In this paper [11], the predictor network is also transformer-based, however with a smaller number of layers – in the process of inference it is often necessary to call this component, there is no need for a more complex architecture.

RNN-T loss, unlike the CTC loss, allows you to take into account not only the probabilities based on the input sequence, but also based on the predicted labels. In addition, one of the advantages of the Transformer Transducer architecture is that this approach is much easier to adapt for streaming recognition because only the encoder part of the transformer is used.

In the summer of 2020, another paper [12], named Conv-Transformer Transducer, was published, in which the audio encoder consists of three blocks, each of which contains convolution layers and then transformer layers. And in the fall of this year, in [13] (which is a continuation of [11]), the authors proposed the variable context layers technique, which allows training a model capable of using the variable size of the future context, providing trade-off latency / quality at the inference stage.

Local & global context

One of the strengths of transformer-based architectures is their high efficiency while taking into account the global context. In the audio signal, local connections play a greater role than global ones. In the summer of 2020, several works were published that draw attention to these aspects and once again lead the transformer-based model into the gap:

the authors of [14] proposed to change the architecture of the transformer block by adding a Convolution module after the Multi-Head Attention (MHA) block. Convolutions are better at taking into account local information, while the transformer model is good at extracting global information. The authors named the resulting model Conformer. Also, inspired by Macaron-Net, the authors used the half-step feed-forward network.
the paper [15] introduced weak attention suppression technique; it was proposed to use sparse attention, dynamically zeroing weights less than a certain threshold, so that we make the model scatter attention less in the entire context and focus more on meaningful frames.

Streamable Transformers

As noted above, the Transducer approach allows the system to be used for streaming speech recognition, i.e. when audio enters the system in real-time, processing immediately occurs and the system returns responses as soon as it is ready. Streaming recognition serves as a prerequisite for voice Conversational AI tasks.

However, for the system to be streaming, it is necessary that the transformer model itself be able to process audio sequentially. In the original transformer, the attention mechanism looks at the entire input sequence.

There are the following streaming data processing techniques in Speech Recognition that are used with transformer-based solutions:

time-restricted self-attention is used, for example, in the following paper [11]. Each transformer layer has a limited forward-looking context. The disadvantage of this approach is the increase in latency as the number of layers increases, as the general context of looking into the future increases.
block processing – the idea can be seen in [[5](https://arxiv.org/abs/1910.07204)], [[16](https://arxiv.org/abs/1912.02958)], and [[17](https://arxiv.org/abs/2005.08042)]. The idea is to feed segments/blocks/chunks as input to the transformer. The disadvantage of this method is that the context is limited to the segment. In order not to lose the global context, it can be transferred as a separate embedding, as seen in [5], or use architectures with recurrent connections, in which embeddings from previous segments are transferred to the current ones, as seen in [16], or use information from all previous processed segments stored in a memory bank. This approach is called augmented memory and is proposed in [17].

Emformer

In the following research paper [18], a model that is suitable for streaming recognition is presented, both in a hybrid setup and in a transducer system.

Emformer continues to develop the idea represented in [17]. Like its predecessor, Emformer uses augmented memory. Computational optimizations, computation caching are performed, the memory bank is used not from the current layer, but from the previous transformer layer, and GPU parallelization is added.

As a result, it was possible to achieve significant acceleration of system training and a reduction in inference time. In addition, the model converges better as a result of fewer useless computations.

Unsupervised speech representation learning

Another area in which Transformers have found successful use is the construction of high-level audio representations based on unlabeled data, on which even a simple model will produce good results.

Here I would like to note a number of works – Mockinjay [19], Speech-XLNet [20], Audio ALBERT [21], TERA [22], and wav2vec 2.0 especially [23].

One of the ideas for constructing such a view is to spoil the spectrogram (by masking it along the time axis, as in Mockingjay and Audio ALBERT, or along both time and frequency axes, as in TERA, or shuffle some frames, as in Speech-XLNet ), and train the model to restore it. Then the latent representation of such a model can be used as the high-level representation. The transformer here acts as a model, or rather its encoder plus additional modules before and after.

Image by Author

The resulting views can be used for downstream tasks. Moreover, the weights of the model can be either frozen or left to fine-tune for the downstream task.

Another idea is implemented in wav2vec 2.0. It is the continuation of vq-wav2vec [24].

First, latent representations are built from the audio signal using convolutional neural network layers. Latent representations are fed to the input of the transformer and are also used to construct discrete representations. Some of the frames at the entrance to the transformer are masked. The transformer model is trained to predict discrete-like grants by means of a contrastive loss. Unlike vq-wav2vec, learning of discrete and latent representations now happens together (end-to-end).

Image by Author

In [25], the authors used the idea of wav2vec pre-training in conjunction with the Conformer architecture. The authors have used LibriLight data for pre-training and obtained SOTA on the LibriSpeech corpus at the time of this writing.

Large-scale Settings

Most scientific publications consider the results of models trained on small, approximately 1000 hour cases such as the LibriSpeech.

Nevertheless, there are studies such as [26] and [27], which show that Transformer-based models show an advantage even on large amounts of data.

Conclusion

This article examined the techniques that are encountered when using transformer-based models in speech recognition.

Of course, not all papers related to transformers in the field of speech recognition are reflected here (the number of works related to transformers in STT is growing exponentially!), But I tried to collect the most interesting ideas for you.

And finally – WER graphs on the LibriSpeech Transformer-based models case:

Image by Author

References

A.Vaswani et al., 2017, "Attention Is All You Need", https://arxiv.org/abs/1706.03762
L.Dong et al., 2018 "Speech-Transformer: A No-Recurrence Sequence-to-Sequence Model for Speech Recognition", https://ieeexplore.ieee.org/document/8462506
S.Karita et al., 2019, "Improving Transformer-based End-to-End Speech Recognition with Connectionist Temporal Classification and Language Model Integration", https://pdfs.semanticscholar.org/ffe1/416bcfde82f567dd280975bebcfeb4892298.pdf
P.Zhou at al., 2019, "Improving Generalization of Transformer for Speech Recognition with Parallel Schedule Sampling and Relative Positional Embedding", https://arxiv.org/abs/1911.00203
E.Tsunoo et al., 2019, "Transformer ASR with Contextual Block Processing", https://arxiv.org/abs/1910.07204
A. Mohamed et al., 2019, "Transformers with convolutional context for ASR", https://arxiv.org/abs/1904.11660
E.Tsunoo et al., 2019, "Towards Online End-to-end Transformer Automatic Speech Recognition", https://arxiv.org/abs/1910.11871
Y.Wang et al., 2019, "Transformer-based Acoustic Modeling for Hybrid Speech Recognition", https://arxiv.org/abs/1910.09799
C.Yeh et al., 2019, "Transformer-Transducer: End-to-End Speech Recognition with Self-Attention", https://arxiv.org/abs/1910.12977
S.Karita et al., 2019, "A Comparative Study on Transformer vs RNN in Speech Applications", https://arxiv.org/abs/1909.06317
Q.Zhang et al., 2020, "Transformer Transducer: A Streamable Speech Recognition Model with Transformer Encoders and RNN-T Loss", https://arxiv.org/abs/2002.02562
W.Huang et al., 2020, "Conv-Transformer Transducer: Low Latency, Low Frame Rate, Streamable End-to-End Speech Recognition", https://arxiv.org/abs/2008.05750
A.Tripathi et al., 2020, "Transformer Transducer: One Model Unifying Streaming and Non-streaming Speech Recognition", https://arxiv.org/abs/2010.03192
A.Gulati et al., 2020, "Conformer: Convolution-augmented Transformer for Speech Recognition", https://arxiv.org/abs/2005.08100
Y.Shi et al., 2020, "Weak-Attention Suppression For Transformer Based Speech Recognition", https://arxiv.org/abs/2005.09137
Z.Tian et al., 2020, "Synchronous Transformers for End-to-End Speech Recognition", https://arxiv.org/abs/1912.02958
C.Wu et al., 2020, "Streaming Transformer-based Acoustic Models Using Self-attention with Augmented Memory", https://arxiv.org/abs/2005.08042
Y.Shi et al., 2020, "Emformer: Efficient Memory Transformer Based Acoustic Model For Low Latency Streaming Speech Recognition", https://arxiv.org/abs/2010.10759v3
A.T.Liu et al., 2019, "Mockingjay: Unsupervised Speech Representation Learning with Deep Bidirectional Transformer Encoders", https://arxiv.org/abs/1910.12638
X.Song et al., 2020, "Speech-XLNet: Unsupervised Acoustic Model Pretraining For Self-Attention Networks", https://arxiv.org/abs/1910.10387
P.Chi et al., 2020, "Audio ALBERT: A Lite BERT for Self-supervised Learning of Audio Representation", https://arxiv.org/abs/2005.08575
A.T.Liu et al., 2020, "TERA: Self-Supervised Learning of Transformer Encoder Representation for Speech", https://arxiv.org/abs/2007.06028
A.Baevski et al., 2020, "wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations", https://arxiv.org/abs/2006.11477
A.Baevski et al., 2020 "vq-wav2vec: Self-Supervised Learning of Discrete Speech Representations", https://arxiv.org/abs/1910.05453
Y.Zhang et al., 2020, "Pushing the Limits of Semi-Supervised Learning for Automatic Speech Recognition", https://arxiv.org/abs/2010.10504
L.Lu et al., 2020, "Exploring Transformers for Large-Scale Speech Recognition", https://arxiv.org/abs/2005.09684
Y.Wang et al., 2020, "Transformer in action: a comparative study of transformer-based acoustic models for large scale speech recognition applications", https://arxiv.org/abs/2010.14665

The post Breakthroughs in speech recognition achieved with the use of transformers appeared first on Towards Data Science.