Tal Rosenwein, Author at Towards Data Science

Overcoming Automatic Speech Recognition Challenges: The Next Frontier

Tal Rosenwein — Thu, 30 Mar 2023 07:18:26 +0000

TL;DR:

This post focuses on the advancements in Automatic Speech Recognition (ASR) technology and its impact on various domains. ASR has become prevalent in multiple industries, with improved accuracy driven by scaling model size and constructing larger labeled and unlabelled training datasets.

Looking ahead, ASR technology is expected to continue improving with the scaling of the acoustic model size and the enhancement of the internal language model. Additionally, self-supervised and multi-task training techniques will enable low-resource languages to benefit from ASR technology, while multilingual training will boost performance even further, allowing for basic usage such as voice commands in many low-resource languages.

ASR will also play a significant role in Generative AI, as interaction with avatars will be via an audio/text interface. With the emergence of textless NLP, some end-tasks, such as speech-2-speech translation, may be solved without using any explicit ASR model. Multimodal models that can be prompted using text, audio, or both will be released and generate text or synthesize audio as an output.

Furthermore, open-ended dialogue systems with voice-based human-machine interfaces will improve robustness to transcription errors and differences between written and spoken forms. This will provide robustness to challenging accents and children’s speech, enabling ASR technology to become an essential tool for many applications.

An end-to-end speech enhancement-ASR-diarization system is set to be released, enabling the personalization of ASR models and improving performance on overlapped speech and challenging acoustic scenarios. This is a significant step towards solving ASR technology’s challenges in real-world scenarios.

Lastly, A wave of speech APIs is expected. And still, there are opportunities for small startups to outperform big tech companies in domains with more legal or regulatory restrictions on the use of technology/data acquisition and in populations with low technology adoption rates.

2022 In A Review

Automatic Speech Recognition (ASR) technology is gaining momentum across various industries such as education, podcasts, social media, telemedicine, call centers, and more. A great example is the growing prevalence of voice-based human-machine interface (HMI) in consumer products, such as smart cars, smart homes, smart assistive technology [1], smartphones, and even artificial intelligence (AI) assistants in hotels [2]. In order to meet the increasing demand for fast and accurate responses, low-latency ASR models have been deployed for tasks like keyword spotting [3], endpointing [4], and transcription [5]. Speaker-attributed ASR models [6–7] are also gaining attention as they enable product personalization, providing greater value to end-users.

Prevalence of Data. Streaming audio and video platforms such as social media and YouTube have led to the easy acquisition of unlabeled audio data [8]. New self-supervised techniques have been introduced to utilize this audio without needing ground truth [9–10]. These techniques improve the performance of ASR systems in the target domain, even without fine-tuning on labeled data for that domain [11]. Another approach gaining attention due to its ability to utilize this unlabeled data is self-training using pseudo-labeling [12–13]. The main concept is to automatically transcribe unlabeled audio data using an automatic speech recognition (ASR) system and then use the generated transcription as ground truth for training a different ASR system in a supervised fashion. OpenAI took a different approach, assuming they can find human-generated transcripts at scale online. They generated a high-quality and large-scale (640K hours) training dataset by crawling publicly available audio data with human-generated subtitles. Using this dataset, they trained an ASR model (a.k.a Whisper) in a fully supervised manner, achieving state-of-the-art (SoTA) results on several benchmarks in zero-shot settings [14].

Losses. Despite End-2-end (E2E) losses dominating SoTA ASR models [15–17], new losses are still being published. A new technique called hybrid autoregressive transducer (HAT) [18] has been introduced, enabling to measure the quality of the internal language model (ILM) by separating the blank and label posteriors. Later work [19] used this factorization to effectively adapt the ILM using only textual data, which improved the overall performance of ASR systems, particularly the transcription of named entities, slang terms, and nouns, which are major pain points for ASR systems. New metrics have also been developed to better align with human perception and overcome word error rate (WER) semantic issues [20].

Architecture Choice. Regarding the acoustic model’s architectural choices, Conformer [21] remained preferred for streaming models, while Transformers [22] is the default architecture for non-streaming models. As for the latter, encoder-only (wav2vec2 based [23–24]) and encoder-decoder (Whisper [14]) multi-lingual models were introduced and improved over the SoTA results across several benchmarks in zero-shot settings. These models outperform their streaming counterparts due to model size, training data size, and their larger context.

Multilingual AI Developments from Tech Giants. Google has announced its "1,000 Languages Initiative" to build an AI model that supports the 1,000 most spoken languages [25], while Meta AI has announced its long-term effort to build language and machine translation (MT) tools that include most of the world’s languages [26].

Spoken Language Breakthrough. Multi-modal (speech/text) and multi-task pre-trained seq-2-seq (encoder-decoder) models such as SpeechT5 [27] were released, showing great success on a wide variety of spoken language processing tasks, including ASR, speech synthesis, speech translation, voice conversion, speech enhancement, and speaker identification.

These advancements in ASR technology are expected to drive further innovation and impact a wide range of industries in the years to come.

A Look Ahead

Despite its challenges, the field of Automatic Speech Recognition (ASR) is expected to make significant advancements in various domains, ranging from acoustic and semantic modeling to conversational and generative AI, and even speaker-attributed ASR. This section provides detailed insights into these areas and shares my predictions for the future of ASR technology.

Photo by Nik on Unsplash

General Improvements:

The improvement of ASR systems is expected on both the acoustic and semantic parts.

On the acoustic model side, larger model and training data sizes are anticipated to enhance the overall performance of ASR systems, similar to the progress observed in LLMs. Although scaling Transformer encoders, such as Wav2Vec or Conformer, poses a challenge, a breakthrough is expected to enable their scaling or see a shift towards encoder-decoder architectures as in Whisper. However, encoder-decoder architectures have drawbacks that need to be addressed, such as hallucinations. Optimizations such as faster-whisper [28] and NVIDIA-wav2vec2 [29] will reduce training and inference time, lowering the barrier to deploying large ASR models.

On the semantic side, researchers will focus on improving ASR models by incorporating larger acoustic or textual contexts. Injecting large-scale unpaired text into the ILM during E2E training, as in JEIT [30], will also be explored. These efforts will help to overcome key challenges such as accurately transcribing named entities, slang terms, and nouns.

Although Whisper and Google’s universal speech model (USM) [31] have improved ASR system performances over several benchmarks, some benchmarks still need to be solved as the word error rate (WER) remains around 20% [32]. Using speech foundation models, adding more diverse training data, and applying multi-task learning will significantly improve performance in such scenarios, opening up new business opportunities. Moreover, new metrics and benchmarks are expected to emerge to better align new end-tasks and domains, such as non-lexical conversational sounds [33] in the medical domain and filler word detection and classification [34] in media editing and educational domains. Task-specific fine-tuned models may be developed for this purpose. Finally, with the growth of multi-modality, more models, training datasets, and new benchmarks for several tasks are also expected to be released [35–36].

As progress continues, a wave of speech APIs is expected, similar to natural language processing (NLP). Google’s USM, OpenAI’s Whisper, and Assembly’s Conformer-1 [37] are some of the early examples.

Although it sounds silly, force alignment is still challenging for many companies. An open-source code for that may help many achieve accurate alignment between audio segments and their corresponding transcript.

Low Resources Languages:

Advancements in self-supervised learning, multi-task learning, and multi-lingual models are expected to improve performance on low-resource and unwritten languages significantly. These methods will achieve acceptable performances by utilizing pre-trained models and fine-tuning on a relatively small number of labeled samples [24]. Another promising approach is dual learning [38], a paradigm for semi-supervised machine learning that seeks to leverage unsupervised data by solving two opposite tasks (text-to-speech (TTS) and ASR in our case) at once. In this method, each model produces pseudo-labels for unlabeled examples, which are used to train the other model.

Additionally, improving ILM using unpaired text can enhance model robustness, which will be especially advantageous for closed-set challenges such as voice commands. The performance will be acceptable but not flawless in some applications, such as captioning YouTube videos, while in others, such as generating verbatim transcripts in court, it may take more time for models to meet the threshold. We anticipate that companies will gather data based on these models while manually correcting transcripts in 2023, and we will see significant improvements in low-resource languages after fine-tuned on proprietary data in 2024.

Generative AI:

The use of avatars is expected to revolutionize human interaction with digital assets. In the short term, ASR will serve as one of the foundations in Generative AI as these avatars will communicate through textual/auditory interface.

But in the future, changes could occur as attention shifts towards new research directions. For example, an emerging technology that is likely to be adopted is Textless NLP, which represents a new language modeling approach to audio generation [39]. This approach uses learnable discrete audio units [40], and auto-regressively generates the next discrete audio unit one unit at a time, similar to text generation. These discrete units can be later decoded back to the audio domain. Thus far, this technology has been able to generate syntactically and semantically plausible speech continuations while also maintaining speaker identity and prosody for unseen speakers, as can be seen in GSLM/AudioLM [39, 41]. The potential of this technology is enormous, as one can skip the ASR component (and its errors) in many tasks. For example, traditional speech-2-speech (S2S) translation methods work as follows: They transcribe the utterance in the source language, then translate the text to the target language using a machine translation model, and finally generate the audio in the target languages using a TTS engine. Using textless-NLP technology, S2S translation can be done using a single encoder-decoder architecture that works directly on discrete audio units without using any explicit ASR model [42]. We predict that future Textless NLP models will solve many other tasks without going through explicit transcription, such as question-answering systems. However, the main drawback of this method is backtracking errors and debugging, as things will get less intuitive when working on the discrete units space rather than working on the transcription.

T5 [43] and T0 [44] showed great success in NLP by utilizing their multi-task training and showing zero-shot task generalization. In 2021 SpeechT5 [27] was published, showing great success in various spoken language processing tasks. Earlier this year, VALL-E [45] and VALL-EX [46] were released. They showed impressive in-context learning capabilities for TTS models by using textless NLP technology, enabling cloning speaker’s voice by using only a few seconds of their audio, and without requiring any fine-tuning, doing it even in cross-lingual settings.

By joining the concepts taken from SpeechT5 and VALL-E, we can expect the release of T0-like models that can be prompted using either text, audio, or both, and generate text or synthesize audio as an output, depending on the task. A new era of models will begin, as in-context learning will enable generalization in zero-shot settings to new tasks. This will allow semantic search over audio, transcribing a target speaker using speaker-attributed ASR or describing it in free text, e.g., ‘what did the young kid that coughed say?". Furthermore, it will enable us to classify or synthesize audio using audio or textual description and solve NLP tasks directly from audio using explicit/implicit ASR.

Conversational AI:

Conversational AI has been adopted mainly through task-oriented dialogue systems, namely AI personal assistants (PA) such as Amazon’s Alexa and Apple’s Siri. These PAs have become popular due to their ability to provide quick access to features and information through voice commands. As big tech companies dominate this technology, new regulations on AI assistants will force them to offer third-party options for voice assistants, opening up competition [47]. As this happens, we can expect interoperability between personal assistants, meaning they will start communicating. This will be great as one can use any device to connect to any conversational agent anywhere in the world [48]. From the ASR perspective, this will pose new challenges as the contextualization will be much broader, and assistants will must have the robustness to different accents and possibly support multilingualism.

Over the past few years, a great technological leap has happened in text-based open-ended dialogue systems, e.g., Blender-Bot and LaMDA [49–50]. Initially, these dialogue systems were text-based, meaning they were fed by text and trained to output text, all in the written-form domain. As ASR performances improved, open-ended dialogue systems were augmented with voice-based HMI, which resulted in misalignment between modalities due to differences between the spoken and written forms. One of the main challenges is to bridge this gap by overcoming new types of errors introduced due to the audio-related processing, e.g., differences between spoken and written forms such as disfluencies and entity resolution, and transcription errors such pronunciation errors [51–52].

Possible solutions can be derived from improved transcription quality and robust NLP models that can effectively handle transcription and pronunciation errors. A reliable acoustic model’s confidence score [53] will serve as a key player in these systems, enabling it to point out speaker errors or serve as another input to the NLP model or decoding logic. Furthermore, we expect that ASR models will predict non-verbal cues such as sarcasm, enabling agents to understand the conversation more deeply and provide better responses.

These improvements will enable to push further dialogue systems with an auditory HMI to support challenging accents and children’s speech, such as in Loora [54] and Speaks [55].

Pushing the limits even further, we expect the release of an E2E multi-task learning framework for spoken language tasks using joint modeling of the speech and NLP problems as in MTL-SLT [56]. These models will train in an E2E fashion that will reduce the cumulative error between sequential modules and will address tasks such as spoken language understanding, spoken summarization, and spoken question answering, by taking speech as input and emitting various outputs such as transcription, intent, named entities, summaries, and answers to text queries.

Personalization will play a huge factor for AI assistants and open-ended dialogue systems, leading us to the next point: speaker-attributed ASR.

Speaker Attributed ASR:

There is still a challenge in transcribing distant conversations involving multiple microphones and parties in home environments. Even state-of-the-art (SoTA) systems can only achieve around 35% WER [57].

Early birds of joint ASR and diarization were released in 2019 [58]. This year, we can expect a release of an end-to-end speech enhancement-ASR-diarization system which will improve performance on overlapped speech and enable better performance in challenging acoustic scenarios such as reverberant rooms, far-field settings, and low Signal-to-Noise (SNR) ratios. The improvement will be achieved through joint task optimization, improved pre-training methods (such as WavLM [10]), applying architectural changes [59], data augmentation, and training on in-domain data during pre-training and fine-tuning [11]. Moreover, we can expect the deployment of speaker-attributed ASR systems for personalized speech recognition. This will further improve the transcription accuracy of the target speaker’s voice and bias the transcript towards user-defined words, such as contact names, proper nouns, and other named entities, which are crucial for smart assistants [60]. Additionally, low latency models will continue to be a significant area of focus to enhance edge devices’ overall experience and response time [61–62].

The Role of Startups Compared to Big Tech Companies in The ASR Landscape

Although big tech companies are expected to continue dominating the market with their APIs, small startups can still outperform them in specific domains. These include areas that are underrepresented in the big tech’s training data due to regulations, such as the medical domain and children’s speech, and populations that have not yet adopted technology, such as immigrants with challenging accents or individuals learning English worldwide. In markets where there isn’t enough demand for big tech companies to invest in, such as languages that are not widely spokem small startups may find opportunities to succeed and generate profit.

To create a win-win situation, big tech companies can provide APIs that offer full access to the output of their acoustic models while allowing others to write the decoding logic (WFST/beam-search) instead of merely adding customizable vocabulary or using current model adaptation features [63–64]. This approach will enable small startups to excel in their domains by incorporating priming or multiple language models during inference on top of the given acoustic model, rather than having to train the acoustic models themselves, which can be costly in terms of human capital and domain knowledge. In turn, big tech companies will benefit from broader adoption of their paid models.

How Does ASR Fit Into The Broader Machine Learning Landscape?

On one hand, ASR is on par with the importance of computer vision (CV) and NLP when considering it as the end task. This is the current situation in low-resource languages and domains where the transcript is the main business, e.g., court, medical records, movie subtitles, etc.

On the other hand, ASR is no longer the bottleneck in other domains where it has passed a certain usability threshold. In these cases, the NLP is the bottleneck, which means that improving ASR performances toward perfectionism is not essential for extracting insights for the end task. For example, meeting summarization or action item extraction can be achieved in many cases using current ASR quality.

Closing Remarks

The advancements in ASR technology have brought us closer to achieving seamless communication between humans and machines, for example in Conversational AI and Generative Ai. With the continued development of speech enhancement-ASR-diarization systems and the emergence of textless NLP, we are poised to witness exciting breakthroughs in this field. As we look forward to the future, we can’t help but anticipate the endless possibilities that ASR technology will unlock.

Thank you for taking the time to read this post! Your thoughts and feedback on these projections are highly valued and appreciated. Please feel free to share your comments and ideas.

References:

[1] https://www.orcam.com/en/home/

[2] https://voicebot.ai/2022/12/01/hey-disney-custom-alexa-assistant-rolls-out-at-disney-world/

[3] Jose, Christin, et al. "Latency Control for Keyword Spotting." ArXiv, 2022, https://doi.org/10.21437/Interspeech.2022-10608.

[4] Bijwadia, Shaan, et al. "Unified End-to-End Speech Recognition and Endpointing for Fast and Efficient Speech Systems." ArXiv, 2022, https://doi.org/10.1109/SLT54892.2023.10022338.

[5] Yoon, Ji, et al. "HuBERT-EE: Early Exiting HuBERT for Efficient Speech Recognition." ArXiv, 2022, https://doi.org/10.48550/arXiv.2204.06328.

[6] Kanda, Naoyuki, et al. "Transcribe-to-Diarize: Neural Speaker Diarization for Unlimited Number of Speakers Using End-to-End Speaker-Attributed ASR." ArXiv, 2021, https://doi.org/10.48550/arXiv.2110.03151.

[7] Kanda, Naoyuki, et al. "Streaming Speaker-Attributed ASR with Token-Level Speaker Embeddings." ArXiv, 2022, https://doi.org/10.48550/arXiv.2203.16685.

[8] https://www.fiercevideo.com/video/video-will-account-for-82-all-internet-traffic-by-2022-cisco-says

[9] Chiu, Chung, et al. "Self-Supervised Learning with Random-Projection Quantizer for Speech Recognition." ArXiv, 2022, https://doi.org/10.48550/arXiv.2202.01855.

[10] Chen, Sanyuan, et al. "WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing." ArXiv, 2021, https://doi.org/10.1109/JSTSP.2022.3188113.

[11] Hsu, Wei, et al. "Robust Wav2vec 2.0: Analyzing Domain Shift in Self-Supervised Pre-Training." ArXiv, 2021, https://doi.org/10.48550/arXiv.2104.01027.

[12] Lugosch, Loren, et al. "Pseudo-Labeling for Massively Multilingual Speech Recognition." ArXiv, 2021, https://doi.org/10.48550/arXiv.2111.00161.

[13] Berrebbi, Dan, et al. "Continuous Pseudo-Labeling from the Start." ArXiv, 2022, https://doi.org/10.48550/arXiv.2210.08711.

[14] Radford, Alec, et al. "Robust Speech Recognition via Large-Scale Weak Supervision." ArXiv, 2022, https://doi.org/10.48550/arXiv.2212.04356.

[15] Graves, Alex, et al. "Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks." ICML, 2016, https://www.cs.toronto.edu/~graves/icml_2006.pdf

[16] Graves, Alex. "Sequence Transduction with Recurrent Neural Networks." ArXiv, 2012, https://doi.org/10.48550/arXiv.1211.3711.

[17] Chan, William, et al. "Listen, Attend and Spell." ArXiv, 2015, https://doi.org/10.48550/arXiv.1508.01211.

[18] Variani, Ehsan, et al. "Hybrid Autoregressive Transducer (Hat)." ArXiv, 2020, https://doi.org/10.48550/arXiv.2003.07705.

[19] Meng, Zhong, et al. "Modular Hybrid Autoregressive Transducer." ArXiv, 2022, https://doi.org/10.48550/arXiv.2210.17049.

[20] Kim, Suyoun, et al. "Evaluating User Perception of Speech Recognition System Quality with Semantic Distance Metric." ArXiv, 2021, https://doi.org/10.48550/arXiv.2110.05376.

[21] Gulati, Anmol, et al. "Conformer: Convolution-Augmented Transformer for Speech Recognition." ArXiv, 2020, https://doi.org/10.48550/arXiv.2005.08100.

[22] Vaswani, Ashish, et al. "Attention Is All You Need." ArXiv, 2017, https://doi.org/10.48550/arXiv.1706.03762.

[23] Baevski, Alexei, et al. "Wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations." ArXiv, 2020, https://doi.org/10.48550/arXiv.2006.11477.

[24] Babu, Arun, et al. "XLS-R: Self-Supervised Cross-Lingual Speech Representation Learning at Scale." ArXiv, 2021, https://doi.org/10.48550/arXiv.2111.09296.

[25] https://blog.google/technology/ai/ways-ai-is-scaling-helpful/

[26] https://ai.facebook.com/blog/teaching-ai-to-translate-100s-of-spoken-and-written-languages-in-real-time/

[27] Ao, Junyi, et al. "SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing." ArXiv, 2021, https://doi.org/10.48550/arXiv.2110.07205.

[28] https://github.com/guillaumekln/faster-whisper

[29] https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/SpeechRecognition/wav2vec2

[30] Meng, Zhong, et al. "JEIT: Joint End-to-End Model and Internal Language Model Training for Speech Recognition." ArXiv, 2023, https://doi.org/10.48550/arXiv.2302.08583.

[31] Zhang, Yu, et al. "Google USM: Scaling Automatic Speech Recognition Beyond 100 Languages." ArXiv, 2023, https://doi.org/10.48550/arXiv.2303.01037.

[32] Kendall, T. and Farrington, C. "The corpus of regional african american language". Version 2021.07. Eugene, OR: The Online Resources for African American Language Project. http://oraal.uoregon.edu/coraal, 2021

[33] Brian, D Tran, et al. ‘"Mm-hm," "Uh-uh": are non-lexical conversational sounds deal breakers for the ambient clinical documentation technology?,’ Journal of the American Medical Informatics Association, 2023, https://doi.org/10.1093/jamia/ocad001

[34] Zhu, Ge, et al. "Filler Word Detection and Classification: A Dataset and Benchmark." ArXiv, 2022, https://doi.org/10.48550/arXiv.2203.15135.

[35] Anwar, Mohamed, et al. "MuAViC: A Multilingual Audio-Visual Corpus for Robust Speech Recognition and Robust Speech-to-Text Translation." ArXiv, 2023, https://doi.org/10.48550/arXiv.2303.00628.

[36] Jaegle, Andrew, et al. "Perceiver IO: A General Architecture for Structured Inputs & Outputs." ArXiv, 2021, https://doi.org/10.48550/arXiv.2107.14795.

[37] https://www.assemblyai.com/blog/conformer-1/

[38] Peyser, Cal, et al. "Dual Learning for Large Vocabulary On-Device ASR." ArXiv, 2023, https://doi.org/10.48550/arXiv.2301.04327.

[39] Lakhotia, Kushal, et al. "Generative Spoken Language Modeling from Raw Audio." ArXiv, 2021, https://doi.org/10.48550/arXiv.2102.01192.

[40] Zeghidour, Neil, et al. "SoundStream: An End-to-End Neural Audio Codec." ArXiv, 2021, https://doi.org/10.48550/arXiv.2107.03312.

[41] Borsos, Zalán, et al. "AudioLM: a Language Modeling Approach to Audio Generation." ArXiv, 2022, https://doi.org/10.48550/arXiv.2209.03143.

[42] https://about.fb.com/news/2022/10/hokkien-ai-speech-translation/

[43] Raffel, Colin, et al. "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer." ArXiv, 2019, /abs/1910.10683.

[44] Sanh, Victor, et al. "Multitask Prompted Training Enables Zero-Shot Task Generalization." ArXiv, 2021, https://doi.org/10.48550/arXiv.2110.08207.

[45] Wang, Chengyi, et al. "Neural Codec Language Models Are Zero-Shot Text to Speech Synthesizers." ArXiv, 2023, https://doi.org/10.48550/arXiv.2301.02111.

[46] Zhang, Ziqiang, et al. "Speak Foreign Languages with Your Own Voice: Cross-Lingual Neural Codec Language Modeling." ArXiv, 2023, https://doi.org/10.48550/arXiv.2303.03926.

[47] https://voicebot.ai/2022/07/05/eu-passes-new-regulations-for-voice-ai-and-digital-technology/

[48] https://www.speechtechmag.com/Articles/ReadArticle.aspx?ArticleID=154094

[49] Thoppilan, Romal, et al. "LaMDA: Language Models for Dialog Applications." ArXiv, 2022, https://doi.org/10.48550/arXiv.2201.08239.

[50] Shuster, Kurt, et al. "BlenderBot 3: a Deployed Conversational Agent that Continually Learns to Responsibly Engage." ArXiv, 2022, https://doi.org/10.48550/arXiv.2208.03188.

[51] Xiaozhou, Zhou, et al. "Phonetic Embedding for ASR Robustness in Entity Resolution." Proc. Interspeech 2022, 3268–3272, doi: 10.21437/Interspeech.2022–10956

[52] Chen, Angelica, et al. "Teaching BERT to Wait: Balancing Accuracy and Latency for Streaming Disfluency Detection." ArXiv, 2022, https://doi.org/10.48550/arXiv.2205.00620.

[53] Li, Qiujia, et al. "Improving Confidence Estimation on Out-of-Domain Data for End-to-End Speech Recognition." ArXiv, 2021, https://doi.org/10.48550/arXiv.2110.03327.

[54] https://loora.ai/

[55] https://techcrunch.com/2022/11/17/speak-lands-investment-from-openai-to-expand-its-language-learning-platform/

[56] Zhiqi, Huang, et al. "MTL-SLT: Multi-Task Learning for Spoken Language Tasks." NLP4ConvAI, 2022, https://aclanthology.org/2022.nlp4convai-1.11

[57] Watanabe, Shinji, et al. "CHiME-6 Challenge: Tackling Multispeaker Speech Recognition for Unsegmented Recordings." ArXiv, 2020, https://doi.org/10.48550/arXiv.2004.09249.

[58] Shafey, Laurent, et al. "Joint Speech Recognition and Speaker Diarization via Sequence Transduction." ArXiv, 2019, https://doi.org/10.48550/arXiv.1907.05337.

[59] Kim, Juntae, and Lee, Jeehye. "Generalizing RNN-Transducer to Out-Domain Audio via Sparse Self-Attention Layers." ArXiv, 2021, https://doi.org/10.48550/arXiv.2108.10752.

[60] Sathyendra, Kanthashree, et al. "Contextual Adapters for Personalized Speech Recognition in Neural Transducers." ArXiv, 2022, https://doi.org/10.48550/arXiv.2205.13660.

[61] Tian, Jinchuan, et al. "Bayes Risk CTC: Controllable CTC Alignment in Sequence-to-Sequence Tasks." ArXiv, 2022, https://doi.org/10.48550/arXiv.2210.07499.

[62] Tian, Zhengkun, et al. "Peak-First CTC: Reducing the Peak Latency of CTC Models by Applying Peak-First Regularization." ArXiv, 2022, https://doi.org/10.48550/arXiv.2211.03284.

[63] https://docs.rev.ai/api/custom-vocabulary/

[64] https://cloud.google.com/speech-to-text/docs/adaptation-model

The post Overcoming Automatic Speech Recognition Challenges: The Next Frontier appeared first on Towards Data Science.

2023 Predictions: What’s Next for AI Research?

Tal Rosenwein — Sun, 08 Jan 2023 19:50:50 +0000

This blog post was co-authored with Guy Eyal, an NLP team leader at Gong

TL;DR:

In 2022, large models achieved state-of-the-art results in various tasks and domains. A significant breakthrough in natural language processing (NLP) was achieved when models were trained to align with user intent and human preferences, leading to improved generation quality. Looking ahead to 2023, we can expect to see new methods to improve the alignment process (such as reinforcement learning with AI feedback), the development of automatic metrics for understanding alignment effectiveness, and the emergence of personalized aligned models, even in an online manner. There may also be a focus on addressing factuality issues as well as developing open-source tools and specialized compute resources to allow the industrial scale of aligned models. In addition to NLP, there will likely be progress in other modalities such as audio processing, computer vision, and robotics, and the development of multimodal models.

2022 AI Research Progress: A Year in Review

2022 was an excellent year for Artificial Intelligence/machine learning, with numerous large language models (LLMs) published and achieving state-of-the-art results across various benchmarks. These LLMs demonstrated their superior performance through few-shot learning, surpassing smaller models that had been fine-tuned on the same tasks [1–3]. This has the potential to reduce the need for specialized, in-domain datasets. Techniques like Chain of Thoughts [4] and Self Consistency [5] also helped to improve the reasoning capabilities of LLMs, leading to significant gains on reasoning benchmarks.

There were also notable advancements in dialogue systems resulting in more helpful, safe, and faithful models that could stay up-to-date through fine-tuning on annotated data and the use of retrieval from external knowledge sources [6–7].

In Automatic Speech Recognition (ASR), the use of an encoder-decoder transformer architecture allowed for more efficient scaling of model size, leading to a 50% reduction in word error rate on multiple ASR benchmarks without any domain adaptation [8].

Diffusion models [9–10] trained on large image datasets, made impressive strides in computer vision and sparked a new trend in AI art. Additionally, we saw the beginnings of multimodal models that use pre-trained LLMs to improve performance in tasks ranging from vision to robotics [9–12].

Finally, the release of ChatGPT [13] gave users a glimpse into the future of working with AI assistants in various fields and domains.

Photo by Moritz Knöringer on Unsplash

2023 Predictions: The Year of The Alignment

Excited by the past year, we looked forward to 2023, and wondered what it would look like. Here are our thoughts:

Reinforcement Learning with Human Feedback (RLHF), a supervised approach that aligns models with user intent and human preferences, has become increasingly popular in recent months [15]. This supervised approach shows promising results in the generation quality as can be seen by comparing the outputs of vanilla GPT3 [16] and ChatGPT. RLHF is so effective that a model trained with instruction tuning outperforms a model that is more than 100 times larger in size.

Image source – InstructGPT [15]

Going further with this year’s trend, we, like many others, expect that alignment will continue to be a significant factor. However, we predict active Research work in additional core functionalities that are currently lacking in most models and therefore limit their applicability in many fields.

Reinforcement Learning with AI Feedback (RLAIF)

Currently, RLHF requires human-curated data. Although small compared to the pre-training data size, it requires extensive and expensive human labor. For example, OpenAI used 40 annotators to write and label 64K samples for instruction tuning [15]. An interesting and exciting alternative that we think will be utilized this year is to use different LLMs as the instructors and labelers **** – Reinforcement Learning with AI Feedback (RLAIF). RLAIF will enable cost reduction and fast scaling of the alignment process as machines will do everything end-2-end. An interesting recent work by Anthropic [17] showed that with good prompting, one could guide an LM to classify harming outputs. These in turn, are used for the training of the reward model necessary for RLHF.

Metrics For Alignment

We assume that many methods will be developed to achieve better alignment between the model’s outputs and user intent. This, in return, will improve even further generation quality.

In order to understand which method is superior, automatic metrics should be developed alongside current human evaluation methods. This is a long-standing issue in NLP, as previous metrics fail to align with human annotations.

Recently, two promising approaches have been introduced: MAUVE [18], which compares the distributions of human-generated and model-generated outputs using divergence frontiers, and model-written evaluations [19], which utilizes other language models to assess the quality of the generated output. Further research in these areas may be a valuable direction during 2023.

Personalized Aligned Models

Expecting a model to be aligned with the entire society does not make sense, as we are not aligned with each other. Therefore we expect to see many different models aligned with different usages and users. We term this as Personalized Aligned Models.

We’ll see various companies align models with their own needs, and big companies align many models with their different users. This will greatly improve the end user’s experience when using LLMs in personal assistants, internet searches, text editing, and more.

Open Source And a Specialized Compute

To achieve personalization of aligned models at the industry scale, two components that don’t/partially exist today will have to be available for public use: models that can be aligned and compute resources.

Models to be Aligned and Open Source Models that are candidates for alignment will have to be developed, as the current models, such as Meta’s OPT [20], are not sufficient as they are not on par with paid APIs. Alternatively, we’ll see a paid API for the model’s alignment: non-public models by Google / OpenAI / Cohere / AI21 with full-serving options for consumers will be available and will serve as a valid business model.

Computational resources: although the alignment is much cheaper than pre-training, it still requires very specialized computational resources. Therefore we predict a race towards generating such an accessible infrastructure for the public, probably on-cloud.

Handling Factuality Issues

The apparent fluency of the output produced by LLMs may lead individuals to perceive the model as factually accurate and confident. However, a known limitation of LLMs that still needs to be solved by alignment is their tendency to generate hallucinated content. Therefore, we see two important research directions that will flourish this year: Outputting sources for text (citations) and outputting the model’s confidence.

Outputting sources for the current output can be achieved in many ways. One interesting direction is to connect LLMs with text retrieval ** mechanisms that will help ground/relate the outputs to known sources [21]. This may also help models stay relevant,** although their training process stopped at some point in the past. Another recently suggested idea is to do this in post-process by searching for documents that are most proximal to the output [22]. While the latter will not solve hallucinations, it will make it easier for the user to validate the results.

Recent works in different domains (ASR for example [23]) trained models that have two outputs: token prediction and a per-token confidence score. Using similar methods while extending the confidence score to relate to the entire output will help the user to take the results with a grain of salt.

Online Alignment

As people change over time, with shifts in interests, beliefs, jobs, and family status, it makes sense that their personal assistants should adapt as well. One very promising research direction we’re predicting is online alignment. Users will be able to continue personalizing their models after deployment. The alignment process will be continuously updated using an online learning algorithm by giving feedback to the models [24].

What About Other Modalities?

We expect to see considerable improvements in audio and speech recognition domains. We assume that Whisper [8] will be able to utilize unlabelled data (such as Wav2Vec 2.0 [25] / HuBERT [26]), which will significantly improve performances in challenging acoustic scenarios.

SpeechT5 [27] was an early bird, so we assume that T0-like models [28] for audio will be trained on scale (both training data and model size), resulting in improved audio embeddings. This will enable a unified speech enhancement, diarization, and transcription system. In the longer term, we expect auditory models to answer questions similar to natural language processing (NLP) models. The grounding context of these auditory models will be an audio segment, which will serve as the context for the query without the need for implicit transcription.

Multi-Modal Models

An important paradigm for the next year would be large multimodal models. What will they look like? We suspect they may look very similar to language models. By that we mean that the user will prompt the model with a given modality, and the model will be able to generate its output in a different modality (as in Unified-IO [29]).

Although very exciting, diffusion models [9] currently cannot classify images. This can be solved easily by outputting text similar to how we use LLMs today in classification tasks. Similarly, these models will be able to transcribe, generate and enhance audio and videos by good prompting.

What about aligning multimodal models? This is for the far future! Or as we call it in the current pace of our field – in a few months.

Closing Thoughts

This post presents our predictions regarding the needed advances in AI research in 2023. Large models can perform a wide range of academic tasks, as shown by their impressive performance on standard benchmarks. However, their applicability needs to be improved, as in real-world scenarios, these models still encounter embarrassing failures (untruthful, toxic, or simply not helpful to the user). We believe that aligning the models with user needs and keeping them up-to-date can address many of these issues. To that end, we have focused on the scalability and adaptability of the alignment process. If our hypothesis is correct, the field of generative language models will undergo significant changes soon. The potential uses of these models are vast, ranging from editing tools to domain-specific AI assistants that can automate manual labor in industries such as law, accounting, and engineering. Combining the above statement with predicted progress in computing (GPT- 4) and using the same methods applied to domains such as vision and audio processing promises another exciting year.

Thank you for reading!! If you have any thoughts regarding this 2023 projection, we warmly welcome them in the comments.

References:

[1] Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra, G., Roberts, A., Barham, P., Chung, H. W., Sutton, C., Gehrmann, S., Schuh, P., Shi, K., Tsvyashchenko, S., Maynez, J., Rao, A., Barnes, P., Tay, Y., Shazeer, N., Prabhakaran, V., . . . Fiedel, N. (2022). PaLM: Scaling Language Modeling with Pathways. arXiv. https://doi.org/10.48550/arXiv.2204.02311

[2] Scao, T. L., Fan, A., Akiki, C., Pavlick, E., Ilić, S., Hesslow, D., Castagné, R., Luccioni, A. S., Yvon, F., Gallé, M., Tow, J., Rush, A. M., Biderman, S., Webson, A., Ammanamanchi, P. S., Wang, T., Sagot, B., Muennighoff, N., . . . Wolf, T. (2022). BLOOM: A 176B-Parameter Open-Access Multilingual Language Model. arXiv. https://doi.org/10.48550/arXiv.2211.05100

[3] Zhang, S., Roller, S., Goyal, N., Artetxe, M., Chen, M., Chen, S., Dewan, C., Diab, M., Li, X., Lin, X. V., Mihaylov, T., Ott, M., Shleifer, S., Shuster, K., Simig, D., Koura, P. S., Sridhar, A., Wang, T., & Zettlemoyer, L. (2022). OPT: Open Pre-trained Transformer Language Models. arXiv. https://doi.org/10.48550/arXiv.2205.01068

[4] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E., Le, Q., & Zhou, D. (2022). Chain of Thought Prompting Elicits Reasoning in Large Language Models. arXiv. https://doi.org/10.48550/arXiv.2201.11903

[5] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., & Zhou, D. (2022). Self-Consistency Improves Chain of Thought Reasoning in Language Models. arXiv. https://doi.org/10.48550/arXiv.2203.11171

[6] Thoppilan, R., De Freitas, D., Hall, J., Shazeer, N., Kulshreshtha, A., Cheng, H., Jin, A., Bos, T., Baker, L., Du, Y., Li, Y., Lee, H., Zheng, H. S., Ghafouri, A., Menegali, M., Huang, Y., Krikun, M., Lepikhin, D., Qin, J., . . . Le, Q. (2022). LaMDA: Language Models for Dialog Applications. arXiv. https://doi.org/10.48550/arXiv.2201.08239

[7] Shuster, K., Xu, J., Komeili, M., Ju, D., Smith, E. M., Roller, S., Ung, M., Chen, M., Arora, K., Lane, J., Behrooz, M., Ngan, W., Poff, S., Goyal, N., Szlam, A., Boureau, Y., Kambadur, M., & Weston, J. (2022). BlenderBot 3: a deployed conversational agent that continually learns to responsibly engage. arXiv. https://doi.org/10.48550/arXiv.2208.03188

[8] Radford, A., Kim, JW, Xu, T, Brockman, G., McLeavey, C., Sutskever, I. (2022). Robust Speech Recognition via Large-Scale Weak Supervision. OpenAI. https://cdn.openai.com/papers/whisper.pdf

[9] Rombach, R., Blattmann, A., Lorenz, D., Esser, P., & Ommer, B. (2021). High-Resolution Image Synthesis with Latent Diffusion Models. arXiv. https://doi.org/10.48550/arXiv.2112.10752

[10] Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., & Chen, M. (2022). Hierarchical Text-Conditional Image Generation with CLIP Latents. arXiv. https://doi.org/10.48550/arXiv.2204.06125

[11] Tang, Z., Yang, Z., Wang, G., Fang, Y., Liu, Y., Zhu, C., Zeng, M., Zhang, C., & Bansal, M. (2022). Unifying Vision, Text, and Layout for Universal Document Processing. arXiv. https://doi.org/10.48550/arXiv.2212.02623

[12] Wang, P., Yang, A., Men, R., Lin, J., Bai, S., Li, Z., Ma, J., Zhou, C., Zhou, J., & Yang, H. (2022). OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework. arXiv. https://doi.org/10.48550/arXiv.2202.03052

[13] Ahn, M., Brohan, A., Brown, N., Chebotar, Y., Cortes, O., David, B., Finn, C., Fu, C., Gopalakrishnan, K., Hausman, K., Herzog, A., Ho, D., Hsu, J., Ibarz, J., Ichter, B., Irpan, A., Jang, E., Ruano, R. J., Jeffrey, K., . . . Zeng, A. (2022). Do As I Can, Not As I Say: Grounding Language in Robotic Affordances. arXiv. https://doi.org/10.48550/arXiv.2204.01691

[14] https://chat.openai.com/chat

[15] Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C. L., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., Schulman, J., Hilton, J., Kelton, F., Miller, L., Simens, M., Askell, A., Welinder, P., Christiano, P., Leike, J., . . . Lowe, R. (2022). Training language models to follow instructions with human feedback. arXiv. https://doi.org/10.48550/arXiv.2203.02155

[16] Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., Hesse, C., . . . Amodei, D. (2020). Language Models are Few-Shot Learners. arXiv. https://doi.org/10.48550/arXiv.2005.14165

[17] Bai, Y., Kadavath, S., Kundu, S., Askell, A., Kernion, J., Jones, A., Chen, A., Goldie, A., Mirhoseini, A., McKinnon, C., Chen, C., Olsson, C., Olah, C., Hernandez, D., Drain, D., Ganguli, D., Li, D., Perez, E., Kerr, J., . . . Kaplan, J. (2022). Constitutional AI: Harmlessness from AI Feedback. arXiv. https://doi.org/10.48550/arXiv.2212.08073

[18] Pillutla, K., Swayamdipta, S., Zellers, R., Thickstun, J., Welleck, S., Choi, Y., & Harchaoui, Z. (2021). MAUVE: Measuring the Gap Between Neural Text and Human Text using Divergence Frontiers. arXiv. https://doi.org/10.48550/arXiv.2102.01454

[19] Perez, E., Ringer, S., Lukošiūtė, K., Nguyen, K., Chen, E., Heiner, S., Pettit, C., Olsson, C., Kundu, S., Kadavath, S., Jones, A., Chen, A., Mann, B., Israel, B., Seethor, B., McKinnon, C., Olah, C., Yan, D., Amodei, D., . . . Kaplan, J. (2022). Discovering Language Model Behaviors with Model-Written Evaluations. arXiv. https://doi.org/10.48550/arXiv.2212.09251

[20] Iyer, S., Lin, X. V., Pasunuru, R., Mihaylov, T., Simig, D., Yu, P., Shuster, K., Wang, T., Liu, Q., Koura, P. S., Li, X., Pereyra, G., Wang, J., Dewan, C., Celikyilmaz, A., Zettlemoyer, L., & Stoyanov, V. (2022). OPT-IML: Scaling Language Model Instruction Meta Learning through the Lens of Generalization. arXiv. https://doi.org/10.48550/arXiv.2212.12017

[21] He, H., Zhang, H., & Roth, D. (2022). Rethinking with Retrieval: Faithful Large Language Model Inference. arXiv. https://doi.org/10.48550/arXiv.2301.00303

[22] Bohnet, B., Tran, V. Q., Verga, P., Aharoni, R., Andor, D., Soares, L. B., Eisenstein, J., Ganchev, K., Herzig, J., Hui, K., Kwiatkowski, T., Ma, J., Ni, J., Schuster, T., Cohen, W. W., Collins, M., Das, D., Metzler, D., Petrov, S., . . . Webster, K. (2022). Attributed Question Answering: Evaluation and Modeling for Attributed Large Language Models. arXiv. https://doi.org/10.48550/arXiv.2212.08037

[23] Gekhman, Z., Zverinski, D., Mallinson, J., & Beryozkin, G. (2022). RED-ACE: Robust Error Detection for ASR using Confidence Embeddings. arXiv. https://doi.org/10.48550/arXiv.2203.07172

[24] Bai, Y., Jones, A., Ndousse, K., Askell, A., Chen, A., DasSarma, N., Drain, D., Fort, S., Ganguli, D., Henighan, T., Joseph, N., Kadavath, S., Kernion, J., Conerly, T., Elhage, N., Hernandez, D., Hume, T., Johnston, S., Kravec, S., . . . Kaplan, J. (2022). Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback. arXiv. https://doi.org/10.48550/arXiv.2204.05862

[25] Baevski, A., Zhou, H., Mohamed, A., & Auli, M. (2020). wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations. arXiv. https://doi.org/10.48550/arXiv.2006.11477

[26] Hsu, W., Bolte, B., Tsai, Y., Lakhotia, K., Salakhutdinov, R., & Mohamed, A. (2021). HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units. arXiv. https://doi.org/10.48550/arXiv.2106.07447

[27] Ao, J., Wang, R., Zhou, L., Wang, C., Ren, S., Wu, Y., Liu, S., Ko, T., Li, Q., Zhang, Y., Wei, Z., Qian, Y., Li, J., & Wei, F. (2021). SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing. arXiv. https://doi.org/10.48550/arXiv.2110.07205

[28] Sanh, V., Webson, A., Raffel, C., Bach, S. H., Sutawika, L., Alyafeai, Z., Chaffin, A., Stiegler, A., Scao, T. L., Raja, A., Dey, M., Bari, M. S., Xu, C., Thakker, U., Sharma, S. S., Szczechla, E., Kim, T., Chhablani, G., Nayak, N., . . . Rush, A. M. (2021). Multitask Prompted Training Enables Zero-Shot Task Generalization. arXiv. https://doi.org/10.48550/arXiv.2110.08207

[29] Lu, J., Clark, C., Zellers, R., Mottaghi, R., & Kembhavi, A. (2022). Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks. arXiv. https://doi.org/10.48550/arXiv.2206.08916

[30] Mildenhall, B., Srinivasan, P. P., Tancik, M., Barron, J. T., Ramamoorthi, R., & Ng, R. (2020). NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis. arXiv. https://doi.org/10.48550/arXiv.2003.08934

[31] Chen, C., Gao, R., Calamia, P., & Grauman, K. (2022). Visual Acoustic Matching. arXiv. https://doi.org/10.48550/arXiv.2202.06875

[32] Zhu, Z., Peng, S., Larsson, V., Xu, W., Bao, H., Cui, Z., Oswald, M. R., & Pollefeys, M. (2021). NICE-SLAM: Neural Implicit Scalable Encoding for SLAM. arXiv. https://doi.org/10.48550/arXiv.2112.12130

The post 2023 Predictions: What’s Next for AI Research? appeared first on Towards Data Science.

A Step-By-Step Guide to Approaching Complex Research Projects

Tal Rosenwein — Tue, 05 Oct 2021 19:01:13 +0000

Thoughts and Theory

When you look at a championship sports team, it’s easy to attribute the team’s success to the star players. And, while much of the credit surely does belong to the team, there’s a key player leading them along the journey– the coach. Although the coach can’t score a single point, (s)he has to manage the team and devise strategies. In essence, a coach creates the blueprint for winning.

Similar to coaching, managing Research projects requires a team leader, who is responsible for the team’s players, adequate planning, and fostering an adaptive mindset to execute the work (which often changes on the fly). Like a coach devising a roadmap for winning, having a guide for how to conduct research can completely reshape how you approach your projects and maximize your chance at achieving successful outcomes.

To better understand what I mean, let’s rip a page from my seasons of being a team leader.

In 2016, my R&D team at OrCam started developing an Automatic Speech Recognition (ASR) engine from scratch. Since then, we have been consistently conducting a (relatively) long and iterative process of collecting insights and refining our research accordingly.

From our experience learning from our own mistakes, we have developed a methodology for how to perform research that:

Encourages us to take the time, think and then decide what is the best solution, instead of executing the first idea that crosses our minds (which, in most cases, is not the ideal solution).
Promotes focused research.
Minimizes (and enables) the delivery time of highly complex research projects.

We often get asked: Does this methodology slow down project onboarding and cause the work to be exhaustive and delayed?

The answer for that question is simple: our goal is to finish the project ASAP, not start executing it ASAP. By sticking to the methodology outlined below, initiating projects will not take longer than expected, and if it does, you can rest assured that it will be for a good reason.

A World of Possibilities

At times, kicking off a research project can feel like being a kid in a candy shop.

With so many options in front of you, it can be hard to choose how to proceed. In addition, as you dive deeper into the project, each of the options sounds more fascinating than the last, and the number of options seemingly grows exponentially. But, at the end of the day, you need to stay focused and deliver whatever was asked of you (for example, a feature for a product or a research paper).

The Overall Picture

Roughly speaking, a research project should consist of four (4) sequential stages:

Step 0: Motivation – understand why we are going to work on the project
Step 1: Goals – define what we are trying to achieve
Step 2: Planning – define how we are going to achieve the goal(s)
Step 3: Execution – implementing the selected plan

We would be remiss if we didn’t acknowledge that these steps hold true in theory. In practice, we know that complex research projects tend to be cyclic. This means that during execution, we may need to revisit the goals, rethink our plans, and shift our execution. To explain the methodology simply, let’s assume that these steps take shape in this defined order. But, remember that in practice, these steps will likely be iterative.

Here’s a look at the methodology, as well as some theoretical insights with practical examples.

Step 0: Motivation

"The two most important days in your life are the day you are born and the day you find out why." [Mark Twain]

Photo by Emily Morter on Unsplash

Many people prefer to know Why they are going to work on a project. By answering this question, we will have to understand the motivation behind the project, which leads us to its context.

This step is labeled as zero (0) because while it is not entirely necessary, it serves as a bonus. Understanding the motive behind a project can positively impact how the Research And Development will be executed.

Research is a dynamic field where things might change rapidly; when things don’t proceed as expected, understanding the motivation keeps us focused and enables us to converge towards a solution that solves the problem, rather than diverging.

Furthermore, when people from other departments ask for a feature or some help, they might be unaware of the dynamics at play. By understanding the reasoning behind their request, we can either agree or suggest an alternative solution that is easier to implement.

Step 1: Goals

"If you do not know where you are going, every road will get you nowhere" [Henry A. Kissinger]

Photo by Markus Winkler on Unsplash

We want to know that when we deliver something, we solve the original proposed issue and not a different one (or even worse- no issue at all). Therefore, after the motivation is made known, it’s time to be very clear and answer: What are the objectives of the project?

It’s critical here to define measurements for success (i.e. selecting the benchmarks and evaluation metrics and designing them to serve as a good indicator of progress as you move forward [Ruder, 2021].)

Here is a (partial) list of high-level tips that we’ve found to be useful in constructing benchmarks and selecting evaluation metrics:

Validity: A benchmark should reflect your end goals.

We define our benchmarks such that when we reach the desired Key Performance Indicator (KPI), we achieve our goals. In other words, hitting your benchmark will imply that you’ve been successful in accomplishing what you set out to do. [Bowman, 2021].

For example, in automatic speech recognition (ASR), researchers often use LibriSpeech [Panayotov, 2015] as a benchmark. LibriSeech is derived from audiobooks and consists of audio segments paired with their corresponding transcripts. If one wants to develop an ASR engine for meeting transcription, they should choose a different benchmark as LibriSpeech doesn’t reflect their end goal because spontaneous speech holds different features compared to the structured reading of audiobooks [Aksenova, 2021]. This example showcases how it’s important to choose benchmarks that are aligned with and relevant to your goals.

Reliability: Labels must be correct and consistent [Bowman, 2021].

In order to trust benchmark results, the labels must be correct. It may be the case that you develop your benchmarks in-house or even outsource this stage. Since instruction may be interpreted in different ways, it’s of paramount importance to be concise and accurate when sharing your thoughts with others. Make sure you review the randomly sampled data and labels before you start working.

When thinking about **evaluation metrics, objectivity is preferred over subjectivity** as it reduces biases and enables reproducible evaluation.

We prefer metrics that highly correlate with human perception. Although it sounds simple, finding these metrics can sometimes be challenging in many domains, as it’s hard to imitate human perception.

Furthermore, think about whether all errors should be weighted the same in your statistics. For example, when evaluating language-related tasks, remember that the words that actually matter are rare. Filler words could end up biasing your statistics, so it’s important to rectify this by creating specific benchmarks to address these concerns.

If a human evaluation is required, do your best to reduce its biases. A common solution to do so is to use crowdsourcing, but it should be addressed with great attention to detail, as even the guidelines themselves may inherently possess bias [Hansen, 2021]. Thus, crowdsourcing should be collected from a diverse population and the headcount should be properly determined.

Define success: Performances with respect to the ground truth are not the only measurement for success, and benchmarks should reflect it [Aksenova, 2021].

Achieving 100% accuracy is not always preferable over 95%. Let’s take as an example always-on keywords spotting like ‘Hey Siri’ on the iPhone. It is preferable to have a short response time with lower accuracy rather than it is to have one’s iPhone to always wake up with a delay of 30 seconds.

Benchmarks should approximate reality and evolve as the project progresses.

Benchmarks should be sufficiently large and representative in order to approximate reality [Kiela, 2021].

Furthermore, benchmarks should evolve as the project progresses.

The way we work is to define a benchmark at the beginning of the project, based on our understanding of the problem at that specific point in time. As we work, new challenges arise. To make sure we solve these challenges, new and specific benchmarks are developed.

Step 2: Planning

"By failing to prepare, you are preparing to fail." [Benjamin Franklin]

Photo by Kaleidico on Unsplash

Planning answers: How are we going to achieve our goals? or What is the best method to solve the challenge?

The process could be abstractedly seen as a breadth-first search algorithm, which means that at every decision point, we examine all possible actions before moving on to the next decision. Doing this recursively (bounded in time, of course) results in a clear view of all possible pathways for the project to follow.

Then, we can choose the expected best solution given the constraints (which could include product, release dates, technological stack, etc.). The planning stage is crucial and should be conducted while we are relaxed and open-minded because every wrong decision that one makes here has the potential to cost a considerable amount of time and money.

Planning has two main objectives, namely:

Reducing uncertainty

Eliminating as many unknowns as possible by offering different solutions, taking advice from relevant teams (e.g. product and infrastructure teams), etc.

You’ll often reach a point where two options look promising, and it’s hard to choose between them. To remove unknowns, existing tools (like open-source software) can serve to help develop a quick proof of concept, even if it is not solving your exact problem. For example, if one wants to automatically detect lions within images, and they find an existing tool that detects dogs, horses, sheep, cows, elephants, etc. they can safely assume that lions could be detected as there shouldn’t be a difference between the groups, other than available training data. Furthermore, remember the existing toolset you have in your arsenal while planning. In many cases, it’s quicker to modify existing tools as opposed to developing new ones from scratch.

Setting up the roadmap for the project

List and detail the technical steps that must be completed to achieve project goals, including the timeline. It helps to break complex research projects into small steps. Although each step by itself will not lead us to the final goal, it enables us to steer the research in the right direction along the way instead of just at the end, which tends to be more challenging and time-consuming. Furthermore, it makes the process of finding and fixing bugs more straightforward.

We will demonstrate how complex projects could be broken into small steps by comparing the process to how contractors construct a house. The house, or the end goal, is ready after all construction stages are finished from foundations and framing, infrastructure (plumbing, electricity, etc.), to interior trim, to name a few. The construction stages serve as the small tasks; although each stage on its own will not create a house, the combination of each finished step will. Furthermore, if there is a problem at a particular stage of the execution, it’s easy to pinpoint the issue and assign the necessary professional to fix it. The same can be said when you break down a complex research project into its parts.

It is also important to include a solid baseline. We can discuss the importance of baselines and how to construct them in a different post, but to summarize it shortly:

Baselines assess the quality of your method by comparing the outcome to prior work / a naive approach. Do you consider the performance of a highly complexed model that achieves 95% accuracy to be good? If the naive baseline reaches 97%, the answer is no. Baselines are sometimes given, such as an existing component that one wants to improve. But the main takeaway here is that even if baselines don’t exist, make an effort to create one, even if it is naive and will not serve as the ultimate solution. Examples for naive baselines for text summarization can be found in [Ferriera, 2013].

To make the planning as effective as possible, we usually brainstorm. We gather and briefly walk through the motivation and project goals. We then ask each of the participants to think about their preferred solution independently.

Naturally, some people are prone to perform a literature review, while others like to think independently. We encourage this diversity since it helps in keeping some of us more open-minded and less biased towards a certain solution based on our respective discipline and expertise.

After everyone completes their "homework," we meet again, and each of the participants presents their point of view while everyone else solely listens. This way, judgments like "Who is the most assertive person in the room?" or "My idea sounds silly, I will not mention it" are minimized [Kahneman, 2011].

After everyone shares their solutions, a technical discussion begins. Criticism is welcomed, but it should be specific and well-explained. It might take more than one meeting to agree on the optimal solution, since complex research questions often require further thought.

When the planning stage is finished, we know (up to some extent) the steps we need to execute to achieve our goals, as well as an estimated timeline.

Step 3: Execution

"Doing the right thing is more important than doing the thing right." [Peter F. Drucker]

Photo by Todd Quackenbush on Unsplash

First, make something work, then make it work better.

We prefer to solve problems with an increased level of difficulty, by first solving the challenge without constraints and only then adding constraints and performing optimizations. For example, suppose we want to develop a low latency app for a mobile device. In that case, we will start by developing an offline (non-streaming) version that runs on unlimited computational power (servers). When this is solved, we will add the constraints of streaming and mobile device.

In order to be able to trust your results, you must minimize flaws in your experimental methodology. Musgrave [Musgrave, 2020] showed that after fixing flaws such as lack of validation set for hyper-parameters tuning (that results in training with test-set feedback), metric learning papers between 2016–2020 had marginal improvement at best, although they claimed great advances in accuracy.

Once you’ve minimized flaws and can trust that your module is high-performing and accurate, it’s best to perform an ablation study to understand why it is the case. The ablation can serve to gain an intuition for future projects, to point where optimization of the module should be applied, and to ensure that independent effects won’t conflate.

A great example of the need for an ablation study can be found in Tay et al. [Tay, 2021]. Their paper claims that tying pre-training and architectural advances (i.e. Transformers) in natural language processing (NLP) tasks is misguided and that both advances should be considered independently. Thus, when one applies the pre-train-fine-tune paradigm and wants to improve the model’s performance and memory consumption, they can optimize the search for new architectural changes that weren’t tested before due to this unbreakable tie.

Remember that things exist with context, meaning that you will need to deploy your project in a bigger system. To prevent embarrassing outcomes, you should include a ‘fail safe’ mechanism.

Keep in mind: although we want the original plan to succeed, end users don’t care how you go to the solution; they just care that the product or service works as promised. So, if you see a new approach/technology that outperforms your hard work and sleepless nights’ solution, don’t hesitate to embrace it. In the same vein, try to understand why things may have gone wrong or shifted so you can prevent repeating the same mistake twice.

If you decide to diverge from the original plan, make sure your decision to do so is data-driven. As Jim Barksdale said, "If we have data, let’s look at data. If all we have are opinions, let’s go with mine."

Finally, build everything using a solid infrastructure that you can modify easily and that can serve multiple purposes. There is a constant tension between running as fast as possible and making something that will last long.

We try to stick to the following rule of thumb: 70–80% of our code is written in a way that even if the project is discarded, we can use it for future projects. This has two benefits: the execution time of future projects is shortened (because we already have part of the building blocks ready), and it enables us to focus on the task-specific core issues.

Let’s take LEGO as an example. Models share most of the same building blocks, and only differ in their size, color, and quantity. Having a unified block set saves the company money by cutting manufacturing costs, among other expenses.

From the R&D aspect, it significantly reduces the amount of work and attention needed for execution (as only a tiny portion of model-specific building blocks are needed), which eventually reduces the delivery time of new models and enables the creation of more models, including complex LEGO sets.

Closing Thoughts

This post aims to present a step-by-step methodology that helps to keep research focused. The methodology can serve to answer three main questions: Why, What, and How, which all guide implementation.

In retrospect, having a managing-research guide like this one would have helped my team and I avoid a plethora of mistakes over the years (despite the fact that I have always had great executive supervision and oversight).

Throughout my professional career, I like to revisit feedback that I’ve received from colleagues and executives. A comment that sticks out is that I "run too fast" once something has to be done, without taking the adequate time to plan. This comment has stuck with me and served as a large motivator in creating and sharing this guide to conducting research.

Practically speaking, once my team and I started to adhere to this methodology, we have witnessed efficiency and less mistakes in our research and projects as compared to when we hit the ground running in the past.

Hopefully, the outlined methodology will benefit you like a coach does a team– by helping you choose the right pathway forward and directing you along your research journey(s). If you have any suggestions or thoughts regarding this step-by-step research guide, I warmly welcome them in the comments.

The post A Step-By-Step Guide to Approaching Complex Research Projects appeared first on Towards Data Science.

Tal Rosenwein, Author at Towards Data Science

Overcoming Automatic Speech Recognition Challenges: The Next Frontier

TL;DR:

2022 In A Review

A Look Ahead

General Improvements:

Low Resources Languages:

Generative AI:

Conversational AI:

Speaker Attributed ASR:

The Role of Startups Compared to Big Tech Companies in The ASR Landscape

How Does ASR Fit Into The Broader Machine Learning Landscape?

Closing Remarks

References:

2023 Predictions: What’s Next for AI Research?

TL;DR:

2022 AI Research Progress: A Year in Review

2023 Predictions: The Year of The Alignment

Reinforcement Learning with AI Feedback (RLAIF)

Metrics For Alignment

Personalized Aligned Models

Open Source And a Specialized Compute

Handling Factuality Issues

Online Alignment

What About Other Modalities?

Multi-Modal Models

Closing Thoughts

References:

A Step-By-Step Guide to Approaching Complex Research Projects

Thoughts and Theory

A World of Possibilities

The Overall Picture

Step 0: Motivation

Step 1: Goals

Validity: A benchmark should reflect your end goals.

Reliability: Labels must be correct and consistent [Bowman, 2021].

When thinking about evaluation metrics, objectivity is preferred over subjectivity as it reduces biases and enables reproducible evaluation.

Define success: Performances with respect to the ground truth are not the only measurement for success, and benchmarks should reflect it [Aksenova, 2021].

Benchmarks should approximate reality and evolve as the project progresses.

Step 2: Planning

Reducing uncertainty

Setting up the roadmap for the project

Step 3: Execution

Closing Thoughts

When thinking about **evaluation metrics, objectivity is preferred over subjectivity** as it reduces biases and enables reproducible evaluation.