Avishek Biswas, Author at Towards Data Science

Sesame Speech Model: How This Viral AI Model Generates Human-Like Speech

Avishek Biswas — Sat, 12 Apr 2025 01:09:27 +0000

Recently, Sesame AI published a demo of their latest Speech-to-Speech model. A conversational AI agent who is really good at speaking, they provide relevant answers, they speak with expressions, and honestly, they are just very fun and interactive to play with.

Note that a technical paper is not out yet, but they do have a short blog post that provides a lot of information about the techniques they used and previous algorithms they built upon.

Thankfully, they provided enough information for me to write this article and make a YouTube video out of it. Read on!

Training a Conversational Speech Model

Sesame is a Conversational Speech Model, or a CSM. It inputs both text and audio, and generates speech as audio. While they haven’t revealed their training data sources in the articles, we can still try to take a solid guess. The blog post heavily cites another CSM, 2024’s Moshi, and fortunately, the creators of Moshi did reveal their data sources in their paper. Moshi uses 7 million hours of unsupervised speech data, 170 hours of natural and scripted conversations (for multi-stream training), and 2000 more hours of telephone conversations (The Fischer Dataset).

Sesame builds upon the Moshi Paper (2024)

But what does it really take to generate audio?

In raw form, audio is just a long sequence of amplitude values — a waveform. For example, if you’re sampling audio at 24 kHz, you are capturing 24,000 float values every second.

There are 24000 values here to represent 1 second of speech! (Image generated by author)

Of course, it is quite resource-intensive to process 24000 float values for just one second of data, especially because transformer computations scale quadratically with sequence length. It would be great if we could compress this signal and reduce the number of samples required to process the audio.

We will take a deep dive into the Mimi encoder and specifically Residual Vector Quantizers (RVQ), which are the backbone of Audio/Speech modeling in Deep Learning today. We will end the article by learning about how Sesame generates audio using its special dual-transformer architecture.

Preprocessing audio

Compression and feature extraction are where convolution helps us. Sesame uses the Mimi speech encoder to process audio. Mimi was introduced in the aforementioned Moshi paper as well. Mimi is a self-supervised audio encoder-decoder model that converts audio waveforms into discrete “latent” tokens first, and then reconstructs the original signal. Sesame only uses the encoder section of Mimi to tokenize the input audio tokens. Let’s learn how.

Mimi inputs the raw speech waveform at 24Khz, passes them through several strided convolution layers to downsample the signal, with a stride factor of 4, 5, 6, 8, and 2. This means that the first CNN block downsamples the audio by 4x, then 5x, then 6x, and so on. In the end, it downsamples by a factor of 1920, reducing it to just 12.5 frames per second.

The convolution blocks also project the original float values to an embedding dimension of 512. Each embedding aggregates the local features of the original 1D waveform. 1 second of audio is now represented as around 12 vectors of size 512. This way, Mimi reduces the sequence length from 24000 to just 12 and converts them into dense continuous vectors.

Before applying any quantization, the Mimi Encoder downsamples the input 24KHz audio by 1920 times, and embeds it into 512 dimensions. In other words, you get 12.5 frames per second with each frame as a 512-dimensional vector. (Image from author’s video)

What is Audio Quantization?

Given the continuous embeddings obtained after the convolution layer, we want to tokenize the input speech. If we can represent speech as a sequence of tokens, we can apply standard language learning transformers to train generative models.

Mimi uses a Residual Vector Quantizer or RVQ tokenizer to achieve this. We will talk about the residual part soon, but first, let’s look at what a simple vanilla Vector quantizer does.

Vector Quantization

The idea behind Vector Quantization is simple: you train a codebook , which is a collection of, say, 1000 random vector codes all of size 512 (same as your embedding dimension).

A Vanilla Vector Quantizer. A codebook of embeddings is trained. Given an input embedding, we map/quantize it to the nearest codebook entry. (Screenshot from author’s video)

Then, given the input vector, we will map it to the closest vector in our codebook — basically snapping a point to its nearest cluster center. This means we have effectively created a fixed vocabulary of tokens to represent each audio frame, because whatever the input frame embedding may be, we will represent it with the nearest cluster centroid. If you want to learn more about Vector Quantization, check out my video on this topic where I go much deeper with this.

More about Vector Quantization! (Video by author)

Residual Vector Quantization

The problem with simple vector quantization is that the loss of information may be too high because we are mapping each vector to its cluster’s centroid. This “snap” is rarely perfect, so there is always an error between the original embedding and the nearest codebook.

The big idea of Residual Vector Quantization is that it doesn’t stop at having just one codebook. Instead, it tries to use multiple codebooks to represent the input vector.

First, you quantize the original vector using the first codebook.
Then, you subtract that centroid from your original vector. What you’re left with is the residual — the error that wasn’t captured in the first quantization.
Now take this residual, and quantize it again, using a second codebook full of brand new code vectors — again by snapping it to the nearest centroid.
Subtract that too, and you get a smaller residual. Quantize again with a third codebook… and you can keep doing this for as many codebooks as you want.

Residual Vector Quantizers (RVQ) hierarchically encode the input embeddings by using a new codebook and VQ layer to represent the previous codebook’s error. (Illustration by the author)

Each step hierarchically captures a little more detail that was missed in the previous round. If you repeat this for, let’s say, N codebooks, you get a collection of N discrete tokens from each stage of quantization to represent one audio frame.

The coolest thing about RVQs is that they are designed to have a high inductive bias towards capturing the most essential content in the very first quantizer. In the subsequent quantizers, they learn more and more fine-grained features.

If you’re familiar with PCA, you can think of the first codebook as containing the primary principal components, capturing the most critical information. The subsequent codebooks represent higher-order components, containing information that adds more details.

Residual Vector Quantizers (RVQ) uses multiple codebooks to encode the input vector — one entry from each codebook. (Screenshot from author’s video)

Acoustic vs Semantic Codebooks

Since Mimi is trained on the task of audio reconstruction, the encoder compresses the signal to the discretized latent space, and the decoder reconstructs it back from the latent space. When optimizing for this task, the RVQ codebooks learn to capture the essential acoustic content of the input audio inside the compressed latent space.

Mimi also separately trains a single codebook (vanilla VQ) that only focuses on embedding the semantic content of the audio. This is why Mimi is called a split-RVQ tokenizer – it divides the quantization process into two independent parallel paths: one for semantic information and another for acoustic information.

The Mimi Architecture (Source: Moshi paper) License: Free

To train semantic representations, Mimi used knowledge distillation with an existing speech model called WavLM as a semantic teacher. Basically, Mimi introduces an additional loss function that decreases the cosine distance between the semantic RVQ code and the WavLM-generated embedding.

Audio Decoder

Given a conversation containing text and audio, we first convert them into a sequence of token embeddings using the text and audio tokenizers. This token sequence is then input into a transformer model as a time series. In the blog post, this model is referred to as the Autoregressive Backbone Transformer. Its task is to process this time series and output the “zeroth” codebook token.

A lighterweight transformer called the audio decoder then reconstructs the next codebook tokens conditioned on this zeroth code generated by the backbone transformer. Note that the zeroth code already contains a lot of information about the history of the conversation since the backbone transformer has visibility of the entire past sequence. The lightweight audio decoder only operates on the zeroth token and generates the other N-1 codes. These codes are generated by using N-1 distinct linear layers that output the probability of choosing each code from their corresponding codebooks.

You can imagine this process as predicting a text token from the vocabulary in a text-only LLM. Just that a text-based LLM has a single vocabulary, but the RVQ-tokenizer has multiple vocabularies in the form of the N codebooks, so you need to train a separate linear layer to model the codes for each.

The Sesame Architecture (Illustration by the author)

Finally, after the codewords are all generated, we aggregate them to form the combined continuous audio embedding. The final job is to convert this audio back to a waveform. For this, we apply transposed convolutional layers to upscale the embedding back from 12.5 Hz back to KHz waveform audio. Basically, reversing the transforms we had applied originally during audio preprocessing.

In Summary

Check out the accompanying video on this article! (Video by author)

So, here is the overall summary of the Sesame model in some bullet points.

Sesame is built on a multimodal Conversation Speech Model or a CSM.
Text and audio are tokenized together to form a sequence of tokens and input into the backbone transformer that autoregressively processes the sequence.
While the text is processed like any other text-based LLM, the audio is processed directly from its waveform representation. They use the Mimi encoder to convert the waveform into latent codes using a split RVQ tokenizer.
The multimodal backbone transformers consume a sequence of tokens and predict the next zeroth codeword.
Another lightweight transformer called the Audio Decoder predicts the next codewords from the zeroth codeword.
The final audio frame representation is generated from combining all the generated codewords and upsampled back to the waveform representation.

Thanks for reading!

References and Must-read papers

Check out my ML YouTube Channel

Sesame Blogpost and Demo

Relevant papers:
Moshi: https://arxiv.org/abs/2410.00037
SoundStream: https://arxiv.org/abs/2107.03312
HuBert: https://arxiv.org/abs/2106.07447
Speech Tokenizer: https://arxiv.org/abs/2308.16692

The post Sesame Speech Model: How This Viral AI Model Generates Human-Like Speech appeared first on Towards Data Science.

The Ultimate Guide to RAGs – Each Component Dissected

Avishek Biswas — Tue, 29 Oct 2024 23:53:10 +0000

A visual tour of what it takes to build production-ready LLM pipelines

Let’s learn RAGs! (Image by Author)

If you have worked with Large Language Models, there is a great chance that you have at least heard the term RAG – Retrieval Augmented Generation. The idea of RAGs are pretty simple – suppose you want to ask a question to a LLM, instead of just relying on the LLM’s pre-trained knowledge, you first retrieve relevant information from an external knowledge base. This retrieved information is then provided to the LLM along with the question, allowing it to generate a more informed and up-to-date response.

Comparing standard LLM calls with RAG (Source: Image by Author)

So, why use Retrieval Augmented Generation?

When providing accurate and up-to-date information is key, you cannot rely on the Llm’s inbuilt knowledge. RAGs are a cheap practical way to use LLMs to generate content about recent topics or niche topics without needing to finetune them on your own and burn away your life’s savings. Even when LLMs internal knowledge may be enough to answer questions, it might be a good idea to use RAGs anyway, since recent studies have shown that they could help reduce LLMs hallucinations.

The different components of a bare-bones RAG

Before we dive into the advanced portion of this article, let’s review the basics. Generally Rags consist of two pipelines – preprocessing and inferencing.

Inferencing is all about using data from your existing database to answer questions from a user query. Preprocessing is the process of setting up the database in the correct way so that retrieval is done correctly later on.

Here is a diagramatic look into the entire basic barebones RAG pipeline.

The Basic RAG pipeline (Image by Author)

The Indexing or Preprocessing Steps

This is the offline preprocessing stage, where we would set up our database.

Identify Data Source: Choose a relevant data source based on the application, such as Wikipedia, books, or manuals. Since this is domain dependent, I am going to skip over this step in this article. Go choose any data you want to use, knock yourself out!
Chunking the Data: Break down the dataset into smaller, manageable documents or chunks.
Convert to Searchable Format: Transform each chunk into a numerical vector or similar searchable representation.
Insert into Database: Store these searchable chunks in a custom database, though external databases or search engines could also be used.

The Inferencing Steps

During the Query Inferencing stage, the following components stand out.

Query Processing: A method to convert the user’s query into a format suitable for search.
Retrieval/Search Strategy: A similarity search mechanism to retrieve the most relevant documents.
Post-Retrieval Answer Generation: Use retrieved documents as context to generate the answer with an LLM.

Great – so we identified several key modules required to build a RAG. Believe it or not, each of these components have a lot of additional research to make this simple RAG turn into CHAD-rag. Let’s look into each of the major components in this list, starting with chunking.

By the way, this article is based on this 17-minute Youtube video I made on the same topic, covering all the topics in this article. Feel free to check it out after reading this Medium article!

1. Chunking

Chunking is the process of breaking down large documents into smaller, manageable pieces. It might sound simple, but trust me, the way you chunk your data can make or break your RAG pipeline. Whatever chunks you create during preprocessing will eventually get retrieved during inference. If you make the size of chunks too small – like each sentence – then it might be difficult to retrieve them through search because they capture very little information. If the chunk size is too big – like inserting entire Wikipedia articles – the retrieved passages might end up confusing the LLM because you are sending large bodies of texts at once.

Depending on the different levels of chunking, your results may vary! (Image by author)

Some frameworks use LLMs to do chunking, for example by extracting simple factoids or propositions from the text corpus, and treat them as documents. This could be expensive because the larger your dataset, the more LLM calls you’ll have to make.

Proposition Chunking (Source: Dense X Retrieval Paper. License: Free)

Structural Chunking

If your data has inherent boundaries (like HTML or Code), sometimes it is best to just utilize it. (Image by Author)

Quite often we may also deal with datasets that inherently have a known structure or format. For example, if you want to insert code into your database, you can simply split each script by the function names or class definitions. For HTML pages like Wikipedia articles, you can split by the heading tags – for example, split by the H2 tags to isolate each sub-chapter.

Contextual Chunking

But there are some glaring issues with the types of chunking we have discussed so far. Suppose your dataset consists of tens of thousands of paragraphs extracted from all Sherlock Holmes books. Now the user has queried something general like what was the first crime in Study in Scarlet? What do you think is going to happen?

The problem is that since each documented is an isolated piece of information, we don’t know which chunks are from the book Study in Scarlet. Therefore, later on during retrieval, we will end up fetch a bunch of passages about the topic "crime" without knowing if it’s relevant to the book. To resolve this, we can use something known as contextual chunking.

Enter Contextual Chunking

A recent blogpost from Anthropic describes it as prepending chunk-specific explanatory context to each chunk before embedding. Basically, while we are indexing, we would also include additional information relevant to the chunk – like the name of the book, the chapter, maybe a summary of the events in the book. Adding this context will allow the retriever to find references to Study in Scarlett and crimes when searching, hopefully getting the right documents from the database!

Contextual Chunking adds additional information to the chunks than just the text body (Image by author)

There are other ways to solve the problem of finding the right queries – like metadata filtering, We will talk about this later when we talk about Databases.

2. Data Conversion

Next, we come to the data-conversion stage. Note that whatever strategy we used to convert the documents during preprocessing, we need to use it to search for similarity later, so these two components are tightly coupled.

Two of the most common approaches that have emerged in this space are embedding based methods and keyword-frequency based methods like TF-IDF or BM-25.

Embedding Based Methods

We’ll start with embedding-based methods. Here, we use pretrained transformer models to transform the text into high-dimensional vector representations, capturing semantic meaning about the text. Embeddings are great for capturing semantic relationships, handling synonyms, and understanding context-dependent meanings. However, embedding can be computationally intensive, and can sometimes overlook exact matches that simpler methods would easily catch.

Embeddings (Image by Author)

When does Semantic Search fail?

For example, suppose you have a database of manuals containing information about specific refrigerators. When you ask a query mentioning a very specific niche model or a serial number, embeddings will fetch documents that kind of resemble your query, but may fail to exactly match it. This brings us to the alternative of embeddings retrieval – keyword based retrieval.

Keyword Based Methods

Two popular keyword-based methods are TF-IDF and BM25. These algorithms focus on statistical relationships between terms in documents and queries.

TF-IDF weighs the importance of a word based on its frequency in a document relative to its frequency in the entire corpus. Every document in our dataset is be represented by a array of TF-IDF scores for each word in the vocabulary. The indices of the high values in this document vector tell us which words that are likely to be most characteristic of that document’s content, because these words appear more frequently in this document and less frequently in others. For example, the documents related to this Godrej A241gX , will have a high TF-IDF score for the phrase Godrej and A241gX, making it more likely for us to retrieve this using TF-IDF.

TF-IDF relies on the ratio of the occurence of terms in a document compared to the entire corpus. (Image by author)

BM25, an evolution of TF-IDF, incorporates document length normalization and term saturation. Meaning that it adjusts the TF-IDF score based on if the document itself is longer or shorter than the average document length in the collection. Term saturation means that as a particular word appears too often in the database, it’s importance decreases.

TF-IDF and BM-25 are great finding documents with specific keyword occurrences when they exactly occur. And embeddings are great for finding documents with similar semantic meaning.

A common thing these days is to retrieve using both keyword and embedding based methods, and combine them, giving us the best of both worlds. Later on when we discuss Reciprocal Rank Fusion and Deduplication, we will look into how to combine these different retrieval methods.

3. Databases

Up next, let’s talk about Databases. The most common type of database that is used in RAGs are Vector Databases. Vector databases store documents by indexing them with their vector representation, be in from an embedding, or TF-IDF. Vector databases specialize in fast similarity check with query vectors, making them ideal for RAG. Popular vector databases that you may want to look into are Pinecone, Milvus, ChromaDB, MongoDB, and they all have their pros and cons and pricing model.

An alternative to vector databases are graph databases. Graph databases store information as a network of documents with each document connected to others through relationships.

Modern Vector Databases allow attribute filtering with semantic search (Image by Author)

Many modern vector and graph database also allow properties from relational databases, most notably metadata or attribute filtering. If you know the question is about the 5th Harry Potter book, it would be really nice to filter your entire database first to only contain documents from the 5th Harry Potter book, and not run embeddings search through the entire dataset. Optimal metadata filtering in Vector Databases is a pretty amazing area in Computer Science research, and a seperate article would be best for a in-depth discussion about this.

4. Query transformation

Next, let’s move to inferencing starting with query transformation – which is any preprocessing step we do to the user’s actual query before doing any similarity search. Think of it like improving the user’s question to get better answers.

In general, we want to avoid searching directly with the user query. User inputs are usually very noisy and they can type random stuff – we want an additional transformation layer that interprets the user query and turns it into a search query.

A simple example why Query Rewriting is important

The most common technique to do this transformation is query rewriting. Imagine someone asks, "What happened to the artist who painted the Mona Lisa?" If we do semantic or keyword searches, the retrieved information will be all about the Mona Lisa, not about the artist. A query rewriting system would use an LLM to rewrite this query. The LLM might transform this into "Leonardo da Vinci Mona Lisa artist", which will be a much fruitful search.

Direct Retrieval vs Query Rewriting (Image by Author)

Sometimes we would also use Contextual Query Writing, where we might use additional contexts, like using the older conversation transcript from the user, or if we know that our application covers documents from 10 different books, maybe we can have a classifier LLM that classifies the user query to detect which of the 10 books we are working with. If our database is in a different language, we can also translate the query.

There are also powerful techniques like HYDE, which stands for Hypothetical Document Embedding. HYDE uses a language model to generate a hypothetical answer to the query, and do similarity search with this hypothetical answer to retrieve relevant documents.

Hypothetical Document Embeddings (Image by Author)

Another technique is Multi-Query Expansion where we generate multiple queries from the single user query and perform parallel searches to retrieve multiple sets of documents. The received documents can then later go through a de-duplication step or rank fusion to remove redundant documents.

A recent approach called Astute RAG tries to consolidate externally input knowledge with the LLM’s own internal knowledge before generating answers. There are also Multi-Hop techniques like Baleen programs. They work by performing an initial search, analyzing the top results to find frequently co-occurring terms, and then adding these terms to the original query. This adaptive approach can help bridge the vocabulary gap between user queries and document content, and help retrieve better documents.

5. Post Retrieval Processing

Now that we’ve retrieved our potentially relevant documents, we can add another post-retrieval processing step before feeding information to our language model for generating the answer.

For example, we can do information selection and emphasis, where an LLM selects portion of the retrieved documents that could be useful for finding the answer. We might highlight key sentences, or do semantic filtering where we remove unimportant paragraphs, or do context summarization by fusing multiple documents into one. The goal here is to avoid overwhelming our LLM with too much information, which could lead to less focused or accurate responses.

You can use smaller LLMs to flag relevant info from retrieved documents before consolidating the context prompt for the final LLM call (Image by Author)

Often we do multiple queries with query expansion, or use multiple retrieval algorithms like Embeddings+BM-25 to separately fetch multiple documents. To remove duplicates, we often use reranking methods like Reciprocal Rank Fusion. RRF combines the rankings from all the different approaches, giving higher weight to documents that consistently rank well across multiple methods. In the end, the top K high ranking documents are passed to the LLM.

Reciprocal Rank Fusion is a classic Search Engine algorithm to combine item ranks obtained from multiple ranking algorithms (Image by author)

FLARE or forward-looking active retrieval augmented generation is an iterative post-retrieval strategy. Starting with the user input and initial retrieval results, an LLM iteratively guesses the next sentence. Then we check if the generated guess contains any low probability tokens indicated here with an underline – if so, we call the retriever to retrieve useful documents from the dataset and make necessary corrections.

Final Thoughts

For a more visual breakdown of the different components of RAGs, do checkout my Youtube video on this topic. The field of LLMs and RAGs are rapidly evolving – a thorough understanding of the RAG framework is incredibly essential to appreciate the pros and cons of each approach and weigh which approaches work best for YOUR use-case. The next time you are thinking of designing a RAG system, do stop and ask yourself these questions –

What are my data sources?
How should I chunk my data? Is there inherent structure that comes with my data domain? Do my chunks need additional context (contextual chunking)?
Do I need semantic retrieval (embeddings) or more exact-match retrieval (BM-25)? What type of queries am I expecting from the user?
What database should I use? Is my data a graph? Does it need metadata-filtering? How much money do I want to spend on databases?
How can I best rewrite the user query for easy search hits? Can an LLM rewrite the queries? Should I use HYDE? If LLMs already have enough domain knowledge about my target field, can I use Astute?
Can I combine multiple different retrieval algorithms and then do rank fusion? (honestly, just do it if you can afford it cost-wise and latency-wise)

The Author

Check out my Youtube channel where I post content about Deep Learning, Machine Learning, Paper Reviews, Tutorials, and just about anything related to AI (except news, there are WAY too many Youtube channels for AI news). Here are some of my links:

Youtube Channel: https://www.youtube.com/@avb_fj

Patreon: https://www.patreon.com/c/NeuralBreakdownwithAVB

Give me a follow on Medium and a clap if you enjoyed this!

References

Vector Databases: https://superlinked.com/vector-db-comparison

Metadata Filtering: https://www.pinecone.io/learn/vector-search-filtering/

Contextual Chunking: https://www.anthropic.com/news/contextual-retrieval

Propositions / Dense X Retrieval: https://arxiv.org/pdf/2312.06648

Hypothetical Document Embeddigs (HYDE): https://arxiv.org/abs/2212.10496

FLARE: https://arxiv.org/abs/2305.06983

The post The Ultimate Guide to RAGs – Each Component Dissected appeared first on Towards Data Science.

The Evolution of Text to Video Models

Avishek Biswas — Thu, 19 Sep 2024 19:04:03 +0000

We’ve witnessed remarkable strides in AI image generation. But what happens when we add the dimension of time? Videos are moving images, after all.

Text-to-video generation is a complex task that requires AI to understand not just what things look like, but how they move and interact over time. It is an order of magnitude more complex than text-to-image.

To produce a coherent video, a neural network must:

Comprehend the input prompt
Understand how the world works
Know how objects move and how physics applies
Generate a sequence of frames that make sense spatially, temporally, and logically

Despite these challenges, today’s diffusion neural networks are making impressive progress in this field. In this article, we will cover the main ideas behind video diffusion models – main challenges, approaches, and the seminal papers in the field.

Also, this article is based on this larger YouTube video I made. If you enjoy this read, you will enjoy watching the video too.

Text to Image Overview

To understand text-to-video generation, we need to start with its predecessor: text-to-image diffusion models. These models have a singular goal – to transform random noise and a text prompt into a coherent image. In general, all generative image models do this – Variational Autoencoders (VAE), Generative Adversarial Neural Nets (GANs), and yes, Diffusion too.

The basic goal of all image generation models is to convert random noise into an image, often conditioned on additional conditioning prompts (like text). [Image by Author]

Diffusion, in particular, relies on a gradual denoising process to generate images.

Start with a randomly generated noisy image
Use a neural network to progressively remove noise
Condition the denoising process on text input
Repeat until a clear image emerges

How Diffusion Models generate images – A neural network progressively removes noise from a pure noise image conditioned on a text prompt, eventually revealing a clear image. [Illustration by Author] (Image generated by a neural network)

But how are these denoising neural networks trained?

During training, we start with real images and progressively add noise to it in small steps – this is called forward diffusion. This generates a lot of samples of clear image and their slightly noisier versions. The neural network is then trained to reverse this process by inputting the noisy image and predicting how much noise to remove to retrieve the clearer version. In text-conditional models, we train attention layers to attend to the inputted prompt for guided denoising.

During training, we add noise to clear images (left) – this is called Forward Diffusion. The neural network is trained to reverse this noise addition process – a process known as Reverse Diffusion. Images generated using a neural network. [Image by Author]

This iterative approach allows for the generation of highly detailed and diverse images. You can watch the following YouTube video where I explain text to image in much more detail – concepts like Forward and Reverse Diffusion, U-Net, CLIP models, and how I implemented them in Python and Pytorch from scratch.

If you are comfortable with the core concepts of Text-to-Image Conditional Diffusion, let’s move to videos next.

The Temporal Dimension: A New Frontier

In theory, we could follow the same conditioned noise-removal idea to do text-to-video diffusion. However, adding time into the equation introduces several new challenges:

Temporal Consistency: Ensuring objects, backgrounds, and motions remain coherent across frames.
Computational Demands: Generating multiple frames per second instead of a single image.
Data Scarcity: While large image-text datasets are readily available, high-quality video-text datasets are scarce.

Some commonly used video-text datasets [Image by Author]

The Evolution of Video Diffusion

Because of the lack of high quality datasets, text-to-video cannot rely just on supervised training. And that is why people usually also combine two more data sources to train video diffusion models – one – paired image-text data, which is much more readily available, and two – unlabelled video data, which are super-abundant and contains lots of information about how the world works. Several groundbreaking models have emerged to tackle these challenges. Let’s discuss some of the important milestone papers one by one.

We are about to get into the technical nitty gritty! If you find the material ahead difficult, feel free to watch this companion video as a visual side-by-side guide while reading the next section.

Video Diffusion Model (VDM) – 2022

VDM Uses a 3D U-Net architecture with factorized spatio-temporal convolution layers. Each term is explained in the picture below.

What each of the terms mean (Image by Author)

VDM is jointly trained on both image and video data. VDM replaces the 2D UNets from Image Diffusion models with 3D UNet models. The video is input into the model as a time sequence of 2D frames. The term Factorized basically means that the spatial and temporal layers are decoupled and processed separately from each other. This makes the computations much more efficient.

What is a 3D-UNet?

3D U-Net is a unique Computer Vision neural network that first downsamples the video through a series of these factorized spatio-temporal convolutional layers, basically extracting video features at different resolutions. Then, there is an upsampling path that expands the low-dimensional features back to the shape of the original video. While upsampling, skip connections are used to reuse the generated features during the downsampling path.

The 3D Factorized UNet Architecture [Image by Author]

Remember in any convolutional neural network, the earlier layers always capture detailed information about local sections of the image, while latter layers pick up global level pattern by accessing larger sections – so by using skip connections, U-Net combines local details with global features to be a super-awesome network for feature learning and denoising.

VDM is jointly trained on paired image-text and video-text datasets. While it’s a great proof of concept, VDM generates quite low-resolution videos for today’s standards.

You can read more about VDM here.

Make-A-Video (Meta AI) – 2022

Make-A-Video by Meta AI takes the bold approach of claiming that we don’t necessarily need labeled-video data to train video Diffusion Models. WHHAAA?! Yes, you read that right.

Adding temporal layers to Image Diffusion

Make A Video first trains a regular text-to-image diffusion model, just like Dall-E or Stable Diffusion with paired image-text data. Next, unsupervised learning is done on unlabelled video data to teach the model temporal relationships. The additional layers of the network are trained using a technique called masked spatio-temporal decoding, where the network learns to generate missing frames by processing the visible frames. Note that no labelled video data is needed in this pipeline (although further video-text fine-tuning is possible as an additional third step), because the model learns spatio-temporal relationships with paired text-image and raw unlabelled video data.

Make-A-Video in a nutshell [Image by Author]

The video outputted by the above model is 64×64 with 16 frames. This video is then upsampled along the time and pixel axis using separate neural networks called Temporal Super Resolution or TSR (insert new frames between existing frames to increase frames-per-second (fps)), and Spatial Super Resolution or SSR (super-scale the individual frames of the video to be higher resolution). After these steps, Make-A-Video outputs 256×256 videos with 76 frames.

You can learn more about Make-A-Video right here.

Imagen Video (Google) – 2022

Imagen video employs a cascade of seven models for video generation and enhancement. The process starts with a base video generation model that creates low-resolution video clips. This is followed by a series of super-resolution models – three SSR (Spatial Super Resolution) models for spatial upscaling and three TSR (Temporal Super Resolution) models for temporal upscaling. This cascaded approach allows Imagen Video to generate high-quality, high-resolution videos with impressive temporal consistency. Generates high-quality, high-resolution videos with impressive temporal consistency

The Imagen workflow [Source: Imagen paper: https://imagen.research.google/video/paper.pdf]

VideoLDM (NVIDIA) – 2023

Models like Nvidia’s VideoLDM tries to address the temporal consistency issue by using latent diffusion modelling. First they train a latent diffusion image generator. The basic idea is to train a Variational Autoencoder or VAE. The VAE consists of an encoder network that can compress input frames into a low dimensional latent space and another decoder network that can reconstruct it back to the original images. The diffusion process is done entirely on this low dimensional space instead of the full pixel-space, making it much more computationally efficient and semantically powerful.

A typical Autoencoder. The input frames are individually downsampled into a low dimensional compressed latent space. A Decoder network then learns to reconstruct the image back from this low resolution space. [Image by Author]

What are Latent Diffusion Models?

The diffusion model is trained entirely in the low dimensional latent space, i.e. the diffusion model learns to denoise the low dimensional latent space images instead of the full resolution frames. This is why we call it Latent Diffusion Models. The resulting latent space outputs is then pass through the VAE decoder to convert it back to pixel-space.

The decoder of the VAE is enhanced by adding new temporal layers in between it’s spatial layers. These temporal layers are fine-tuned on video data, making the VAE produce temporally consistent and flicker-free videos from the latents generated by the image diffusion model. This is done by freezing the spatial layers of the decoder and adding new trainable temporal layers that are conditioned on previously generated frames.

The VAE Decoder is finetuned with temporal information so that it can produce consistent videos from the latents generated by the Latent Diffusion Model (LDM) [Source: Video LDM Paper https://arxiv.org/abs/2304.08818]

You can learn more about Video LDMs here.

SORA (OpenAI) – 2024

While Video LDM compresses individual frames of the video to train an LDM, SORA compresses video both spatially and temporally. Recent papers like CogVideoX have demonstrated that 3D Causal VAEs are great at compressing videos making diffusion training computationally efficient, and able to generate flicker-free consistent videos.

3D VAEs compress videos spatio-temporally to generate compressed 4D representations of video data [Image by Author]

Transformers for Diffusion

A transformer model is used as the diffusion network instead of the more traditional UNEt model. Of course, transformers need the input data to be presented as a sequence of tokens. That’s why the compressed video encodings are flattened into a sequence of patches. Observe that each patch and its location in the sequence represents a spatio-temporal feature of the original video.

OpenAI SORA Video Preprocessing [Source: OpenAI (https://openai.com/index/sora/)] (License: Free)

It is speculated that OpenAI has collected a rather large annotation dataset of video-text data which they are using to train conditional video generation models.

Combining all the strengths listed below, plus more tricks that the ironically-named OpenAI may never disclose, SORA promises to be a giant leap in video generation AI models.

Massive video-text annotated dataset + pretraining techniques with image-text data and unlabelled data
General architectures of Transformers
Huge compute investment (thanks Microsoft)
The representation power of Latent Diffusion Modeling.

What’s next

The future of AI is easy to predict. In 2024, Data + Compute = Intelligence. Large corporations will invest computing resources to train large diffusion transformers. They will hire annotators to label high-quality video-text data. Large-scale text-video datasets probably already exist in the closed-source domain (looking at you OpenAI), and they may become open-source within the next 2–3 years, especially with recent advancements in AI video understanding. It remains to be seen if the upcoming huge computing and financial investments could on their own solve video generation. Or will further architectural and algorithmic advancements be needed from the research community?

Links

Author’s Youtube Channel: https://www.youtube.com/@avb_fj

Video on this topic: https://youtu.be/KRTEOkYftUY

15-step Zero-to-Hero on Conditional Image Diffusion: https://youtu.be/w8YQc

Papers and Articles

Video Diffusion Models: https://arxiv.org/abs/2204.03458

Imagen: https://imagen.research.google/video/

Make A Video: https://makeavideo.studio/

Video LDM: https://research.nvidia.com/labs/toronto-ai/VideoLDM/index.html

CogVideoX: https://arxiv.org/abs/2408.06072

OpenAI SORA article: https://openai.com/index/sora/

Diffusion Transformers: https://arxiv.org/abs/2212.09748

Useful article: https://lilianweng.github.io/posts/2024-04-12-diffusion-video/

The post The Evolution of Text to Video Models appeared first on Towards Data Science.

Segment Anything 2: What Is the Secret Sauce? (A Deep Learner’s Guide)

Avishek Biswas — Tue, 06 Aug 2024 11:35:10 +0000

Meta just released the Segment Anything 2 model or SAM 2 – a neural network that can segment not just images, but entire videos as well. SAM 2 is a promptable interactive foundation segmentation model. Being promptable means you can click or drag bounding boxes on one or more objects you want to segment, and SAM2 will be able to predict a mask singling out the object and track it across the input clip. Being interactive means you can edit the prompts on the fly, like adding new prompts in different frames – and the segments will adjust accordingly! Lastly, being a foundation segmentation model means that it is trained on a massive corpus of data and can be applied for a large variety of use-cases.

Note that this article is a "Deep Learning Guide", so we will primarily be focusing on the network architecture behind SAM-2. If you are a visual learner, you might want to check out the YouTube video that this article is based on.

Promptable Visual Segmentation (PVS)

SAM-2 focuses on the PVS or Prompt-able Visual Segmentation task. Given an input video and a user prompt – like point clicks, boxes, or masks – the network must predict a masklet, which is another term for a spatio-temporal mask. Once a masklet is predicted, it can be iteratively refined by providing more prompts in additional frames – through positive or negative clicks – to interactively update the segmented mask.

The Original Segment Anything Model

SAM-2 builds up on the original SAM architecture which was an image segmentation model. Let’s do a quick recap about the basic architecture of the original SAM model.

SAM-1 architecture – The Image Encoder encodes the input image. The Prompt Encoder encodes the input prompt, and the Mask Decoder contextualizes the prompt and image embeddings together to predict the segmentation mask of the queried object. Mask Decoder also outputs IOU Scores, which are not shown above for simplicity. (Illustration by the author)

The Image Encoder processes the input image to create general image embeddings. These embeddings are unconditional on the prompts.
The Prompt Encoder processes the user input prompts to create prompt embeddings. Prompt embeddings are unconditional on the input image.
The Mask Decoder inputs the unconditional image and prompt embeddings and applies cross-attention and self-attention blocks to contextualize them with each other. From the resulting contextual embeddings, multiple segmentation masks are generated
Multiple Output Segmentation Masks are predicted by the Mask Decoder. These masks often represent the whole, part, or subpart of the queried object and help resolve ambiguity that might arise due to user prompts (image below).
Intersection-Over-Union scores are predicted for each output segmentation mask. These IoU scores denote the confidence score SAM has for each predicted mask to be "correct". Meaning if SAM predicts a high IoU score for Mask 1, then Mask 1 is probably the right mask.

SAM predicts 3 segmentation masks often denoting the "whole", "part", and "subpart" of the queried object. For each predicted mask SAM predicts an IOU score as well to estimate the amount of overlap the generated mask will have with the object of interest (Source: Image by author)

So what does SAM-2 does differently to adopt the above architecture for videos? Let’s discuss.

Frame Encoder

The input video is first divided into multiple frames, and each of the frames is independently encoded using a Vision Transformer-based Masked Auto-encoder Computer Vision model, called the Heira architecture. We will worry about the exact architecture of this transformer later, for now just imagine it as a black box that inputs a single frame as an image and outputs a multi-channel feature map of shape 256x64x64. All the frames of the video are similarly processed using the encoder.

The frames from the input video are embedded using a pretrained Vision Transformer (Image by the Author)

Notice that these embeddings do not consider the video sequence – they are just independent frame embeddings, meaning they don’t have access to other frames in the video. Secondly, just like SAM-1 they do not consider the input prompt at all – meaning that the resultant output is treated as a universal representation of the input frame and completely unconditional on the input prompts.

The advantage of this is that if the user adds new prompts in future frames, we do not need to run the frames through the image encoder again. The image encoder runs just once for each frame, and the results are cached and reused for all types of input prompts. This design decision makes SAM-2 run at interactive speeds – because the heavy work of encoding images only needs to happen once per video input.

Quick notes about the Heira Architecture

The exact nature of the Image encoder is an implementation detail – as long as the encoder is good enough and trained on a huge corpus of images it is fine. The Heira architecture is a hierarchical vision transformer, meaning that spatial resolution is reduced and the feature dimensions are increased as the network deepens. These models are trained on the task of Mask Autoencoding where the image inputted into the network is divided into multiple patches and some patches are randomly grayed out – and then the Heira model learns to reconstruct the original image back from the remaining patches that it can see.

The Heira Model returns general purpose image embeddings that can be used for a variety of downstream computer vision tasks! (Source: here)

Because mask auto-encoding is self-supervised, meaning all the labels are generated from the source image itself, we can easily train these large encoders on massive image datasets without needing to manually label them. Mask Autoencoders tend to learn general embeddings about images that can be used for a bunch of downstream tasks. It’s general-purpose embedding capabilities make it the choice architecture for SAM’s frame encoder.

Prompt Encoder

Just like the original SAM model, input prompts can come from point clicks, boxes, and segmentation masks. The prompt encoder’s job is to encode these prompts by converting them into a representative vector representation. Here’s a video of the original SAM architecture that goes into the details about how prompt encoders work.

The Prompt Encoder takes converts it into a shape of N_tokens x 256. For example,

To encode a click – The positional encoding of the x and y coordinate of the click is used as one of the tokens in the prompt sequence. The "type" of click (foreground/positive or background/negative) is also included in the representation.
To encode a bounding box – The positional encoding for the top-left and bottom-right points is used.

Encoding Prompts (Illustration by the Author)

A note on Dense Prompt Encodings

Point clicks and Bounding boxes are sparse prompt encodings, but we can also input entire segmentation masks as well, which is a form of dense prompt encodings. The dense prompt encodings are rarely used during inference – but during training, they are used to iteratively train the SAM model. The training methods of SAM are beyond the scope of this article, but for those curious here is my attempt to explain an entire article in one paragraph.

SAM is trained using iterative segmentation. During training, when SAM outputs a segmentation mask, we input it back into SAM as a dense prompt along with refinement clicks (sparse prompts) simulated from the ground-truth and prediction. The mask decoder uses the sparse and dense prompts and learns to output a new refined segmentation mask. During inference, only sparse prompts are used, and segmentation masks are predicted in one-pass (without feeding back the dense mask.

Maybe one day I’ll write an article about iterative segmentation training, but for now, let’s move on with our exploration of SAM’s network architecture.

Prompt encodings have largely remained the same in SAM-2. The only difference is that they must run separately for all frames that the user prompts.

Mask Decoder (Teaser)

Umm… before we get into mask decoders, let’s talk about the concept of Memory in SAM-2. For now, let’s just assume that the Mask Decoder inputs a bunch of things (trust me we will talk about what these things are in a minute) and outputs segmentation masks.

Memory Encoder and Memory Bank

After the mask decoder generates output masks the output mask is passed through a memory encoder to obtain a memory embedding. A new memory is created after each frame is processed. These memory embeddings are appended to a Memory Bank which is a first-in-first-out (FIFO) queue of the latest memories generated during video decoding.

For each frame, the Memory Encoder inputs the output of the Mask Decoder Output Mask and converts it into a memory. This memory is then inserted into a FIFO queue called the Memory Bank. Note that this image does not show the Memory Attention module – we will talk about it later in the article (Source: Image by author)

Memory Encoder

The output masks are first downsampled using a convolutional layer, and the unconditional image encoding is added to this output, passed through light-weight convolution layers to fuse the information, and the resulting spatial feature map is called the memory. You can imagine the memory to be a representation of the original input frame and the generated mask from a given time frame in the video.

Memory Encoder (Image by the Author)

Memory Bank

The memory bank contains the following:

The most recent N memories are stored in a queue
The last M prompts inputed by the user to keep track of multiple previous prompts.
The mask decoder output tokens for each frame are also stored – which are like object pointers that capture high-level semantic information about the object to segment.

Memory Bank (Image by the Author)

Memory Attention

We now have a way to save historical information in a Memory Bank. Now we need to use this information while generating segmentation masks for future frames. This is achieved using Memory Attention. The role of memory attention is to condition the current frame features on the Memory Bank features before it is inputted into the Mask Decoder.

The Memory Attention block contextualizes the image encodings with the memory bank. These contextualized image embeddings are then inputted into the mask decoder to generate segmentation masks. (Image by Author)

The Memory attention block first performs self-attention with the frame embeddings and then performs cross-attention between the image embeddings and the contents of the memory bank. The unconditional image embeddings therefore get contextualized with the previous output masks, previous input prompts, and object pointers.

The Memory Attention Block (Image by Author)

During the self-attention and cross-attention layers, in addition to the usual sinusoidal position embeddings, 2D rotary positional embeddings are also used. Without getting into extra details – rotary positional embeddings allow to capture of relative relationships between the frames, and 2D rotary positional embeddings work well for images because they help to model the spatial relationship between the frames both horizontally and vertically.

Mask Decoder (For real this time)

The Mask Decoder inputs the memory-contextualized image encodings output by the Memory Attention block, plus the prompt encodings and outputs the segmentation masks, IOU scores, and (the brand new) occlusion scores.

SAM-2 Mask Decoder (Source: SAM-2 Paper here)

The Mask-Decoder uses self-attention and cross attention mechanisms to contextualize the prompt tokens with the (memory-conditioned) image embeddings. This allows the image and prompts to be "married" together into one context-aware sequence. These embeddings are then used to produce segmentation masks, IOU scores, and the occlusion score. Basically, during the video, the queried object can get occluded because it got blocked by another object in the scene – the occlusion score predicts if the queried object is present in the current frame. The occlusion scores are a new addition to SAM-2 which predicts if the queried object is at-all present in the current scene or not.

To recap, just like the IOU scores, SAM generates an occlusion score for each of the three predicted masks. The three IOU scores tell us how confident SAM is for each of the three predicted masks – and the three occlusion scores tell us how likely SAM thinks that the corresponding object is present in the scene.

The outputs from SAM-2 Mask Decoders (Image by Author)

Final Thoughts

So… that was an outline of the network architecture behind SAM-2. For the extracted frames from a video (often at 6 FPS), the frames are encoded using the Frame/Image encoder, memory attention contextualizes the image encodings, prompts are encoded if present, the Mask Decoder then combines the image and prompt embeddings, produces output masks, IOU scores, and occlusion scores, Memory is generated using memory encoder, appended into the memory bank, and the whole process repeats.

The SAM-2 Algorithm (Image by the Author)

There are still many things to discuss about SAM-2 like how it trains interactively to be a promptable model, and how their data engine works to create training data. I hope to cover these topics in a separate article! You can watch the SAM-2 video on my YouTube channel above for more information and a visual tour of the systems that empower SAM-2.

Thanks for reading! Throw a clap and a follow!

References / Where to go next

Author’s Youtube Channel
My video explanation of SAM
My video explanation of SAM-2
You can play with the Segment Anything 2 Model on Meta’s website.
OG SAM Paper
SAM-2 Paper

The post Segment Anything 2: What Is the Secret Sauce? (A Deep Learner’s Guide) appeared first on Towards Data Science.

Monocular Depth Estimation with Depth Anything V2

Avishek Biswas — Wed, 24 Jul 2024 06:24:11 +0000

What is Monocular Depth Estimation?

The Depth Anything V2 Algorithm (Illustration by Author)

Monocular Depth Estimation (MDE) is the task of training a neural network to determine depth information from a single image. This is an exciting and challenging area of Machine Learning and Computer Vision because predicting a depth map requires the neural network to form a 3-dimensional understanding from just a 2-dimensional image.

In this article, we will discuss a new model called Depth Anything V2 and its precursor, Depth Anything V1. Depth Anything V2 has outperformed nearly all other models in Depth Estimation, showing impressive results on tricky images.

Depth Anything V2 Demo (Source: Screen recording by the author from Depth Anything V2 DEMO page)

This article is based on a video I made on the same topic. Here is a video link for learners who prefer a visual medium. For those who prefer reading, continue!

Why should we even care about MDE models?

Good MDE models have many practical uses, such as aiding navigation and obstacle avoidance for robots, drones, and autonomous vehicles. They can also be used in video and image editing, background replacement, object removal, and creating 3D effects. Additionally, they are useful for AR and VR headsets to create interactive 3D spaces around the user.

There are two main approaches for doing MDE (this article only covers one)

Two main approaches have emerged for training MDE models – one, discriminative approaches where the network tries to predict depth as a supervised learning objective, and two, generative approaches like conditional diffusion where depth prediction is an iterative image generation task. Depth Anything belongs to the first category of discriminative approaches, and that’s what we will be discussing today. Welcome to Neural Breakdown, and let’s go deep with Depth Estimation[!

Traditional Datasets and the MiDAS paper

To fully understand Depth Anything, let’s first revisit the MiDAS paper from 2019, which serves as a precursor to the Depth Anything algorithm.

Source: Screenshot taken from the MIDAS Paper (License: Free)

MiDAS trains an MDE model using a combination of different datasets containing labeled depth information. For instance, the KITTI dataset for autonomous driving provides outdoor images, while the NYU-Depth V2 dataset offers indoor scenes. Understanding how these datasets are collected is crucial because newer models like Depth Anything and Depth Anything V2 address several issues inherent in the data collection process.

How real-world depth datasets are collected

These datasets are typically collected using stereo cameras, where two or more cameras placed at fixed distances capture images simultaneously from slightly different perspectives, allowing for depth information extraction. The NYU-Depth V2 dataset uses RGB-D cameras that capture depth values along with pixel colors. Some datasets utilize LiDAR, projecting laser beams to capture 3D information about a scene.

However, these methods come with several problems. The amount of labeled data is limited due to the high operational costs of obtaining these datasets. Additionally, the annotations can be noisy and low-resolution. Stereo cameras struggle under various lighting conditions and can’t reliably identify transparent or highly reflective surfaces. LiDAR is expensive, and both LiDAR and RGB-D cameras have limited range and generate low-resolution, sparse depth maps.

Can we use Unlabelled Images to learn Depth Estimation?

It would be beneficial to use unlabeled images to train depth estimation models, given the abundance of such images available online. The major innovation proposed in the original Depth Anything paper from 2023 was the incorporation of these unlabeled datasets into the training pipeline. In the next section, we’ll explore how this was achieved.

Depth Anything Architecture

The original Depth Anything (V1) model from 2023 was trained in a three-step process. Let’s get a high-level overview of the algorithm before diving into each section.

Depth Anything V1 Algorithm (Illustration made by the Author)

Step 1: Teacher Training

First, a neural network called the TEACHER model is trained for supervised depth estimation using five different publicly available datasets.

Converting from Depth to Disparity Space

The TEACHER model is initialized with a pre-trained Dino-V2 encoder and then trained on the combined labeled dataset. A major challenge with training on multiple datasets is the variability in absolute depths. To address this, the depths are inverted into disparity space (d = 1 / t) and normalized between 0 and 1 for each depth map – 1 for the nearest pixel and 0 for the farthest. This way, all datasets share the same output space, allowing the model to predict disparity.

Different Depth Estimation datasets provide depths at different scales. We need to align them to have the same output space. Disparity lets us normalize all depth values between 0 and 1 (Illustration by Author)

Two loss functions are used to train these models: a scale-shift invariant loss and a gradient-matching loss, both also utilized in the MiDAS paper from 2019.

Scale-shift invariant loss

There is a problem with using a simple mean square error loss between the predicted and ground truth images. Let’s say the ground truth depth values of three pixels in an image are 1, 0.5, and 0.1, while our network predicts 0.9, 0.6, and 0.3. Although the predictions aren’t exact, the relationship between the predicted and ground truth depths is similar, differing only by a multiplicative and additive factor. We don’t want this scale and shift to affect our loss function – we need to align the two maps before applying the mean square error loss.

Scale and Shift Invariant Loss (Illustration by Author)

The MiDaS paper proposes normalizing the ground truth and predicted depths to have zero translation and unit scale. The median and deviation are calculated, and the depth maps are scaled and shifted accordingly. Once aligned, the mean square error loss is applied.

The SSI Loss (Source: MiDAS Paper) (License: Free)

2. Gradient Matching Loss

Without Gradient Matching Loss depth maps may become too smudgy and less sharp (Illustration by Author)

Using only the SSI loss might result in smoothed depth maps that fail to capture sharp distinctions between adjacent pixels. Gradient Matching Loss helps preserve these details by aligning the gradients of the predicted depth map with those of the ground truth.

First, we calculate the gradients of the predicted and ground truth depth maps across the x and y axes, then apply the loss at the gradient level. MiDaS also uses a multi-scale gradient matching loss with four scale levels. The predicted and ground truth depth maps are downsampled four times, and the loss is applied at each resolution.

Gradient Matching Loss. This loss is applied at multiple downscaled depth maps (not shown above) (Illustration by Author)

The final loss is the weighted sum of the scale-and-shift invariant loss and the multi-scale gradient matching loss. While the SSI loss encourages the model to learn general relative depth relationships, the gradient matching loss helps preserve sharp edges and fine-grained information in the scene.

The Loss functions used to train Depth Estimation models in MIDAS and Depth Anything V1 (Illustration by Author)

Step 2 – Pseudo-Labelling Unlabelled Dataset

With our trained TEACHER model, we can now annotate millions of unlabeled images to create a massive pseudo-depth label dataset. These labels are called pseudo because they are AI-generated and may not represent the actual ground truth depth. We now have a lot of (pseudo) labeled images to train a new network.

Pseudo – Labelling images (Note that this screen is actually from the Depth Anything V2 paper and not V1) Source: Depth Anything V2 Paper (License: Free)

Step 3 – Training Student Network

Flashback to the Depth Anything V1 algorithm. We are in Step 3 now. (Illustration made by the author)

We will be training a new neural network (the student network) on the combination of the labeled and pseudo-labeled datasets. However, simply training the network on the annotations provided by the Teacher Network won’t improve the model beyond the capabilities of the base Teacher model. To make the student network more capable, two strategies were employed: heavy perturbations with image augmentations and introducing an auxiliary semantic preservation loss.

Heavy Perturbations

One interesting perturbation used was the Cut Mix operation. This involves combining a random pair of unlabeled images using a binary mask, replacing a rectangular portion of image A with image B. The final loss is the combined SSI and Gradient Matching loss of the two sections from the two ground truth depth maps. These spatial distortions are also combined with color distortions to help the Student Network handle the diversity of open-world images.

The Cut Mix Operation (Illustration by Author)

Auxiliary Semantic Preservation Loss

The network is also trained with an auxiliary task called Semantic Assisted Perception. A strong pre-trained computer vision model like Dino-V2, which has been trained on millions of images in a self-supervised manner, is used. Given an image, we aim to reduce the cosine distance between the embeddings produced by our new Student model and the pre-trained Dino-V2 encoder. This enables our Student model to capture some of the semantic perception capabilities of the larger and more general Dino-V2 model, which it uses to predict the depth map.

Semantic Assisted Perception (Illustration by author)

By combining spatial distortions, semantic-assisted perception, and the power of both labeled and unlabeled datasets, the Student Network generalizes better and outperforms the original Teacher Network in-depth estimation! Here are some incredible results from the Depth Anything V1 model!

Depth Anything V2

As impressive as Depth Anything V1’s results are, it struggles with transparent objects and capturing fine-grained details. The authors of Depth Anything V2 suggest that the biggest bottleneck for model performance isn’t the architecture itself, but the quality of the data. Most labeled datasets captured with sensors can be quite noisy, ignore fine-grained details, generate low-resolution depth maps, and struggle with lighting conditions and reflective/transparent objects.

Issues with real-world sensor datasets (Illustration by Author)

Depth Anything V2 discards labeled datasets from real-world sensors like stereo cameras, LiDAR, and RGB-D cameras, instead using only synthetic datasets. Synthetic datasets are generated through graphics engines, not captured with equipment. An example is the Virtual KITTI dataset, which uses the Unity Game Engine to create rendered images and depth maps for automated driving. There are also indoor datasets like IRS and Hyper-sim. Depth Anything V2 uses five synthetic datasets containing close to 595K photorealistic images.

Synthetic Datasets vs Real World Sensor Datasets

Synthetic images do have their pros and cons. They are super accurate, they have high-resolution outputs that capture the finest of the details, and the depth of transparent and reflective surfaces can be easily obtained. Synthetic datasets have direct access to all the 3D information needed since the graphics engine itself creates the scene.

On the cons side, these images may not essentially capture the images that we will encounter in real-world scenarios. The scene coverage of these datasets isn’t particularly diverse enough too, and is a much smaller subset of real-world images. Depth Anything 2 combines the power of synthetic images with millions of unlabelled images to train an MDE model that outperforms pretty much everything else we have seen so far.

The pros and cons of Synthetic or Computer Generated Datasets (Illustration by Author)

Much like V1, the Teacher model in V2 is first trained on labeled datasets. However, in V2, it is exclusively trained on synthetic datasets. In Step 2, the Teacher model assigns pseudo-depth labels to all unlabeled images. Finally, in Step 3, the Student model is trained exclusively on pseudo-labeled images – no real labeled datasets and no synthetic datasets. The synthetic datasets are not used at this stage due to the distribution shift mentioned earlier. The Student network is trained on real-world images annotated by the Teacher model. Just like in V1, the auxiliary semantic preservation loss is used along with the Scale-and-Shift invariant and gradient matching loss.

The Depth Anything V2 architecture (Illustration by the Author)

Video link explaining the concepts here visually

Here is a video that explains all the concepts discussed in this video in a step-by-step method.

Depth Anything V1 vs Depth Anything V2

The original Depth Anything emphasized the importance of using unlabeled images in the MDE training pipeline. It introduced the knowledge distillation pipeline with Teacher training, pseudo-labeling unlabeled images, and then training the Student network on a combination of labeled and unlabeled images. The use of strong spatial and color distortions, and a semantic-assisted perception loss, helped create more general and robust embeddings. This resulted in efficient and high-quality depth maps for complex scenes. However, Depth Anything V1 still struggled with reflective surfaces and fine details due to noisy and low-resolution depth labels from real-world sensors.

Depth Anything V2 improved performance by ignoring real-world sensor datasets and only using synthetic images generated with graphics engines to train the Teacher Network. The Teacher Network then annotates millions of unlabeled images, and the Student Network is trained solely on these pseudo-labeled datasets with real-world images. With these techniques, Depth Anything V2 can now predict fine-level depth maps and handle transparent and reflective surfaces more effectively.

Relevant Links

MiDAS: https://arxiv.org/abs/1907.01341 Depth Anything: https://depth-anything.github.io/ Depth Anything V2: https://depth-anything-v2.github.io/

KITTI DATASET: https://www.cvlibs.net/datasets/kitti/ NYU V2: https://cs.nyu.edu/~fergus/datasets/nyu_depth_v2.html VIRTUAL KITTI: https://datasetninja.com/virtual-kitti

Youtube Video: https://youtu.be/sz30TDttIBA

The post Monocular Depth Estimation with Depth Anything V2 appeared first on Towards Data Science.

The History of Convolutional Neural Networks for Image Classification (1989- Today)

Avishek Biswas — Fri, 28 Jun 2024 01:51:23 +0000

The History of Convolutional Neural Networks for Image Classification (1989 – Today)

A visual tour of the greatest innovations in Deep Learning and Computer Vision.

Before CNNs, the standard way to train a neural network to classify images was to flatten it into a list of pixels and pass it through a feed-forward neural network to output the image’s class. The problem with flattening the image is that the essential spatial information in the image is discarded.

In 1989, Yann LeCun and team introduced Convolutional Neural Networks – the backbone of Computer Vision research for the last 15 years! Unlike feedforward networks, CNNs preserve the 2D nature of images and are capable of processing information spatially!

In this article, we are going to go through the history of CNNs specifically for Image Classification tasks – starting from those early research years in the 90’s to the golden era of the mid-2010s when many of the most genius Deep Learning architectures ever were conceived, and finally discuss the latest trends in CNN research now as they compete with attention and vision-transformers.

_Check out the YouTube video that explains all the concepts in this article visually with animations. Unless otherwise specified, all the images and illustrations used in this article are generated by myself during creating the video version._

The papers we will be discussing today!

The Basics of Convolutional Neural Networks

At the heart of a CNN is the convolution operation. We scan the filter across the image and calculate the dot product of the filter with the image at each overlapping location. This resulting output is called a feature map and it captures how much and where the filter pattern is present in the image.

How Convolution works – The kernel slides over the input image and calculates the overlap (dot-product) at each location – outputting a feature map in the end!

In a convolution layer, we train multiple filters that extract different feature maps from the input image. When we stack multiple convolutional layers in sequence with some non-linearity, we get a convolutional neural network (CNN).

So each convolution layer simultaneously does 2 things –

spatial filtering with the convolution operation between images and kernels, and
combining the multiple input channels and output a new set of channels.

90 percent of the research in CNNs has been to modify or to improve just these two things.

The two main things CNN do

The 1989 Paper

This 1989 paper taught us how to train non-linear CNNs from scratch using backpropagation. They input 16×16 grayscale images of handwritten digits, and pass through two convolutional layers with 12 filters of size 5×5. The filters also move with a stride of 2 during scanning. Strided-convolution is useful for downsampling the input image. After the conv layers, the output maps are flattened and passed through two fully connected networks to output the probabilities for the 10 digits. Using the softmax cross-entropy loss, the network is optimized to predict the correct labels for the handwritten digits. After each layer, the tanh nonlinearity is also used – allowing the learned feature maps to be more complex and expressive. With just 9760 parameters, this was a very small network compared to today’s networks which contain hundreds of millions of parameters.

The OG CNN architecture from 1989

Inductive Bias

Inductive Bias is a concept in Machine Learning where we deliberately introduce specific rules and limitations into the learning process to move our models away from generalizations and steer more toward solutions that follow our human-like understanding.

When humans classify images, we also do spatial filtering to look for common patterns to form multiple representations and then combine them together to form our predictions. The CNN architecture is designed to replicate just that. In feedforward networks, each pixel is treated like it’s own isolated feature as each neuron in the layers connects with all the pixels – in CNNs there is more parameter-sharing because the same filter scans the entire image. Inductive biases make CNNs less data-hungry too because they get local pattern recognition for free due to the network design but feedforward networks need to spend their training cycles learning about it from scratch.

Le-Net 5 (1998)

Lenet-5 architecture (Credit: Le-Net-5 paper)

In 1998, Yann LeCun and team published the Le-Net 5 – a deeper and larger 7-layer CNN model network. They also use Max Pooling which downsamples the image by grabbing the maximum values from a 2×2 sliding window.

How Max Pooling works (LEFT) and how Local Receptive Fields increase as CNNs add more layers (RIGHT)

Local Receptive Field

Notice when you train a 3×3 conv layer, each neuron is connected to a 3×3 region in the original image – this is the neuron’s local receptive field – the region of the image where this neuron extracts patterns from.

When we pass this feature map through another 3×3 layer , the new feature map indirectly creates a receptive field of a larger 5×5 region from the original image. Additionally, when we downsample the image through max-pooling or strided-convolution, the receptive field also increases – making deeper layers access the input image more and more globally.

For this reason, earlier layers in a CNN can only pick low-level details like specific edges or corners, and the latter layers pick up more spread-out global-level patterns.

The Draught (1998–2012)

As impressive Le-Net-5 was, researchers in the early 2000s still deemed neural networks to be computationally very expensive and data hungry to train. There was also problems with overfitting – where a complex neural network will just memorize the entire training dataset and fail to generalize on new unseen datasets. The researchers instead focused on traditional machine learning algorithms like support vector machines that were showing much better performance on the smaller datasets of the time with much less computational demands.

ImageNet Dataset (2009)

The ImageNet dataset was open-sourced in 2009 – it contained 3.2 million annotated images at the time covering over 1000 different classes. Today it has over 14 million images and over 20,000 annotated different classes. Every year from 2010 to 2017 we got this massive competition called the ILSVRC where different research groups will publish models to beat the benchmarks on a subset of the ImageNet dataset. In 2010 and 2011, traditional ML methods like Support Vector Machines were winning – but starting from 2012 it was all about CNNs. The metric used to rank different networks was generally the top-5 error rate – measuring the percentage of times that the true class label was not in the top 5 classes predicted by the network.

AlexNet (2012)

AlexNet, introduced by Dr. Geoffrey Hinton and his team was the winner of ILSVRC 2012 with a top-5 test set error of 17%. Here are the three main contributions from AlexNet.

1. Multi-scaled Kernels

AlexNet trained on 224×224 RGB images and used multiple kernel sizes in the network – an 11×11, a 5×5, and a 3×3 kernel. Models like Le-Net 5 only used 5×5 kernels. Larger kernels are more computationally expensive because they train more weights, but also capture more global patterns from the image. Because of these large kernels, AlexNet had over 60 million trainable parameters. All that complexity can however lead to overfitting.

AlexNet starts with larger kernels (11×11) and reduces the size (to 5×5 and 3×3) for deeper layers (Image by the author)

2. Dropout

To alleviate overfitting, AlexNet used a regularization technique called Dropout. During training, a fraction of the neurons in each layer is turned to zero. This prevents the network from being too reliant on specific neurons or groups of neurons for generating a prediction and instead encourages all the neurons to learn general meaningful features useful for classification.

3. RELU

Alexnet also replaced tanh nonlinearity with ReLU. RELU is an activation function that turns negative values to zero and keeps positive values as-is. The tanh function tends to saturate for deep networks because the gradients get low when the value of x goes too high or too low making optimization slow. RELU offers a steady gradient signal to train the network about 6 times faster than tanH.

RELU, TANH, and How much difference RELU makes (Image credits: Middle: Artificial Intelligence for Big Data, Right: Alex-Net paper)

AlexNet also introduced the concept of Local Response Normalization and strategies for distributed CNN training.

GoogleNet / Inception (2014)

In 2014, GoogleNet paper got an ImageNet top-5 error rate of 6.67%. The core component of GoogLeNet was the inception module. Each inception module consists of parallel convolutional layers with different filter sizes (1×1, 3×3, 5×5) and max-pooling layers. Inception applies these kernels to the same input and then concats them, combining both low-level and medium-level features.

An Inception Module

1×1 Convolution

They also use 1×1 convolutional layer. Each 1×1 kernel first scales the input channels and then combines them. 1×1 kernels multiply each pixel with a fixed value – which is why it is also called pointwise convolutions.

While larger kernels like 3×3 and 5×5 kernels do both spatial filtering and channel combination, 1×1 kernels are only good for channel mixing, and it does so very efficiently with a lower number of weights. For example, A 3-by-4 grid of 1×1 convolution layers trains only (1×1 x 3×4 =) 12 weights – but if it were 3×3 kernels – we would train (3×3 x 3×4 =) 108 weights.

1×1 kernels versus larger kernels (LEFT) and Dimensionality reduction with 1×1 kernels (RIGHT)

Dimensionality Reduction

GoogleNet uses 1×1 conv layers as a dimensionality reduction method to reduce the number of channels before running spatial filtering with the 3×3 and 5×5 convolutions on these lower dimensional feature maps. This helps them to cut down on the number of trainable weights compared to AlexNet.

VGGNet (2014)

The VGG Network claims that we do not need larger kernels like 5×5 or 7×7 networks and all we need are 3×3 kernels. 2 layer 3×3 convolutional layer has the same receptive field of the image that a single 5×5 layer does. Three 3×3 layers have the same receptive field that a single 7×7 layer does.

Deep 3×3 Convolution Layers capture the same receptive field as larger kernels but with fewer parameters!

One 5×5 filter trains 25 weights – while two 3×3 filters train 18 weights. Similarly one 7×7 trains 49 weights, while 3 3×3 trains just 27. Training with deep 3×3 convolution layers became the standard for a long time in CNN architectures.

Batch Normalization (2015)

Deep neural networks can suffer from a problem known as "Internal Covariate Shift" during training. Since the earlier layers of the network are constantly training, the latter layers need to continuously adapt to the constantly shifting input distribution it receive from the previous layers.

Batch Normalization aims to counteract this problem by normalizing the inputs of each layer to have zero mean and unit standard deviation during training. A batch normalization or BN layer can be applied after any convolution layer. During training it subtracts the mean of the feature map along the minibatch dimension and divides it by the standard deviation. This means that each layer will now see a more stationary unit gaussian distribution during training.

Advantages of Batch Norm

converge around 14 times faster
let us use higher learning rates, and
makes the network robust to the initial weights of the network.

ResNets (2016)

Deep Networks struggle to do Identity Mapping

Imagine you have a shallow neural network that has great accuracy on a classification task. Turns out that if we added 100 new convolution layers on top of this network, the training accuracy of the model could go down!

This is quite counter-intuitive because all these new layers need to do is copy the output of the shallow network at each layer – and at least be able to match the original accuracy. In reality, deep networks can be notoriously difficult to train because gradients can saturate or become unstable when backpropagating through many layers. With Relu and batch norm, we were able to train 22-layer deep CNNs at this point – the good folks at Microsoft introduced ResNets in 2015 which allowed us to stably train 150 layered CNNs. What did they do?

Residual learning

The input passes through one or more CNN layers as usual, but at the end, the original input is added back to the final output. These blocks are called residual blocks because they don’t need to learn the final output feature maps in the traditional sense – but they are just the residual features that must be added to the input to get the final feature maps. If the weights in the middle layers were to turn themselves to ZERO, then the residual block would just return the identity function – meaning it would be able to easily copy the input X.

Residual Networks

Easy Gradient Flow

During backpropagation gradients can directly flow through these shortcut paths to reach the earlier layers of the model faster, helping to prevent gradient vanishing issues. ResNet stacks many of these blocks together to form really deep networks without any loss of accuracy!

From the ResNet paper

And with this remarkable improvement, ResNets managed to train a 152-layered model that got a top-5 error rate that shattered all previous records!

DenseNet (2017)

Dense-Nets also add shortcut paths connecting earlier layers with the latter layers in the network. A DenseNet block trains a series of convolution layers, and the output of every layer is concatenated with the feature maps of every previous layer in the block before passing to the next layer. Each layer adds only a small number of new feature maps to the "collective knowledge" of the network as the image flows through the network. DenseNets have an improved flow of information and gradients throughout the network because each layer has direct access to the gradients from the loss function.

Dense Nets

Squeeze and Excitation Network (2017)

SEN-NET was the final winner of the ILSVRC competition, which introduced the Squeeze and Excitation Layer into CNNs. The SE block is designed to explicitly model the dependencies between all the channels of a feature map. In normal CNNs, each channel of a feature map is computed independently of each other; SEN-Net applies a self-attention-like method to make each channel of a feature map contextually aware of the global properties of the input image. SEN-Net won the final ILVSRC of 2017, and one of the 154-layered SenNet + ResNet models got a ridiculous top-5 error rate of 4.47%.

SEN-NET

Squeeze Operation

The squeeze operation compresses the spatial dimensions of the input feature map into a channel descriptor using global average pooling. Since each channel contains neurons that capture local properties of the image, the squeeze operation accumulates global information about each channel.

Excitation Operation

The excitation operation rescales the input feature maps by channel-wise multiplication with the channel descriptors obtained from the squeeze operation. This effectively propagates global-level information to each channel – contextualizing each channel with the rest of the channels in the feature map.

Squeeze and Excitation Block

MobileNet (2017)

Convolution layers do two things –1) filtering spatial information and 2) combining them channel-wise. The MobileNet paper uses Depthwise Separable Convolution, **** a technique that separates these two operations into two different layers – Depthwise Convolution for filtering and pointwise convolution for channel combination.

Depthwise Convolution

Depthwise Separable Convolution

Given an input set of feature maps with M channels, first, they use depthwise convolution layers that train M 3×3 convolutional kernels. Unlike normal convolution layers that perform convolution on all feature maps, depthwise convolution layers train filters that perform convolution on just one feature map each. Secondly, they use 1×1 pointwise convolution filters to mix all these feature maps. Separating the filtering and combining steps like this drastically reduces the number of weights, making it super lightweight while still retaining the performance.

Why Depthwise Separable Layers reduce training weights

MobileNetV2 (2019)

In 2018, MobileNetV2 improved the MobileNet architecture by introducing two more innovations: Linear Bottlenecks and Inverted residuals.

Linear Bottlenecks

MobileNetV2 uses 1×1 pointwise convolution for dimensionality reduction, followed by depthwise convolution layers for spatial filtering, and another 1×1 pointwise convolution layer to expand the channels back. These bottlenecks don’t pass through RELU and are instead kept linear. RELU zeros out all the negative values that came out of the dimensionality reduction step – and this can cause the network to lose valuable information especially if a bulk of this lower dimensional subspace was negative. Linear layers prevent the loss of excessive information during this bottleneck.

The width of each feature map is intended to show the relative channel dimensions.

Inverted Residuals

The second innovation is called Inverted Residuals. Generally, residual connections occur between layers with the highest channels, but the authors add shortcuts between the bottlenecks layers. The bottleneck captures the relevant information within a low-dimensional latent space, and the free flow of information and gradient between these layers is the most crucial.

Vision Transformers (2020)

Vision Transformers or ViTs established that transformers can indeed beat state-of-the-art CNNs in Image Classification. Transformers and Attention mechanisms provide a highly parallelizable, scalable, and general architecture for modeling sequences. Neural Attention is a whole different area of Deep Learning, which we won’t get into this article, but feel free to learn more in this Youtube video.

ViTs use Patch Embeddings and Self-Attention

The input image is first divided into a sequence of fixed-size patches. Each patch is independently embedded into a fixed-size vector either through a CNN or passing through a linear layer. These patch embeddings and their positional encodings are then inputted as a sequence of tokens into a self-attention-based transformer encoder. Self-attention models the relationships between all the patches, and outputs new updated patch embeddings that are contextually aware of the entire image.

Vision Transformers. Each self-attention layer further contextualizes each patch embedding with the global context of the image

Inductive Bias vs Generality

Where CNNs introduce several inductive biases about images, Transformers do the opposite – No localization, no sliding kernels – they rely on generality and raw computing to model the relationships between all the patches of the image. The Self-Attention layers allow global connectivity between all patches of the image irrespective of how far they are spatially. Inductive biases are great on smaller datasets, but the promise of Transformers is on massive training datasets, a general framework is going to eventually beat out the inductive biases offered by CNNs.

Convolution Layers vs Self-Attention Layers

ConvNext – A ConvNet for the 2020s (2022)

A great choice to include in this article would be Swin Transformers, but that is a topic for a different day! Since this is a CNN article, let’s focus on one last CNN paper.

Patchifying Images like VITs

The input of ConvNext follows the patching strategy inspired by Vision Transformers. A 4×4 convolution kernel with a stride of 4 creates a downsampled image which is inputted into the rest of the network.

Depthwise Separable Convolution

Inspired by MobileNet, ConvNext uses depthwise separable convolution layers. The authors also hypothesize depthwise convolution is similar to the weighted sum operation in self-attention, which operates on a per-channel basis by only mixing information in the spatial dimension. Also the 1×1 pointwise convolutions are similar to the channel mixing steps in Self-Attention.

Larger Kernel Sizes

While ConvNets have been using 3×3 kernels ever since VGG, ConvNext proposes larger 7×7 filters to capture a wider spatial context, trying to come close to the fully global context that ViTs capture, while retaining the localization spirits of CNNs.

There are also some other tweaks, like using MobileNetV2-inspired inverted bottlenecks, the GELU activations, layer norms instead of batch norms, and more that shape up the rest of the ConvNext architecture.

Scalability

ConvNext are more computationally efficient way with the depthwise separable convolutions and is more scalable than transformers on high-resolution images – this is because Self-Attention scales quadratically with sequence length and Convolution doesn’t.

Final Thoughts!

The history of CNNs teaches us so much about Deep Learning, Inductive Bias, and the nature of computation itself. It’ll be interesting to see what wins out in the end – the inductive biases of ConvNets or the Generality of Transformers. Do check out the companion YouTube video for a visual tour of this article, and the individual papers as listed below.