Artificial Intelligence | Towards Data Science

Sesame Speech Model: How This Viral AI Model Generates Human-Like Speech

Avishek Biswas — Sat, 12 Apr 2025 01:09:27 +0000

Recently, Sesame AI published a demo of their latest Speech-to-Speech model. A conversational AI agent who is really good at speaking, they provide relevant answers, they speak with expressions, and honestly, they are just very fun and interactive to play with.

Note that a technical paper is not out yet, but they do have a short blog post that provides a lot of information about the techniques they used and previous algorithms they built upon.

Thankfully, they provided enough information for me to write this article and make a YouTube video out of it. Read on!

Training a Conversational Speech Model

Sesame is a Conversational Speech Model, or a CSM. It inputs both text and audio, and generates speech as audio. While they haven’t revealed their training data sources in the articles, we can still try to take a solid guess. The blog post heavily cites another CSM, 2024’s Moshi, and fortunately, the creators of Moshi did reveal their data sources in their paper. Moshi uses 7 million hours of unsupervised speech data, 170 hours of natural and scripted conversations (for multi-stream training), and 2000 more hours of telephone conversations (The Fischer Dataset).

Sesame builds upon the Moshi Paper (2024)

But what does it really take to generate audio?

In raw form, audio is just a long sequence of amplitude values — a waveform. For example, if you’re sampling audio at 24 kHz, you are capturing 24,000 float values every second.

There are 24000 values here to represent 1 second of speech! (Image generated by author)

Of course, it is quite resource-intensive to process 24000 float values for just one second of data, especially because transformer computations scale quadratically with sequence length. It would be great if we could compress this signal and reduce the number of samples required to process the audio.

We will take a deep dive into the Mimi encoder and specifically Residual Vector Quantizers (RVQ), which are the backbone of Audio/Speech modeling in Deep Learning today. We will end the article by learning about how Sesame generates audio using its special dual-transformer architecture.

Preprocessing audio

Compression and feature extraction are where convolution helps us. Sesame uses the Mimi speech encoder to process audio. Mimi was introduced in the aforementioned Moshi paper as well. Mimi is a self-supervised audio encoder-decoder model that converts audio waveforms into discrete “latent” tokens first, and then reconstructs the original signal. Sesame only uses the encoder section of Mimi to tokenize the input audio tokens. Let’s learn how.

Mimi inputs the raw speech waveform at 24Khz, passes them through several strided convolution layers to downsample the signal, with a stride factor of 4, 5, 6, 8, and 2. This means that the first CNN block downsamples the audio by 4x, then 5x, then 6x, and so on. In the end, it downsamples by a factor of 1920, reducing it to just 12.5 frames per second.

The convolution blocks also project the original float values to an embedding dimension of 512. Each embedding aggregates the local features of the original 1D waveform. 1 second of audio is now represented as around 12 vectors of size 512. This way, Mimi reduces the sequence length from 24000 to just 12 and converts them into dense continuous vectors.

Before applying any quantization, the Mimi Encoder downsamples the input 24KHz audio by 1920 times, and embeds it into 512 dimensions. In other words, you get 12.5 frames per second with each frame as a 512-dimensional vector. (Image from author’s video)

What is Audio Quantization?

Given the continuous embeddings obtained after the convolution layer, we want to tokenize the input speech. If we can represent speech as a sequence of tokens, we can apply standard language learning transformers to train generative models.

Mimi uses a Residual Vector Quantizer or RVQ tokenizer to achieve this. We will talk about the residual part soon, but first, let’s look at what a simple vanilla Vector quantizer does.

Vector Quantization

The idea behind Vector Quantization is simple: you train a codebook , which is a collection of, say, 1000 random vector codes all of size 512 (same as your embedding dimension).

A Vanilla Vector Quantizer. A codebook of embeddings is trained. Given an input embedding, we map/quantize it to the nearest codebook entry. (Screenshot from author’s video)

Then, given the input vector, we will map it to the closest vector in our codebook — basically snapping a point to its nearest cluster center. This means we have effectively created a fixed vocabulary of tokens to represent each audio frame, because whatever the input frame embedding may be, we will represent it with the nearest cluster centroid. If you want to learn more about Vector Quantization, check out my video on this topic where I go much deeper with this.

More about Vector Quantization! (Video by author)

Residual Vector Quantization

The problem with simple vector quantization is that the loss of information may be too high because we are mapping each vector to its cluster’s centroid. This “snap” is rarely perfect, so there is always an error between the original embedding and the nearest codebook.

The big idea of Residual Vector Quantization is that it doesn’t stop at having just one codebook. Instead, it tries to use multiple codebooks to represent the input vector.

First, you quantize the original vector using the first codebook.
Then, you subtract that centroid from your original vector. What you’re left with is the residual — the error that wasn’t captured in the first quantization.
Now take this residual, and quantize it again, using a second codebook full of brand new code vectors — again by snapping it to the nearest centroid.
Subtract that too, and you get a smaller residual. Quantize again with a third codebook… and you can keep doing this for as many codebooks as you want.

Residual Vector Quantizers (RVQ) hierarchically encode the input embeddings by using a new codebook and VQ layer to represent the previous codebook’s error. (Illustration by the author)

Each step hierarchically captures a little more detail that was missed in the previous round. If you repeat this for, let’s say, N codebooks, you get a collection of N discrete tokens from each stage of quantization to represent one audio frame.

The coolest thing about RVQs is that they are designed to have a high inductive bias towards capturing the most essential content in the very first quantizer. In the subsequent quantizers, they learn more and more fine-grained features.

If you’re familiar with PCA, you can think of the first codebook as containing the primary principal components, capturing the most critical information. The subsequent codebooks represent higher-order components, containing information that adds more details.

Residual Vector Quantizers (RVQ) uses multiple codebooks to encode the input vector — one entry from each codebook. (Screenshot from author’s video)

Acoustic vs Semantic Codebooks

Since Mimi is trained on the task of audio reconstruction, the encoder compresses the signal to the discretized latent space, and the decoder reconstructs it back from the latent space. When optimizing for this task, the RVQ codebooks learn to capture the essential acoustic content of the input audio inside the compressed latent space.

Mimi also separately trains a single codebook (vanilla VQ) that only focuses on embedding the semantic content of the audio. This is why Mimi is called a split-RVQ tokenizer – it divides the quantization process into two independent parallel paths: one for semantic information and another for acoustic information.

The Mimi Architecture (Source: Moshi paper) License: Free

To train semantic representations, Mimi used knowledge distillation with an existing speech model called WavLM as a semantic teacher. Basically, Mimi introduces an additional loss function that decreases the cosine distance between the semantic RVQ code and the WavLM-generated embedding.

Audio Decoder

Given a conversation containing text and audio, we first convert them into a sequence of token embeddings using the text and audio tokenizers. This token sequence is then input into a transformer model as a time series. In the blog post, this model is referred to as the Autoregressive Backbone Transformer. Its task is to process this time series and output the “zeroth” codebook token.

A lighterweight transformer called the audio decoder then reconstructs the next codebook tokens conditioned on this zeroth code generated by the backbone transformer. Note that the zeroth code already contains a lot of information about the history of the conversation since the backbone transformer has visibility of the entire past sequence. The lightweight audio decoder only operates on the zeroth token and generates the other N-1 codes. These codes are generated by using N-1 distinct linear layers that output the probability of choosing each code from their corresponding codebooks.

You can imagine this process as predicting a text token from the vocabulary in a text-only LLM. Just that a text-based LLM has a single vocabulary, but the RVQ-tokenizer has multiple vocabularies in the form of the N codebooks, so you need to train a separate linear layer to model the codes for each.

The Sesame Architecture (Illustration by the author)

Finally, after the codewords are all generated, we aggregate them to form the combined continuous audio embedding. The final job is to convert this audio back to a waveform. For this, we apply transposed convolutional layers to upscale the embedding back from 12.5 Hz back to KHz waveform audio. Basically, reversing the transforms we had applied originally during audio preprocessing.

In Summary

Check out the accompanying video on this article! (Video by author)

So, here is the overall summary of the Sesame model in some bullet points.

Sesame is built on a multimodal Conversation Speech Model or a CSM.
Text and audio are tokenized together to form a sequence of tokens and input into the backbone transformer that autoregressively processes the sequence.
While the text is processed like any other text-based LLM, the audio is processed directly from its waveform representation. They use the Mimi encoder to convert the waveform into latent codes using a split RVQ tokenizer.
The multimodal backbone transformers consume a sequence of tokens and predict the next zeroth codeword.
Another lightweight transformer called the Audio Decoder predicts the next codewords from the zeroth codeword.
The final audio frame representation is generated from combining all the generated codewords and upsampled back to the waveform representation.

Thanks for reading!

References and Must-read papers

Check out my ML YouTube Channel

Sesame Blogpost and Demo

Relevant papers:
Moshi: https://arxiv.org/abs/2410.00037
SoundStream: https://arxiv.org/abs/2107.03312
HuBert: https://arxiv.org/abs/2106.07447
Speech Tokenizer: https://arxiv.org/abs/2308.16692

The post Sesame Speech Model: How This Viral AI Model Generates Human-Like Speech appeared first on Towards Data Science.

Learnings from a Machine Learning Engineer — Part 6: The Human Side

David Martin — Fri, 11 Apr 2025 18:44:39 +0000

In my previous articles, I have spent a lot of time talking about the technical aspects of an Image Classification problem from data collection, model evaluation, performance optimization, and a detailed look at model training.

These elements require a certain degree of in-depth expertise, and they (usually) have well-defined metrics and established processes that are within our control.

Now it’s time to consider…

The human aspects of machine learning

Yes, this may seem like an oxymoron! But it is the interaction with people — the ones you work with and the ones who use your application — that help bring the technology to life and provide a sense of fulfillment to your work.

These human interactions include:

Communicating technical concepts to a non-technical audience.
Understanding how your end-users engage with your application.
Providing clear expectations on what the model can and cannot do.

I also want to touch on the impact to people’s jobs, both positive and negative, as AI becomes a part of our everyday lives.

Overview

As in my previous articles, I will gear this discussion around an image classification application. With that in mind, these are the groups of people involved with your project:

AI/ML Engineer (that’s you) — bringing life to the Machine Learning application.
MLOps team — your peers who will deploy, monitor, and enhance your application.
Subject matter experts — the ones who will provide the care and feeding of labeled data.
Stakeholders — the ones who are looking for a solution to a real world problem.
End-users — the ones who will be using your application. These could be internal and external customers.
Marketing — the ones who will be promoting usage of your application.
Leadership — the ones who are paying the bill and need to see business value.

Let’s dive right in…

AI/ML Engineer

You may be a part of a team or a lone wolf. You may be an individual contributor or a team leader.

Photo by Christina @ wocintechchat.com on Unsplash

Whatever your role, it is important to see the whole picture — not only the coding, the data science, and the technology behind AI/ML — but the value that it brings to your organization.

Understand the business needs

Your company faces many challenges to reduce expenses, improve customer satisfaction, and remain profitable. Position yourself as someone who can create an application that helps achieve their goals.

What are the pain points in a business process?
What is the value of using your application (time savings, cost savings)?
What are the risks of a poor implementation?
What is the roadmap for future enhancements and use-cases?
What other areas of the business could benefit from the application, and what design choices will help future-proof your work?

Communication

Deep technical discussions with your peers is probably our comfort zone. However, to be a more successful AI/ML Engineer, you should be able to clearly explain the work you are doing to different audiences.

With practice, you can explain these topics in ways that your non-technical business users can follow along with, and understand how your technology will benefit them.

To help you get comfortable with this, try creating a PowerPoint with 2–3 slides that you can cover in 5–10 minutes. For example, explain how a neural network can take an image of a cat or a dog and determine which one it is.

Practice giving this presentation in your mind, to a friend — even your pet dog or cat! This will get you more comfortable with the transitions, tighten up the content, and ensure you cover all the important points as clearly as possible.

Be sure to include visuals — pure text is boring, graphics are memorable.
Keep an eye on time — respect your audience’s busy schedule and stick to the 5–10 minutes you are given.
Put yourself in their shoes — your audience is interested in how the technology will benefit them, not on how smart you are.

Creating a technical presentation is a lot like the Feynman Technique — explaining a complex subject to your audience by breaking it into easily digestible pieces, with the added benefit of helping you understand it more completely yourself.

MLOps team

These are the people that deploy your application, manage data pipelines, and monitor infrastructure that keeps things running.

Without them, your model lives in a Jupyter notebook and helps nobody!

Photo by airfocus on Unsplash

These are your technical peers, so you should be able to connect with their skillset more naturally. You speak in jargon that sounds like a foreign language to most people. Even so, it is extremely helpful for you to create documentation to set expectations around:

Process and data flows.
Data quality standards.
Service level agreements for model performance and availability.
Infrastructure requirements for compute and storage.
Roles and responsibilities.

It is easy to have a more informal relationship with your MLOps team, but remember that everyone is trying to juggle many projects at the same time.

Email and chat messages are fine for quick-hit issues. But for larger tasks, you will want a system to track things like user stories, enhancement requests, and break-fix issues. This way you can prioritize the work and ensure you don’t forget something. Plus, you can show progress to your supervisor.

Some great tools exist, such as:

Jira, GitHub, Azure DevOps Boards, Asana, Monday, etc.

We are all professionals, so having a more formal system to avoid miscommunication and mistrust is good business.

Subject matter experts

These are the team members that have the most experience working with the data that you will be using in your AI/ML project.

Photo by National Cancer Institute on Unsplash

SMEs are very skilled at dealing with messy data — they are human, after all! They can handle one-off situations by considering knowledge outside of their area of expertise. For example, a doctor may recognize metal inserts in a patient’s X-ray that indicate prior surgery. They may also notice a faulty X-ray image due to equipment malfunction or technician error.

However, your machine learning model only knows what it knows, which comes from the data it was trained on. So, those one-off cases may not be appropriate for the model you are training. Your SMEs need to understand that clear, high quality training material is what you are looking for.

Think like a computer

In the case of an image classification application, the output from the model communicates to you how well it was trained on the data set. This comes in the form of error rates, which is very much like when a student takes an exam and you can tell how well they studied by seeing how many questions — and which ones — they get wrong.

In order to reduce error rates, your image data set needs to be objectively “good” training material. To do this, put yourself in an analytical mindset and ask yourself:

What images will the computer get the most useful information out of? Make sure all the relevant features are visible.
What is it about an image that confused the model? When it makes an error, try to understand why — objectively — by looking at the entire picture.
Is this image a “one-off” or a typical example of what the end-users will send? Consider creating a new subclass of exceptions to the norm.

Be sure to communicate to your SMEs that model performance is directly tied to data quality and give them clear guidance:

Provide visual examples of what works.
Provide counter-examples of what does not work.
Ask for a wide variety of data points. In the X-ray example, be sure to get patients with different ages, genders, and races.
Provide options to create subclasses of your data for further refinement. Use that X-ray from a patient with prior surgery as a subclass, and eventually as you can get more examples over time, the model can handle them.

This also means that you should become familiar with the data they are working with — perhaps not expert level, but certainly above a novice level.

Lastly, when working with SMEs, be cognizant of the impression they may have that the work you are doing is somehow going to replace their job. It can feel threatening when someone asks you how to do your job, so be mindful.

Ideally, you are building a tool with honest intentions and it will enable your SMEs to augment their day-to-day work. If they can use the tool as a second opinion to validate their conclusions in less time, or perhaps even avoid mistakes, then this is a win for everyone. Ultimately, the goal is to allow them to focus on more challenging situations and achieve better outcomes.

I have more to say on this in my closing remarks.

Stakeholders

These are the people you will have the closest relationship with.

Stakeholders are the ones who created the business case to have you build the machine learning model in the first place.

Photo by Ninthgrid on Unsplash

They have a vested interest in having a model that performs well. Here are some key point when working with your stakeholder:

Be sure to listen to their needs and requirements.
Anticipate their questions and be prepared to respond.
Be on the lookout for opportunities to improve your model performance. Your stakeholders may not be as close to the technical details as you are and may not think there is any room for improvement.
Bring issues and problems to their attention. They may not want to hear bad news, but they will appreciate honesty over evasion.
Schedule regular updates with usage and performance reports.
Explain technical details in terms that are easy to understand.
Set expectations on regular training and deployment cycles and timelines.

Your role as an AI/ML Engineer is to bring to life the vision of your stakeholders. Your application is making their lives easier, which justifies and validates the work you are doing. It’s a two-way street, so be sure to share the road.

End-users

These are the people who are using your application. They may also be your harshest critics, but you may never even hear their feedback.

Photo by Alina Ruf on Unsplash

Think like a human

Recall above when I suggested to “think like a computer” when analyzing the data for your training set. Now it’s time to put yourself in the shoes of a non-technical user of your application.

End-users of an image classification model communicate their understanding of what’s expected of them by way of poor images. These are like the students that didn’t study for the exam, or worse didn’t read the questions, so their answers don’t make sense.

Your model may be really good, but if end-users misuse the application or are not satisfied with the output, you should be asking:

Are the instructions confusing or misleading? Did the user focus the camera on the subject being classified, or is it more of a wide-angle image? You can’t blame the user if they follow bad instructions.
What are their expectations? When the results are presented to the user, are they satisfied or are they frustrated? You may noticed repeated images from frustrated users.
Are the usage patterns changing? Are they trying to use the application in unexpected ways? This may be an opportunity to improve the model.

Inform your stakeholders of your observations. There may be simple fixes to improve end-user satisfaction, or there may be more complex work ahead.

If you are lucky, you may discover an unexpected way to leverage the application that leads to expanded usage or exciting benefits to your business.

Explainability

Most AI/ML model are considered “black boxes” that perform millions of calculations on extremely high dimensional data and produce a rather simplistic result without any reason behind it.

The Answer to Ultimate Question of Life, the Universe, and Everything is 42.
— The Hitchhikers Guide to the Galaxy

Depending on the situation, your end-users may require more explanation of the results, such as with medical imaging. Where possible, you should consider incorporating model explainability techniques such as LIME, SHAP, and others. These responses can help put a human touch to cold calculations.

Now it’s time to switch gears and consider higher-ups in your organization.

Marketing team

These are the people who promote the use of your hard work. If your end-users are completely unaware of your application, or don’t know where to find it, your efforts will go to waste.

The marketing team controls where users can find your app on your website and link to it through social media channels. They also see the technology through a different lens.

Gartner hype cycle. Image from Wikipedia – https://en.wikipedia.org/wiki/Gartner_hype_cycle

The above hype cycle is a good representation of how technical advancements tends to flow. At the beginning, there can be an unrealistic expectation of what your new AI/ML tool can do — it’s the greatest thing since sliced bread!

Then the “new” wears off and excitement wanes. You may face a lack of interest in your application and the marketing team (as well as your end-users) move on to the next thing. In reality, the value of your efforts are somewhere in the middle.

Understand that the marketing team’s interest is in promoting the use of the tool because of how it will benefit the organization. They may not need to know the technical inner workings. But they should understand what the tool can do, and be aware of what it cannot do.

Honest and clear communication up-front will help smooth out the hype cycle and keep everyone interested longer. This way the crash from peak expectations to the trough of disillusionment is not so severe that the application is abandoned altogether.

Leadership team

These are the people that authorize spending and have the vision for how the application fits into the overall company strategy. They are driven by factors that you have no control over and you may not even be aware of. Be sure to provide them with the key information about your project so they can make informed decisions.

Photo by Adeolu Eletu on Unsplash

Depending on your role, you may or may not have direct interaction with executive leadership in your company. Your job is to summarize the costs and benefits associated with your project, even if that is just with your immediate supervisor who will pass this along.

Your costs will likely include:

Compute and storage — training and serving a model.
Image data collection — both real-world and synthetic or staged.
Hours per week — SME, MLOps, AI/ML engineering time.

Highlight the savings and/or value added:

Provide measures on speed and accuracy.
Translate efficiencies into FTE hours saved and customer satisfaction.
Bonus points if you can find a way to produce revenue.

Business leaders, much like the marketing team, may follow the hype cycle:

Be realistic about model performance. Don’t try to oversell it, but be honest about the opportunities for improvement.
Consider creating a human benchmark test to measure accuracy and speed for an SME. It is easy to say human accuracy is 95%, but it’s another thing to measure it.
Highlight short-term wins and how they can become long-term success.

Conclusion

I hope you can see that, beyond the technical challenges of creating an AI/ML application, there are many humans involved in a successful project. Being able to interact with these individuals, and meet them where they are in terms of their expectations from the technology, is vital to advancing the adoption of your application.

Photo by Vlad Hilitanu on Unsplash

Key takeaways:

Understand how your application fits into the business needs.
Practice communicating to a non-technical audience.
Collect measures of model performance and report these regularly to your stakeholders.
Expect that the hype cycle could help and hurt your cause, and that setting consistent and realistic expectations will ensure steady adoption.
Be aware that factors outside of your control, such as budgets and business strategy, could affect your project.

And most importantly…

Don’t let machines have all the fun learning!

Human nature gives us the curiosity we need to understand our world. Take every opportunity to grow and expand your skills, and remember that human interaction is at the heart of machine learning.

Closing remarks

Advancements in AI/ML have the potential (assuming they are properly developed) to do many tasks as well as humans. It would be a stretch to say “better than” humans because it can only be as good as the training data that humans provide. However, it is safe to say AI/ML can be faster than humans.

The next logical question would be, “Well, does that mean we can replace human workers?”

This is a delicate topic, and I want to be clear that I am not an advocate of eliminating jobs.

I see my role as an AI/ML Engineer as being one that can create tools that aide in someone else’s job or enhance their ability to complete their work successfully. When used properly, the tools can validate difficult decisions and speed through repetitive tasks, allowing your experts to spend more time on the one-off situations that require more attention.

There may also be new career opportunities, from the care-and-feeding of data, quality assessment, user experience, and even to new roles that leverage the technology in exciting and unexpected ways.

Unfortunately, business leaders may make decisions that impact people’s jobs, and this is completely out of your control. But all is not lost — even for us AI/ML Engineers…

There are things we can do

Be kind to the fellow human beings that we call “coworkers”.
Be aware of the fear and uncertainty that comes with technological advancements.
Be on the lookout for ways to help people leverage AI/ML in their careers and to make their lives better.

This is all part of being human.

The post Learnings from a Machine Learning Engineer — Part 6: The Human Side appeared first on Towards Data Science.

A Data Scientist’s Guide to Docker Containers

Jonte Dancker — Tue, 08 Apr 2025 20:02:45 +0000

For a ML model to be useful it needs to run somewhere. This somewhere is most likely not your local machine. A not-so-good model that runs in a production environment is better than a perfect model that never leaves your local machine.

However, the production machine is usually different from the one you developed the model on. So, you ship the model to the production machine, but somehow the model doesn’t work anymore. That’s weird, right? You tested everything on your local machine and it worked fine. You even wrote unit tests.

What happened? Most likely the production machine differs from your local machine. Perhaps it does not have all the needed dependencies installed to run your model. Perhaps installed dependencies are on a different version. There can be many reasons for this.

How can you solve this problem? One approach could be to exactly replicate the production machine. But that is very inflexible as for each new production machine you would need to build a local replica.

A much nicer approach is to use Docker containers.

Docker is a tool that helps us to create, manage, and run code and applications in containers. A container is a small isolated computing environment in which we can package an application with all its dependencies. In our case our ML model with all the libraries it needs to run. With this, we do not need to rely on what is installed on the host machine. A Docker Container enables us to separate applications from the underlying infrastructure.

For example, we package our ML model locally and push it to the cloud. With this, Docker helps us to ensure that our model can run anywhere and anytime. Using Docker has several advantages for us. It helps us to deliver new models faster, improve reproducibility, and make collaboration easier. All because we have exactly the same dependencies no matter where we run the container.

As Docker is widely used in the industry Data Scientists need to be able to build and run containers using Docker. Hence, in this article, I will go through the basic concept of containers. I will show you all you need to know about Docker to get started. After we have covered the theory, I will show you how you can build and run your own Docker container.

What is a container?

A container is a small, isolated environment in which everything is self-contained. The environment packages up all code and dependencies.

A container has five main features.

self-contained: A container isolates the application/software, from its environment/infrastructure. Due to this isolation, we do not need to rely on any pre-installed dependencies on the host machine. Everything we need is part of the container. This ensures that the application can always run regardless of the infrastructure.
isolated: The container has a minimal influence on the host and other containers and vice versa.
independent: We can manage containers independently. Deleting a container does not affect other containers.
portable: As a container isolates the software from the hardware, we can run it seamlessly on any machine. With this, we can move it between machines without a problem.
lightweight: Containers are lightweight as they share the host machine’s OS. As they do not require their own OS, we do not need to partition the hardware resource of the host machine.

This might sound similar to virtual machines. But there is one big difference. The difference is in how they use their host computer’s resources. Virtual machines are an abstraction of the physical hardware. They partition one server into multiple. Thus, a VM includes a full copy of the OS which takes up more space.

In contrast, containers are an abstraction at the application layer. All containers share the host’s OS but run in isolated processes. Because containers do not contain an OS, they are more efficient in using the underlying system and resources by reducing overhead.

Containers vs. Virtual Machines (Image by the author based on docker.com)

Now we know what containers are. Let’s get some high-level understanding of how Docker works. I will briefly introduce the technical terms that are used often.

What is Docker?

To understand how Docker works, let’s have a brief look at its architecture.

Docker uses a client-server architecture containing three main parts: A Docker client, a Docker daemon (server), and a Docker registry.

The Docker client is the primary way to interact with Docker through commands. We use the client to communicate through a REST API with as many Docker daemons as we want. Often used commands are docker run, docker build, docker pull, and docker push. I will explain later what they do.

The Docker daemon manages Docker objects, such as images and containers. The daemon listens for Docker API requests. Depending on the request the daemon builds, runs, and distributes Docker containers. The Docker daemon and client can run on the same or different systems.

The Docker registry is a centralized location that stores and manages Docker images. We can use them to share images and make them accessible to others.

Sounds a bit abstract? No worries, once we get started it will be more intuitive. But before that, let’s run through the needed steps to create a Docker container.

Docker Architecture (Image by author based on docker.com)

What do we need to create a Docker container?

It is simple. We only need to do three steps:

create a Dockerfile
build a Docker Image from the Dockerfile
run the Docker Image to create a Docker container

Let’s go step-by-step.

A Dockerfile is a text file that contains instructions on how to build a Docker Image. In the Dockerfile we define what the application looks like and its dependencies. We also state what process should run when launching the Docker container. The Dockerfile is composed of layers, representing a portion of the image’s file system. Each layer either adds, removes, or modifies the layer below it.

Based on the Dockerfile we create a Docker Image. The image is a read-only template with instructions to run a Docker container. Images are immutable. Once we create a Docker Image we cannot modify it anymore. If we want to make changes, we can only add changes on top of existing images or create a new image. When we rebuild an image, Docker is clever enough to rebuild only layers that have changed, reducing the build time.

A Docker Container is a runnable instance of a Docker Image. The container is defined by the image and any configuration options that we provide when creating or starting the container. When we remove a container all changes to its internal states are also removed if they are not stored in a persistent storage.

Using Docker: An example

With all the theory, let’s get our hands dirty and put everything together.

As an example, we will package a simple ML model with Flask in a Docker container. We can then run requests against the container and receive predictions in return. We will train a model locally and only load the artifacts of the trained model in the Docker Container.

I will go through the general workflow needed to create and run a Docker container with your ML model. I will guide you through the following steps:

build model
create requirements.txt file containing all dependencies
create Dockerfile
build docker image
run container

Before we get started, we need to install Docker Desktop. We will use it to view and run our Docker containers later on.

1. Build a model

First, we will train a simple RandomForestClassifier on scikit-learn’s Iris dataset and then store the trained model.

Second, we build a script making our model available through a Rest API, using Flask. The script is also simple and contains three main steps:

extract and convert the data we want to pass into the model from the payload JSON
load the model artifacts and create an onnx session and run the model
return the model’s predictions as json

I took most of the code from here and here and made only minor changes.

2. Create requirements

Once we have created the Python file we want to execute when the Docker container is running, we must create a requirements.txt file containing all dependencies. In our case, it looks like this:

3. Create Dockerfile

The last thing we need to prepare before being able to build a Docker Image and run a Docker container is to write a Dockerfile.

The Dockerfile contains all the instructions needed to build the Docker Image. The most common instructions are

FROM — this specifies the base image that the build will extend.
WORKDIR — this instruction specifies the “working directory” or the path in the image where files will be copied and commands will be executed.
COPY — this instruction tells the builder to copy files from the host and put them into the container image.
RUN — this instruction tells the builder to run the specified command.
ENV — this instruction sets an environment variable that a running container will use.
EXPOSE — this instruction sets the configuration on the image that indicates a port the image would like to expose.
USER — this instruction sets the default user for all subsequent instructions.
CMD ["", ""] — this instruction sets the default command a container using this image will run.

With these, we can create the Dockerfile for our example. We need to follow the following steps:

Determine the base image
Install application dependencies
Copy in any relevant source code and/or binaries
Configure the final image

Let’s go through them step by step. Each of these steps results in a layer in the Docker Image.

First, we specify the base image that we then build upon. As we have written in the example in Python, we will use a Python base image.

Second, we set the working directory into which we will copy all the files we need to be able to run our ML model.

Third, we refresh the package index files to ensure that we have the latest available information about packages and their versions.

Fourth, we copy in and install the application dependencies.

Fifth, we copy in the source code and all other files we need. Here, we also expose port 8080, which we will use for interacting with the ML model.

Sixth, we set a user, so that the container does not run as the root user

Seventh, we define that the example.py file will be executed when we run the Docker container. With this, we create the Flask server to run our requests against.

Besides creating the Dockerfile, we can also create a .dockerignore file to improve the build speed. Similar to a .gitignore file, we can exclude directories from the build context.

If you want to know more, please go to docker.com.

4. Create Docker Image

After we created all the files we needed to build the Docker Image.

To build the image we first need to open Docker Desktop. You can check if Docker Desktop is running by running docker ps in the command line. This command shows you all running containers.

To build a Docker Image, we need to be at the same level as our Dockerfile and requirements.txt file. We can then run docker build -t our_first_image . The -t flag indicates the name of the image, i.e., our_first_image, and the . tells us to build from this current directory.

Once we built the image we can do several things. We can

view the image by running docker image ls
view the history or how the image was created by running docker image history
push the image to a registry by running docker push

5. Run Docker Container

Once we have built the Docker Image, we can run our ML model in a container.

For this, we only need to execute docker run -p 8080:8080 in the command line. With -p 8080:8080 we connect the local port (8080) with the port in the container (8080).

If the Docker Image doesn’t expose a port, we could simply run docker run . Instead of using the image_name, we can also use the image_id.

Okay, once the container is running, let’s run a request against it. For this, we will send a payload to the endpoint by running curl X POST http://localhost:8080/invocations -H "Content-Type:application/json" -d @.path/to/sample_payload.json

Conclusion

In this article, I showed you the basics of Docker Containers, what they are, and how to build them yourself. Although I only scratched the surface it should be enough to get you started and be able to package your next model. With this knowledge, you should be able to avoid the “it works on my machine” problems.

I hope that you find this article useful and that it will help you become a better Data Scientist.

See you in my next article and/or leave a comment.

The post A Data Scientist’s Guide to Docker Containers appeared first on Towards Data Science.

How I Would Learn To Code (If I Could Start Over)

Egor Howell — Fri, 04 Apr 2025 18:43:36 +0000

According to various sources, the average salary for Coding jobs is ~£47.5k in the UK, which is ~35% higher than the median salary of about £35k.

So, coding is a very valuable skill that will earn you more money, not to mention it’s really fun.

I have been coding professionally now for 4 years, working as a data scientist and machine learning engineer and in this post, I will explain how I would learn to code if I had to do it all over again.

My journey

I still remember the time I wrote my first bit of code.
It was 9am on the first day of my physics undergrad, and we were in the computer lab.

The professor explained that computation is an integral part of modern physics as it allows us to run large-scale simulations of everything from subatomic particle collisions to the movement of galaxies.

It sounded amazing.

And the way we started this process was by going through a textbook to learn Fortran.

Yes, you heard that right.

My first programming language was Fortran, specifically Fortran 90.
I learned DO loops before FOR loops. I am definitely a rarity in this case.

In that first lab session, I remember writing “Hello World” as is the usual rite of passage and thinking, “Big woop.”

This is how you write “Hello World” in Fortran in case you are interested.

program hello
print *, 'Hello World!'
end program hello

I actually really struggled to code in Fortran and didn’t do that well on tests we had, which put me off coding.

I still have some old coding projects in Fortran on my GitHub that you can check out.

Looking back, the learning curve to coding is quite steep, but it really does compound, and eventually, it will just click.

I didn’t realise this at the time and actively avoided programming modules in my physics degree, which I regret in hindsight as my progress would have been much quicker.

During my third year, I had to do a research placement as part of my master’s. The company I chose to work for/with used a graphical programming language called LabVIEW to run and manage their experiments.

LabVIEW is based on something called “G” and taught me to think of programming differently than script-based.

However, I haven’t used it since and probably never will, but it was cool to learn then.

I did enjoy the research year somewhat, but the pace at which research moves, at least in physics, is painfully slow. Nothing like the “heyday” from the early 20th century I envisioned.

One day after work a video was recommended to me on my YouTube home page.

For those of you unaware, this was a documentary about DeepMind’s AI AlphaGo that beat the best GO player in the world. Most people thought that an AI could never be good at GO.

From the video, I started to understand how AI worked and learn about neural networks, reinforcement learning, and deep learning.
I found it all so interesting, similar to physics research in the early 20th century.

Ultimately, this is when I started studying for a career in Data Science and machine learning, where I needed to teach myself Python and SQL.

This is where I so-called “fell in love” with coding.
I saw its real potential in actually solving problems, but the main thing was that I had a motivated reason to learn. I was studying to break into a career I wanted to be in, which really drove me.

I then became a data scientist for three years and am now a Machine Learning engineer. During this time, I worked extensively with Python and SQL.

Until a few months ago, those were the only programming languages I knew. I did learn other tools, such as bash/z-shell, AWS, docker, data bricks, snowflake, etc. but not any other “proper” programming languages.

In my spare time, I dabbled a bit with C a couple of years ago, but I have forgotten virtually all of it now. I have some basic scripts on my GitHub if you are interested.

However, in my new role that I started a couple of months ago, I will be using Rust and GO, which I am very much looking forward to learning.

If you are interested in my entire journey to becoming a data scientist and machine learning engineer, you can read about it below:

How I Became A Machine Learning Engineer (No CS Degree, No Bootcamp)

Choose a language

I always recommend starting with a single language.

According to TestGorilla, there are over 8,000 programming languages, so how do you pick one?

Well, I would argue that many of these are useless for most jobs and have probably been developed as pet projects or for really niche cases.

You could choose your first language based on popularity. The Stack Overflow 2024 survey has great information on this. The most popular languages are JavaScript, Python, SQL, and Java.

However, the way I recommend you choose your first language should be based on what you want to do or work as.

Front-end web — JavaScript, HTML, CSS
Back-end web — Java, C#, Python, PHP or GO
iOS/macOS apps — Swift
Andriod apps — Kotlin or Java
Games — C++ or C
Embedded Systems — C or C++
Data science/machine learning / AI — Python and SQL

As I wanted to work in the AI/ML space, I focused my energy mainly on Python and some on SQL. It was probably a 90% / 10% split as SQL is smaller and easier to learn.

To this day, I still only know Python and SQL to a “professional” standard, but that’s fine, as pretty much the whole machine-learning community requires these languages.

This shows that you don’t need to know many languages; I have progressed quite far in my career, only knowing two to a significant depth. Of course, it would vary by sector, but the main point still stands.

So, pick a field you want to enter and choose the most in-demand and relevant language in that field.

Learn the bare minimum

The biggest mistake I see beginners make is getting stuck in “tutorial hell.”

This is where you take course after course but never branch out on your own.

I recommend taking a maximum of two courses on a language — literally any intro course would do — and then starting to build immediately.

And I literally mean, build your own projects and experience the struggle because that’s where learning is done.

You won’t know how to write functions until you do it yourself, you won’t know how to create classes until you do it yourself, and you literally won’t understand loops until you implement them yourself.

So, learn the bare minimum and immediately start experimenting; I promise it will at least 2x your learning curve.

You probably have heard this advice a lot, but in reality it is that simple.

I always say that most things in life are simple but hard to do, especially in programming.

Avoid trends

When I say avoid trends, I don’t mean not to focus on areas that are doing well or in demand in the market.

What I am saying is that when you pick a certain language or specialism, stick with it.

Programming languages all share similar concepts and patterns, so when you learn one, you indirectly improve your ability to pick up another later.

But you still should focus on one language for at least a few months.

Don’t develop “shiny object syndrome” and chase the latest technologies; it’s a game that you will unfortunately lose.

There have been so many “distracting” technologies, such as blockchain, Web3, AI, the list goes on.

Instead, focus on the fundamentals:

Data types
Design patterns
Object-oriented programming
Data structures and algorithms
Problem-solving skills

These topics transcend individual programming languages and are much better to master than the latest Javascript framework!

It’s much better to have a strong understanding of one area than try to learn everything. Not only is this more manageable, but it is also better for your long-term career.

As I said earlier, I have progressed quite well in my career by only knowing Python and SQL, as I learned the required technologies for the field and didn’t get distracted.

I can’t stress how much leverage you will have in your career if you document your learning publicly.

Document your learning

I don’t know why more people don’t do this. Sharing what I have learned online has been the biggest game changer for my career.

Literally committing your code on GitHub is enough, but I really recommend posting on LinkedIn or X, and ideally, you should create blog posts to help you cement your understanding and show off you knowledge to employers.

When I interview candidates, if they have some sort of online presence showing their learnings, that’s immediately a tick in my box and an extra edge over other applicants.

It shows enthusiasm and passion, not to mention increasing your surface area of serendipity.

I know many people are scared to do this, but you are suffering from the spotlight effect. Wikipedia defines this as:

The spotlight effect is the psychological phenomenon by which people tend to believe they are being noticed more than they really are.

No one literally cares if you post online or think about you as much as 1% as you think.

So, start posting.

What about AI?

I could spend hours discussing why AI is not an immediate risk for anyone who wants to work in the coding profession.

You should embrace AI as part of your toolkit, but that’s as far as it will go, and it will definitely not replace programmers in 5 years.

Unless an AGI breakthrough suddenly occurs in the next decade, which is highly unlikely.

I personally doubt the answer to AGI is the cross-entropy loss function, which is what is used in most LLMs nowadays.

It has been shown time and time again that these AI models lack strong mathematical reasoning abilities, which is one of the most fundamental skills to being a good coder.

Even the so-called “software engineer killer” Devin is not as good as the creators initially marketed it.

Most companies are simply trying to boost their investment by hyping AI, and their results are often over-exaggerated with controversial benchmark testing.

When I was building a website, ChatGPT even struggled with simple HTML and CSS, which you can argue is its bread and butter!

Overall, don’t worry about AI if you want to work as a coder; there is much, much bigger fish to fry before we cross that bridge!

NeetCode has done a great video explaining how current AI is incapable of replacing programmers.

Another thing!

Join my free newsletter, Dishing the Data, where I share weekly tips, insights, and advice from my experience as a practicing data scientist. Plus, as a subscriber, you’ll get my FREE Data Science Resume Template!

Connect with me

The post How I Would Learn To Code (If I Could Start Over) appeared first on Towards Data Science.

The Case for Centralized AI Model Inference Serving

Chaim Rand — Wed, 02 Apr 2025 01:52:26 +0000

As AI models continue to increase in scope and accuracy, even tasks once dominated by traditional algorithms are gradually being replaced by Deep Learning models. Algorithmic pipelines — workflows that take an input, process it through a series of algorithms, and produce an output — increasingly rely on one or more AI-based components. These AI models often have significantly different resource requirements than their classical counterparts, such as higher memory usage, reliance on specialized hardware accelerators, and increased computational demands.

In this post, we address a common challenge: efficiently processing large-scale inputs through algorithmic pipelines that include deep learning models. A typical solution is to run multiple independent jobs, each responsible for processing a single input. This setup is often managed with job orchestration frameworks (e.g., Kubernetes). However, when deep learning models are involved, this approach can become inefficient as loading and executing the same model in each individual process can lead to resource contention and scaling limitations. As AI models become increasingly prevalent in algorithmic pipelines, it is crucial that we revisit the design of such solutions.

In this post we evaluate the benefits of centralized Inference serving, where a dedicated inference server handles prediction requests from multiple parallel jobs. We define a toy experiment in which we run an image-processing pipeline based on a ResNet-152 image classifier on 1,000 individual images. We compare the runtime performance and resource utilization of the following two implementations:

Decentralized inference — each job loads and runs the model independently.
Centralized inference — all jobs send inference requests to a dedicated inference server.

To keep the experiment focused, we make several simplifying assumptions:

Instead of using a full-fledged job orchestrator (like Kubernetes), we implement parallel process execution using Python’s multiprocessing module.
While real-world workloads often span multiple nodes, we run everything on a single node.
Real-world workloads typically include multiple algorithmic components. We limit our experiment to a single component — a ResNet-152 classifier running on a single input image.
In a real-world use case, each job would process a unique input image. To simplify our experiment setup, each job will process the same kitty.jpg image.
We will use a minimal deployment of a TorchServe inference server, relying mostly on its default settings. Similar results are expected with alternative inference server solutions such as NVIDIA Triton Inference Server or LitServe.

The code is shared for demonstrative purposes only. Please do not interpret our choice of TorchServe — or any other component of our demonstration — as an endorsement of its use.

Toy Experiment

We conduct our experiments on an Amazon EC2 c5.2xlarge instance, with 8 vCPUs and 16 GiB of memory, running a PyTorch Deep Learning AMI (DLAMI). We activate the PyTorch environment using the following command:

source /opt/pytorch/bin/activate

Step 1: Creating a TorchScript Model Checkpoint

We begin by creating a ResNet-152 model checkpoint. Using TorchScript, we serialize both the model definition and its weights into a single file:

import torch
from torchvision.models import resnet152, ResNet152_Weights

model = resnet152(weights=ResNet152_Weights.DEFAULT)
model = torch.jit.script(model)
model.save("resnet-152.pt")

Step 2: Model Inference Function

Our inference function performs the following steps:

Load the ResNet-152 model.
Load an input image.
Preprocess the image to match the input format expected by the model, following the implementation defined here.
Run inference to classify the image.
Post-process the model output to return the top five label predictions, following the implementation defined here.

We define a constant MAX_THREADS hyperparameter that we use to restrict the number of threads used for model inference in each process. This is to prevent resource contention between the multiple jobs.

import os, time, psutil
import multiprocessing as mp
import torch
import torch.nn.functional as F
import torchvision.transforms as transforms
from PIL import Image


def predict(image_id):
    # Limit each process to 1 thread
    MAX_THREADS = 1
    os.environ["OMP_NUM_THREADS"] = str(MAX_THREADS)
    os.environ["MKL_NUM_THREADS"] = str(MAX_THREADS)
    torch.set_num_threads(MAX_THREADS)

    # load the model
    model = torch.jit.load('resnet-152.pt').eval()

    # Define image preprocessing steps
    transform = transforms.Compose([
        transforms.Resize(256),
        transforms.CenterCrop(224),
        transforms.ToTensor(),
        transforms.Normalize(mean=[0.485, 0.456, 0.406], 
                             std=[0.229, 0.224, 0.225])
    ])

    # load the image
    image = Image.open('kitten.jpg').convert("RGB")
    
    # preproc
    image = transform(image).unsqueeze(0)

    # perform inference
    with torch.no_grad():
        output = model(image)

    # postproc
    probabilities = F.softmax(output[0], dim=0)
    probs, classes = torch.topk(probabilities, 5, dim=0)
    probs = probs.tolist()
    classes = classes.tolist()

    return dict(zip(classes, probs))

Step 3: Running Parallel Inference Jobs

We define a function that spawns parallel processes, each processing a single image input. This function:

Accepts the total number of images to process and the maximum number of concurrent jobs.
Dynamically launches new processes when slots become available.
Monitors CPU and memory usage throughout execution.

def process_image(image_id):
    print(f"Processing image {image_id} (PID: {os.getpid()})")
    predict(image_id)

def spawn_jobs(total_images, max_concurrent):
    start_time = time.time()
    max_mem_utilization = 0.
    max_utilization = 0.

    processes = []
    index = 0
    while index < total_images or processes:

        while len(processes) < max_concurrent and index < total_images:
            # Start a new process
            p = mp.Process(target=process_image, args=(index,))
            index += 1
            p.start()
            processes.append(p)

        # sample memory utilization
        mem_usage = psutil.virtual_memory().percent
        max_mem_utilization = max(max_mem_utilization, mem_usage)
        cpu_util = psutil.cpu_percent(interval=0.1)
        max_utilization = max(max_utilization, cpu_util)

        # Remove completed processes from list
        processes = [p for p in processes if p.is_alive()]

    total_time = time.time() - start_time
    print(f"\nTotal Processing Time: {total_time:.2f} seconds")
    print(f"Max CPU Utilization: {max_utilization:.2f}%")
    print(f"Max Memory Utilization: {max_mem_utilization:.2f}%")

spawn_jobs(total_images=1000, max_concurrent=32)

Estimating the Maximum Number of Processes

While the optimal number of maximum concurrent processes is best determined empirically, we can estimate an upper bound based on the 16 GiB of system memory and the size of the resnet-152.pt file, 231 MB.

The table below summarizes the runtime results for several configurations:

Decentralized Inference Results (by Author)

Although memory becomes fully saturated at 50 concurrent processes, we observe that maximum throughput is achieved at 8 concurrent jobs — one per vCPU. This indicates that beyond this point, resource contention outweighs any potential gains from additional parallelism.

The Inefficiencies of Independent Model Execution

Running parallel jobs that each load and execute the model independently introduces significant inefficiencies and waste:

Each process needs to allocate the appropriate memory resources for storing its own copy of the AI model.
AI models are compute-intensive. Executing them in many processes in parallel can lead to resource contention and reduced throughput.
Loading the model checkpoint file and initializing the model in each process adds overhead and can further increase latency. In the case of our toy experiment, model initialization makes up for roughly 30%(!!) of the overall inference processing time.

A more efficient alternative is to centralize inference execution using a dedicated model inference server. This approach would eliminate redundant model loading and reduce overall system resource utilization.

In the next section we will set up an AI model inference server and assess its impact on resource utilization and runtime performance.

Note: We could have modified our multiprocessing-based approach to share a single model across processes (e.g., using torch.multiprocessing or another solution based on shared memory). However, the inference server demonstration better aligns with real-world production environments, where jobs often run in isolated containers.

TorchServe Setup

The TorchServe setup described in this section loosely follows the resnet tutorial. Please refer to the official TorchServe documentation for more in-depth guidelines.

Installation

The PyTorch environment of our DLAMI comes preinstalled with TorchServe executables. If you are running in a different environment run the following installation command:

pip install torchserve torch-model-archiver

Creating a Model Archive

The TorchServe Model Archiver packages the model and its associated files into a “.mar” file archive, the format required for deployment on TorchServe. We create a TorchServe model archive file based on our model checkpoint file and using the default image_classifier handler:

mkdir model_store
torch-model-archiver \
    --model-name resnet-152 \
    --serialized-file resnet-152.pt \
    --handler image_classifier \
    --version 1.0 \
    --export-path model_store

TorchServe Configuration

We create a TorchServe config.properties file to define how TorchServe should operate:

model_store=model_store
load_models=resnet-152.mar
models={\
  "resnet-152": {\
    "1.0": {\
        "marName": "resnet-152.mar"\
    }\
  }\
}

# Number of workers per model
default_workers_per_model=1

# Job queue size (default is 100)
job_queue_size=100

After completing these steps, our working directory should look like this:

├── config.properties
֫├── kitten.jpg
├── model_store
│   ├── resnet-152.mar
├── multi_job.py

Starting TorchServe

In a separate shell we start our TorchServe inference server:

source /opt/pytorch/bin/activate
torchserve \
    --start \
    --disable-token-auth \
    --enable-model-api \
    --ts-config config.properties

Inference Request Implementation

We define an alternative prediction function that calls our inference service:

import requests

def predict_client(image_id):
    with open('kitten.jpg', 'rb') as f:
        image = f.read()
    response = requests.post(
        "http://127.0.0.1:8080/predictions/resnet-152",
        data=image,
        headers={'Content-Type': 'application/octet-stream'}
    )

    if response.status_code == 200:
        return response.json()
    else:
        print(f"Error from inference server: {response.text}")

Scaling Up the Number of Concurrent Jobs

Now that inference requests are being processed by a central server, we can scale up parallel processing. Unlike the earlier approach where each process loaded and executed its own model, we have sufficient CPU resources to allow for many more concurrent processes. Here we choose 100 processes in accordance with the default job_queue_size capacity of the inference server:

spawn_jobs(total_images=1000, max_concurrent=100)

Results

The performance results are captured in the table below. Keep in mind that the comparative results can vary greatly based on the details of the AI model and the runtime environment.

Inference Server Results (by Author)

By using a centralized inference server, not only have we have increased overall throughput by more than 2X, but we have freed significant CPU resources for other computation tasks.

Next Steps

Now that we have effectively demonstrated the benefits of a centralized inference serving solution, we can explore several ways to enhance and optimize the setup. Recall that our experiment was intentionally simplified to focus on demonstrating the utility of inference serving. In real-world deployments, additional enhancements may be required to tailor the solution to your specific needs.

Custom Inference Handlers: While we used TorchServe’s built-in image_classifier handler, defining a custom handler provides much greater control over the details of the inference implementation.
Advanced Inference Server Configuration: Inference server solutions will typically include many features for tuning the service behavior according to the workload requirements. In the next sections we will explore some of the features supported by TorchServe.
Expanding the Pipeline: Real world models will typically include more algorithm blocks and more sophisticated AI models than we used in our experiment.
Multi-Node Deployment: While we ran our experiments on a single compute instance, production setups will typically include multiple nodes.
Alternative Inference Servers: While TorchServe is a popular choice and relatively easy to set up, there are many alternative inference server solutions that may provide additional benefits and may better suit your needs. Importantly, it was recently announced that TorchServe would no longer be actively maintained. See the documentation for details.
Alternative Orchestration Frameworks: In our experiment we use Python multiprocessing. Real-world workloads will typically use more advanced orchestration solutions.
Utilizing Inference Accelerators: While we executed our model on a CPU, using an AI accelerator (e.g., an NVIDIA GPU, a Google Cloud TPU, or an AWS Inferentia) can drastically improve throughput.
Model Optimization: Optimizing your AI models can greatly increase efficiency and throughput.
Auto-Scaling for Inference Load: In some use cases inference traffic will fluctuate, requiring an inference server solution that can scale its capacity accordingly.

In the next sections we explore two simple ways to enhance our TorchServe-based inference server implementation. We leave the discussion on other enhancements to future posts.

Batch Inference with TorchServe

Many model inference service solutions support the option of grouping inference requests into batches. This usually results in increased throughput, especially when the model is running on a GPU.

We extend our TorchServe config.properties file to support batch inference with a batch size of up to 8 samples. Please see the official documentation for details on batch inference with TorchServe.

model_store=model_store
load_models=resnet-152.mar
models={\
  "resnet-152": {\
    "1.0": {\
        "marName": "resnet-152.mar",\
        "batchSize": 8,\
        "maxBatchDelay": 100,\
        "responseTimeout": 200\
    }\
  }\
}

# Number of workers per model
default_workers_per_model=1

# Job queue size (default is 100)
job_queue_size=100

Results

We append the results in the table below:

Batch Inference Server Results (by Author)

Enabling batched inference increases the throughput by an additional 26.5%.

Multi-Worker Inference with TorchServe

Many model inference service solutions will support creating multiple inference workers for each AI model. This enables fine-tuning the number of inference workers based on expected load. Some solutions support auto-scaling of the number of inference workers.

We extend our own TorchServe setup by increasing the default_workers_per_model setting that controls the number of inference workers assigned to our image classification model.

Importantly, we must limit the number of threads allocated to each worker to prevent resource contention. This is controlled by the number_of_netty_threads setting and by the OMP_NUM_THREADS and MKL_NUM_THREADS environment variables. Here we have set the number of threads to equal the number of vCPUs (8) divided by the number of workers.

model_store=model_store
load_models=resnet-152.mar
models={\
  "resnet-152": {\
    "1.0": {\
        "marName": "resnet-152.mar"\
        "batchSize": 8,\
        "maxBatchDelay": 100,\
        "responseTimeout": 200\
    }\
  }\
}

# Number of workers per model
default_workers_per_model=2 

# Job queue size (default is 100)
job_queue_size=100

# Number of threads per worker
number_of_netty_threads=4

The modified TorchServe startup sequence appears below:

export OMP_NUM_THREADS=4
export MKL_NUM_THREADS=4
torchserve \
    --start \
    --disable-token-auth \
    --enable-model-api \
    --ts-config config.properties

Results

In the table below we append the results of running with 2, 4, and 8 inference workers:

Multi-Worker Inference Server Results (by Author)

By configuring TorchServe to use multiple inference workers, we are able to increase the throughput by an additional 36%. This amounts to a 3.75X improvement over the baseline experiment.

Summary

This experiment highlights the potential impact of inference server deployment on multi-job deep learning workloads. Our findings suggest that using an inference server can improve system resource utilization, enable higher concurrency, and significantly increase overall throughput. Keep in mind that the precise benefits will greatly depend on the details of the workload and the runtime environment.

Designing the inference serving architecture is just one part of optimizing AI model execution. Please see some of our many posts covering a wide range AI model optimization techniques.

The post The Case for Centralized AI Model Inference Serving appeared first on Towards Data Science.

A Simple Implementation of the Attention Mechanism from Scratch

Marcello Politi — Tue, 01 Apr 2025 01:05:51 +0000

Introduction

The Attention Mechanism is often associated with the transformer architecture, but it was already used in RNNs. In Machine Translation or MT (e.g., English-Italian) tasks, when you want to predict the next Italian word, you need your model to focus, or pay attention, on the most important English words that are useful to make a good translation.

I will not go into details of RNNs, but attention helped these models to mitigate the vanishing gradient problem and to capture more long-range dependencies among words.

At a certain point, we understood that the only important thing was the attention mechanism, and the entire RNN architecture was overkill. Hence, Attention is All You Need!

Self-Attention in Transformers

Classical attention indicates where words in the output sequence should focus attention in relation to the words in input sequence. This is important in sequence-to-sequence tasks like MT.

The self-attention is a specific type of attention. It operates between any two elements in the same sequence. It provides information on how “correlated” the words are in the same sentence.

For a given token (or word) in a sequence, self-attention generates a list of attention weights corresponding to all other tokens in the sequence. This process is applied to each token in the sentence, obtaining a matrix of attention weights (as in the picture).

This is the general idea, in practice things are a bit more complicated because we want to add many learnable parameters to our neural network, let’s see how.

K, V, Q representations

Our model input is a sentence like “my name is Marcello Politi”. With the process of tokenization, a sentence is converted into a list of numbers like [2, 6, 8, 3, 1].

Before feeding the sentence into the transformer we need to create a dense representation for each token.

How to create this representation? We multiply each token by a matrix. The matrix is learned during training.

Let’s add some complexity now.

For each token, we create 3 vectors instead of one, we call these vectors: key, value and query. (We see later how we create these 3 vectors).

Conceptually these 3 tokens have a particular meaning:

The vector key represents the core information captured by the token
The vector value captures the full information of a token
The vector query, it’s a question about the token relevance for the current task.

So the idea is that we focus on a particular token i , and we want to ask what is the importance of the other tokens in the sentence regarding the token i we are taking into consideration.

This means that we take the vector q_i (we ask a question regarding i) for token i, and we do some mathematical operations with all the other tokens k_j (j!=i). This is like wondering at first glance what are the other tokens in the sequence that look really important to understand the meaning of token i.

What is this magical mathematical operation?

We need to multiply (dot-product) the query vector by the key vectors and divide by a scaling factor. We do this for each k_j token.

In this way, we obtain a score for each pair (q_i, k_j). We make this list become a probability distribution by applying a softmax operation on it. Great now we have obtained the attention weights!

With the attention weights, we know what is the importance of each token k_j to for undestandin the token i. So now we multiply the value vector v_j associated with each token per its weight and we sum the vectors. In this way we obtain the final context-aware vector of token_i.

If we are computing the contextual dense vector of token_1 we calculate:

z1 = a11*v1 + a12*v2 + … + a15*v5

Where a1j are the computer attention weights, and v_j are the value vectors.

Done! Almost…

I didn’t cover how we obtained the vectors k, v and q of each token. We need to define some matrices w_k, w_v and w_q so that when we multiply:

token * w_k -> k
token * w_q -> q
token * w_v -> v

These 3 matrices are set at random and are learned during training, this is why we have many parameters in modern models such as LLMs.

Multi-head Self-Attention in Transformers (MHSA)

Are we sure that the previous self-attention mechanism is able to capture all important relationships among tokens (words) and create dense vectors of those tokens that really make sense?

It could actually not work always perfectly. What if to mitigate the error we re-run the entire thing 2 times with new w_q, w_k and w_v matrices and somehow merge the 2 dense vectors obtained? In this way maybe one self-attention managed to capture some relationship and the other managed to capture some other relationship.

Well, this is what exactly happens in MHSA. The case we just discussed contains two heads because it has two sets of w_q, w_k and w_v matrices. We can have even more heads: 4, 8, 16 etc.

The only complicated thing is that all these heads are managed in parallel, we process the all in the same computation using tensors.

The way we merge the dense vectors of each head is simple, we concatenate them (hence the dimension of each vector shall be smaller so that when concat them we obtain the original dimension we wanted), and we pass the obtained vector through another w_o learnable matrix.

Hands-on

 span:before #888; Syntax Highlighting Keywords (e.g., def, return, import) .token.keyword #c678dd; font-weight: bold; Functions .token.function #61afef; Strings .token.string #98c379; Comments .token.comment #7d">Python">import torch

Suppose you have a sentence. After tokenization, each token (word for simplicity) corresponds to an index (number):

 span:before #888; Syntax Highlighting Keywords (e.g., def, return, import) .token.keyword #c678dd; font-weight: bold; Functions .token.function #61afef; Strings .token.string #98c379; Comments .token.comment #7d">tokenized_sentence = torch.tensor([
    2, #my
    6, #name
    8, #is
    3, #marcello
    1  #politi
])
tokenized_sentence

Before feeding the sentence into the transofrmer we need to create a dense representation for each token.

How to create these representation? We multiply each token per a matrix. This matrix is learned during training.

Let’s build this embedding matrix.

 span:before #888; Syntax Highlighting Keywords (e.g., def, return, import) .token.keyword #c678dd; font-weight: bold; Functions .token.function #61afef; Strings .token.string #98c379; Comments .token.comment #7d">torch.manual_seed(0) # set a fixed seed for reproducibility
embed = torch.nn.Embedding(10, 16)

If we multiply our tokenized sentence with the embeddings, we obtain a dense representation of dimension 16 for each token

 span:before #888; Syntax Highlighting Keywords (e.g., def, return, import) .token.keyword #c678dd; font-weight: bold; Functions .token.function #61afef; Strings .token.string #98c379; Comments .token.comment #7d">sentence_embed = embed(tokenized_sentence).detach()
sentence_embed

In order to use the attention mechanism we need to create 3 new We define 3 matrixes w_q, w_k and w_v. When we multiply one input token time the w_q we obtain the vector q. Same with w_k and w_v.

 span:before #888; Syntax Highlighting Keywords (e.g., def, return, import) .token.keyword #c678dd; font-weight: bold; Functions .token.function #61afef; Strings .token.string #98c379; Comments .token.comment #7d">d = sentence_embed.shape[1] # let's base our matrix on a shape (16,16)

w_key = torch.rand(d,d)
w_query = torch.rand(d,d)
w_value = torch.rand(d,d)

Compute attention weights

Let’s now compute the attention weights for only the first input token of the sentence.

 span:before #888; Syntax Highlighting Keywords (e.g., def, return, import) .token.keyword #c678dd; font-weight: bold; Functions .token.function #61afef; Strings .token.string #98c379; Comments .token.comment #7d">token1_embed = sentence_embed[0]

#compute the tre vector associated to token1 vector : q,k,v
key_1 = w_key.matmul(token1_embed)
query_1 = w_query.matmul(token1_embed)
value_1 = w_value.matmul(token1_embed)

print("key vector for token1: \n", key_1)   
print("query vector for token1: \n", query_1)
print("value vector for token1: \n", value_1)

We need to multiply the query vector associated to token1 (query_1) with all the keys of the other vectors.

So now we need to compute all the keys (key_2, key_2, key_4, key_5). But wait, we can compute all of these in one time by multiplying the sentence_embed times the w_k matrix.

 span:before #888; Syntax Highlighting Keywords (e.g., def, return, import) .token.keyword #c678dd; font-weight: bold; Functions .token.function #61afef; Strings .token.string #98c379; Comments .token.comment #7d">keys = sentence_embed.matmul(w_key.T)
keys[0] #contains the key vector of the first token and so on

Let’s do the same thing with the values

 span:before #888; Syntax Highlighting Keywords (e.g., def, return, import) .token.keyword #c678dd; font-weight: bold; Functions .token.function #61afef; Strings .token.string #98c379; Comments .token.comment #7d">values = sentence_embed.matmul(w_value.T)
values[0] #contains the value vector of the first token and so on

Let’s compute the first part of the attions formula.

import torch.nn.functional as F

 span:before #888; Syntax Highlighting Keywords (e.g., def, return, import) .token.keyword #c678dd; font-weight: bold; Functions .token.function #61afef; Strings .token.string #98c379; Comments .token.comment #7d"># the following are the attention weights of the first tokens to all the others
a1 = F.softmax(query_1.matmul(keys.T)/d**0.5, dim = 0)
a1

With the attention weights we know what is the importance of each token. So now we multiply the value vector associated to each token per its weight.

To obtain the final context aware vector of token_1.

 span:before #888; Syntax Highlighting Keywords (e.g., def, return, import) .token.keyword #c678dd; font-weight: bold; Functions .token.function #61afef; Strings .token.string #98c379; Comments .token.comment #7d">z1 = a1.matmul(values)
z1

In the same way we could compute the context aware dense vectors of all the other tokens. Now we are always using the same matrices w_k, w_q, w_v. We say that we use one head.

But we can have multiple triplets of matrices, so multi-head. That’s why it is called multi-head attention.

The dense vectors of an input tokens, given in oputut from each head are at then end concatenated and linearly transformed to get the final dense vector.

Implementing MultiheadSelf-Attention

 span:before #888; Syntax Highlighting Keywords (e.g., def, return, import) .token.keyword #c678dd; font-weight: bold; Functions .token.function #61afef; Strings .token.string #98c379; Comments .token.comment #7d">import torch
import torch.nn as nn
import torch.nn.functional as F

torch.manual_seed(0) # fixed seed for reproducibility

Same steps as before…

 span:before #888; Syntax Highlighting Keywords (e.g., def, return, import) .token.keyword #c678dd; font-weight: bold; Functions .token.function #61afef; Strings .token.string #98c379; Comments .token.comment #7d"># Tokenized sentence (same as yours)
tokenized_sentence = torch.tensor([2, 6, 8, 3, 1])  # [my, name, is, marcello, politi]

# Embedding layer: vocab size = 10, embedding dim = 16
embed = nn.Embedding(10, 16)
sentence_embed = embed(tokenized_sentence).detach()  # Shape: [5, 16] (seq_len, embed_dim)

We’ll define a multi-head attention mechanism with h heads (let’s say 4 heads for this example). Each head will have its own w_q, w_k, and w_v matrices, and the output of each head will be concatenated and passed through a final linear layer.

Since the output of the head will be concatenated, and we want a final dimension of d, the dimension of each head needs to be d/h. Additionally each concatenated vector will go though a linear transformation, so we need another matrix w_ouptut as you can see in the formula.

 span:before #888; Syntax Highlighting Keywords (e.g., def, return, import) .token.keyword #c678dd; font-weight: bold; Functions .token.function #61afef; Strings .token.string #98c379; Comments .token.comment #7d">d = sentence_embed.shape[1]  # embed dimension 16
h = 4  # Number of heads
d_k = d // h  # Dimension per head (16 / 4 = 4)

Since we have 4 heads, we want 4 copies for each matrix. Instead of copies, we add a dimension, which is the same thing, but we only do one operation. (Imagine stacking matrices on top of each other, its the same thing).

 span:before #888; Syntax Highlighting Keywords (e.g., def, return, import) .token.keyword #c678dd; font-weight: bold; Functions .token.function #61afef; Strings .token.string #98c379; Comments .token.comment #7d"># Define weight matrices for each head
w_query = torch.rand(h, d, d_k)  # Shape: [4, 16, 4] (one d x d_k matrix per head)
w_key = torch.rand(h, d, d_k)    # Shape: [4, 16, 4]
w_value = torch.rand(h, d, d_k)  # Shape: [4, 16, 4]
w_output = torch.rand(d, d)  # Final linear layer: [16, 16]

I’m using for simplicity torch’s einsum. If you’re not familiar with it check out my blog post.

The einsum operation torch.einsum('sd,hde->hse', sentence_embed, w_query) in PyTorch uses letters to define how to multiply and rearrange numbers. Here’s what each part means:

Input Tensors:
- sentence_embed with the notation 'sd':
  - s represents the number of words (sequence length), which is 5.
  - d represents the number of numbers per word (embedding size), which is 16.
  - The shape of this tensor is [5, 16].
- w_query with the notation 'hde':
  - h represents the number of heads, which is 4.
  - d represents the embedding size, which again is 16.
  - e represents the new number size per head (d_k), which is 4.
  - The shape of this tensor is [4, 16, 4].
Output Tensor:
- The output has the notation 'hse':
  - h represents 4 heads.
  - s represents 5 words.
  - e represents 4 numbers per head.
  - The shape of the output tensor is [4, 5, 4].

 span:before #888; Syntax Highlighting Keywords (e.g., def, return, import) .token.keyword #c678dd; font-weight: bold; Functions .token.function #61afef; Strings .token.string #98c379; Comments .token.comment #7d"># Compute Q, K, V for all tokens and all heads
# sentence_embed: [5, 16] -> Q: [4, 5, 4] (h, seq_len, d_k)
queries = torch.einsum('sd,hde->hse', sentence_embed, w_query)  # h heads, seq_len tokens, d dim
keys = torch.einsum('sd,hde->hse', sentence_embed, w_key)       # h heads, seq_len tokens, d dim
values = torch.einsum('sd,hde->hse', sentence_embed, w_value)   # h heads, seq_len tokens, d dim

This einsum equation performs a dot product between the queries (hse) and the transposed keys (hek) to obtain scores of shape [h, seq_len, seq_len], where:

h -> Number of heads.
s and k -> Sequence length (number of tokens).
e -> Dimension of each head (d_k).

The division by (d_k ** 0.5) scales the scores to stabilize gradients. Softmax is then applied to obtain attention weights:

 span:before #888; Syntax Highlighting Keywords (e.g., def, return, import) .token.keyword #c678dd; font-weight: bold; Functions .token.function #61afef; Strings .token.string #98c379; Comments .token.comment #7d"># Compute attention scores
scores = torch.einsum('hse,hek->hsk', queries, keys.transpose(-2, -1)) / (d_k ** 0.5)  # [4, 5, 5]
attention_weights = F.softmax(scores, dim=-1)  # [4, 5, 5]

 span:before #888; Syntax Highlighting Keywords (e.g., def, return, import) .token.keyword #c678dd; font-weight: bold; Functions .token.function #61afef; Strings .token.string #98c379; Comments .token.comment #7d"># Apply attention weights
head_outputs = torch.einsum('hij,hjk->hik', attention_weights, values)  # [4, 5, 4]
head_outputs.shape

Now we concatenate all the heads of token 1

 span:before #888; Syntax Highlighting Keywords (e.g., def, return, import) .token.keyword #c678dd; font-weight: bold; Functions .token.function #61afef; Strings .token.string #98c379; Comments .token.comment #7d"># Concatenate heads
concat_heads = head_outputs.permute(1, 0, 2).reshape(sentence_embed.shape[0], -1)  # [5, 16]
concat_heads.shape

Let’s finally multiply per the last w_output matrix as in the formula above

 span:before #888; Syntax Highlighting Keywords (e.g., def, return, import) .token.keyword #c678dd; font-weight: bold; Functions .token.function #61afef; Strings .token.string #98c379; Comments .token.comment #7d">multihead_output = concat_heads.matmul(w_output)  # [5, 16] @ [16, 16] -> [5, 16]
print("Multi-head attention output for token1:\n", multihead_output[0])

Final Thoughts

In this blog post I’ve implemented a simple version of the attention mechanism. This is not how it is really implemented in modern frameworks, but my scope is to provide some insights to allow anyone an understanding of how this works. In future articles I’ll go through the entire implementation of a transformer architecture.

Follow me on TDS if you like this article!

Linkedin | X (Twitter) | Website

Unless otherwise noted, images are by the author

The post A Simple Implementation of the Attention Mechanism from Scratch appeared first on Towards Data Science.

Talk to Videos

Umair Ali Khan — Thu, 27 Mar 2025 19:06:24 +0000

Large language models (LLMs) are improving in efficiency and are now able to understand different data formats, offering possibilities for myriads of applications in different domains. Initially, LLMs were inherently able to process only text. The image understanding feature was integrated by coupling an LLM with another image encoding model. However, gpt-4o was trained on both text and images and is the first true multimodal LLM that can understand both text and images. Other modalities such as audio are integrated into modern LLMs through other AI models, e.g., OpenAI’s Whisper models.

LLMs are now being used more as information processors where they can process data in different formats. Integrating multiple modalities into LLMs opens areas of numerous applications in education, Business, and other sectors. One such application is the processing of educational videos, documentaries, webinars, presentations, business meetings, lectures, and other content using LLMs and interacting with this content more naturally. The audio modality in these videos contains rich information that could be used in a number of applications. In educational settings, it can be used for personalized learning, enhancing accessibility of students with special needs, study aid creation, remote learning support without requiring a teacher’s presence to understand content, and assessing students’ knowledge about a topic. In business settings, it can be used for training new employees with onboarding videos, extracting and generating knowledge from recording meetings and presentations, customized learning materials from product demonstration videos, and extracting insights from recorded industry conferences without watching hours of videos, to name a few.

This article discusses the development of an application to interact with videos in a natural way and create learning content from them. The application has the following features:

It takes an input video either through a URL or from a local path and extracts audio from the video
Transcribes the audio using OpenAI’s state-of-the-art model gpt-4o-transcribe, which has demonstrated improved Word Error Rate (WER) performance over existing Whisper models across multiple established benchmarks
Creates a vector store of the transcript and develops a retrieval augment generation (RAG) to establish a conversation with the video transcript
Respond to users’ questions in text and speech using different voices, selectable from the application’s UI.
Creates learning content such as:
- Hierarchical representation of the video contents to provide users with quick insights into the main concepts and supporting details
- Generate quizzes to transform passive video watching into active learning by challenging users to recall and apply information presented in the video.
- Generates flashcards from the video content that support active recall and spaced repetition learning techniques

The entire workflow of the application is shown in the following figure.

Application workflow (image by author)

The whole codebase, along with detailed instructions for installation and usage, is available on GitHub.

Here is the structure of the GitHub repository. The main Streamlit application implements the GUI interface and calls several other functions from other feature and helper modules (.py files).

GitHub code structure (image by author)

In addition, you can visualize the codebase by opening the “codebase visualization” HTML file in a browser, which describes the structures of each module.

Codebase visualization (image by author)

Let’s delve into the step-by-step development of this application. I will not discuss the entire code, but only its major part. The whole code in the GitHub repository is adequately commented.

Video Input and Processing

Video input and processing logic are implemented in transcriber.py. When the application loads, it verifies whether FFMPEG is present (verify_ffmpeg) in the application’s root directory. FFMPEG is required for downloading a video (if the input is a URL) and extracting audio from the video which is then used to create a transcript.

def verify_ffmpeg():
    """Verify that FFmpeg is available and print its location."""
    # Add FFmpeg to PATH
    os.environ['PATH'] = FFMPEG_LOCATION + os.pathsep + os.environ['PATH']
    # Check if FFmpeg binaries exist
    ffmpeg_path = os.path.join(FFMPEG_LOCATION, 'ffmpeg.exe')
    ffprobe_path = os.path.join(FFMPEG_LOCATION, 'ffprobe.exe')
    if not os.path.exists(ffmpeg_path):
        raise FileNotFoundError(f"FFmpeg executable not found at: {ffmpeg_path}")
    if not os.path.exists(ffprobe_path):
        raise FileNotFoundError(f"FFprobe executable not found at: {ffprobe_path}")
    print(f"FFmpeg found at: {ffmpeg_path}")
    print(f"FFprobe found at: {ffprobe_path}")
    # Try to execute FFmpeg to make sure it works
    try:
        # Add shell=True for Windows and capture errors properly
        result = subprocess.run([ffmpeg_path, '-version'], 
                               stdout=subprocess.PIPE, 
                               stderr=subprocess.PIPE,
                               shell=True,  # This can help with permission issues on Windows
                               check=False)
        if result.returncode == 0:
            print(f"FFmpeg version: {result.stdout.decode().splitlines()[0]}")
        else:
            error_msg = result.stderr.decode()
            print(f"FFmpeg error: {error_msg}")
            # Check for specific permission errors
            if "Access is denied" in error_msg:
                print("Permission error detected. Trying alternative approach...")
                # Try an alternative approach - just check file existence without execution
                if os.path.exists(ffmpeg_path) and os.path.exists(ffprobe_path):
                    print("FFmpeg files exist but execution test failed due to permissions.")
                    print("WARNING: The app may fail when trying to process videos.")
                    # Return paths anyway and hope for the best when actually used
                    return ffmpeg_path, ffprobe_path
                
            raise RuntimeError(f"FFmpeg execution failed: {error_msg}")
    except Exception as e:
        print(f"Error checking FFmpeg: {e}")
        # Fallback option if verification fails but files exist
        if os.path.exists(ffmpeg_path) and os.path.exists(ffprobe_path):
            print("WARNING: FFmpeg files exist but verification failed.")
            print("Attempting to continue anyway, but video processing may fail.")
            return ffmpeg_path, ffprobe_path 
        raise
    return ffmpeg_path, ffprobe_path

The video input is in the form of a URL (for instance, YouTube URL) or a local file path. The process_video function determines the input type and routes it accordingly. If the input is a URL, the helper functions get_video_info and get_video_id extract video metadata (title, description, duration) without downloading it using yt_dlp package.

#Function to determine the input type and route it appropriately
def process_video(youtube_url, output_dir, api_key, model="gpt-4o-transcribe"):
    """
    Process a YouTube video to generate a transcript
    Wrapper function that combines download and transcription
    Args:
        youtube_url: URL of the YouTube video
        output_dir: Directory to save the output
        api_key: OpenAI API key
        model: The model to use for transcription (default: gpt-4o-transcribe)
    Returns:
        dict: Dictionary containing transcript and file paths
    """
    # First download the audio
    print("Downloading video...")
    audio_path = process_video_download(youtube_url, output_dir)
    
    print("Transcribing video...")
    # Then transcribe the audio
    transcript, transcript_path = process_video_transcribe(audio_path, output_dir, api_key, model=model)
    
    # Return the combined results
    return {
        'transcript': transcript,
        'transcript_path': transcript_path,
        'audio_path': audio_path
    }

def get_video_info(youtube_url):
    """Get video information without downloading."""
    # Check local cache first
    global _video_info_cache
    if youtube_url in _video_info_cache:
        return _video_info_cache[youtube_url]
        
    # Extract info if not cached
    with yt_dlp.YoutubeDL() as ydl:
        info = ydl.extract_info(youtube_url, download=False)
        # Cache the result
        _video_info_cache[youtube_url] = info
        # Also cache the video ID separately
        _video_id_cache[youtube_url] = info.get('id', 'video')
        return info

def get_video_id(youtube_url):
    """Get just the video ID without re-extracting if already known."""
    global _video_id_cache
    if youtube_url in _video_id_cache:
        return _video_id_cache[youtube_url]
    
    # If not in cache, extract from URL directly if possible
    if "v=" in youtube_url:
        video_id = youtube_url.split("v=")[1].split("&")[0]
        _video_id_cache[youtube_url] = video_id
        return video_id
    elif "youtu.be/" in youtube_url:
        video_id = youtube_url.split("youtu.be/")[1].split("?")[0]
        _video_id_cache[youtube_url] = video_id
        return video_id
    
    # If we can't extract directly, fall back to full info extraction
    info = get_video_info(youtube_url)
    video_id = info.get('id', 'video')
    return video_id

After the video input is given, the code in app.py checks whether a transcript for the input video already exists (in the case of URL input). This is done by calling the following two helper functions from transcriber.py.

def get_transcript_path(youtube_url, output_dir):
    """Get the expected transcript path for a given YouTube URL."""
    # Get video ID with caching
    video_id = get_video_id(youtube_url)
    # Return expected transcript path
    return os.path.join(output_dir, f"{video_id}_transcript.txt")

def transcript_exists(youtube_url, output_dir):
    """Check if a transcript already exists for this video."""
    transcript_path = get_transcript_path(youtube_url, output_dir)
    return os.path.exists(transcript_path)

If transcript_exists returns the path of an existing transcript, the next step is to create the vector store for the RAG. If no existing transcript is found, the next step is to download audio from the URL and convert it to a standard audio format. The function process_video_download downloads audio from the URL using the FFMPEG library and converts it to .mp3 format. If the input is a local video file, app.py proceeds to convert it to .mp3 file.

def process_video_download(youtube_url, output_dir):
    """
    Download audio from a YouTube video
    Args:
        youtube_url: URL of the YouTube video
        output_dir: Directory to save the output
        
    Returns:
        str: Path to the downloaded audio file
    """
    # Create output directory if it doesn't exist
    os.makedirs(output_dir, exist_ok=True)
    
    # Extract video ID from URL
    video_id = None
    if "v=" in youtube_url:
        video_id = youtube_url.split("v=")[1].split("&")[0]
    elif "youtu.be/" in youtube_url:
        video_id = youtube_url.split("youtu.be/")[1].split("?")[0]
    else:
        raise ValueError("Could not extract video ID from URL")
    # Set output paths
    audio_path = os.path.join(output_dir, f"{video_id}.mp3")
    
    # Configure yt-dlp options
    ydl_opts = {
        'format': 'bestaudio/best',
        'postprocessors': [{
            'key': 'FFmpegExtractAudio',
            'preferredcodec': 'mp3',
            'preferredquality': '192',
        }],
        'outtmpl': os.path.join(output_dir, f"{video_id}"),
        'quiet': True
    }
    
    # Download audio
    with yt_dlp.YoutubeDL(ydl_opts) as ydl:
        ydl.download([youtube_url])
    
    # Verify audio file exists
    if not os.path.exists(audio_path):
        # Try with an extension that yt-dlp might have used
        potential_paths = [
            os.path.join(output_dir, f"{video_id}.mp3"),
            os.path.join(output_dir, f"{video_id}.m4a"),
            os.path.join(output_dir, f"{video_id}.webm")
        ]
        
        for path in potential_paths:
            if os.path.exists(path):
                # Convert to mp3 if it's not already
                if not path.endswith('.mp3'):
                    ffmpeg_path = verify_ffmpeg()[0]
                    output_mp3 = os.path.join(output_dir, f"{video_id}.mp3")
                    subprocess.run([
                        ffmpeg_path, '-i', path, '-c:a', 'libmp3lame', 
                        '-q:a', '2', output_mp3, '-y'
                    ], check=True, capture_output=True)
                    os.remove(path)  # Remove the original file
                    audio_path = output_mp3
                else:
                    audio_path = path
                break
        else:
            raise FileNotFoundError(f"Could not find downloaded audio file for video {video_id}")
    return audio_path

Audio Transcription Using OpenAI’s gpt-4o-transcribe Model

After extracting audio and converting it to a standard audio format, the next step is to transcribe the audio to text format. For this purpose, I used OpenAI’s newly launched gpt-4o-transcribe speech-to-text model accessible through speech-to-text API. This model has outperformed OpenAI’s Whisper models in terms of both transcription accuracy and robust language coverage.

The function process_video_transcribe in transcriber.py receives the converted audio file and interfaces with gpt-4o-transcribe model with OpenAI’s speech-to-text API. The gpt-4o-transcribe model currently has an audio file limit of 25MB and 1500 duration. To overcome this limitation, I split the longer files into multiple chunks and transcribe these chunks separately. The process_video_transcribe function checks whether the input file exceeds the size and/or duration limit. If either threshold is exceeded, it calls split_and_transcribe function, which first calculates the number of chunks needed based on both size and duration and takes the maximum of these two as the final number of chunks for transcription. It then finds the start and end times for each chunk and extracts these chunks from the audio file. Subsequently, it transcribes each chunk using gpt-4o-transcribe model with OpenAI’s speech-to-text API and then combines transcripts of all chunks to generate the final transcript.

def process_video_transcribe(audio_path, output_dir, api_key, progress_callback=None, model="gpt-4o-transcribe"):
    """
    Transcribe an audio file using OpenAI API, with automatic chunking for large files
    Always uses the selected model, with no fallback
    
    Args:
        audio_path: Path to the audio file
        output_dir: Directory to save the transcript
        api_key: OpenAI API key
        progress_callback: Function to call with progress updates (0-100)
        model: The model to use for transcription (default: gpt-4o-transcribe)
        
    Returns:
        tuple: (transcript text, transcript path)
    """
    # Extract video ID from audio path
    video_id = os.path.basename(audio_path).split('.')[0]
    transcript_path = os.path.join(output_dir, f"{video_id}_transcript.txt")
    
    # Setup OpenAI client
    client = OpenAI(api_key=api_key)
    
    # Update progress
    if progress_callback:
        progress_callback(10)
    
    # Get file size in MB
    file_size_mb = os.path.getsize(audio_path) / (1024 * 1024)
    
    # Universal chunking thresholds - apply to both models
    max_size_mb = 25  # 25MB chunk size for both models
    max_duration_seconds = 1500  # 1500 seconds chunk duration for both models
    
    # Load the audio file to get its duration
    try:
        audio = AudioSegment.from_file(audio_path)
        duration_seconds = len(audio) / 1000  # pydub uses milliseconds
    except Exception as e:
        print(f"Error loading audio to check duration: {e}")
        audio = None
        duration_seconds = 0
    
    # Determine if chunking is needed
    needs_chunking = False
    chunking_reason = []
    
    if file_size_mb > max_size_mb:
        needs_chunking = True
        chunking_reason.append(f"size ({file_size_mb:.2f}MB exceeds {max_size_mb}MB)")
    
    if duration_seconds > max_duration_seconds:
        needs_chunking = True
        chunking_reason.append(f"duration ({duration_seconds:.2f}s exceeds {max_duration_seconds}s)")
    
    # Log the decision
    if needs_chunking:
        reason_str = " and ".join(chunking_reason)
        print(f"Audio needs chunking due to {reason_str}. Using {model} for transcription.")
    else:
        print(f"Audio file is within limits. Using {model} for direct transcription.")
    
    # Check if file needs chunking
    if needs_chunking:
        if progress_callback:
            progress_callback(15)
        
        # Split the audio file into chunks and transcribe each chunk using the selected model only
        full_transcript = split_and_transcribe(
            audio_path, client, model, progress_callback, 
            max_size_mb, max_duration_seconds, audio
        )
    else:
        # File is small enough, transcribe directly with the selected model
        with open(audio_path, "rb") as audio_file:
            if progress_callback:
                progress_callback(30)
                
            transcript_response = client.audio.transcriptions.create(
                model=model, 
                file=audio_file
            )
            
            if progress_callback:
                progress_callback(80)
            
            full_transcript = transcript_response.text
    
    # Save transcript to file
    with open(transcript_path, "w", encoding="utf-8") as f:
        f.write(full_transcript)
    
    # Update progress
    if progress_callback:
        progress_callback(100)
    
    return full_transcript, transcript_path

def split_and_transcribe(audio_path, client, model, progress_callback=None, 
                         max_size_mb=25, max_duration_seconds=1500, audio=None):
    """
    Split an audio file into chunks and transcribe each chunk 
    
    Args:
        audio_path: Path to the audio file
        client: OpenAI client
        model: Model to use for transcription (will not fall back to other models)
        progress_callback: Function to call with progress updates
        max_size_mb: Maximum file size in MB
        max_duration_seconds: Maximum duration in seconds
        audio: Pre-loaded AudioSegment (optional)
        
    Returns:
        str: Combined transcript from all chunks
    """
    # Load the audio file if not provided
    if audio is None:
        audio = AudioSegment.from_file(audio_path)
    
    # Get audio duration in seconds
    duration_seconds = len(audio) / 1000
    
    # Calculate the number of chunks needed based on both size and duration
    file_size_mb = os.path.getsize(audio_path) / (1024 * 1024)
    
    chunks_by_size = math.ceil(file_size_mb / (max_size_mb * 0.9))  # Use 90% of max to be safe
    chunks_by_duration = math.ceil(duration_seconds / (max_duration_seconds * 0.95))  # Use 95% of max to be safe
    num_chunks = max(chunks_by_size, chunks_by_duration)
    
    print(f"Splitting audio into {num_chunks} chunks based on size ({chunks_by_size}) and duration ({chunks_by_duration})")
    
    # Calculate chunk duration in milliseconds
    chunk_length_ms = len(audio) // num_chunks
    
    # Create temp directory for chunks if it doesn't exist
    temp_dir = os.path.join(os.path.dirname(audio_path), "temp_chunks")
    os.makedirs(temp_dir, exist_ok=True)
    
    # Split the audio into chunks and transcribe each chunk
    transcripts = []
    
    for i in range(num_chunks):
        if progress_callback:
            # Update progress: 20% for splitting, 60% for transcribing
            progress_percent = 20 + int((i / num_chunks) * 60)
            progress_callback(progress_percent)
        
        # Calculate start and end times for this chunk
        start_ms = i * chunk_length_ms
        end_ms = min((i + 1) * chunk_length_ms, len(audio))
        
        # Extract the chunk
        chunk = audio[start_ms:end_ms]
        
        # Save the chunk to a temporary file
        chunk_path = os.path.join(temp_dir, f"chunk_{i}.mp3")
        chunk.export(chunk_path, format="mp3")
        
        # Log chunk information
        chunk_size_mb = os.path.getsize(chunk_path) / (1024 * 1024)
        chunk_duration = len(chunk) / 1000
        print(f"Chunk {i+1}/{num_chunks}: {chunk_size_mb:.2f}MB, {chunk_duration:.2f}s")
        
        # Transcribe the chunk 
        try:
            with open(chunk_path, "rb") as chunk_file:
                transcript_response = client.audio.transcriptions.create(
                    model=model,
                    file=chunk_file
                )
                
                # Add to our list of transcripts
                transcripts.append(transcript_response.text)
        except Exception as e:
            print(f"Error transcribing chunk {i+1} with {model}: {e}")
            # Add a placeholder for the failed chunk
            transcripts.append(f"[Transcription failed for segment {i+1}]")
        
        # Clean up the temporary chunk file
        os.remove(chunk_path)
    
    # Clean up the temporary directory
    try:
        os.rmdir(temp_dir)
    except:
        print(f"Note: Could not remove temporary directory {temp_dir}")
    
    # Combine all transcripts with proper spacing
    full_transcript = " ".join(transcripts)
    
    return full_transcript

The following screenshot of the Streamlit app shows the video processing and transcribing workflow for one of my webinars, “Integrating LLMs into Business,” available on my YouTube channel.

Snapshot of the Streamlit app showing the process of extracting audio and transcribing (image by author)

Retrieval Augmented Generation (RAG) for Interactive Conversations

After generating the video transcript, the application develops a RAG to facilitate both text and speech-based interactions. The conversational intelligence is implemented through VideoRAG class in rag_system.py which initializes chunk size and overlap, OpenAI embeddings, ChatOpenAI instance to generate responses with gpt-4o model, and ConversationBufferMemory to maintain chat history for contextual continuity.

The create_vector_store method splits the documents into chunks and creates a vector store using the FAISS vector database. The handle_question_submission method processes text questions and appends each new question and its answer to the conversation history. The handle_speech_input function implements the complete voice-to-text-to-voice pipeline. It first records the question audio, transcribes the question, processes the query through the RAG system, and synthesizes speech for the response.

class VideoRAG:
    def __init__(self, api_key=None, chunk_size=1000, chunk_overlap=200):
        """Initialize the RAG system with OpenAI API key."""
        # Use provided API key or try to get from environment
        self.api_key = api_key if api_key else st.secrets["OPENAI_API_KEY"]
        if not self.api_key:
            raise ValueError("OpenAI API key is required either as parameter or environment variable")
            
        self.embeddings = OpenAIEmbeddings(openai_api_key=self.api_key)
        self.llm = ChatOpenAI(
            openai_api_key=self.api_key,
            model="gpt-4o",
            temperature=0
        )
        self.chunk_size = chunk_size
        self.chunk_overlap = chunk_overlap
        self.vector_store = None
        self.chain = None
        self.memory = ConversationBufferMemory(
            memory_key="chat_history",
            return_messages=True
        )
    
    def create_vector_store(self, transcript):
        """Create a vector store from the transcript."""
        # Split the text into chunks
        text_splitter = RecursiveCharacterTextSplitter(
            chunk_size=self.chunk_size,
            chunk_overlap=self.chunk_overlap,
            separators=["nn", "n", " ", ""]
        )
        chunks = text_splitter.split_text(transcript)
        
        # Create vector store
        self.vector_store = FAISS.from_texts(chunks, self.embeddings)
        
        # Create prompt template for the RAG system
        system_template = """You are a specialized AI assistant that answers questions about a specific video. 
        
        You have access to snippets from the video transcript, and your role is to provide accurate information ONLY based on these snippets.
        
        Guidelines:
        1. Only answer questions based on the information provided in the context from the video transcript, otherwise say that "I don't know. The video doesn't cover that information."
        2. The question may ask you to summarize the video or tell what the video is about. In that case, present a summary of the context. 
        3. Don't make up information or use knowledge from outside the provided context
        4. Keep your answers concise and directly related to the question
        5. If asked about your capabilities or identity, explain that you're an AI assistant that specializes in answering questions about this specific video
        
        Context from the video transcript:
        {context}
        
        Chat History:
        {chat_history}
        """
        user_template = "{question}"
        
        # Create the messages for the chat prompt
        messages = [
            SystemMessagePromptTemplate.from_template(system_template),
            HumanMessagePromptTemplate.from_template(user_template)
        ]
        
        # Create the chat prompt
        qa_prompt = ChatPromptTemplate.from_messages(messages)
        
        # Initialize the RAG chain with the custom prompt
        self.chain = ConversationalRetrievalChain.from_llm(
            llm=self.llm,
            retriever=self.vector_store.as_retriever(
                search_kwargs={"k": 5}
            ),
            memory=self.memory,
            combine_docs_chain_kwargs={"prompt": qa_prompt},
            verbose=True
        )
        
        return len(chunks)
    
    def set_chat_history(self, chat_history):
        """Set chat history from external session state."""
        if not self.memory:
            return
            
        # Clear existing memory
        self.memory.clear()
        
        # Convert standard chat history format to LangChain message format
        for message in chat_history:
            if message["role"] == "user":
                self.memory.chat_memory.add_user_message(message["content"])
            elif message["role"] == "assistant":
                self.memory.chat_memory.add_ai_message(message["content"])
    
    def ask(self, question, chat_history=None):
        """Ask a question to the RAG system."""
        if not self.chain:
            raise ValueError("Vector store not initialized. Call create_vector_store first.")
        
        # If chat history is provided, update the memory
        if chat_history:
            self.set_chat_history(chat_history)
        
        # Get response
        response = self.chain.invoke({"question": question})
        return response["answer"]

See the following snapshot of the Streamlit app, showing the interactive conversation interface with the video.

Snapshot showing conversational interface and interactive learning content (image by author)

The following snapshot shows a conversation with the video with speech input and text+speech output.

Conversation with video (image by author)

Feature Generation

The application generates three features: hierarchical summary, quiz, and flashcards. Please refer to their respective commented codes in the GitHub repo.

The SummaryGenerator class in summary.py provides structured content summarization by creating a hierarchical representation of the video content to provide users with quick insights into the main concepts and supporting details. The system retrieves key contextual segments from the transcript using RAG. Using a prompt (see generate_summary), it creates a hierarchical summary with three levels: main points, sub-points, and additional details. The create_summary_popup_html method transforms the generated summary into an interactive visual representation using CSS and JavaScript.

# summary.py
class SummaryGenerator:
    def __init__(self):
        pass
    
    def generate_summary(self, rag_system, api_key, model="gpt-4o", temperature=0.2):
        """
        Generate a hierarchical bullet-point summary from the video transcript
        
        Args:
            rag_system: The RAG system with vector store
            api_key: OpenAI API key
            model: Model to use for summary generation
            temperature: Creativity level (0.0-1.0)
            
        Returns:
            str: Hierarchical bullet-point summary text
        """
        if not rag_system:
            st.error("Please transcribe the video first before creating a summary!")
            return ""
        
        with st.spinner("Generating hierarchical summary..."):
            # Create LLM for summary generation
            summary_llm = ChatOpenAI(
                openai_api_key=api_key,
                model=model,
                temperature=temperature  # Lower temperature for more factual summaries
            )
            
            # Use the RAG system to get relevant context
            try:
                # Get broader context since we're summarizing the whole video
                relevant_docs = rag_system.vector_store.similarity_search(
                    "summarize the main points of this video", k=10
                )
                context = "nn".join([doc.page_content for doc in relevant_docs])
                
                prompt = """Based on the video transcript, create a hierarchical bullet-point summary of the content.
                Structure your summary with exactly these levels:
                
                • Main points (use • or * at the start of the line for these top-level points)
                  - Sub-points (use - at the start of the line for these second-level details)
                    * Additional details (use spaces followed by * for third-level points)
                
                For example:
                • First main point
                  - Important detail about the first point
                  - Another important detail
                    * A specific example
                    * Another specific example
                • Second main point
                  - Detail about second point
                
                Be consistent with the exact formatting shown above. Each bullet level must start with the exact character shown (• or *, -, and spaces+*).
                Create 3-5 main points with 2-4 sub-points each, and add third-level details where appropriate.
                Focus on the most important information from the video.
                """
                
                # Use the LLM with context to generate the summary
                messages = [
                    {"role": "system", "content": f"You are given the following context from a video transcript:nn{context}nnUse this context to create a hierarchical summary according to the instructions."},
                    {"role": "user", "content": prompt}
                ]
                
                response = summary_llm.invoke(messages)
                return response.content
            except Exception as e:
                # Fallback to the regular RAG system if there's an error
                st.warning(f"Using standard summary generation due to error: {str(e)}")
                return rag_system.ask(prompt)
    
    def create_summary_popup_html(self, summary_content):
        """
        Create HTML for the summary popup with properly formatted hierarchical bullets
        
        Args:
            summary_content: Raw summary text with markdown bullet formatting
            
        Returns:
            str: HTML for the popup with properly formatted bullets
        """
        # Instead of relying on markdown conversion, let's manually parse and format the bullet points
        lines = summary_content.strip().split('n')
        formatted_html = []
        
        in_list = False
        list_level = 0
        
        for line in lines:
            line = line.strip()
            
            # Skip empty lines
            if not line:
                continue
                
            # Detect if this is a markdown header
            if line.startswith('# '):
                if in_list:
                    # Close any open lists
                    for _ in range(list_level):
                        formatted_html.append('')
                    in_list = False
                    list_level = 0
                formatted_html.append(f'{line[2:]}')
                continue
                
            # Check line for bullet point markers
            if line.startswith('• ') or line.startswith('* '):
                # Top level bullet
                content = line[2:].strip()
                
                if not in_list:
                    # Start a new list
                    formatted_html.append('')
                    in_list = True
                    list_level = 1
                elif list_level > 1:
                    # Close nested lists to get back to top level
                    for _ in range(list_level - 1):
                        formatted_html.append('')
                    list_level = 1
                else:
                    # Close previous list item if needed
                    if formatted_html and not formatted_html[-1].endswith('') and in_list:
                        formatted_html.append('')
                        
                formatted_html.append(f'{content}')
                
            elif line.startswith('- '):
                # Second level bullet
                content = line[2:].strip()
                
                if not in_list:
                    # Start new lists
                    formatted_html.append('Second level items')
                    formatted_html.append('')
                    in_list = True
                    list_level = 2
                elif list_level == 1:
                    # Add a nested list
                    formatted_html.append('')
                    list_level = 2
                elif list_level > 2:
                    # Close deeper nested lists to get to second level
                    for _ in range(list_level - 2):
                        formatted_html.append('')
                    list_level = 2
                else:
                    # Close previous list item if needed
                    if formatted_html and not formatted_html[-1].endswith('') and list_level == 2:
                        formatted_html.append('')
                        
                formatted_html.append(f'{content}')
                
            elif line.startswith('  * ') or line.startswith('    * '):
                # Third level bullet
                content = line.strip()[2:].strip()
                
                if not in_list:
                    # Start new lists (all levels)
                    formatted_html.append('Top level')
                    formatted_html.append('Second level')
                    formatted_html.append('')
                    in_list = True
                    list_level = 3
                elif list_level == 2:
                    # Add a nested list
                    formatted_html.append('')
                    list_level = 3
                elif list_level < 3:
                    # We missed a level, adjust
                    formatted_html.append('Missing level')
                    formatted_html.append('')
                    list_level = 3
                else:
                    # Close previous list item if needed
                    if formatted_html and not formatted_html[-1].endswith('') and list_level == 3:
                        formatted_html.append('')
                        
                formatted_html.append(f'{content}')
            else:
                # Regular paragraph
                if in_list:
                    # Close any open lists
                    for _ in range(list_level):
                        formatted_html.append('')
                        if list_level > 1:
                            formatted_html.append('')
                    in_list = False
                    list_level = 0
                formatted_html.append(f'{line}')
        
        # Close any open lists
        if in_list:
            # Close final item
            formatted_html.append('')
            # Close any open lists
            for _ in range(list_level):
                if list_level > 1:
                    formatted_html.append('')
                else:
                    formatted_html.append('')
        
        summary_html = 'n'.join(formatted_html)
        
        html = f"""
        
            
                
                    Hierarchical Summary
                    
                
                
                    {summary_html}
                
            
        
        
        
        
        
        """
        return html

Heirarchical summary (image by author)

Talk-to-Videos app generates quizzes from the video through the QuizGenerator class in quiz.py. The quiz generator creates multiple-choice questions targeting specific facts and concepts presented in the video. Unlike RAG, where I use a zero temperature, I increased the LLM temperature to 0.4 to encourage some creativity in quiz generation. A structured prompt guides the quiz generation process. The parse_quiz_response method extracts and validates the generated quiz elements to make sure that each question has all the required components. To prevent the users from recognizing the pattern and to promote real understanding, the quiz generator shuffles the answer options. Questions are presented one at a time, followed by immediate feedback on each answer. After completing all questions, the calculate_quiz_results method assesses user answers and the user is presented with an overall score, a visual breakdown of correct versus incorrect answers, and feedback on the performance level. In this way, the quiz generation functionality transforms passive video watching into active learning by challenging users to recall and apply information presented in the video.

# quiz.py
class QuizGenerator:
    def __init__(self):
        pass
    
    def generate_quiz(self, rag_system, api_key, transcript=None, model="gpt-4o", temperature=0.4):
        """
        Generate quiz questions based on the video transcript
        
        Args:
            rag_system: The RAG system with vector store2
            api_key: OpenAI API key
            transcript: The full transcript text (optional)
            model: Model to use for question generation
            temperature: Creativity level (0.0-1.0)
            
        Returns:
            list: List of question objects
        """
        if not rag_system:
            st.error("Please transcribe the video first before creating a quiz!")
            return []
        
        # Create a temporary LLM with slightly higher temperature for more creative questions
        creative_llm = ChatOpenAI(
            openai_api_key=api_key,
            model=model,
            temperature=temperature
        )

        num_questions = 10
        
        # Prompt to generate quiz
        prompt = f"""Based on the video transcript, generate {num_questions} multiple-choice questions to test understanding of the content.
        For each question:
        1. The question should be specific to information mentioned in the video
        2. Include 4 options (A, B, C, D)
        3. Clearly indicate the correct answer
        
        Format your response exactly as follows for each question:
        QUESTION: [question text]
        A: [option A]
        B: [option B]
        C: [option C]
        D: [option D]
        CORRECT: [letter of correct answer]
       
        Make sure all questions are based on facts from the video."""
        
        try:
            if transcript:
                # If we have the full transcript, use it
                messages = [
                    {"role": "system", "content": f"You are given the following transcript from a video:nn{transcript}nnUse this transcript to create quiz questions according to the instructions."},
                    {"role": "user", "content": prompt}
                ]
                
                response = creative_llm.invoke(messages)
                response_text = response.content
            else:
                # Fallback to RAG approach if no transcript is provided
                relevant_docs = rag_system.vector_store.similarity_search(
                    "what are the main topics covered in this video?", k=5
                )
                context = "nn".join([doc.page_content for doc in relevant_docs])
                
                # Use the creative LLM with context to generate questions
                messages = [
                    {"role": "system", "content": f"You are given the following context from a video transcript:nn{context}nnUse this context to create quiz questions according to the instructions."},
                    {"role": "user", "content": prompt}
                ]
                
                response = creative_llm.invoke(messages)
                response_text = response.content
        except Exception as e:
            # Fallback to the regular RAG system if there's an error
            st.warning(f"Using standard question generation due to error: {str(e)}")
            response_text = rag_system.ask(prompt)
        
        return self.parse_quiz_response(response_text)

    # The rest of the class remains unchanged
    def parse_quiz_response(self, response_text):
        """
        Parse the LLM response to extract questions, options, and correct answers
        
        Args:
            response_text: Raw text response from LLM
            
        Returns:
            list: List of parsed question objects
        """
        quiz_questions = []
        current_question = {}
        
        for line in response_text.strip().split('n'):
            line = line.strip()
            if line.startswith('QUESTION:'):
                if current_question and 'question' in current_question and 'options' in current_question and 'correct' in current_question:
                    quiz_questions.append(current_question)
                current_question = {
                    'question': line[len('QUESTION:'):].strip(),
                    'options': [],
                    'correct': None
                }
            elif line.startswith(('A:', 'B:', 'C:', 'D:')):
                option_letter = line[0]
                option_text = line[2:].strip()
                current_question.setdefault('options', []).append((option_letter, option_text))
            elif line.startswith('CORRECT:'):
                current_question['correct'] = line[len('CORRECT:'):].strip()
        
        # Add the last question
        if current_question and 'question' in current_question and 'options' in current_question and 'correct' in current_question:
            quiz_questions.append(current_question)
        
        # Randomize options for each question
        randomized_questions = []
        for q in quiz_questions:
            # Get the original correct answer
            correct_letter = q['correct']
            correct_option = None
            
            # Find the correct option text
            for letter, text in q['options']:
                if letter == correct_letter:
                    correct_option = text
                    break
            
            if correct_option is None:
                # If we can't find the correct answer, keep the question as is
                randomized_questions.append(q)
                continue
                
            # Create a list of options texts and shuffle them
            option_texts = [text for _, text in q['options']]
            
            # Create a copy of the original letters
            option_letters = [letter for letter, _ in q['options']]
            
            # Create a list of (letter, text) pairs
            options_pairs = list(zip(option_letters, option_texts))
            
            # Shuffle the pairs
            random.shuffle(options_pairs)
            
            # Find the new position of the correct answer
            new_correct_letter = None
            for letter, text in options_pairs:
                if text == correct_option:
                    new_correct_letter = letter
                    break
            
            # Create a new question with randomized options
            new_q = {
                'question': q['question'],
                'options': options_pairs,
                'correct': new_correct_letter
            }
            
            randomized_questions.append(new_q)
        
        return randomized_questions
    
    def calculate_quiz_results(self, questions, user_answers):
        """
        Calculate quiz results based on user answers
        
        Args:
            questions: List of question objects
            user_answers: Dictionary of user answers keyed by question_key
            
        Returns:
            tuple: (results dict, correct count)
        """
        correct_count = 0
        results = {}
        
        for i, question in enumerate(questions):
            question_key = f"quiz_q_{i}"
            user_answer = user_answers.get(question_key)
            correct_answer = question['correct']
            
            # Only count as correct if user selected an answer and it matches
            is_correct = user_answer is not None and user_answer == correct_answer
            if is_correct:
                correct_count += 1
            
            results[question_key] = {
                'user_answer': user_answer,
                'correct_answer': correct_answer,
                'is_correct': is_correct
            }
        
        return results, correct_count

Quiz result (image by author)

Talk-to-Videos also generates flashcards from the video content, which support active recall and spaced repetition learning techniques. This is done through the FlashcardGenerator class in flashcards.py, which creates a mix of different flashcards focusing on key term definitions, conceptual questions, fill-in-the-blank statements, and true/False questions with explanations. A prompt guides the LLM to output flashcards in a structured JSON format, with each card containing distinct “front” and “back” elements. The shuffle_flashcards produces a randomized presentation, and each flashcard is validated to ensure that it contains both front and back components before being presented to the user. The answer to each flashcard is initially hidden. It is revealed at the user’s input using a classic flashcard reveal functionality. Users can generate a new set of flashcards for more practice. The flashcard and quiz systems are interconnected with each other so that users can switch between them as needed.

# flashcards.py
class FlashcardGenerator:
    """Class to generate flashcards from video content using the RAG system."""
    
    def __init__(self):
        """Initialize the flashcard generator."""
        pass
    
    def generate_flashcards(self, rag_system, api_key, transcript=None, num_cards=10, model="gpt-4o") -> List[Dict[str, str]]:
        """
        Generate flashcards based on the video content.
        
        Args:
            rag_system: The initialized RAG system with video content
            api_key: OpenAI API key
            transcript: The full transcript text (optional)
            num_cards: Number of flashcards to generate (default: 10)
            model: The OpenAI model to use
            
        Returns:
            List of flashcard dictionaries with 'front' and 'back' keys
        """
        # Import here to avoid circular imports
        from langchain_openai import ChatOpenAI
        
        # Initialize language model
        llm = ChatOpenAI(
            openai_api_key=api_key,
            model=model,
            temperature=0.4
        )
        
        # Create the prompt for flashcard generation
        prompt = f"""
        Create {num_cards} educational flashcards based on the video content.
        
        Each flashcard should have:
        1. A front side with a question, term, or concept
        2. A back side with the answer, definition, or explanation
        
        Focus on the most important and educational content from the video. 
        Create a mix of different types of flashcards:
        - Key term definitions
        - Conceptual questions
        - Fill-in-the-blank statements
        - True/False questions with explanations
        
        Format your response as a JSON array of objects with 'front' and 'back' properties.
        Example:
        [
            {{"front": "What is photosynthesis?", "back": "The process by which plants convert light energy into chemical energy."}},
            {{"front": "The three branches of government are: Executive, Legislative, and _____", "back": "Judicial"}}
        ]
        
        Make sure your output is valid JSON format with exactly {num_cards} flashcards.
        """
        
        try:
            # Determine the context to use
            if transcript:
                # Use the full transcript if provided
                # Create messages for the language model
                messages = [
                    {"role": "system", "content": f"You are an educational content creator specializing in creating effective flashcards. Use the following transcript from a video to create educational flashcards:nn{transcript}"},
                    {"role": "user", "content": prompt}
                ]
            else:
                # Fallback to RAG system if no transcript is provided
                relevant_docs = rag_system.vector_store.similarity_search(
                    "key points and educational concepts in the video", k=15
                )
                context = "nn".join([doc.page_content for doc in relevant_docs])
                
                # Create messages for the language model
                messages = [
                    {"role": "system", "content": f"You are an educational content creator specializing in creating effective flashcards. Use the following context from a video to create educational flashcards:nn{context}"},
                    {"role": "user", "content": prompt}
                ]
            
            # Generate flashcards
            response = llm.invoke(messages)
            content = response.content
            
            # Extract JSON content in case there's text around it
            json_start = content.find('[')
            json_end = content.rfind(']') + 1
            
            if json_start >= 0 and json_end > json_start:
                json_content = content[json_start:json_end]
                flashcards = json.loads(json_content)
            else:
                # Fallback in case of improper JSON formatting
                raise ValueError("Failed to extract valid JSON from response")
            
            # Verify we have the expected number of cards (or adjust as needed)
            actual_cards = min(len(flashcards), num_cards)
            flashcards = flashcards[:actual_cards]
            
            # Validate each flashcard has required fields
            validated_cards = []
            for card in flashcards:
                if 'front' in card and 'back' in card:
                    validated_cards.append({
                        'front': card['front'],
                        'back': card['back']
                    })
            
            return validated_cards
        
        except Exception as e:
            # Handle errors gracefully
            print(f"Error generating flashcards: {str(e)}")
            # Return a few basic flashcards in case of error
            return [
                {"front": "Error generating flashcards", "back": f"Please try again. Error: {str(e)}"},
                {"front": "Tip", "back": "Try regenerating flashcards or using a different video"}
            ]
    
    def shuffle_flashcards(self, flashcards: List[Dict[str, str]]) -> List[Dict[str, str]]:
        """Shuffle the order of flashcards"""
        shuffled = flashcards.copy()
        random.shuffle(shuffled)
        return shuffled

Flashcards (image by author)

Potential Extensions and Improvements

This application can be extended and improved in a number of ways. For instance:

Integration of visual features in video (such as keyframes) may be explored with audio to extract more meaningful information.
Team-based learning experiences can be enabled where office colleagues or classmates can share notes, quiz scores, and summaries.
Creating navigable transcripts that allow users to click on specific sections to jump to that point in the video
Creating step-by-step action plans for implementing concepts from the video in real business settings
Modifying the RAG prompt to elaborate on the answers and provide simpler explanations to difficult concepts.
Generating questions that stimulate metacognitive skills in learners by stimulating them to think about their thinking process and learning strategies while engaging with video content.

That’s all folks! If you liked the article, please follow me on Medium and LinkedIn.

The post Talk to Videos appeared first on Towards Data Science.

Attractors in Neural Network Circuits: Beauty and Chaos

Anya Korsakova — Tue, 25 Mar 2025 19:26:57 +0000

The state space of the first two neuron activations over time follows an attractor.

What is one thing in common between memories, oscillating chemical reactions and double pendulums? All these systems have a basin of attraction for possible states, like a magnet that draws the system towards certain trajectories. Complex systems with multiple inputs usually evolve over time, generating intricate and sometimes chaotic behaviors. Attractors represent the long-term behavioral pattern of dynamical systems — a pattern to which a system converges over time regardless of its initial conditions.

Neural networks have become ubiquitous in our current Artificial Intelligence era, typically serving as powerful tools for representation extraction and pattern recognition. However, these systems can also be viewed through another fascinating lens: as dynamical systems that evolve and converge to a manifold of states over time. When implemented with feedback loops, even simple neural networks can produce strikingly beautiful attractors, ranging from limit cycles to chaotic structures.

Neural Networks as Dynamical Systems

While neural networks in general sense are most commonly known for embedding extraction tasks, they can also be viewed as dynamical systems. A dynamical system describes how points in a state space evolve over time according to a fixed set of rules or forces. In the context of neural networks, the state space consists of the activation patterns of neurons, and the evolution rule is determined by the network’s weights, biases, activation functions, and other tricks.

Traditional NNs are optimized via gradient descent to find its endstate of convergence. However, when we introduce feedback — connecting the output back to the input — the network becomes a recurrent system with a different kind of temporal dynamic. These dynamics can exhibit a wide range of behaviors, from simple convergence to a fixed point to complex chaotic patterns.

Understanding Attractors

An attractor is a set of states toward which a system tends to evolve from a wide variety of starting conditions. Once a system reaches an attractor, it remains within that set of states unless perturbed by an external force. Attractors are indeed deeply involved in forming memories [1], oscillating chemical reactions [2], and other nonlinear dynamical systems.

Types of Attractors

Dynamical Systems can exhibit several types of attractors, each with distinct characteristics:

Point Attractors: the simplest form, where the system converges to a single fixed point regardless of starting conditions. This represents a stable equilibrium state.
Limit Cycles: the system settles into a repeating periodic orbit, forming a closed loop in phase space. This represents oscillatory behavior with a fixed period.
Toroidal (Quasiperiodic) Attractors: the system follows trajectories that wind around a donut-like structure in the phase space. Unlike limit cycles, these trajectories never really repeat but they remain bound to a specific region.
Strange (Chaotic) Attractors: characterized by aperiodic behavior that never repeats exactly yet remains bounded within a finite region of phase space. These attractors exhibit sensitive dependence on initial conditions, where a tiny difference will introduce significant consequences over time — a hallmark of chaos. Think butterfly effect.

Setup

In the following section, we will dive deeper into an example of a very simple NN architecture capable of said behavior, and demonstrate some pretty examples. We will touch on Lyapunov exponents, and provide implementation for those who wish to experiment with generating their own Neural Network attractor art (and not in the generative AI sense).

Figure 1. NN schematic and components that we will use for the attractor generation. [all figures are created by the author, unless stated otherwise]

We will use a grossly simplified one-layer NN with a feedback loop. The architecture consists of:

Input Layer:
- Array of size D (here 16-32) inputs
- We will unconventionally label them as y₁, y₂, y₃, …, y_D to highlight that these are mapped from the outputs
- Acts as a shift register that stores previous outputs
Hidden Layer:
- Contains N neurons (here fewer than D, ~4-8)
- We will label them x₁, x₂, …, x_N
- tanh() activation is applied for squashing
Output Layer
- Single output neuron (y₀)
- Combines the hidden layer outputs with biases — typically, we use biases to offset outputs by adding them; here, we used them for scaling, so they are factually an array of weights
Connections:
- Input to Hidden: Weight matrix w[i,j] (randomly initialized between -1 and 1)
- Hidden to Output: Bias weights b[i] (randomly initialized between 0 and s)
Feedback Loop:
- The output y₀ is fed back to the input layer, creating a dynamic map
- Acts as a shift register (y₁ = previous y₀, y₂ = previous y₁, etc.)
- This feedback is what creates the dynamical system behavior
Key Formulas:
- Hidden layer: u[i] = Σ(w[i,j] * y[j]); x[i] = tanh(u[i])
- Output: y₀ = Σ(b[i] * x[i])

The critical aspects that make this network generate attractors:

The feedback loop turns a simple feedforward network into a dynamical system
The nonlinear activation function (tanh) enables complex behaviors
The random weight initialization (controlled by the random seed) creates different attractor patterns
The scaling factor s affects the dynamics of the system and can push it into chaotic regimes

In order to investigate how prone the system is to chaos, we will calculate the Lyapunov exponents for different sets of parameters. Lyapunov exponent is a measure of the instability of a dynamical system…

\[\delta Z(t)| \approx e^{\lambda t} |\delta (Z(0))|\]

\[\lambda = n_t \sum_{k=0}^{n_t-1} ln \frac{|\Delta y_{k+1}|}{|\Delta y_k|}\]

…where n_t is a number of time steps, Δy_k is a distance between the states y(x_i) and y(x_i+ϵ) at a point in time; ΔZ(0) represents an initial infinitesimal (very small) separation between two nearby starting points, and ΔZ(t) is the separation after time t. For stable systems converging to a fixed point or a stable attractor this parameter is less than 0, for unstable (diverging, and, therefore, chaotic systems) it is greater than 0.

Let’s code it up! We will only use NumPy and default Python libraries for the implementation.

import numpy as np
from typing import Tuple, List, Optional


class NeuralAttractor:
    """
    
    N : int
        Number of neurons in the hidden layer
    D : int
        Dimension of the input vector
    s : float
        Scaling factor for the output

    """
    
    def __init__(self, N: int = 4, D: int = 16, s: float = 0.75, seed: Optional[int] = 
None):
        self.N = N
        self.D = D
        self.s = s
        
        if seed is not None:
            np.random.seed(seed)
        
        # Initialize weights and biases
        self.w = 2.0 * np.random.random((N, D)) - 1.0  # Uniform in [-1, 1]
        self.b = s * np.random.random(N)  # Uniform in [0, s]
        
        # Initialize state vector structures
        self.x = np.zeros(N)  # Neuron states
        self.y = np.zeros(D)  # Input vector

We initialize the NeuralAttractor class with some basic parameters — number of neurons in the hidden layer, number of elements in the input array, scaling factor for the output, and random seed. We proceed to initialize the weights and biases randomly, and x and y states. These weights and biases will not be optimized — they will stay put, no gradient descent this time.

    def reset(self, init_value: float = 0.001):
        """Reset the network state to initial conditions."""
        self.x = np.ones(self.N) * init_value
        self.y = np.zeros(self.D)
        
    def iterate(self) -> np.ndarray:
        """
        Perform one iteration of the network and return the neuron outputs.
        
        """
        # Calculate the output y0
        y0 = np.sum(self.b * self.x)
        
        # Shift the input vector
        self.y[1:] = self.y[:-1]
        self.y[0] = y0
        
        # Calculate the neuron inputs and apply activation fn
        for i in range(self.N):
            u = np.sum(self.w[i] * self.y)
            self.x[i] = np.tanh(u)
            
        return self.x.copy()

Next, we will define the iteration logic. We start every iteration with the feedback loop — we implement the shift register circuit by shifting all y elements to the right, and compute the most recent y₀ output to place it into the first element of the input.

    def generate_trajectory(self, tmax: int, discard: int = 0) -> Tuple[np.ndarray, 
np.ndarray]:
        """
        Generate a trajectory of the states for tmax iterations.
        
        -----------
        tmax : int
            Total number of iterations
        discard : int
            Number of initial iterations to discard

        """
        self.reset()
        
        # Discard initial transient
        for _ in range(discard):
            self.iterate()
        
        x1_traj = np.zeros(tmax)
        x2_traj = np.zeros(tmax)
        
        for t in range(tmax):
            x = self.iterate()
            x1_traj[t] = x[0]
            x2_traj[t] = x[1]
            
        return x1_traj, x2_traj

Now, we define the function that will iterate our network map over the tmax number of time steps and output the states of the first two hidden neurons for visualization. We can use any hidden neurons, and we could even visualize 3D state space, but we will limit our imagination to two dimensions.

This is the gist of the system. Now, we will just define some line and segment magic for pretty visualizations.

import numpy as np
import matplotlib.pyplot as plt
import matplotlib.collections as mcoll
import matplotlib.path as mpath
from typing import Tuple, Optional, Callable


def make_segments(x: np.ndarray, y: np.ndarray) -> np.ndarray:
    """
    Create list of line segments from x and y coordinates.
    
    -----------
    x : np.ndarray
        X coordinates
    y : np.ndarray
        Y coordinates

    """
    points = np.array([x, y]).T.reshape(-1, 1, 2)
    segments = np.concatenate([points[:-1], points[1:]], axis=1)
    return segments


def colorline(
    x: np.ndarray,
    y: np.ndarray,
    z: Optional[np.ndarray] = None,
    cmap = plt.get_cmap("jet"),
    norm = plt.Normalize(0.0, 1.0),
    linewidth: float = 1.0,
    alpha: float = 0.05,
    ax = None
):
    """
    Plot a colored line with coordinates x and y.
    
    -----------
    x : np.ndarray
        X coordinates
    y : np.ndarray
        Y coordinates

    """
    if ax is None:
        ax = plt.gca()
        
    if z is None:
        z = np.linspace(0.0, 1.0, len(x))
    
    segments = make_segments(x, y)
    lc = mcoll.LineCollection(
        segments, array=z, cmap=cmap, norm=norm, linewidth=linewidth, alpha=alpha
    )
    ax.add_collection(lc)
    
    return lc


def plot_attractor_trajectory(
    x: np.ndarray,
    y: np.ndarray,
    skip_value: int = 16,
    color_function: Optional[Callable] = None,
    cmap = plt.get_cmap("Spectral"),
    linewidth: float = 0.1,
    alpha: float = 0.1,
    figsize: Tuple[float, float] = (10, 10),
    interpolate_steps: int = 3,
    output_path: Optional[str] = None,
    dpi: int = 300,
    show: bool = True
):
    """
    Plot an attractor trajectory.
    
    Parameters:
    -----------
    x : np.ndarray
        X coordinates
    y : np.ndarray
        Y coordinates
    skip_value : int
        Number of points to skip for sparser plotting

    """
    fig, ax = plt.subplots(figsize=figsize)
    
    if interpolate_steps > 1:
        path = mpath.Path(np.column_stack([x, y]))
        verts = path.interpolated(steps=interpolate_steps).vertices
        x, y = verts[:, 0], verts[:, 1]
    
    x_plot = x[::skip_value]
    y_plot = y[::skip_value]
    
    if color_function is None:
        z = abs(np.sin(1.6 * y_plot + 0.4 * x_plot))
    else:
        z = color_function(x_plot, y_plot)
    
    colorline(x_plot, y_plot, z, cmap=cmap, linewidth=linewidth, alpha=alpha, ax=ax)
    
    ax.set_xlim(x.min(), x.max())
    ax.set_ylim(y.min(), y.max())
    
    ax.set_axis_off()
    ax.set_aspect('equal')
    
    plt.tight_layout()
    
    if output_path:
        fig.savefig(output_path, dpi=dpi, bbox_inches='tight')

    return fig

The functions written above will take the generated state space trajectories and visualize them. Because the state space may be densely filled, we will skip every 8th, 16th or 32th time point to sparsify our vectors. We also don’t want to plot these in one solid color, therefore we are coding the color as a periodic function (np.sin(1.6 * y_plot + 0.4 * x_plot)) based on the x and y coordinates of the figure axis. The multipliers for the coordinates are arbitrary and happen to generate nice smooth color maps, to your liking.

N = 4
D = 32
s = 0.22
seed=174658140

tmax = 100000
discard = 1000

nn = NeuralAttractor(N, D, s, seed=seed)

# Generate trajectory
x1, x2 = nn.generate_trajectory(tmax, discard)

plot_attractor_trajectory(
    x1, x2,
    output_path='trajectory.png',
)

After defining the NN and iteration parameters, we can generate the state space trajectories. If we spend enough time poking around with parameters, we will find something cool (I promise!). If manual parameter grid search labor is not exactly our thing, we could add a function that checks what proportion of the state space is covered over time. If after t = 100,000 iterations (except the initial 1,000 “warm up” time steps) we only touched a narrow range of values of the state space, we are likely stuck in a point. Once we found an attractor that is not so shy to take up more state space, we can plot it using default plotting params:

Figure 2. Limit cycle attractor.

One of the stable types of attractors is the limit cycle attractor (parameters: N = 4, D = 32, s = 0.22, seed = 174658140). It looks like a single, closed loop trajectory in phase space. The orbit follows a regular, periodic path over time series. I will not include the code for Lyapunov exponent calculation here to focus on the visual aspect of the generated attractors more, but one can find it under this link, if interested. The Lyapunov exponent for this attractor (λ=−3.65) is negative, indicating stability: mathematically, this exponent will lead to the state of the system decaying, or converging, to this basin of attraction over time.

If we keep increasing the scaling factor, we are more likely to tune up the values in the circuit, and perhaps more likely to find something interesting.

Figure 3. Toroidal attractor.

Here is the toroidal (quasiperiodic) attractor (parameters: N = 4, D = 32, s = 0.55, seed = 3160697950). It still has an ordered structure of sheets that wrap around in organized, quasiperiodic patterns. The Lyapunov exponent for this attractor has a higher value, but is still negative (λ=−0.20).

As we further increase the scaling factor s, the system becomes more prone to chaos. The strange (chaotic) attractor emerges with the following parameters: N = 4, D = 16, s = 1.4, seed = 174658140). It is characterized by an erratic, unpredictable pattern of trajectories that never repeat. The Lyapunov exponent for this attractor is positive (λ=0.32), indicating instability (divergence from an initially very close state over time) and chaotic behavior. This is the “butterfly effect” attractor.

Figure 4. Strange attractor.

As we further increase the scaling factor s, the system becomes more prone to chaos. The strange (chaotic) attractor emerges with the following parameters: N = 4, D = 16, s = 1.4, seed = 174658140. It is characterized by an erratic, unpredictable pattern of trajectories that never repeat. The Lyapunov exponent for this attractor is positive (λ=0.32), indicating instability (divergence from an initially very close state over time) and chaotic behavior. This is the “butterfly effect” attractor.

Just another confirmation that aesthetics can be very mathematical, and vice versa. The most visually compelling attractors often exist at the edge of chaos — think about it for a second! These structures are complex enough to exhibit intricate behavior, yet ordered enough to maintain coherence. This resonates with observations from various art forms, where balance between order and unpredictability often creates the most engaging experiences.

An interactive widget to generate and visualize these attractors is available here. The source code is available, too, and invites further exploration. The ideas behind this project were largely inspired by the work of J.C. Sprott [3].

References

[1] B. Poucet and E. Save, Attractors in Memory (2005), Science DOI:10.1126/science.1112555.

[2] Y.J.F. Kpomahou et al., Chaotic Behaviors and Coexisting Attractors in a New Nonlinear Dissipative Parametric Chemical Oscillator (2022), Complexity DOI:10.1155/2022/9350516.

[3] J.C. Sprott, Artificial Neural Net Attractors (1998), Computers & Graphics DOI:10.1016/S0097-8493(97)00089-7.

The post Attractors in Neural Network Circuits: Beauty and Chaos appeared first on Towards Data Science.

What Do Machine Learning Engineers Do?

Egor Howell — Tue, 25 Mar 2025 07:45:20 +0000

In this article, I want to explain precisely what I do as a machine learning engineer.

The aim is to help anyone looking to enter the field gain a truthful view of what a machine learning engineer is, how we work, what we do, and what a typical day in life is like.

I hope it can help you pinpoint if a career in machine learning is indeed for you.

What is a machine learning engineer?

Due to the rapid acceleration of the tech/AI space, a machine learning engineer is still not well-defined and varies between companies and geographies to a certain extent.

However, it generally refers to someone who:

Mixes machine learning, statistics and software engineering skills to train and deploy models into production.

At some companies, there will be a large cross-over with data scientists. Still, the main distinction between the two roles is that machine learning engineers deliver the solution into production. Often, data scientists won’t do this and focus more on helping in the model-building stage.

The need for a machine learning engineer came from the fact that models in Jupyter Notebooks have zero value. So, a role well-versed in machine learning and software engineering was needed to help bring the models “to life” and ensure they generate business value.

Because of this broad skillset, machine learning engineering is not an entry-level role, and you would typically need to be a data scientist or software engineer for a couple of years first.

So, to summarise:

Responsibilities: Train, build and deploy machine learning models.
Skills & Tech: Python, SQL, AWS, Bash/Zsh, PyTorch, Docker, Kubernetes, MLOps, Git, distributed computing (not an exhaustive list).
Experience: A couple of years as a data scientist or software engineer, and then up-skill yourself in the other areas.

If you want a better understanding of the different data and machine learning roles, I recommend checking out some of my previous articles.

The Difference Between ML Engineers and Data Scientists
Helping you decide whether you want to be a data scientist or machine learning engineermedium.com

Should You Become A Data Scientist, Data Analyst Or Data Engineer?
Explaining the differences and requirements between the various data rolesmedium.com

What do I do?

I work as a machine learning engineer within a cross-functional team. My squad specialises in classical machine learning and combinatorial optimisation-based problems.

Much of my work revolves around improving our machine learning models and optimisation solutions to improve the customer experience and generate financial value for the business.

The general workflow for most of my projects is as follows:

Idea — Someone may have an idea or hypothesis about how to improve one of our models.
Data — We check if the data to prove or disprove this hypothesis is readily available so we can start the research.
Research — If the data is available, we start building or testing this new hypothesis in the model.
Analysis — The results of the research stage are analysed to determine if we have improved the model.
Ship — The improvement is “productionised” in the codebase and goes live.

Along this process, there is a lot of interaction with other functions and roles within the team and broader company.

The idea phase is a collaborative discussion with a product manager who can provide business insight and any critical impacts we may have missed in the initial scoping.
Data, Build, and Analysis can be done in collaboration with data analysts and engineers to ensure the quality of our ETL pipelines and the use of the right data sources.
The research section would use the help of data scientists to use statistics and machine learning skills when looking to improve our model.
The ship phase is a joint effort with our dedicated software engineers, ensuring our deployment is robust and up to standard with best coding practices.

From experience, I know that this type of workflow is prevalent among machine learning engineers in numerous companies, although I am sure there are slight variations depending on where you are.

My job is also not just to write code day in and day out. I have other responsibilities, like conducting workshops, presenting to stakeholders, and mentoring more junior members.

What is the structure of machine learning teams?

Machine learning engineers work in many different ways across an organisation, but there are three distinct options, and the rest are a mix of them.

Embedded — In this case, machine learning engineers are embedded in cross-functional teams with analysts, product managers, software engineers and data scientists, where the team solves problems in one domain within the company. This is how I work, and I really like it because you get to pick up lots of valuable skills and abilities from other team members who are specialists in their own right.
Consultancy — This is the flip side, where machine learning engineers are part of an “in-house consultancy” and are part of their own team. In this scenario, the machine learning engineers work on problems based on their perceived value to the business. You are technically less specialised in this option as you may need to change the type of problems you work on.
Infrastructure/Platform — Instead of solving business problems directly, these machine learning engineers develop in-house tools and a deployment platform to make productionising the algorithms much easier.

All ways of working have pros and cons, and in reality, I wouldn’t say one is better than the other; it’s really a matter of personal preference. You still do exciting work, nonetheless!

What is a typical day in a life?

People online often glamourise working in tech, like it’s all coffee breaks, chats, and coding for an hour a day, and you make well over six figures.

This is definitely not the case, and I wish it was true, but it’s still a fun and enjoyable workday compared to many other professions.

My general experience has been:

9:00 am — 9:30 am. Start at 9 am with a morning standup to catch up with the team regarding the previous day’s work and what you are doing today. A “standup” meeting is very common across tech.
9:30 am — 10:30 am. After the standup, there may be another meeting for an hour, 9:30–10:30 or so, with stakeholders, engineers, an all-hands or other company meetings.
10:30 am — 13:00 pm. Then, it’s a work/code block for two hours or so where I focus on my projects. Depending on my work, I may pair with another data scientist, machine learning engineer or software engineer.
13:00 pm — 14:00 pm. Lunch.
14:00 pm — 17:45 pm. Afternoons are normally free of meetings, and there is a large block of focus time to work on your projects. This is mainly for individual contributors like myself.
17:45 pm — 18:00 pm. Reply to emails and Slack messages and wrap up for the day.

Every day is different, but this is what you can expect. As you can tell, it’s nothing “extrordinary.”

This is also the workday for a junior / mid-level individual contributor (IC) like myself. Senior positions, especially managerial roles, typically have more meetings.

An important thing to note is that I don’t always code in my work blocks. I may have a presentation to prepare for stakeholders, some ad-hoc analysis for our product manager, or some writing up of my latest research. I may not even code for the whole day!

On average, I spend 3–4 hours hard coding; the rest is meetings or ad-hoc work. Of course, this varies between companies and at different times of the year.

Why am I’m a machine learning engineer?

The reason I am a machine learning engineer can be boiled down to four main reasons:

Interesting. As a machine learning engineer, I get to be correct at the forefront of the latest tech trends like AI, LLMs, and pretty much anything that is going viral in the field. There is always something new and exciting to learn, which I love! So, if you want to constantly learn new skills and apply them, this may be a career you would be interested in.
Work-Life Balance. Tech jobs generally provide better work-life balance than other professions like banking, law or consulting. Most machine learning jobs are 9–6, and you can often spend a few days working from home. This flexibility allows me to pursue other passions, projects, and hobbies outside of work, such as this blog!
Compensation. It’s no secret that tech jobs provide some of the highest salaries. According to levelsfyi, the median salary of a machine learning engineer in the UK is £93k, which is crazy for an average value.
Range of Industries. As a machine learning engineer, you can work in loads of different industries during your career. However, to become a real specialist, you must find and stick to one industry you love.

I hope this article gave you more insight into machine learning, if you have any questions let me know in the comments.

Another thing!

Dishing The Data | Egor Howell | Substack
Advice and learnings on data science, tech and entrepreneurship. Click to read Dishing The Data, by Egor Howell, a…newsletter.egorhowell.com

Connect with me

The post What Do Machine Learning Engineers Do? appeared first on Towards Data Science.

Build Your Own AI Coding Assistant in JupyterLab with Ollama and Hugging Face

Parul Pandey — Mon, 24 Mar 2025 19:04:48 +0000

Jupyter AI brings generative AI capabilities right into the Jupyter interface. Having a local AI assistant ensures privacy, reduces latency, and provides offline functionality, making it a powerful tool for developers. In this article, we’ll learn how to set up a local AI coding assistant in JupyterLab using Jupyter AI, Ollama and Hugging Face. By the end of this article, you’ll have a fully functional coding assistant in JupyterLab capable of autocompleting code, fixing errors, creating new notebooks from scratch, and much more, as shown in the screenshot below.

Coding assistant in Jupyter Lab via Jupyter AI | Image by Author

Jupyter AI is still under heavy development, so some features may break. As of writing this article, I’ve tested the setup to confirm it works, but expect potential changes as the project evolves. Also the performance of the assistant depends on the model that you select so make sure you choose the one that is fit for your use case.

What is Jupyter AI

First things first — what is Jupyter AI? As the name suggests, Jupyter AI is a JupyterLab extension for generative AI. This powerful tool transforms your standard Jupyter notebooks or JupyterLab environment into a generative AI playground. The best part? It also works seamlessly in environments like Google Colaboratory and Visual Studio Code. This extension does all the heavy lifting, providing access to a variety of model providers (both open and closed source) right within your Jupyter environment.

Installation and Setup

Flow diagram of the installation process | Image by Author

Setting up the environment involves three main components:

JupyterLab
The Jupyter AI extension
Ollama (for Local Model Serving)
[Optional] Hugging Face (for GGUF models)

Honestly, getting the assistant to resolve coding errors is the easy part. What is tricky is ensuring all the installations have been done correctly. It’s therefore essential you follow the steps correctly.

1. Installing the Jupyter AI Extension

It’s recommended to create a new environment specifically for Jupyter AI to keep your existing environment clean and organised. Once done follow the next steps. Jupyter AI requires JupyterLab 4.x or Jupyter Notebook 7+, so make sure you have the latest version of Jupyter Lab installed. You can install/upgrade JupyterLab with pip or conda:

# Install JupyterLab 4 using pip
pip install jupyterlab~=4.0

Next, install the Jupyter AI extension as follows.

pip install "jupyter-ai[all]"

This is the easiest method for installation as it includes all provider dependencies (so it supports Hugging Face, Ollama, etc., out of the box). To date, Jupyter AI supports the following model providers :

Supported Model providers in Jupyter AI along with the dependencies | Created by Author from the documentation

If you encounter errors during the Jupyter AI installation, manually install Jupyter AI using pip without the [all] optional dependency group. This way you can control which models are available in your Jupyter AI environment. For example, to install Jupyter AI with only added support for Ollama models, use the following:

pip install jupyter-ai langchain-ollama

The dependencies depend upon the model providers (see table above). Next, restart your JupyterLab instance. If you see a chat icon on the left sidebar, this means everything has been installed perfectly. With Jupyter AI, you can chat with models or use inline magic commands directly within your notebooks.

Native chat UI in JupyterLab | Image by Author

2. Setting Up Ollama for Local Models

Now that Jupyter AI is installed, we need to configure it with a model. While Jupyter AI integrates with Hugging Face models directly, some models may not work properly. Instead, Ollama provides a more reliable way to load models locally.

Ollama is a handy tool for running Large Language Models locally. It lets you download pre-configured AI models from its library. Ollama supports all major platforms (macOS, Windows, Linux), so choose the method for your OS and download and install it from the official website. After installation, verify that it is set up correctly by running:

Ollama --version
------------------------------
ollama version is 0.6.2

Also, ensure that your Ollama server must be running which you can check by calling ollama serve at the terminal:

$ ollama serve
Error: listen tcp 127.0.0.1:11434: bind: address already in use

If the server is already active, you will see an error like above confirming that Ollama is running and in use.

Using models via Ollama

Option 1: Using Pre-Configured Models

Ollama provides a library of pre-trained models that you can download and run locally. To start using a model, download it using the pull command. For example, to use qwen2.5-coder:1.5b, run:

ollama pull qwen2.5-coder:1.5b

This will download the model in your local environment. To confirm if the model has been downloaded, run:

ollama list

This will list all the models you’ve downloaded and stored locally on your system using Ollama.

Option 2: Loading a Custom Model

If the model you need isn’t available in Ollama’s library, you can load a custom model by creating a Model File that specifies the model’s source.For detailed instructions on this process, refer to the Ollama Import Documentation.

Option 3: Running GGUF Models directly from Hugging Face

Ollama now supports GGUF models directly from the Hugging Face Hub, including both public and private models. This means if you want to use GGUF model directly from Hugging Face Hub you can do so without requiring a custom Model File as mentioned in Option 2 above.

For example, to load a 4-bit quantized Qwen2.5-Coder-1.5B-Instruct model from Hugging Face:

1. First, enable Ollama under your Local Apps settings.

How to enable Ollama under your Local Apps settings on Hugging Face | Image by Author

2. On the model page, choose Ollama from the Use this model dropdown as shown below.

Accessing GGUF model from HuggingFace Hub via Ollama | Image by Author

Configuring Jupyter AI with Ollama

We are almost there. In JupyterLab, open the Jupyter AI chat interface on the sidebar. At the top of the chat panel or in its settings (gear icon), there is a dropdown or field to select the Model provider and model ID. Choose Ollama as the provider, and enter the model name exactly as shown by Ollama list in the terminal (e.g. qwen2.5-coder:1.5b). Jupyter AI will connect to the local Ollama server and load that model for queries. No API keys are needed since this is local.

Set Language model, Embedding model and inline completions models based on the models of your choice.
Save the settings and return to the chat interface.

Configure Jupyter AI with Ollama | Image by Author

This configuration links Jupyter AI to the locally running model via Ollama. While inline completions should be enabled by this process, if that doesn’t happen, you can do it manually by clicking on the Jupyternaut icon, which is located in the bottom bar of the JupyterLab interface to the left of the Mode indicator (e.g., Mode: Command). This opens a dropdown menu where you can select Enable completions by Jupyternaut to activate the feature.

Enabling code completions in notebook | Image by Author

Using the AI Coding Assistant

Once set up, you can use the AI coding assistant for various tasks like code autocompletion, debugging help, and generating new code from scratch. It’s important to note here that you can interact with the assistant either through the chat sidebar or directly in notebook cells using %%ai magic commands. Let’s look at both the ways.

Coding assistant via Chat interface

This is pretty straightforward. You can simply chat with the model to perform an action. For instance, here is how we can ask the model to explain the error in the code and then subsequently fix the error by selecting code in the notebook.

Debugging Assistance Example using Jupyter AI via Chat | Image by Author

You can also ask the AI to generate code for a task from scratch, just by describing what you need in natural language. Here is a Python function that returns all prime numbers up to a given positive integer N, generated by Jupyternaut.

Generating New Code from Prompts using Jupyter AI via Chat | Image by Author

Coding assistant via notebook cell or IPython shell:

You can also interact with models directly within a Jupyter notebook. First, load the IPython extension:

%load_ext jupyter_ai_magics

Now, you can use the %%ai cell magic to interact with your chosen language model using a specified prompt. Let’s replicate the above example but this time within the notebook cells.

Generating New Code from Prompts using Jupyter AI in the notebook | Image by Author

For more details and options you can refer to the official documentation.

Conclusion

As you can gauge from this article, Jupyter AI makes it easy to set up a coding assistant, provided you have the right installations and setup in place. I used a relatively small model, but you can choose from a variety of models supported by Ollama or Hugging Face. The key advantage here is that using a local model offers significant benefits: it enhances privacy, reduces latency, and decreases dependence on proprietary model providers. However, running large models locally with Ollama can be resource-intensive so ensure that you have sufficient RAM. With the rapid pace at which open-source models are improving, you can achieve comparable performance even with these alternatives.

The post Build Your Own AI Coding Assistant in JupyterLab with Ollama and Hugging Face appeared first on Towards Data Science.