Llm | Towards Data Science

The Invisible Revolution: How Vectors Are (Re)defining Business Success

Felix Schmidt — Thu, 10 Apr 2025 20:52:15 +0000

In a world that focuses more on data, business leaders must understand vector thinking. At first, vectors may appear as complicated as algebra was in school, but they serve as a fundamental building block. Vectors are as essential as algebra for tasks like sharing a bill or computing interest. They underpin our digital systems for decision making, customer engagement, and data protection.

They represent a radically different concept of relationships and patterns. They do not simply divide data into rigid categories. Instead, they offer a dynamic, multidimensional view of the underlying connections. Like “Similar” for two customers may mean more than demographics or purchase histories. It’s their behaviors, preferences, and habits that distinctly align. Such associations can be defined and measured accurately in a vector space. But for many modern businesses, the logic is too complex. So leaders tend to fall back on old, learned, rule-based patterns instead. And back then, fraud detection, for example, still used simple rules on transaction limits. We’ve evolved to recognize patterns and anomalies.

While it might have been common to block transactions that allocate 50% of your credit card limit at once just a few years ago, we are now able to analyze your retailer-specific spend history, look at average baskets of other customers at the very same retailers, and do some slight logic checks such as the physical location of your previous spends.

So a $7,000 transaction for McDonald’s in Dubai might just not happen if you just spent $3 on a bike rental in Amsterdam. Even $20 wouldn’t work since logical vector patterns can rule out the physical distance to be valid. Instead, the $7,000 transaction for your new E-Bike at a retailer near Amsterdam’s city center may just work flawlessly. Welcome to the insight of living in a world managed by vectors.

The danger of ignoring the paradigm of vectors is huge. Not mastering algebra can lead to bad financial decisions. Similarly, not knowing vectors can leave you vulnerable as a business leader. While the average customer may stay unaware of vectors as much as an average passenger in a plane is of aerodynamics, a business leader should be at least aware of what kerosene is and how many seats are to be occupied to break even for a specific flight. You may not need to fully understand the systems you rely on. A basic understanding helps to know when to reach out to the experts. And this is exactly my aim in this little journey into the world of vectors: become aware of the basic principles and know when to ask for more to better steer and manage your business.

In the hushed hallways of research labs and tech companies, a revolution was brewing. It would change how computers understood the world. This revolution has nothing to do with processing power or storage capacity. It was all about teaching machines to understand context, meaning, and nuance in words. This uses mathematical representations called vectors. Before we can appreciate the magnitude of this shift, we first need to understand what it differs from.

Think about the way humans take in information. When we look at a cat, we don’t just process a checklist of components: whiskers, fur, four legs. Instead, our brains work through a network of relationships, contexts, and associations. We know a cat is more like a lion than a bicycle. It’s not from memorizing this fact. Our brains have naturally learned these relationships. It boils down to target_transform_sequence or equivalent. Vector representations let computers consume content in a human-like way. And we ought to understand how and why this is true. It’s as fundamental as knowing algebra in the time of an impending AI revolution.

In this brief jaunt in the vector realm, I will explain how vector-based computing works and why it’s so transformative. The code examples are only examples, so they are just for illustration and have no stand-alone functionality. You don’t have to be an engineer to understand those concepts. All you have to do is follow along, as I walk you through examples with plain language commentary explaining each one step by step, one step at a time. I don’t aim to be a world-class mathematician. I want to make vectors understandable to everyone: business leaders, managers, engineers, musicians, and others.

What are vectors, anyway?

Photo by Pete F on Unsplash

It is not that the vector-based computing journey started recently. Its roots go back to the 1950s with the development of distributed representations in cognitive science. James McClelland and David Rumelhart, among other researchers, theorized that the brain holds concepts not as individual entities. Instead, it holds them as the compiled activity patterns of neural networks. This discovery dominated the path for contemporary vector representations.

The real breakthrough was three things coming together:
The exponential growth in computational power, the development of sophisticated neural network architectures, and the availability of massive datasets for training.

It is the combination of these elements that makes vector-based systems theoretically possible and practically implementable at scale. AI as the mainstream as people got to know it (with the likes of ChatGPT e.a.) is the direct consequence of this.

To better understand, let me put this in context: Conventional computing systems work on symbols —discrete, human-readable symbols and rules. A traditional system, for instance, might represent a customer as a record:

customer = {
    'id': '12345',
    'age': 34,
    'purchase_history': ['electronics', 'books'],
    'risk_level': 'low'
}

This representation may be readable or logical, but it misses subtle patterns and relationships. In contrast, vector representations encode information within high-dimensional space where relationships arise naturally through geometric proximity. That same customer might be represented as a 384-dimensional vector where each one of these dimensions contributes to a rich, nuanced profile. Simple code allows for 2-Dimensional customer data to be transformed into vectors. Let’s take a look at how simple this just is:

from sentence_transformers import SentenceTransformer
import numpy as np

class CustomerVectorization:
    def __init__(self):
        self.model = SentenceTransformer('all-MiniLM-L6-v2')
        
    def create_customer_vector(self, customer_data):
        """
        Transform customer data into a rich vector representation
        that captures subtle patterns and relationships
        """
        # Combine various customer attributes into a meaningful text representation
        customer_text = f"""
        Customer profile: {customer_data['age']} year old,
        interested in {', '.join(customer_data['purchase_history'])},
        risk level: {customer_data['risk_level']}
        """
        
        # Generate base vector from text description
        base_vector = self.model.encode(customer_text)
        
        # Enrich vector with numerical features
        numerical_features = np.array([
            customer_data['age'] / 100,  # Normalized age
            len(customer_data['purchase_history']) / 10,  # Purchase history length
            self._risk_level_to_numeric(customer_data['risk_level'])
        ])
        
        # Combine text-based and numerical features
        combined_vector = np.concatenate([
            base_vector,
            numerical_features
        ])
        
        return combined_vector
    
    def _risk_level_to_numeric(self, risk_level):
        """Convert categorical risk level to normalized numeric value"""
        risk_mapping = {'low': 0.1, 'medium': 0.5, 'high': 0.9}
        return risk_mapping.get(risk_level.lower(), 0.5)

I trust that this code example has helped demonstrate how easily complex customer data can be encoded into meaningful vectors. The method seems complex at first. But, it is simple. We merge text and numerical data on customers. This gives us rich, info-dense vectors that capture each customer’s essence. What I love most about this technique is its simplicity and flexibility. Similarly to how we encoded age, purchase history, and risk levels here, you could replicate this pattern to capture any other customer attributes that boil down to the relevant base case for your use case. Just recall the credit card spending patterns we described earlier. It’s similar data being turned into vectors to have a meaning far greater than it could ever have it stayed 2-dimensional and would be used for traditional rule-based logics.

What our little code example allowed us to do is having two very suggestive representations in one semantically rich space and one in normalized value space, mapping every record to a line in a graph that has direct comparison properties.

This allows the systems to identify complex patterns and relations that traditional data structures won’t be able to reflect adequately. With the geometric nature of vector spaces, the shape of these structures tells the stories of similarities, differences, and relationships, allowing for an inherently standardized yet flexible representation of complex data.

But going from here, you will see this structure copied across other applications of vector-based customer analysis: use relevant data, aggregate it in a format we can work with, and meta representation combines heterogeneous data into a common understanding of vectors. Whether it’s recommendation systems, customer segmentation models, or predictive analytics tools, this fundamental approach to thoughtful vectorization will underpin all of it. Thus, this fundamental approach is significant to know and understand even if you consider yourself non-tech and more into the business side.

Just keep in mind — the key is considering what part of your data has meaningful signals and how to encode them in a way that preserves their relationships. It is nothing but following your business logic in another way of thinking other than algebra. A more modern, multi-dimensional way.

The Mathematics of Meaning (Kings and Queens)

Photo by Debbie Fan on Unsplash

All human communication delivers rich networks of meaning that our brains wire to make sense of automatically. These are meanings that we can capture mathematically, using vector-based computing; we can represent words in space so that they are points in a multi-dimensional word space. This geometrical treatment allows us to think in spatial terms about the abstract semantic relations we are interested in, as distances and directions.

For instance, the relationship “King is to Queen as Man is to Woman” is encoded in a vector space in such a way that the direction and distance between the words “King” and “Queen” are similar to those between the words “Man” and “Woman.”

Let’s take a step back to understand why this might be: the key component that makes this system work is word embeddings — numerical representations that encode words as vectors in a dense vector space. These embeddings are derived from examining co-occurrences of words across large snippets of text. Just as we learn that “dog” and “puppy” are related concepts by observing that they occur in similar contexts, embedding algorithms learn to embed these words close to each other in a vector space.

Word embeddings reveal their real power when we look at how they encode analogical relationships. Think about what we know about the relationship between “king” and “queen.” We can tell through intuition that these words are different in gender but share associations related to the palace, authority, and leadership. Through a wonderful property of vector space systems — vector arithmetic — this relationship can be captured mathematically.

One does this beautifully in the classic example:

vector('king') - vector('man') + vector('woman') ≈ vector('queen')

This equation tells us that if we have the vector for “king,” and we subtract out the “man” vector (we remove the concept of “male”), and then we add the “woman” vector (we add the concept of “female”), we get a new point in space very close to that of “queen.” That’s not some mathematical coincidence — it’s based on how the embedding space has arranged the meaning in a sort of structured way.

We can apply this idea of context in Python with pre-trained word embeddings:

import gensim.downloader as api

# Load a pre-trained model that contains word vectors learned from Google News
model = api.load('word2vec-google-news-300')

# Define our analogy words
source_pair = ('king', 'man')
target_word = 'woman'

# Find which word completes the analogy using vector arithmetic
result = model.most_similar(
    positive=[target_word, source_pair[0]], 
    negative=[source_pair[1]], 
    topn=1
)

# Display the result
print(f"{source_pair[0]} is to {source_pair[1]} as {target_word} is to {result[0][0]}")

The structure of this vector space exposes many basic principles:

Semantic similarity is present as spatial proximity. Related words congregate: the neighborhoods of ideas. “Dog,” “puppy,” and “canine” would be one such cluster; meanwhile, “cat,” “kitten,” and “feline” would create another cluster nearby.
Relationships between words become directions in the space. The vector from “man” to “woman” encodes a gender relationship, and other such relationships (for example, “king” to “queen” or “actor” to “actress”) typically point in the same direction.
The magnitude of vectors can carry meaning about word importance or specificity. Common words often have shorter vectors than specialized terms, reflecting their broader, less specific meanings.

Working with relationships between words in this way gave us a geometric encoding of meaning and the mathematical precision needed to reflect the nuances of natural language processing to machines. Instead of treating words as separate symbols, vector-like systems can recognize patterns, make analogies, and even uncover relationships that were never programmed.

To better grasp what was just discussed I took the liberty to have the words we mentioned before (“King, Man, Women”; “Dog, Puppy, Canine”; “Cat, Kitten, Feline”) mapped to a corresponding 2D vector. These vectors numerically represent semantic meaning.

Visualization of the before-mentioned example terms as 2D word embeddings. Showing grouped categories for explanatory purposes. Data is fabricated and axes are simplified for educational purposes.

Human-related words have high positive values on both dimensions.
Dog-related words have negative x-values and positive y-values.
Cat-related words have positive x-values and negative y-values.

Be aware, those values are fabricated by me to illustrate better. As shown in the 2D Space where the vectors are plotted, you can observe groups based on the positions of the dots representing the vectors. The three dog-related words e.g. can be clustered as the “Dog” category etc. etc.

Grasping these basic principles gives us insight into both the capabilities and limitations of modern language AI, such as large language models (LLMs). Though these systems can do amazing analogical and relational gymnastics, they are ultimately cycles of geometric patterns based on the ways that words appear in proximity to one another in a body of text. An elaborate but, by definition, partial reflection of human linguistic comprehension. As such an Llm, since based on vectors, can only generate as output what it has received as input. Although that doesn’t mean it generates only what it has been trained 1:1, we all know about the fantastic hallucination capabilities of LLMs; it means that LLMs, unless specifically instructed, wouldn’t come up with neologisms or new language to describe things. This basic understanding is still lacking for a lot of business leaders that expect LLMs to be miracle machines unknowledgeable about the underlying principles of vectors.

A Tale of Distances, Angles, and Dinner Parties

Photo by OurWhisky Foundation on Unsplash

Now, let’s assume you’re throwing a dinner party and it’s all about Hollywood and the big movies, and you want to seat people based on what they like. You could just calculate “distance” between their preferences (genres, perhaps even hobbies?) and find out who should sit together. But deciding how you measure that distance can be the difference between compelling conversations and annoyed participants. Or awkward silences. And yes, that company party flashback is repeating itself. Sorry for that!

The same is true in the world of vectors. The distance metric defines how “similar” two vectors look, and therefore, ultimately, how well your system performs to predict an outcome.

Euclidean Distance: Straightforward, but Limited

Euclidean distance measures the straight-line distance between two points in space, making it easy to understand:

Euclidean distance is fine as long as vectors are physical locations.
However, in high-dimensional spaces (like vectors representing user behavior or preferences), this metric often falls short. Differences in scale or magnitude can skew results, focusing on scale over actual similarity.

Example: Two vectors might represent your dinner guests’ preferences for how much streaming services are used:

vec1 = [5, 10, 5]
# Dinner guest A likes action, drama, and comedy as genres equally.

vec2 = [1, 2, 1] 
# Dinner guest B likes the same genres but consumes less streaming overall.

While their preferences align, Euclidean distance would make them seem vastly different because of the disparity in overall activity.

But in higher-dimensional spaces, such as user behavior or textual meaning, Euclidean distance becomes increasingly less informative. It overweights magnitude, which can obscure comparisons. Consider two moviegoers: one has seen 200 action movies, the other has seen 10, but they both like the same genres. Because of their sheer activity level, the second viewer would appear much less similar to the first when using Euclidean distance though all they ever watched is Bruce Willis movies.

Cosine Similarity: Focused on Direction

The cosine similarity method takes a different approach. It focuses on the angle between vectors, not their magnitudes. It’s like comparing the path of two arrows. If they point the same way, they are aligned, no matter their lengths. This shows that it’s perfect for high-dimensional data, where we care about relationships, not scale.

If two vectors point in the same direction, they’re considered similar (cosine similarity approx of 1).
When opposing (so pointing in opposite directions), they differ (cosine similarity ≈ -1).
If they’re perpendicular (at a right angle of 90° to one another), they are unrelated (cosine similarity close to 0).

This normalizing property ensures that the similarity score correctly measures alignment, regardless of how one vector is scaled in comparison to another.

Example: Returning to our streaming preferences, let’s take a look at how our dinner guest’s preferences would look like as vectors:

vec1 = [5, 10, 5]
# Dinner guest A likes action, drama, and comedy as genres equally.

vec2 = [1, 2, 1] 
# Dinner guest B likes the same genres but consumes less streaming overall.

Let us discuss why cosine similarity is really effective in this case. So, when we compute cosine similarity for vec1 [5, 10, 5] and vec2 [1, 2, 1], we’re essentially trying to see the angle between these vectors.

The dot product normalizes the vectors first, dividing each component by the length of the vector. This operation “cancels” the differences in magnitude:

So for vec1: Normalization gives us [0.41, 0.82, 0.41] or so.
For vec2: Which resolves to [0.41, 0.82, 0.41] after normalization we will also have it.

And now we also understand why these vectors would be considered identical with regard to cosine similarity because their normalized versions are identical!

This tells us that even though dinner guest A views more total content, the proportion they allocate to any given genre perfectly mirrors dinner guest B’s preferences. It’s like saying both your guests dedicate 20% of their time to action, 60% to drama, and 20% to comedy, no matter the total hours viewed.

It’s this normalization that makes cosine similarity particularly effective for high-dimensional data such as text embeddings or user preferences.

When dealing with data of many dimensions (think hundreds or thousands of components of a vector for various features of a movie), it is often the relative significance of each dimension corresponding to the complete profile rather than the absolute values that matter most. Cosine similarity identifies precisely this arrangement of relative importance and is a powerful tool to identify meaningful relationships in complex data.

Hiking up the Euclidian Mountain Trail

Photo by Christian Mikhael on Unsplash

In this part, we will see how different approaches to measuring similarity behave in practice, with a concrete example from the real world and some little code example. Even if you are a non-techie, the code will be easy to understand for you as well. It’s to illustrate the simplicity of it all. No fear!

How about we quickly discuss a 10-mile-long hiking trail? Two friends, Alex and Blake, write trail reviews of the same hike, but each ascribes it a different character:

The trail gained 2,000 feet in elevation over just 2 miles! Easily doable with some high spikes in between!
Alex

and

Beware, we hiked 100 straight feet up in the forest terrain at the spike! Overall, 10 beautiful miles of forest!
Blake

These descriptions can be represented as vectors:

alex_description = [2000, 2]  # [elevation_gain, trail_distance]
blake_description = [100, 10]  # [elevation_gain, trail_distance]

Let’s combine both similarity measures and see what it tells us:

import numpy as np

def cosine_similarity(vec1, vec2):
    """
    Measures how similar the pattern or shape of two descriptions is,
    ignoring differences in scale. Returns 1.0 for perfectly aligned patterns.
    """
    dot_product = np.dot(vec1, vec2)
    norm1 = np.linalg.norm(vec1)
    norm2 = np.linalg.norm(vec2)
    return dot_product / (norm1 * norm2)

def euclidean_distance(vec1, vec2):
    """
    Measures the direct 'as-the-crow-flies' difference between descriptions.
    Smaller numbers mean descriptions are more similar.
    """
    return np.linalg.norm(np.array(vec1) - np.array(vec2))

# Alex focuses on the steep part: 2000ft elevation over 2 miles
alex_description = [2000, 2]  # [elevation_gain, trail_distance]

# Blake describes the whole trail: 100ft average elevation per mile over 10 miles
blake_description = [100, 10]  # [elevation_gain, trail_distance]

# Let's see how different these descriptions appear using each measure
print("Comparing how Alex and Blake described the same trail:")
print("\nEuclidean distance:", euclidean_distance(alex_description, blake_description))
print("(A larger number here suggests very different descriptions)")

print("\nCosine similarity:", cosine_similarity(alex_description, blake_description))
print("(A number close to 1.0 suggests similar patterns)")

# Let's also normalize the vectors to see what cosine similarity is looking at
alex_normalized = alex_description / np.linalg.norm(alex_description)
blake_normalized = blake_description / np.linalg.norm(blake_description)

print("\nAlex's normalized description:", alex_normalized)
print("Blake's normalized description:", blake_normalized)

So now, running this code, something magical happens:

Comparing how Alex and Blake described the same trail:

Euclidean distance: 8.124038404635959
(A larger number here suggests very different descriptions)

Cosine similarity: 0.9486832980505138
(A number close to 1.0 suggests similar patterns)

Alex's normalized description: [0.99975 0.02236]
Blake's normalized description: [0.99503 0.09950]

This output shows why, depending on what you are measuring, the same trail may appear different or similar.

The large Euclidean distance (8.12) suggests these are very different descriptions. It’s understandable that 2000 is a lot different from 100, and 2 is a lot different from 10. It’s like taking the raw difference between these numbers without understanding their meaning.

But the high Cosine similarity (0.95) tells us something more interesting: both descriptions capture a similar pattern.

If we look at the normalized vectors, we can see it, too; both Alex and Blake are describing a trail in which elevation gain is the prominent feature. The first number in each normalized vector (elevation gain) is much larger relative to the second (trail distance). Either that or elevating them both and normalizing based on proportion — not volume — since they both share the same trait defining the trail.

Perfectly true to life: Alex and Blake hiked the same trail but focused on different parts of it when writing their review. Alex focused on the steeper section and described a 100-foot climb, and Blake described the profile of the entire trail, averaged to 200 feet per mile over 10 miles. Cosine similarity identifies these descriptions as variations of the same basic trail pattern, whereas Euclidean distance regards them as completely different trails.

This example highlights the need to select the appropriate similarity measure. Normalizing and taking cosine similarity gives many meaningful correlations that are missed by just taking distances like Euclidean in real use cases.

Real-World Impacts of Metric Choices

Photo by fabio on Unsplash

The metric you pick doesn’t merely change the numbers; it influences the results of complex systems. Here’s how it breaks down in various domains:

In Recommendation Engines: When it comes to cosine similarity, we can group users who have the same tastes, even if they are doing different amounts of overall activity. A streaming service could use this to recommend movies that align with a user’s genre preferences, regardless of what is popular among a small subset of very active viewers.
In Document Retrieval: When querying a database of documents or research papers, cosine similarity ranks documents according to whether their content is similar in meaning to the user’s query, rather than their text length. This enables systems to retrieve results that are contextually relevant to the query, even though the documents are of a wide range of sizes.
In Fraud Detection: Patterns of behavior are often more important than pure numbers. Cosine similarity can be used to detect anomalies in spending habits, as it compares the direction of the transaction vectors — type of merchant, time of day, transaction amount, etc. — rather than the absolute magnitude.

And these differences matter because they give a sense of how systems “think”. Let’s get back to that credit card example one more time: It might, for example, identify a high-value $7,000 transaction for your new E-Bike as suspicious using Euclidean distance — even if that transaction is normal for you given you have an average spent of $20,000 a mont.

A cosine-based system, on the other hand, understands that the transaction is consistent with what the user typically spends their money on, thus avoiding unnecessary false notifications.

But measures like Euclidean distance and cosine similarity are not merely theoretical. They’re the blueprints on which real-world systems stand. Whether it’s recommendation engines or fraud detection, the metrics we choose will directly impact how systems make sense of relationships in data.

Vector Representations in Practice: Industry Transformations

Photo by Louis Reed on Unsplash

This ability for abstraction is what makes vector representations so powerful — they transform complex and abstract field data into concepts that can be scored and actioned. These insights are catalyzing fundamental transformations in business processes, decision-making, and customer value delivery across sectors.

Next, we will explore the solution use cases we are highlighting as concrete examples to see how vectors are freeing up time to solve big problems and creating new opportunities that have a big impact. I picked an industry to show what vector-based approaches to a challenge can achieve, so here is a healthcare example from a clinical setting. Why? Because it matters to us all and is rather easy to relate to than digging into the depths of the finance system, insurance, renewable energy, or chemistry.

Healthcare Spotlight: Pattern Recognition in Complex Medical Data

The healthcare industry poses a perfect storm of challenges that vector representations can uniquely solve. Think of the complexities of patient data: medical histories, genetic information, lifestyle factors, and treatment outcomes all interact in nuanced ways that traditional rule-based systems are incapable of capturing.

At Massachusetts General Hospital, researchers implemented a vector-based early detection system for sepsis, a condition in which every hour of early detection increases the chances of survival by 7.6% (see the full study at pmc.ncbi.nlm.nih.gov/articles/PMC6166236/).

In this new methodology, spontaneous neutrophil velocity profiles (SVP) are used to describe the movement patterns of neutrophils from a drop of blood. We won’t get too medically detailed here, because we’re vector-focused today, but a neutrophil is an immune cell that is kind of a first responder in what the body uses to fight off infections.

The system then encodes each neutrophil’s motion as a vector that captures not just its magnitude (i.e., speed), but also its direction. So they converted biological patterns to high-dimensional vector spaces; thus, they got subtle differences and showed that healthy individuals and sepsis patients exhibited statistically significant differences in movement. Then, these numeric vectors were processed with the help of a Machine Learning model that was trained to detect early signs of sepsis. The result was a diagnostic tool that reached impressive sensitivity (97%) and specificity (98%) to achieve a rapid and accurate identification of this fatal condition — probably with the cosine similarity (the paper doesn’t go into much detail, so this is pure speculation, but it would be the most suitable) that we just learned about a moment ago.

This is just one example of how medical data can be encoded into its vector representations and turned into malleable, actionable insights. This approach made it possible to re-contextualize complex relationships and, along with tread-based machine learning, worked around the limitations of previous diagnostic modalities and proved to be a potent tool for clinicians to save lives. It’s a powerful reminder that Vectors aren’t merely theoretical constructs — they’re practical, life-saving solutions that are powering the future of healthcare as much as your credit card risk detection software and hopefully also your business.

Lead and understand, or face disruption. The naked truth.

Photo by Hunters Race on Unsplash

With all you have read about by now: Think of a decision as small as the decision about the metrics under which data relationships are evaluated. Leaders risk making assumptions that are subtle yet disastrous. You are basically using algebra as a tool, and while getting some result, you cannot know if it is right or not: making leadership decisions without understanding the fundamentals of vectors is like calculating using a calculator but not knowing what formulas you are using.

The good news is this doesn’t mean that business leaders have to become data scientists. Vectors are delightful because, once the core ideas have been grasped, they become very easy to work with. An understanding of a handful of concepts (for example, how vectors encode relationships, why distance metrics are important, and how embedding models function) can fundamentally change how you make high-level decisions. These tools will help you ask better questions, work with technical teams more effectively, and make sound decisions about the systems that will govern your business.

The returns on this small investment in comprehension are huge. There is much talk about personalization. Yet, few organizations use vector-based thinking in their business strategies. It could help them leverage personalization to its full potential. Such an approach would delight customers with tailored experiences and build loyalty. You could innovate in areas like fraud detection and operational efficiency, leveraging subtle patterns in data that traditional ones miss — or perhaps even save lives, as described above. Equally important, you can avoid expensive missteps that happen when leaders defer to others for key decisions without understanding what they mean.

The truth is, vectors are here now, driving a vast majority of all the hyped AI technology behind the scenes to help create the world we navigate in today and tomorrow. Companies that do not adapt their leadership to think in vectors risk falling behind a competitive landscape that becomes ever more data-driven. One who adopts this new paradigm will not just survive but will prosper in an age of never-ending AI innovation.

Now is the moment to act. Start to view the world through vectors. Study their tongue, examine their doctrine, and ask how the new could change your tactics and your lodestars. Much in the way that algebra became an essential tool for writing one’s way through practical life challenges, vectors will soon serve as the literacy of the data age. Actually they do already. It is the future of which the powerful know how to take control. The question is not if vectors will define the next era of businesses; it is whether you are prepared to lead it.

The post The Invisible Revolution: How Vectors Are (Re)defining Business Success appeared first on Towards Data Science.

Circuit Tracing: A Step Closer to Understanding Large Language Models

Sudheer Singh — Tue, 08 Apr 2025 18:38:39 +0000

Context

Over the years, Transformer-based large language models (LLMs) have made substantial progress across a wide range of tasks evolving from simple information retrieval systems to sophisticated agents capable of coding, writing, conducting research, and much more. But despite their capabilities, these models are still largely black boxes. Given an input, they accomplish the task but we lack intuitive ways to understand how the task was actually accomplished.

LLMs are designed to predict the statistically best next word/token. But do they only focus on predicting the next token, or plan ahead? For instance, when we ask a model to write a poem, is it generating one word at a time, or is it anticipating rhyme patterns before outputting the word? or when asked about basic reasoning question like what is state capital where city Dallas is located? They often produce results that looks like a chain of reasoning, but did the model actually use that reasoning? We lack visibility into the model’s internal thought process. To understand LLMs, we need to trace their underlying logic.

The study of LLMs internal computation falls under “Mechanistic Interpretability,” which aims to uncover the computational circuit of models. Anthropic is one of the leading AI companies working on interpretability. In March 2025, they published a paper titled “Circuit Tracing: Revealing Computational Graphs in Language Models,” which aims to tackle the problem of circuit tracing.

This post aims to explain the core ideas behind their work and build a foundation for understating circuit tracing in LLMs.

What is a circuit in LLMs?

Before we can define a “circuit” in language models, we first need to look inside the LLM. It’s a Neural Network built on the transformer architecture, so it seems obvious to treat neurons as a basic computational unit and interpret the patterns of their activations across layers as the model’s computation circuit.

However, the “Towards Monosemanticity” paper revealed that tracking neuron activations alone doesn’t provide a clear understanding of why those neurons are activated. This is because individual neurons are often polysemantic they respond to a mix of unrelated concepts.

The paper further showed that neurons are composed of more fundamental units called features, which capture more interpretable information. In fact, a neuron can be seen as a combination of features. So rather than tracing neuron activations, we aim to trace feature activations the actual units of meaning driving the model’s outputs.

With that, we can define a circuit as a sequence of feature activations and connections used by the model to transform a given input into an output.

Now that we know what we’re looking for, let’s dive into the technical setup.

Technical Setup

We’ve established that we need to trace feature activations rather than neuron activations. To enable this, we need to convert the neurons of the existing LLM models into features, i.e. build a replacement model that represents computations in terms of features.

Before diving into how this replacement model is constructed, let’s briefly review the architecture of Transformer-based large language models.

The following diagram illustrates how transformer-based language models operate. The idea is to convert the input into tokens using embeddings. These tokens are passed to the attention block, which calculates the relationships between tokens. Then, each token is passed to the multi-layer perceptron (MLP) block, which further refines the token using a non-linear activation and linear transformations. This process is repeated across many layers before the model generates the final output.

Image by Author

Now that we have laid out the structure of transformer based LLM, let’s looks at what transcoders are. The authors have used a “Transcoder” to develop the replacement model.

Transcoders

A transcoder is a neural network (generally with a much higher dimension than LLM’s dimension) in itself designed to replace the MLP block in a transformer model with a more interpretable, functionally equivalent component (feature).

Image by Author

It processes tokens from the attention block in three stages: encoding, sparse activation, and decoding. Effectively, it scales the input to a higher-dimensional space, applies activation to force the model to activate only sparse features, and then compresses the output back to the original dimension in the decoding stage.

Image by Author

With a basic understanding of transformer-based LLMs and transcoder, let’s look at how a transcoder is used to build a replacement model.

Construct a replacement model

As mentioned earlier, a transformer block typically consists of two main components: an attention block and an MLP block (feedforward network). To build a replacement model, the MLP block in the original transformer model is replaced with a transcoder. This integration is seamless because the transcoder is trained to mimic the output of the original MLP, while also exposing its internal computations through sparse and modular features.

While standard transcoders are trained to imitate the MLP behavior within a single transformer layer, the authors of the paper used a cross layer transcoder (CLT), which captures the combined effects of multiple transcoder blocks across several layers. This is important because it allows us to track if a feature is spread across multiple layers, which is needed for circuit tracing.

The below image illustrates how the cross-layer transcoders (CLT) setup is used in building a replacement model. The Transcoder output at layer 1 contributes to constructing the MLP-equivalent output in all the upper layers until the end.

Image by Author

Side Note: the following image is from the paper and shows how a replacement model is constructed. it replaces the neuron of the original model with features.

Image from https://transformer-circuits.pub/2025/attribution-graphs/methods.html#graphs-constructing

Now that we understand the architecture of the replacement model, let’s look at how the interpretable presentation is built on the replacement model’s computational path.

Interpretable presentation of model’s computation: Attribution graph

To build the interpretable representation of the model’s computational path, we start from the model’s output feature and trace backward through the feature network to uncover which earlier feature contributed to it. This is done using the backward Jacobian, which tells how much a feature in the previous layer contributed to the current feature activation, and is applied recursively until we reach the input. Each feature is considered as a node and each influence as an edge. This process can lead to a complex graph with millions of edges and nodes, hence pruning is done to keep the graph compact and manually interpretable.

The authors refer to this computational graph as an attribution graph and have also developed a tool to inspect it. This forms the core contribution of the paper.

The image below illustrate a sample attribution graph.

Image from https://transformer-circuits.pub/2025/attribution-graphs/methods.html#graphs

Now, with all this understanding, we can go to feature interpretability.

Feature interpretability using an attribution graph

The researchers used attribution graphs on Anthropic’s Claude 3.5 Haiku model to study how it behaves across different tasks. In the case of poem generation, they discovered that the model doesn’t just generate the next word. It engages in a form of planning, both forward and backward. Before generating a line, the model identifies several possible rhyming or semantically appropriate words to end with, then works backward to craft a line that naturally leads to that target. Surprisingly, the model appears to hold multiple candidate end words in mind simultaneously, and it can restructure the entire sentence based on which one it ultimately chooses.

This technique offers a clear, mechanistic view of how language models generate structured, creative text. This is a significant milestone for the AI community. As we develop increasingly powerful models, the ability to trace and understand their internal planning and execution will be essential for ensuring alignment, safety, and trust in AI systems.

Limitations of the current approach

Attribution graphs offer a way to trace model behavior for a single input, but they don’t yet provide a reliable method for understanding global circuits or the consistent mechanisms a model uses across many examples. This analysis relies on replacing MLP computations with transcoders, but it is still unclear whether these transcoders truly replicate the original mechanisms or simply approximate the outputs. Additionally, the current approach highlights only active features, but inactive or inhibitory ones can be just as important for understanding the model’s behavior.

Conclusion

Circuit tracing via attribution graph is an early but important step toward understanding how language models work internally. While this approach still has a long way to go, the introduction of circuit tracing marks a major milestone on the path to true interpretability.

References:

The post Circuit Tracing: A Step Closer to Understanding Large Language Models appeared first on Towards Data Science.

AI in Social Research and Polling

Stephanie Kirmer — Wed, 02 Apr 2025 00:42:53 +0000

This month, I’m going to be discussing a really interesting topic that I came across in a recent draft paper by a professor at the University of Maryland named M. R. Sauter. In the paper, they discuss (among other things) the phenomenon of social scientists and pollsters trying to employ AI tools to help overcome some of the challenges in conducting social science human subjects research and polling, and point out some major flaws with these approaches. I had some additional thoughts that were inspired by the topic, so let’s talk about it!

Hi, can I ask you a short series of questions?

Let’s start with a quick discussion of why this would be necessary in the first place. Doing social science research and polling is extraordinarily difficult in the modern day. A huge part of this is simply due to the changes in how people connect and communicate — namely, cellphones — making it incredibly hard to get access to a random sampling of individuals who will participate in your research.

To contextualize this, when I was an undergraduate sociology student almost 25 years ago, in research methods class we were taught that a good way to randomly sample people for large research studies was to just take the area code and 3 digit phone number prefix for an area, and randomly generate selections of four digits to complete them, and call those numbers. In those days, before phone scammers became the bane of all our lives, people would answer and you could ask your research questions. Today, on the other hand, this kind of method for trying to get a representative sample of the public is almost laughable. Scarcely anyone answers calls from unknown numbers in their daily lives, outside of very special situations (like when you’re waiting for a job interviewer to call).

So, what do researchers do now? Today, you can sometimes pay gig workers for poll participation, although Amazon MTurk workers or Upworkers are not necessarily representative of the entire population. The sample you can get will have some bias, which has to be accounted for with sampling and statistical methods. A bigger barrier is that these people’s time and effort costs money, which pollsters and academics are loath to part with (and in the case of academics, increasingly they do not have).

What else? If you’re like me, you’ve probably gotten an unsolicited polling text before as well — these are interesting, because they could be legitimate, or they could be scammers out to get your data or money, and it’s tremendously difficult to tell the difference. As a sociologist, I have a soft spot for doing polls and answering surveys to help other social scientists, and I don’t even trust these to click through, more often than not. They’re also a demand on your time, and many people are too busy even if they trust the source.

The entire industry of polling depends on being able to get a diverse sample of people from all walks of life on the telephone, and convincing them to give you their opinion about something.

Regardless of the attempted solutions and their flaws, this problem matters. The entire industry of polling depends on being able to get a diverse sample of people from all walks of life on the telephone, and convincing them to give you their opinion about something. This is more than just a problem for social scientists doing academic work, because polling is a massive industry unto itself with a lot of money on the line.

Do we really need the humans?

Can AI help with this problem in some way? If we involve generative AI in this task, what might that look like? Before we get to practical ways to attack this, I want to discuss a concept Sauter proposes called “AI imaginaries” — essentially, the narratives and social beliefs we hold about what AI really is, what it can do, and what value it can create. This is hard to pin down, partly because of a “strategic vagueness” about the whole idea of AI. Longtime readers will know I have struggled mightily with figuring out whether and how to even reference the term “AI” because it is such an overloaded and conflicted term.

However, we can all think of potentially problematic beliefs and expectations about AI that we encounter implicitly or explicitly in society, such as the idea that AI is inherently a channel for social progress, or that using AI instead of employing human people for tasks is inherently good, because of “efficiency”. I’ve talked about many of these concepts in my other columns, because I think challenging the accuracy of our assumptions is important to help us suss out what the true contributions of AI to our world can really be. Flawed assumptions can lead us to buy into undeserved hype or overpromising that the tech industry can be unfortunately prone to.

In the context of applying AI to social science research, some of Sauter’s components of the AI imaginary include:

expectations that AI can be relied upon as a source of truth
believing that everything meaningful can be measured or quantified, and
(perhaps most problematically) asserting that there is some equivalency between the output of human intelligence or creativity and the output of AI models

Flawed assumptions can lead us to buy into undeserved hype or overpromising that the tech industry can be unfortunately prone to.

What have they tried?

With this framework if thinking in mind, let’s look at a few of the specific approaches people have taken to fixing the difficulties in finding real human beings to involve in research using AI. Many of these techniques have a common thread in that they give up on trying to actually get access to individuals for the research, and instead just ask LLMs to answer the questions instead.

In one case, an AI startup offers to use LLMs to run your Polling for you, instead of actually asking any people at all. They mimic electoral demographics as closely as possible and build samples almost like “digital twin” entities. (Notably, they were predicting the eventual US general election result wrong in a September 2024 article.)

Sauter cites a number of other Research approaches applying similar techniques, including testing whether the LLM would change its answers to opinion questions when exposed to media with particular leanings or opinions (eg, replicating the effect of news on public opinion), attempting to specifically emulate human subgroups using LLMs, believing that this can overcome algorithmic bias, and testing whether the poll responses of LLMs are distinguishable from human answers to the lay person.

Does it work?

Some defend these strategies by arguing that their LLMs can be made to produce answers that approximately match the results of real human polling, but simultaneously argue that human polling is no longer accurate enough to be usable. This brings up the obvious question of, if the human polling is not trustworthy, how is it trustworthy enough to be the benchmark standard for the LLMs?

Furthermore, if the LLM’s output today can be made to match what we think we know about human opinions, that doesn’t mean that output will continue to match human beliefs or the opinions of the public in the future. LLMs are constantly being retrained and developed, and the dynamics of public opinions and perspectives are fluid and variable. One validation today, even if successful, doesn’t promise anything about another set of questions, on another topic, at another time or in another context. Assumptions about this future dependability are consequences of the fallacious expectation that LLMs can be trusted and relied upon as sources of truth, when that is not now and never has been the purpose of these models.

We should always take a step back and remember what LLMs are built for, and what their actual objectives are. As Sanders et al notes, “LLMs generate a response predicted to be most acceptable to the user on the basis of a training process such as reinforcement learning with human feedback”. They’re trying to estimate the next word that will be appealing to you, based on the prompt you have provided — we should not start to fall into mythologizing that suggests the LLM is doing anything else.

When an LLM produces an unexpected response, it’s essentially because a certain amount of randomness is built in to the model — from time to time, in order to sound more “human” and dynamic, instead of choosing the next word with the highest probability, it’ll choose a different one further down the rankings. This randomness is not based on an underlying belief, or opinion, but is just built in to avoid the text sounding robotic or dull. However, when you use an LLM to replicate human opinions, these become outliers that are absorbed into your data. How does this methodology interpret such responses? In real human polling, the outliers may contain useful information about minority perspectives or the fringes of belief — not the majority, but still part of the population. This opens up a lot of questions about how our interpretation of this artificial data can be conducted, and what inferences we can actually draw.

On synthetic data

This topic overlaps with the broader concept of synthetic data in the AI space. As the quantities of unseen organically human generated content available for training LLMs dwindle, studies have attempted to see whether you could bootstrap your way to better models, namely by making an LLM generate new data, then using that to train on. This fails, causing models to collapse, in a form that Jathan Sadowski named “Habsburg AI”.

What this teaches us is that there is more that differentiates the stuff that LLMs produce from organically human generated content than we can necessarily detect. Something is different about the synthetic stuff, even if we can’t totally identify or measure what it is, and we can tell this is the case because the end results are so drastically different. I’ve talked before about the complications and challenges around human detection of synthetic content, and it’s clear that just because humans may not be able to easily and obviously tell the difference, that doesn’t mean there is none.

[J]ust because humans may not be able to easily and obviously tell the difference, that doesn’t mean there is none.

We might also be tempted by the argument that, well, polling is increasingly unreliable and inaccurate, because we have no more easy, free access to the people we want to poll, so this AI mediated version might be the best we can do. If it’s better than the status quo, what’s wrong with trying?

Is it a good idea?

Whether or not it works, is this the right thing to do? This is the question that most users and developers of such technology don’t take much note of. The tech industry broadly is often guilty of this — we ask whether something is effective, for the immediate objective we have in mind, but we may skip over the question of whether we should do it anyway.

I’ve spent a lot of time recently thinking about why these approaches to polling and research worry me. Sauter makes the argument that this is inherently corrosive to social participation, and I’m inclined to agree in general. There’s something troubling about determining that because people are difficult or expensive to use, that we toss them aside and use technological mimicry to replace them. The validity of this depends heavily on what the task is, and what the broader impact on people and society would be. Efficiency is not the unquestionable good that we might sometimes think.

For one thing, people have increasingly begun to learn that our data (including our opinions) has monetary and social value, and it isn’t outrageous for us to want to get a piece of that value. We’ve been giving our opinions away for free for a long time, but I sense that’s evolving. These days retailers regularly offer discounts and deals in exchange for product reviews, and as I noted earlier, MTurkers and other gig workers can rent out their time and get paid for polling and research projects. In the case of commercial polling, where a good deal of the energy for this synthetic polling comes from, substituting LLMs sometimes feels like a method for making an end run around the pesky human beings who don’t want to contribute to someone else’s profits for free.

If we assume that the LLM can generate accurate polls, we are assuming a state of determinism that runs counter to the democratic project.

But setting this aside, there’s a social message behind these efforts that I don’t think we should minimize. Teaching people that their beliefs and opinions are replaceable with technology sets a precedent that can unintentionally spread. If we assume that the LLM can generate accurate polls, we are assuming a state of determinism that runs counter to the democratic project, and expects democratic choices to be predictable. We may think we know what our peers believe, maybe even just by looking at them or reading their profiles, but in the US, at least, we still operate under a voting model that lets that person have a secret ballot to elect their representation. They are at liberty to make their choice based on any reasoning, or none at all. Presuming that we don’t actually have the free will to change our mind in the privacy of the voting booth just feels dangerous. What’s the argument, if we accept the LLMs instead of real polls, that this can’t be spread to the voting process itself?

I haven’t even touched on the issue of trust that keeps people from honestly responding to polls or research surveys, which is an additional sticking point. Instead of going to the source and really interrogating what it is in our social structure that makes people unwilling to honestly state their sincerely held beliefs to peers, we again see the approach of just throwing up our hands and eliminating people from the process altogether.

Sweeping social problems under an LLM rug

It just seems really troubling that we are considering using LLMs to paper over the social problems getting in our way. It feels similar to a different area I’ve written about, the fact that LLM output replicates and mimics the bigotry and harmful content that it finds in training data. Instead of taking a deeper look at ourselves, and questioning why this is in the organically human created content, some people propose censoring and heavily filtering LLM output, as an attempt to hide this part of our real social world.

I guess it comes down to this: I’m not in favor of resorting to LLMs to avoid trying to solve real social problems. I’m not convinced we’ve really tried, in some cases, and in other cases like the polling, I’m deeply concerned that we’re going to create even more social problems by using this strategy. We have a responsibility to look beyond the narrow scope of the issue we care about at this particular moment, and anticipate cascading externalities that may result.

Read more of my work at www.stephaniekirmer.com.

Mastering Prompt Engineering with Functional Testing: A Systematic Guide to Reliable LLM Outputs

Ugo Pradère — Fri, 14 Mar 2025 22:48:00 +0000

Creating efficient prompts for large language models often starts as a simple task… but it doesn’t always stay that way. Initially, following basic best practices seems sufficient: adopt the persona of a specialist, write clear instructions, require a specific response format, and include a few relevant examples. But as requirements multiply, contradictions emerge, and even minor modifications can introduce unexpected failures. What was working perfectly in one prompt version suddenly breaks in another.

If you have ever felt trapped in an endless loop of trial and error, adjusting one rule only to see another one fail, you’re not alone! The reality is that traditional prompt optimisation is clearly missing a structured, more scientific approach that will help to ensure reliability.

That’s where functional testing for prompt engineering comes in! This approach, inspired by methodologies of experimental science, leverages automated input-output testing with multiple iterations and algorithmic scoring to turn prompt engineering into a measurable, data-driven process.

No more guesswork. No more tedious manual validation. Just precise and repeatable results that allow you to fine-tune prompts efficiently and confidently.

In this article, we will explore a systematic approach for mastering prompt engineering, which ensures your Llm outputs will be efficient and reliable even for the most complex AI tasks.

Balancing precision and consistency in prompt optimisation

Adding a large set of rules to a prompt can introduce partial contradictions between rules and lead to unexpected behaviors. This is especially true when following a pattern of starting with a general rule and following it with multiple exceptions or specific contradictory use cases. Adding specific rules and exceptions can cause conflict with the primary instruction and, potentially, with each other.

What might seem like a minor modification can unexpectedly impact other aspects of a prompt. This is not only true when adding a new rule but also when adding more detail to an existing rule, like changing the order of the set of instructions or even simply rewording it. These minor modifications can unintentionally change the way the model interprets and prioritizes the set of instructions.

The more details you add to a prompt, the greater the risk of unintended side effects. By trying to give too many details to every aspect of your task, you increase as well the risk of getting unexpected or deformed results. It is, therefore, essential to find the right balance between clarity and a high level of specification to maximise the relevance and consistency of the response. At a certain point, fixing one requirement can break two others, creating the frustrating feeling of taking one step forward and two steps backward in the optimization process.

Testing each change manually becomes quickly overwhelming. This is especially true when one needs to optimize prompts that must follow numerous competing specifications in a complex AI task. The process cannot simply be about modifying the prompt for one requirement after the other, hoping the previous instruction remains unaffected. It also can’t be a system of selecting examples and checking them by hand. A better process with a more scientific approach should focus on ensuring repeatability and reliability in prompt optimization.

From laboratory to AI: Why testing LLM responses requires multiple iterations

Science teaches us to use replicates to ensure reproducibility and build confidence in an experiment’s results. I have been working in academic research in chemistry and biology for more than a decade. In those fields, experimental results can be influenced by a multitude of factors that can lead to significant variability. To ensure the reliability and reproducibility of experimental results, scientists mostly employ a method known as triplicates. This approach involves conducting the same experiment three times under identical conditions, allowing the experimental variations to be of minor importance in the result. Statistical analysis (standard mean and deviation) conducted on the results, mostly in biology, allows the author of an experiment to determine the consistency of the results and strengthens confidence in the findings.

Just like in biology and chemistry, this approach can be used with LLMs to achieve reliable responses. With LLMs, the generation of responses is non-deterministic, meaning that the same input can lead to different outputs due to the probabilistic nature of the models. This variability is challenging when evaluating the reliability and consistency of LLM outputs.

In the same way that biological/chemical experiments require triplicates to ensure reproducibility, testing LLMs should need multiple iterations to measure reproducibility. A single test by use case is, therefore, not sufficient because it does not represent the inherent variability in LLM responses. At least five iterations per use case allow for a better assessment. By analyzing the consistency of the responses across these iterations, one can better evaluate the reliability of the model and identify any potential issues or variation. It ensures that the output of the model is correctly controlled.

Multiply this across 10 to 15 different prompt requirements, and one can easily understand how, without a structured testing approach, we end up spending time in trial-and-error testing with no efficient way to assess quality.

A systematic approach: Functional testing for prompt optimization

To address these challenges, a structured evaluation methodology can be used to ease and accelerate the testing process and enhance the reliability of LLM outputs. This approach has several key components:

Data fixtures: The approach’s core center is the data fixtures, which are composed of predefined input-output pairs specifically created for prompt testing. These fixtures serve as controlled scenarios that represent the various requirements and edge cases the LLM must handle. By using a diverse set of fixtures, the performance of the prompt can be evaluated efficiently across different conditions.
Automated test validation: This approach automates the validation of the requirements on a set of data fixtures by comparison between the expected outputs defined in the fixtures and the LLM response. This automated comparison ensures consistency and reduces the potential for human error or bias in the evaluation process. It allows for quick identification of discrepancies, enabling fine and efficient prompt adjustments.
Multiple iterations: To assess the inherent variability of the LLM responses, this method runs multiple iterations for each test case. This iterative approach mimics the triplicate method used in biological/chemical experiments, providing a more robust dataset for analysis. By observing the consistency of responses across iterations, we can better assess the stability and reliability of the prompt.
Algorithmic scoring: The results of each test case are scored algorithmically, reducing the need for long and laborious « human » evaluation. This scoring system is designed to be objective and quantitative, providing clear metrics for assessing the performance of the prompt. And by focusing on measurable outcomes, we can make data-driven decisions to optimize the prompt effectively.

Step 1: Defining test data fixtures

Selecting or creating compatible test data fixtures is the most challenging step of our systematic approach because it requires careful thought. A fixture is not only any input-output pair; it must be crafted meticulously to evaluate the most accurate as possible performance of the LLM for a specific requirement. This process requires:

1. A deep understanding of the task and the behavior of the model to make sure the selected examples effectively test the expected output while minimizing ambiguity or bias.

2. Foresight into how the evaluation will be conducted algorithmically during the test.

The quality of a fixture, therefore, depends not only on the good representativeness of the example but also on ensuring it can be efficiently tested algorithmically.

A fixture consists of:

• Input example: This is the data that will be given to the LLM for processing. It should represent a typical or edge-case scenario that the LLM is expected to handle. The input should be designed to cover a wide range of possible variations that the LLM might have to deal with in production.

• Expected output: This is the expected result that the LLM should produce with the provided input example. It is used for comparison with the actual LLM response output during validation.

Step 2: Running automated tests

Once the test data fixtures are defined, the next step involves the execution of automated tests to systematically evaluate the performance of the LLM response on the selected use cases. As previously stated, this process makes sure that the prompt is thoroughly tested against various scenarios, providing a reliable evaluation of its efficiency.

Execution process

1. Multiple iterations: For each test use case, the same input is provided to the LLM multiple times. A simple for loop in nb_iter with nb_iter = 5 and voila!

2. Response comparison: After each iteration, the LLM response is compared to the expected output of the fixture. This comparison checks whether the LLM has correctly processed the input according to the specified requirements.

3. Scoring mechanism: Each comparison results in a score:

◦ Pass (1): The response matches the expected output, indicating that the LLM has correctly handled the input.

◦ Fail (0): The response does not match the expected output, signaling a discrepancy that needs to be fixed.

4. Final score calculation: The scores from all iterations are aggregated to calculate the overall final score. This score represents the proportion of successful responses out of the total number of iterations. A high score, of course, indicates high prompt performance and reliability.

Example: Removing author signatures from an article

Let’s consider a simple scenario where an AI task is to remove author signatures from an article. To efficiently test this functionality, we need a set of fixtures that represent the various signature styles.

A dataset for this example could be:

Example Input	Expected Output
A long article Jean Leblanc	The long article
A long article P. W. Hartig	The long article
A long article MCZ	The long article

Validation process:

Signature removal check: The validation function checks if the signature is absent from the rewritten text. This is easily done programmatically by searching for the signature needle in the haystack output text.
Test failure criteria: If the signature is still in the output, the test fails. This indicates that the LLM did not correctly remove the signature and that further adjustments to the prompt are required. If it is not, the test is passed.

The test evaluation provides a final score that allows a data-driven assessment of the prompt efficiency. If it scores perfectly, there is no need for further optimization. However, in most cases, you will not get a perfect score because either the consistency of the LLM response to a case is low (for example, 3 out of 5 iterations scored positive) or there are edge cases that the model struggles with (0 out of 5 iterations).

The feedback clearly indicates that there is still room for further improvements and it guides you to reexamine your prompt for ambiguous phrasing, conflicting rules, or edge cases. By continuously monitoring your score alongside your prompt modifications, you can incrementally reduce side effects, achieve greater efficiency and consistency, and approach an optimal and reliable output.

A perfect score is, however, not always achievable with the selected model. Changing the model might just fix the situation. If it doesn’t, you know the limitations of your system and can take this fact into account in your workflow. With luck, this situation might just be solved in the near future with a simple model update.

Benefits of this method

Reliability of the result: Running five to ten iterations provides reliable statistics on the performance of the prompt. A single test run may succeed once but not twice, and consistent success for multiple iterations indicates a robust and well-optimized prompt.
Efficiency of the process: Unlike traditional scientific experiments that may take weeks or months to replicate, automated testing of LLMs can be carried out quickly. By setting a high number of iterations and waiting for a few minutes, we can obtain a high-quality, reproducible evaluation of the prompt efficiency.
Data-driven optimization: The score obtained from these tests provides a data-driven assessment of the prompt’s ability to meet requirements, allowing targeted improvements.
Side-by-side evaluation: Structured testing allows for an easy assessment of prompt versions. By comparing the test results, one can identify the most effective set of parameters for the instructions (phrasing, order of instructions) to achieve the desired results.
Quick iterative improvement: The ability to quickly test and iterate prompts is a real advantage to carefully construct the prompt ensuring that the previously validated requirements remain as the prompt increases in complexity and length.

By adopting this automated testing approach, we can systematically evaluate and enhance prompt performance, ensuring consistent and reliable outputs with the desired requirements. This method saves time and provides a robust analytical tool for continuous prompt optimization.

Systematic prompt testing: Beyond prompt optimization

Implementing a systematic prompt testing approach offers more advantages than just the initial prompt optimization. This methodology is valuable for other aspects of AI tasks:

1. Model comparison:

◦ Provider evaluation: This approach allows the efficient comparison of different LLM providers, such as ChatGPT, Claude, Gemini, Mistral, etc., on the same tasks. It becomes easy to evaluate which model performs the best for their specific needs.

◦ Model version: State-of-the-art model versions are not always necessary when a prompt is well-optimized, even for complex AI tasks. A lightweight, faster version can provide the same results with a faster response. This approach allows a side-by-side comparison of the different versions of a model, such as Gemini 1.5 flash vs. 1.5 pro vs. 2.0 flash or ChatGPT 3.5 vs. 4o mini vs. 4o, and allows the data-driven selection of the model version.

2. Version upgrades:

◦ Compatibility verification: When a new model version is released, systematic prompt testing helps validate if the upgrade maintains or improves the prompt performance. This is crucial for ensuring that updates do not unintentionally break the functionality.

◦ Seamless Transitions: By identifying key requirements and testing them, this method can facilitate better transitions to new model versions, allowing fast adjustment when necessary in order to maintain high-quality outputs.

3. Cost optimization:

◦ Performance-to-cost ratio: Systematic prompt testing helps in choosing the best cost-effective model based on the performance-to-cost ratio. We can efficiently identify the most efficient option between performance and operational costs to get the best return on LLM costs.

Overcoming the challenges

The biggest challenge of this approach is the preparation of the set of test data fixtures, but the effort invested in this process will pay off significantly as time passes. Well-prepared fixtures save considerable debugging time and enhance model efficiency and reliability by providing a robust foundation for evaluating the LLM response. The initial investment is quickly returned by improved efficiency and effectiveness in LLM development and deployment.

Quick pros and cons

Key advantages:

Continuous improvement: The ability to add more requirements over time while ensuring existing functionality stays intact is a significant advantage. This allows for the evolution of the AI task in response to new requirements, ensuring that the system remains up-to-date and efficient.
Better maintenance: This approach enables the easy validation of prompt performance with LLM updates. This is crucial for maintaining high standards of quality and reliability, as updates can sometimes introduce unintended changes in behavior.
More flexibility: With a set of quality control tests, switching LLM providers becomes more straightforward. This flexibility allows us to adapt to changes in the market or technological advancements, ensuring we can always use the best tool for the job.
Cost optimization: Data-driven evaluations enable better decisions on performance-to-cost ratio. By understanding the performance gains of different models, we can choose the most cost-effective solution that meets the needs.
Time savings: Systematic evaluations provide quick feedback, reducing the need for manual testing. This efficiency allows to quickly iterate on prompt improvement and optimization, accelerating the development process.

Challenges

Initial time investment: Creating test fixtures and evaluation functions can require a significant investment of time.
Defining measurable validation criteria: Not all AI tasks have clear pass/fail conditions. Defining measurable criteria for validation can sometimes be challenging, especially for tasks that involve subjective or nuanced outputs. This requires careful consideration and may involve a difficult selection of the evaluation metrics.
Cost associated with multiple tests: Multiple test use cases associated with 5 to 10 iterations can generate a high number of LLM requests for a single test automation. But if the cost of a single LLM call is neglectable, as it is in most cases for text input/output calls, the overall cost of a test remains minimal.

Conclusion: When should you implement this approach?

Implementing this systematic testing approach is, of course, not always necessary, especially for simple tasks. However, for complex AI workflows in which precision and reliability are critical, this approach becomes highly valuable by offering a systematic way to assess and optimize prompt performance, preventing endless cycles of trial and error.

By incorporating functional testing principles into Prompt Engineering, we transform a traditionally subjective and fragile process into one that is measurable, scalable, and robust. Not only does it enhance the reliability of LLM outputs, it helps achieve continuous improvement and efficient resource allocation.

The decision to implement systematic prompt Testing should be based on the complexity of your project. For scenarios demanding high precision and consistency, investing the time to set up this methodology can significantly improve outcomes and speed up the development processes. However, for simpler tasks, a more classical, lightweight approach may be sufficient. The key is to balance the need for rigor with practical considerations, ensuring that your testing strategy aligns with your goals and constraints.

Thanks for reading!

The post Mastering Prompt Engineering with Functional Testing: A Systematic Guide to Reliable LLM Outputs appeared first on Towards Data Science.

Effortless Spreadsheet Normalisation With LLM

Simon Grah — Fri, 14 Mar 2025 18:10:04 +0000

This article is part of a series of articles on automating Data Cleaning for any tabular dataset.

You can test the feature described in this article on your own dataset using the CleanMyExcel.io service, which is free and requires no registration.

Start with the why

Let’s consider this Excel spreadsheet, which contains information on awards given to films. It is sourced from the book Cleaning Data for Effective Data Science and is available here.

This is a typical and common spreadsheet that everyone may own and deal with in their daily tasks. But what is wrong with it?

To answer that question, let us first recall the end goal of using data: to derive insights that help guide our decisions in our personal or business lives. This process requires at least two crucial things:

Reliable data: clean data without issues, inconsistencies, duplicates, missing values, etc.
Tidy data: a well-normalised data frame that facilitates processing and manipulation.

The second point is the primary foundation of any analysis, including dealing with data quality.

Returning to our example, imagine we want to perform the following actions:

1. For each film involved in multiple awards, list the award and year it is associated with.

2. For each actor/actress winning multiple awards, list the film and award they are associated with.

3. Check that all actor/actress names are correct and well-standardised.

Naturally, this example dataset is small enough to derive those insights by eye or by hand if we structure it (as quickly as coding). But imagine now that the dataset contains the entire awards history; this would be time-consuming, painful, and error-prone without any automation.

Reading this spreadsheet and directly understanding its structure by a machine is difficult, as it does not follow good practices of data arrangement. That is why tidying data is so important. By ensuring that data is structured in a machine-friendly way, we can simplify parsing, automate quality checks, and enhance business analysis—all without altering the actual content of the dataset.

Example of a reshaping of this data:

Now, anyone can use low/no-code tools or code-based queries (SQL, Python, etc.) to interact easily with this dataset and derive insights.

The main challenge is how to turn a shiny and human-eye-pleasant spreadsheet into a machine-readable tidy version.

Moving on to the what? The tidy data guidelines

What is tidy data? A well-shaped data frame?

The term tidy data was described in a well‐known article named Tidy Data by Hadley Wickham and published in the Journal of Statistical Software in 2014. Below are the key quotes required to understand the underlying concepts better.

Data tidying

“Structuring datasets to facilitate manipulation, visualisation and modelling.”

“Tidy datasets provide a standardised way of linking the structure of a dataset (its physical layout) with its semantics (its meaning).”

Data structure

“Most statistical datasets are rectangular tables composed of rows and columns. The columns are almost always labelled, and the rows are sometimes labelled.”

Data semantics

“A dataset is a collection of values, usually either numbers (if quantitative) or strings (if qualitative). Values are organised in two ways. Every value belongs to both a variable and an observation. A variable contains all values that measure the same underlying attribute (such as height, temperature or duration) across units. An observation contains all values measured on the same unit (for example, a person, a day or a race) across attributes.”

“In a given analysis, there may be multiple levels of observation. For example, in a trial of a new allergy medication, we might have three types of observations:

Demographic data collected from each person (age, sex, race),
Medical data collected from each person on each day (number of sneezes, redness of eyes), and
Meteorological data collected on each day (temperature, pollen count).”

Tidy data

“Tidy data is a standard way of mapping the meaning of a dataset to its structure. A dataset is considered messy or tidy depending on how its rows, columns and tables correspond to observations, variables and types. In tidy data:

Each variable forms a column.
Each observation forms a row.
Each type of observational unit forms a table.”

Common problems with messy datasets

Column headers might be values rather than variable names.

Messy example: A table where column headers are years (2019, 2020, 2021) instead of a “Year” column.
Tidy version: A table with a “Year” column and each row representing an observation for a given year.

Multiple variables might be stored in one column.

Messy example: A column named “Age_Gender” containing values like 28_Female
Tidy version: Separate columns for “Age” and “Gender”

Variables might be stored in both rows and columns.

Messy example: A dataset tracking student test scores where subjects (Math, Science, English) are stored as both column headers and repeated in rows instead of using a single “Subject” column.
Tidy version: A table with columns for “Student ID,” “Subject,” and “Score,” where each row represents one student’s score for one subject.

Multiple types of observational units might be stored in the same table.

Messy example: A sales dataset that contains both customer information and store inventory in the same table.
Tidy version: Separate tables for “Customers” and “Inventory.”

A single observational unit might be stored in multiple tables.

Messy example: A patient’s medical records are split across multiple tables (Diagnosis Table, Medication Table) without a common patient ID linking them.
Tidy version: A single table or properly linked tables using a unique “Patient ID.”

Now that we have a better understanding of what tidy data is, let’s see how to transform a messy dataset into a tidy one.

Thinking about the how

“Tidy datasets are all alike, but every messy dataset is messy in its own way.” Hadley Wickham (cf. Leo Tolstoy)

Although these guidelines sound clear in theory, they remain difficult to generalise easily in practice for any kind of dataset. In other words, starting with the messy data, no simple or deterministic process or algorithm exists to reshape the data. This is mainly explained by the singularity of each dataset. Indeed, it is surprisingly hard to precisely define variables and observations in general and then transform data automatically without losing content. That is why, despite massive improvements in data processing over the last decade, data cleaning and formatting are still done “manually” most of the time.

Thus, when complex and hardly maintainable rules-based systems are not suitable (i.e. to precisely deal with all contexts by describing decisions in advance), machine learning models may offer some benefits. This grants the system more freedom to adapt to any data by generalising what it has learned during training. Many large language models (LLMs) have been exposed to numerous data processing examples, making them capable of analysing input data and performing tasks such as spreadsheet structure analysis, table schema estimation, and code generation.

Then, let’s describe a workflow made of code and LLM-based modules, alongside business logic, to reshape any spreadsheet.

Spreadsheet encoder

This module is designed to serialise into text the main information needed from the spreadsheet data. Only the necessary subset of cells contributing to the table layout is retained, removing non-essential or overly repetitive formatting information. By retaining only the necessary information, this step minimises token usage, reduces costs, and enhances model performance.. The current version is a deterministic algorithm inspired by the paper SpreadsheetLLM: Encoding Spreadsheets for Large Language Models, which relies on heuristics. More details about it will be the topic of a next article.

Table structure analysis

Before moving forward, asking an LLM to extract the spreadsheet structure is a crucial step in building the next actions. Here are examples of questions addressed:

How many tables are present, and what are their locations (regions) in the spreadsheet?
What defines the boundaries of each table (e.g., empty rows/columns, specific markers)?
Which rows/columns serve as headers, and do any tables have multi-level headers?
Are there metadata sections, aggregated statistics, or notes that need to be filtered out or processed separately?
Are there any merged cells, and if so, how should they be handled?

Table schema estimation

Once the analysis of the spreadsheet structure has been completed, it is now time to start thinking about the ideal target table schema. This involves letting the LLM process iteratively by:

Identifying all potential columns (multi-row headers, metadata, etc.)
Comparing columns for domain similarities based on column names and data semantics
Grouping related columns

The module outputs a final schema with names and a short description for each retained column.

Code generation to format the spreadsheet

Considering the previous structure analysis and the table schema, this last LLM-based module should draft code that transforms the spreadsheet into a proper data frame compliant with the table schema. Moreover, no useful content must be omitted (e.g. aggregated or computed values may still be derived from other variables).

As generating code that works well from scratch at the first iteration is challenging, two internal iterative processes are added to revise the code if needed:

Code checking: Whenever code cannot be compiled or executed, the trace error is provided to the model to update its code.
Data frame validation: The metadata of the created data frame—such as column names, first and last rows, and statistics about each column—is checked to validate whether the table conforms to expectations. Otherwise, the code is revised accordingly.

Convert the data frame into an Excel file

Finally, if all data fits properly into a single table, a worksheet is created from this data frame to respect the tabular format. The final asset returned is an Excel file whose active sheet contains the tidy spreadsheet data.

Et voilà! The sky’s the limit for making the most of your newly tidy dataset.

Feel free to test it with your own dataset using the CleanMyExcel.io service, which is free and requires no registration.

Final note on the workflow

Why is a workflow proposed instead of an agent for that purpose?

At the time of writing, we consider that a workflow based on LLMs for precise sub-tasks is more robust, stable, iterable, and maintainable than a more autonomous agent. An agent may offer advantages: more freedom and liberty in actions to perform tasks. Nonetheless, they may still be hard to deal with in practice; for example, they may diverge quickly if the objective is not clear enough. I believe this is our case, but that does not mean that this model would not be applicable in the future in the same way as SWE-agent coding is performing, for example.

Next articles in the series

In upcoming articles, we plan to explore related topics, including:

A detailed description of the spreadsheet encoder mentioned earlier.
Data validity: ensuring each column meets the expectations.
Data uniqueness: preventing duplicate entities within the dataset.
Data completeness: handling missing values effectively.
Evaluating data reshaping, validity, and other key aspects of data quality.

Stay tuned!

Thank you to Marc Hobballah for reviewing this article and providing feedback.

All images, unless otherwise noted, are by the author.

The post Effortless Spreadsheet Normalisation With LLM appeared first on Towards Data Science.

Are You Still Using LoRA to Fine-Tune Your LLM?

Martin Görner — Thu, 13 Mar 2025 20:25:32 +0000

LoRA (Low Rank Adaptation – arxiv.org/abs/2106.09685) is a popular technique for fine-tuning Large Language Models (LLMs) on the cheap. But 2024 has seen an explosion of new parameter-efficient fine-tuning techniques, an alphabet soup of LoRA alternatives: SVF, SVFT, MiLoRA, PiSSA, LoRA-XS … And most are based on a matrix technique I like a lot: the SVD (Singular Value Decomposition). Let’s dive in.

LoRA

The original Lora insight is that fine-tuning all the weights of a model is overkill. Instead, LoRA freezes the model and only trains a small pair of low-rank “adapter” matrices. See the illustrations below (where W is any matrix of weights in a transformer LLM).

This saves memory and compute cycles since far fewer gradients have to be computed and stored. For example, here is a Gemma 8B model fine-tuned to speak like a pirate using LoRA: only 22M parameters are trainable, 8.5B parameters remain frozen.

LoRA is very popular. It has even made it as a single-line API into mainstream ML frameworks like Keras:

gemma.backbone.enable_lora(rank=8)

But is LoRA the best? Researchers have been trying hard to improve on the formula. Indeed, there are many ways of selecting smaller “adapter” matrices. And since most of them make clever use of the singular value decomposition (SVD) of a matrix, let’s pause for a bit of Math.

SVD: the simple math

The SVD is a great tool for understanding the structure of matrices. The technique splits a matrix into three: W = USV^T where U and V are orthogonal (i.e., base changes), and S is the diagonal matrix of sorted singular values. This decomposition always exists.

In “textbook” SVD, U and V are square, while S is a rectangle with singular values on the diagonal and a tail of zeros. In practice, you can work with a square S and a rectangular U or V – see the picture – the chopped-off pieces are just multiplications by zero. This “economy-sized” SVD is what is used in common libraries, for example, numpy.linalg.svd.

So how can we use this to more efficiently select the weights to train? Let’s quickly go through five recent SVD-based low-rank fine-tuning techniques, with commented illustrations.

SVF

The simplest alternative to LoRA is to use the SVD on the model’s weight matrices and then fine-tune the singular values directly. Oddly, this is the most recent technique, called SVF, published in the Transformers² paper (arxiv.org/abs/2501.06252v2).

SVF is much more economical in parameters than LoRA. And as a bonus, it makes tuned models composable. For more info on that, see my Transformers² explainer here, but composing two SVF fine-tuned models is just an addition:

SVFT

Should you need more trainable parameters, the SVFT paper (arxiv.org/abs/2405.19597) explores multiple ways of doing that, starting by adding more trainable weights on the diagonal.

It also evaluates multiple alternatives like spreading them randomly through the “M” matrix.

More importantly, the SVFT paper confirms that having more trainable values than just the diagonal is useful. See their fine-tuning results below.

Next come several techniques that split singular values in two sets, “large” and “small”. But before we proceed, let’s pause for a bit more SVD math.

More SVD math

The SVD is usually seen as a decomposition into three matrices W=USV^T but it can also be thought of as a weighted sum of many rank-1 matrices, weighted by the singular values:

Should you want to prove it, express individual matrix elements W_jk using the USV^T form and the formula for matrix multiplication on one hand, the
Σ s_iu_iv_i^T form, on the other, simplify using the fact that S is diagonal and notice that it’s the same thing.

In this representation, it’s easy to see that you can split the sum in two. And as you can always sort the singular values, you can make this a split between “large” and “small” singular values.

Going back to the tree-matrix form W=USV^T, this is what the split looks like:

Based on this formula, two papers have explored what happens if you tune only the large singular values or only the small ones, PiSSA and MiLoRA.

PiSSA

PiSSA (Principal Singular values and Singular Vectors Adaptation, arxiv.org/abs/2404.02948) claims that you should only tune the large principal values. The mechanism is illustrated below:

From the paper: “PiSSA is designed to approximate full finetuning by adapting the principal singular components, which are believed to capture the essence of the weight matrices. In contrast, MiLoRA aims to adapt to new tasks while maximally retaining the base model’s knowledge.”

The PiSSA paper also has an interesting finding: full fine-tuning is prone to over-fitting. You might get better results in the absolute with a low-rank fine-tuning technique.

MiLoRA

MiLoRA (Minor singular component LoRA arxiv.org/abs/2406.09044), on the other hand, claims that you should only tune the small principal values. It uses a similar mechanism to PiSSA:

Surprisingly, MiLoRA seems to have the upper hand, at least when tuning on math datasets which are probably fairly aligned with the original pre-training. Arguably, PiSSA should be better for bending the behavior of the LLM further from its pre-training.

LoRA-XS

Finally, I’d like to mention LoRA-XS (arxiv.org/abs/2405.17604). Very similar to PiSSA but slightly different mechanism. It also shows good results with significantly fewer params than LoRA.

The paper offers a mathematical explanation of why this setup is “ideal’ under two conditions:

that truncating the bottom principal values from the SVD still offers a good approximation of the weights matrices
that the fine-tuning data distribution is close to the pre-training one

Both are questionable IMHO, so I won’t detail the math. Some results:

The underlying assumption seems to be that singular values come in “large” and “small” varieties but is it true? I made a quick Colab to check this on Gemma2 9B. Bottom line: 99% of the singular values are in the 0.1 – 1.1 range. I’m not sure partitioning them into “large” and “small” makes that much sense.

Conclusion

There are many more parameter-efficient fine-tuning techniques. Worth mentioning:

DoRA (arxiv.org/abs/2402.09353) which splits weights into magnitudes and directions then tunes those.
AdaLoRA (arxiv.org/abs/2303.10512) with a complex mechanism for finding the best tuning rank for a given budget of trainable weights.

My conclusion: to go beyond the LoRA standard with 10x fewer params, I like the simplicity of Transformers²’s SVF. And if you need more trainable weights, SVFT is an easy extension. Both use all singular values (full rank, no singular value pruning) and are still cheap . Happy tuning!

Note: All illustrations are either created by the author or extracted from arxiv.org papers for comment and discussion purposes.

The post Are You Still Using LoRA to Fine-Tune Your LLM? appeared first on Towards Data Science.

How to Make Your LLM More Accurate with RAG & Fine-Tuning

Sarah Schürch — Tue, 11 Mar 2025 18:05:00 +0000

Imagine studying a module at university for a semester. At the end, after an intensive learning phase, you take an exam – and you can recall the most important concepts without looking them up.

Now imagine the second situation: You are asked a question about a new topic. You don’t know the answer straight away, so you pick up a book or browse a wiki to find the right information for the answer.

These two analogies represent two of the most important methods for improving the basic model of an Llm or adapting it to specific tasks and areas: Retrieval Augmented Generation (RAG) and Fine-Tuning.

But which example belongs to which method?

That’s exactly what I’ll explain in this article: After that, you’ll know what RAG and fine-tuning are, the most important differences and which method is suitable for which application.

Let’s dive in!

Table of contents

1. Basics: What is RAG? What is fine-tuning?
2. Differences between RAG and fine-tuning
3. Ways to build a RAG model
4. Options for fine-tuning a model
5. When is RAG recommended? When is fine-tuning recommended?
Final Thoughts
Where can you continue learning?

1. Basics: What is RAG? What is fine-tuning?

Large Language Models (LLMs) such as ChatGPT from OpenAI, Gemini from Google, Claude from Anthropics or Deepseek are incredibly powerful and have established themselves in everyday work over an extremely short time.

One of their biggest limitations is that their knowledge is limited to training. A model that was trained in 2024 does not know events from 2025. If we ask the 4o model from ChatGPT who the current US President is and give the clear instruction that the Internet should not be used, we see that it cannot answer this question with certainty:

Screenshot taken by the author

In addition, the models cannot easily access company-specific information, such as internal guidelines or current technical documentation.

This is exactly where RAG and fine-tuning come into play.

Both methods make it possible to adapt an LLM to specific requirements:

RAG — The model remains the same, the input is improved

An LLM with Retrieval Augmented Generation (RAG) remains unchanged.

However, it gains access to an external knowledge source and can therefore retrieve information that is not stored in its model parameters. RAG extends the model in the inference phase by using external data sources to provide the latest or specific information. The inference phase is the moment when the model generates an answer.

This allows the model to stay up to date without retraining.

How does it work?

A user question is asked.
The query is converted into a vector representation.
A retriever searches for relevant text sections or data records in an external data source. The documents or FAQS are often stored in a vector database.
The content found is transferred to the model as additional context.
The LLM generates its answer on the basis of the retrieved and current information.

The key point is that the LLM itself remains unchanged and the internal weights of the LLM remain the same.

Let’s assume a company uses an internal AI-powered support chatbot.

The chatbot helps employees to answer questions about company policies, IT processes or HR topics. If you would ask ChatGPT a question about your company (e.g. How many vacation days do I have left?), the model would logically not give you back a meaningful answer. A classic LLM without RAG would know nothing about the company – it has never been trained with this data.

This changes with RAG: The chatbot can search an external database of current company policies for the most relevant documents (e.g. PDF files, wiki pages or internal FAQs) and provide specific answers.

RAG works similarly as when we humans look up specific information in a library or Google search – but in real-time.

A student who is asked about the meaning of CRUD quickly looks up the Wikipedia article and answers Create, Read, Update and Delete – just like a RAG model retrieves relevant documents. This process allows both humans and AI to provide informed answers without memorizing everything.

And this makes RAG a powerful tool for keeping responses accurate and current.

Own visualization by the author

Fine-tuning — The model is trained and stores knowledge permanently

Instead of looking up external information, an LLM can also be directly updated with new knowledge through fine-tuning.

Fine-tuning is used during the training phase to provide the model with additional domain-specific knowledge. An existing base model is further trained with specific new data. As a result, it “learns” specific content and internalizes technical terms, style or certain content, but retains its general understanding of language.

This makes fine-tuning an effective tool for customizing LLMs to specific needs, data or tasks.

How does this work?

The LLM is trained with a specialized data set. This data set contains specific knowledge about a domain or a task.
The model weights are adjusted so that the model stores the new knowledge directly in its parameters.
After training, the model can generate answers without the need for external sources.

Let’s now assume we want to use an LLM that provides us with expert answers to legal questions.

To do this, this LLM is trained with legal texts so that it can provide precise answers after fine-tuning. For example, it learns complex terms such as “intentional tort” and can name the appropriate legal basis in the context of the relevant country. Instead of just giving a general definition, it can cite relevant laws and precedents.

This means that you no longer just have a general LLM like GPT-4o at your disposal, but a useful tool for legal decision-making.

If we look again at the analogy with humans, fine-tuning is comparable to having internalized knowledge after an intensive learning phase.

After this learning phase, a computer science student knows that the term CRUD stands for Create, Read, Update, Delete. He or she can explain the concept without needing to look it up. The general vocabulary has been expanded.

This internalization allows for faster, more confident responses—just like a fine-tuned LLM.

2. Differences between RAG and fine-tuning

Both methods improve the performance of an LLM for specific tasks.

Both methods require well-prepared data to work effectively.

And both methods help to reduce hallucinations – the generation of false or fabricated information.

But if we look at the table below, we can see the differences between these two methods:

RAG is particularly flexible because the model can always access up-to-date data without having to be retrained. It requires less computational effort in advance, but needs more resources while answering a question (inference). The latency can also be higher.

Fine-tuning, on the other hand, offers faster inference times because the knowledge is stored directly in the model weights and no external search is necessary. The major disadvantage is that training is time-consuming and expensive and requires large amounts of high-quality training data.

RAG provides the model with tools to look up knowledge when needed without changing the model itself, whereas fine-tuning stores the additional knowledge in the model with adjusted parameters and weights.

Own visualization by the author

3. Ways to build a RAG model

A popular framework for building a Retrieval Augmented Generation (RAG) pipeline is LangChain. This framework facilitates the linking of LLM calls with a retrieval system and makes it possible to retrieve information from external sources in a targeted manner.

How does RAG work technically?

1. Query embedding

In the first step, the user request is converted into a vector using an embedding model. This is done, for example, with text-embedding-ada-002 from OpenAI or all-MiniLM-L6-v2 from Hugging Face.

This is necessary because vector databases do not search through conventional texts, but instead calculate semantic similarities between numerical representations (embeddings). By converting the user query into a vector, the system can not only search for exactly matching terms, but also recognize concepts that are similar in content.

2. Search in the vector database

The resulting query vector is then compared with a vector database. The aim is to find the most relevant information to answer the question.

This similarity search is carried out using Approximate Nearest Neighbors (ANN) algorithms. Well-known open source tools for this task are, for example, FAISS from Meta for high-performance similarity searches in large data sets or ChromaDB for small to medium-sized retrieval tasks.

3. Insertion into the LLM context

In the third step, the retrieved documents or text sections are integrated into the prompt so that the LLM generates its response based on this information.

4. Generation of the response

The LLM now combines the information received with its general language vocabulary and generates a context-specific response.

An alternative to LangChain is the Hugging Face Transformer Library, which provides specially developed RAG classes:

‘RagTokenizer’ tokenizes the input and the retrieval result. The class processes the text entered by the user and the retrieved documents.
The ‘RagRetriever’ class performs the semantic search and retrieval of relevant documents from the predefined knowledge base.
The ‘RagSequenceForGeneration’ class takes the documents provided, integrates them into the context and transfers them to the actual language model for answer generation.

4. Options for fine-tuning a model

While an LLM with RAG uses external information for the query, with fine-tuning we change the model weights so that the model permanently stores the new knowledge.

How does fine-tuning work technically?

1. Preparation of the training data

Fine-tuning requires a high-quality collection of data. This collection consists of inputs and the desired model responses. For a chatbot, for example, these can be question-answer pairs. For medical models, this could be clinical reports or diagnostic data. For a legal AI, these could be legal texts and judgments.

Let’s take a look at an example: If we look at the documentation of OpenAI, we see that these models use a standardized chat format with roles (system, user, assistant) during fine-tuning. The data format of these question-answer pairs is JSONL and looks like this, for example:

{"messages": [{"role": "system", "content": "Du bist ein medizinischer Assistent."}, {"role": "user", "content": "Was sind Symptome einer Grippe?"}, {"role": "assistant", "content": "Die häufigsten Symptome einer Grippe sind Fieber, Husten, Muskel- und Gelenkschmerzen."}]}

Other models use other data formats such as CSV, JSON or PyTorch datasets.

2. Selection of the base model

We can use a pre-trained LLM as a starting point. These can be closed-source models such as GPT-3.5 or GPT-4 via OpenAI API or open-source models such as DeepSeek, LLaMA, Mistral or Falcon or T5 or FLAN-T5 for NLP tasks.

3. Training of the model

Fine-tuning requires a lot of computing power, as the model is trained with new data to update its weights. Especially large models such as GPT-4 or LLaMA 65B require powerful GPUs or TPUs.

To reduce the computational effort, there are optimized methods such as LoRA (Low-Rank Adaption), where only a small number of additional parameters are trained, or QLoRA (Quantized LoRA), where quantized model weights (e.g. 4-bit) are used.

4. Model deployment & use

Once the model has been trained, we can deploy it locally or on a cloud platform such as Hugging Face Model Hub, AWS or Azure.

5. When is RAG recommended? When is fine-tuning recommended?

RAG and fine-tuning have different advantages and disadvantages and are therefore suitable for different use cases:

RAG is particularly suitable when content is updated dynamically or frequently.

For example, in FAQ chatbots where information needs to be retrieved from a knowledge database that is constantly expanding. Technical documentation that is regularly updated can also be efficiently integrated using RAG – without the model having to be constantly retrained.

Another point is resources: If limited computing power or a smaller budget is available, RAG makes more sense as no complex training processes are required.

Fine-tuning, on the other hand, is suitable when a model needs to be tailored to a specific company or industry.

The response quality and style can be improved through targeted training. For example, the LLM can then generate medical reports with precise terminology.

The basic rule is: RAG is used when the knowledge is too extensive or too dynamic to be fully integrated into the model, while fine-tuning is the better choice when consistent, task-specific behavior is required.

And then there’s RAFT — the magic of combination

What if we combine the two?

That’s exactly what happens with Retrieval Augmented Fine-Tuning (RAFT).

The model is first enriched with domain-specific knowledge through fine-tuning so that it understands the correct terminology and structure. The model is then extended with RAG so that it can integrate specific and up-to-date information from external data sources. This combination ensures both deep expertise and real-time adaptability.

Companies use the advantages of both methods.

Final thoughts

Both methods—RAG and fine-tuning—extend the capabilities of a basic LLM in different ways.

Fine-tuning specializes the model for a specific domain, while RAG equips it with external knowledge. The two methods are not mutually exclusive and can be combined in hybrid approaches. Looking at computational costs, fine-tuning is resource-intensive upfront but efficient during operation, whereas RAG requires fewer initial resources but consumes more during use.

RAG is ideal when knowledge is too vast or dynamic to be integrated directly into the model. Fine-tuning is the better choice when stability and consistent optimization for a specific task are required. Both approaches serve distinct but complementary purposes, making them valuable tools in AI applications.

On my Substack, I regularly write summaries about the published articles in the fields of Tech, Python, Data Science, Machine Learning and AI. If you’re interested, take a look or subscribe.

Where can you continue learning?

The post How to Make Your LLM More Accurate with RAG & Fine-Tuning appeared first on Towards Data Science.

LLM + RAG: Creating an AI-Powered File Reader Assistant

Gustavo Santos — Mon, 03 Mar 2025 21:02:28 +0000

Introduction

AI is everywhere.

It is hard not to interact at least once a day with a Large Language Model (LLM). The chatbots are here to stay. They’re in your apps, they help you write better, they compose emails, they read emails…well, they do a lot.

And I don’t think that that is bad. In fact, my opinion is the other way – at least so far. I defend and advocate for the use of AI in our daily lives because, let’s agree, it makes everything much easier.

I don’t have to spend time double-reading a document to find punctuation problems or type. AI does that for me. I don’t waste time writing that follow-up email every single Monday. AI does that for me. I don’t need to read a huge and boring contract when I have an AI to summarize the main takeaways and action points to me!

These are only some of AI’s great uses. If you’d like to know more use cases of LLMs to make our lives easier, I wrote a whole book about them.

Now, thinking as a data scientist and looking at the technical side, not everything is that bright and shiny.

LLMs are great for several general use cases that apply to anyone or any company. For example, coding, summarizing, or answering questions about general content created until the training cutoff date. However, when it comes to specific business applications, for a single purpose, or something new that didn’t make the cutoff date, that is when the models won’t be that useful if used out-of-the-box – meaning, they will not know the answer. Thus, it will need adjustments.

Training an LLM model can take months and millions of dollars. What is even worse is that if we don’t adjust and tune the model to our purpose, there will be unsatisfactory results or hallucinations (when the model’s response doesn’t make sense given our query).

So what is the solution, then? Spending a lot of money retraining the model to include our data?

Not really. That’s when the Retrieval-Augmented Generation (RAG) becomes useful.

RAG is a framework that combines getting information from an external knowledge base with large language models (LLMs). It helps AI models produce more accurate and relevant responses.

Let’s learn more about RAG next.

What is RAG?

Let me tell you a story to illustrate the concept.

I love movies. For some time in the past, I knew which movies were competing for the best movie category at the Oscars or the best actors and actresses. And I would certainly know which ones got the statue for that year. But now I am all rusty on that subject. If you asked me who was competing, I would not know. And even if I tried to answer you, I would give you a weak response.

So, to provide you with a quality response, I will do what everybody else does: search for the information online, obtain it, and then give it to you. What I just did is the same idea as the RAG: I obtained data from an external database to give you an answer.

When we enhance the LLM with a content store where it can go and retrieve data to augment (increase) its knowledge base, that is the RAG framework in action.

RAG is like creating a content store where the model can enhance its knowledge and respond more accurately.

User prompt about Content C. LLM retrieves external content to aggregate to the answer. Image by the author.

Summarizing:

Uses search algorithms to query external data sources, such as databases, knowledge bases, and web pages.
Pre-processes the retrieved information.
Incorporates the pre-processed information into the LLM.

Why use RAG?

Now that we know what the RAG framework is let’s understand why we should be using it.

Here are some of the benefits:

Enhances factual accuracy by referencing real data.
RAG can help LLMs process and consolidate knowledge to create more relevant answers
RAG can help LLMs access additional knowledge bases, such as internal organizational data
RAG can help LLMs create more accurate domain-specific content
RAG can help reduce knowledge gaps and AI hallucination

As previously explained, I like to say that with the RAG framework, we are giving an internal search engine for the content we want it to add to the knowledge base.

Well. All of that is very interesting. But let’s see an application of RAG. We will learn how to create an AI-powered PDF Reader Assistant.

Project

This is an application that allows users to upload a PDF document and ask questions about its content using AI-powered natural language processing (NLP) tools.

The app uses Streamlit as the front end.
Langchain, OpenAI’s GPT-4 model, and FAISS (Facebook AI Similarity Search) for document retrieval and question answering in the backend.

Let’s break down the steps for better understanding:

Loading a PDF file and splitting it into chunks of text.
1. This makes the data optimized for retrieval
Present the chunks to an embedding tool.
1. Embeddings are numerical vector representations of data used to capture relationships, similarities, and meanings in a way that machines can understand. They are widely used in Natural Language Processing (NLP), recommender systems, and search engines.
Next, we put those chunks of text and embeddings in the same DB for retrieval.
Finally, we make it available to the LLM.

Data preparation

Preparing a content store for the LLM will take some steps, as we just saw. So, let’s start by creating a function that can load a file and split it into text chunks for efficient retrieval.

# Imports
from  langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

def load_document(pdf):
    # Load a PDF
    """
    Load a PDF and split it into chunks for efficient retrieval.

    :param pdf: PDF file to load
    :return: List of chunks of text
    """

    loader = PyPDFLoader(pdf)
    docs = loader.load()

    # Instantiate Text Splitter with Chunk Size of 500 words and Overlap of 100 words so that context is not lost
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=100)
    # Split into chunks for efficient retrieval
    chunks = text_splitter.split_documents(docs)

    # Return
    return chunks

Next, we will start building our Streamlit app, and we’ll use that function in the next script.

Web application

We will begin importing the necessary modules in Python. Most of those will come from the langchain packages.

FAISS is used for document retrieval; OpenAIEmbeddings transforms the text chunks into numerical scores for better similarity calculation by the LLM; ChatOpenAI is what enables us to interact with the OpenAI API; create_retrieval_chain is what actually the RAG does, retrieving and augmenting the LLM with that data; create_stuff_documents_chain glues the model and the ChatPromptTemplate.

Note: You will need to generate an OpenAI Key to be able to run this script. If it’s the first time you’re creating your account, you get some free credits. But if you have it for some time, it is possible that you will have to add 5 dollars in credits to be able to access OpenAI’s API. An option is using Hugging Face’s Embedding.

# Imports
from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings
from langchain.chains import create_retrieval_chain
from langchain_openai import ChatOpenAI
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_core.prompts import ChatPromptTemplate
from scripts.secret import OPENAI_KEY
from scripts.document_loader import load_document
import streamlit as st

This first code snippet will create the App title, create a box for file upload, and prepare the file to be added to the load_document() function.

# Create a Streamlit app
st.title("AI-Powered Document Q&A")

# Load document to streamlit
uploaded_file = st.file_uploader("Upload a PDF file", type="pdf")

# If a file is uploaded, create the TextSplitter and vector database
if uploaded_file :

    # Code to work around document loader from Streamlit and make it readable by langchain
    temp_file = "./temp.pdf"
    with open(temp_file, "wb") as file:
        file.write(uploaded_file.getvalue())
        file_name = uploaded_file.name

    # Load document and split it into chunks for efficient retrieval.
    chunks = load_document(temp_file)

    # Message user that document is being processed with time emoji
    st.write("Processing document... :watch:")

Machines understand numbers better than text, so in the end, we will have to provide the model with a database of numbers that it can compare and check for similarity when performing a query. That’s where the embeddings will be useful to create the vector_db, in this next piece of code.

# Generate embeddings
    # Embeddings are numerical vector representations of data, typically used to capture relationships, similarities,
    # and meanings in a way that machines can understand. They are widely used in Natural Language Processing (NLP),
    # recommender systems, and search engines.
    embeddings = OpenAIEmbeddings(openai_api_key=OPENAI_KEY,
                                  model="text-embedding-ada-002")

    # Can also use HuggingFaceEmbeddings
    # from langchain_huggingface.embeddings import HuggingFaceEmbeddings
    # embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

    # Create vector database containing chunks and embeddings
    vector_db = FAISS.from_documents(chunks, embeddings)

Next, we create a retriever object to navigate in the vector_db.

# Create a document retriever
    retriever = vector_db.as_retriever()
    llm = ChatOpenAI(model_name="gpt-4o-mini", openai_api_key=OPENAI_KEY)

Then, we will create the system_prompt, which is a set of instructions to the LLM on how to answer, and we will create a prompt template, preparing it to be added to the model once we get the input from the user.

# Create a system prompt
    # It sets the overall context for the model.
    # It influences tone, style, and focus before user interaction starts.
    # Unlike user inputs, a system prompt is not visible to the end user.

    system_prompt = (
        "You are a helpful assistant. Use the given context to answer the question."
        "If you don't know the answer, say you don't know. "
        "{context}"
    )

    # Create a prompt Template
    prompt = ChatPromptTemplate.from_messages(
        [
            ("system", system_prompt),
            ("human", "{input}"),
        ]
    )

    # Create a chain
    # It creates a StuffDocumentsChain, which takes multiple documents (text data) and "stuffs" them together before passing them to the LLM for processing.

    question_answer_chain = create_stuff_documents_chain(llm, prompt)

Moving on, we create the core of the RAG framework, pasting together the retriever object and the prompt. This object adds relevant documents from a data source (e.g., a vector database) and makes it ready to be processed using an LLM to generate a response.

# Creates the RAG
     chain = create_retrieval_chain(retriever, question_answer_chain)

Finally, we create the variable question for the user input. If this question box is filled with a query, we pass it to the chain, which calls the LLM to process and return the response, which will be printed on the app’s screen.

# Streamlit input for question
    question = st.text_input("Ask a question about the document:")
    if question:
        # Answer
        response = chain.invoke({"input": question})['answer']
        st.write(response)

Here is a screenshot of the result.

Screenshot of the final app. Image by the author.

And this is a GIF for you to see the File Reader Ai Assistant in action!

File Reader AI Assistant in action. Image by the author.

Before you go

In this project, we learned what the RAG framework is and how it helps the Llm to perform better and also perform well with specific knowledge.

AI can be powered with knowledge from an instruction manual, databases from a company, some finance files, or contracts, and then become fine-tuned to respond accurately to domain-specific content queries. The knowledge base is augmented with a content store.

To recap, this is how the framework works:

1️⃣ User Query → Input text is received.

2️⃣ Retrieve Relevant Documents → Searches a knowledge base (e.g., a database, vector store).

3️⃣ Augment Context → Retrieved documents are added to the input.

4️⃣ Generate Response → An LLM processes the combined input and produces an answer.

GitHub repository

https://github.com/gurezende/Basic-Rag

About me

If you liked this content and want to learn more about my work, here is my website, where you can also find all my contacts.

https://gustavorsantos.me

References

https://cloud.google.com/use-cases/retrieval-augmented-generation

https://www.ibm.com/think/topics/retrieval-augmented-generation

https://youtu.be/T-D1OfcDW1M?si=G0UWfH5-wZnMu0nw

https://python.langchain.com/docs/introduction

https://www.geeksforgeeks.org/how-to-get-your-own-openai-api-key

The post LLM + RAG: Creating an AI-Powered File Reader Assistant appeared first on Towards Data Science.

LLaDA: The Diffusion Model That Could Redefine Language Generation

Maxime Wolf — Wed, 26 Feb 2025 18:18:22 +0000

Introduction

What if we could make language models think more like humans? Instead of writing one word at a time, what if they could sketch out their thoughts first, and gradually refine them?

This is exactly what Large Language Diffusion Models (LLaDA) introduces: a different approach to current text generation used in Large Language Models (LLMs). Unlike traditional autoregressive models (ARMs), which predict text sequentially, left to right, LLaDA leverages a diffusion-like process to generate text. Instead of generating tokens sequentially, it progressively refines masked text until it forms a coherent response.

In this article, we will dive into how LLaDA works, why it matters, and how it could shape the next generation of LLMs.

I hope you enjoy the article!

The current state of LLMs

To appreciate the innovation that LLaDA represents, we first need to understand how current large language models (LLMs) operate. Modern LLMs follow a two-step training process that has become an industry standard:

Pre-training: The model learns general language patterns and knowledge by predicting the next token in massive text datasets through self-supervised learning.
Supervised Fine-Tuning (SFT): The model is refined on carefully curated data to improve its ability to follow instructions and generate useful outputs.

Note that current LLMs often use RLHF as well to further refine the weights of the model, but this is not used by LLaDA so we will skip this step here.

These models, primarily based on the Transformer architecture, generate text one token at a time using next-token prediction.

Simplified Transformer architecture for text generation (Image by the author)

Here is a simplified illustration of how data passes through such a model. Each token is embedded into a vector and is transformed through successive transformer layers. In current LLMs (LLaMA, ChatGPT, DeepSeek, etc), a classification head is used only on the last token embedding to predict the next token in the sequence.

This works thanks to the concept of masked self-attention: each token attends to all the tokens that come before it. We will see later how LLaDA can get rid of the mask in its attention layers.

Attention process: input embeddings are multiplied byQuery, Key, and Value matrices to generate new embeddings (Image by the author, inspired by [3])

If you want to learn more about Transformers, check out my article here.

While this approach has led to impressive results, it also comes with significant limitations, some of which have motivated the development of LLaDA.

Current limitations of LLMs

Current LLMs face several critical challenges:

Computational Inefficiency

Imagine having to write a novel where you can only think about one word at a time, and for each word, you need to reread everything you’ve written so far. This is essentially how current LLMs operate — they predict one token at a time, requiring a complete processing of the previous sequence for each new token. Even with optimization techniques like KV caching, this process is quite computationally expensive and time-consuming.

Limited Bidirectional Reasoning

Traditional autoregressive models (ARMs) are like writers who could never look ahead or revise what they’ve written so far. They can only predict future tokens based on past ones, which limits their ability to reason about relationships between different parts of the text. As humans, we often have a general idea of what we want to say before writing it down, current LLMs lack this capability in some sense.

Amount of data

Existing models require enormous amounts of training data to achieve good performance, making them resource-intensive to develop and potentially limiting their applicability in specialized domains with limited data availability.

What is LLaDA

LLaDA introduces a fundamentally different approach to Language Generation by replacing traditional autoregression with a “diffusion-based” process (we will dive later into why this is called “diffusion”).

Let’s understand how this works, step by step, starting with pre-training.

LLaDA pre-training

Remember that we don’t need any “labeled” data during the pre-training phase. The objective is to feed a very large amount of raw text data into the model. For each text sequence, we do the following:

We fix a maximum length (similar to ARMs). Typically, this could be 4096 tokens. 1% of the time, the lengths of sequences are randomly sampled between 1 and 4096 and padded so that the model is also exposed to shorter sequences.
We randomly choose a “masking rate”. For example, one could pick 40%.
We mask each token with a probability of 0.4. What does “masking” mean exactly? Well, we simply replace the token with a special token: . As with any other token, this token is associated with a particular index and embedding vector that the model can process and interpret during training.
We then feed our entire sequence into our transformer-based model. This process transforms all the input embedding vectors into new embeddings. We apply the classification head to each of the masked tokens to get a prediction for each. Mathematically, our loss function averages cross-entropy losses over all the masked tokens in the sequence, as below:

Loss function used for LLaDA (Image by the author)

5. And… we repeat this procedure for billions or trillions of text sequences.

Note, that unlike ARMs, LLaDA can fully utilize bidirectional dependencies in the text: it doesn’t require masking in attention layers anymore. However, this can come at an increased computational cost.

Hopefully, you can see how the training phase itself (the flow of the data into the model) is very similar to any other LLMs. We simply predict randomly masked tokens instead of predicting what comes next.

LLaDA SFT

For auto-regressive models, SFT is very similar to pre-training, except that we have pairs of (prompt, response) and want to generate the response when giving the prompt as input.

This is exactly the same concept for LlaDa! Mimicking the pre-training process: we simply pass the prompt and the response, mask random tokens from the response only, and feed the full sequence into the model, which will predict missing tokens from the response.

The innovation in inference

Innovation is where LLaDA gets more interesting, and truly utilizes the “diffusion” paradigm.

Until now, we always randomly masked some text as input and asked the model to predict these tokens. But during inference, we only have access to the prompt and we need to generate the entire response. You might think (and it’s not wrong), that the model has seen examples where the masking rate was very high (potentially 1) during SFT, and it had to learn, somehow, how to generate a full response from a prompt.

However, generating the full response at once during inference will likely produce very poor results because the model lacks information. Instead, we need a method to progressively refine predictions, and that’s where the key idea of ‘remasking’ comes in.

Here is how it works, at each step of the text generation process:

Feed the current input to the model (this is the prompt, followed by tokens)
The model generates one embedding for each input token. We get predictions for the tokens only. And here is the important step: we remask a portion of them. In particular: we only keep the “best” tokens i.e. the ones with the best predictions, with the highest confidence.
We can use this partially unmasked sequence as input in the next generation step and repeat until all tokens are unmasked.

You can see that, interestingly, we have much more control over the generation process compared to ARMs: we could choose to remask 0 tokens (only one generation step), or we could decide to keep only the best token every time (as many steps as tokens in the response). Obviously, there is a trade-off here between the quality of the predictions and inference time.

Let’s illustrate that with a simple example (in that case, I choose to keep the best 2 tokens at every step)

LLaDA generation process example (Image by the author)

Note, in practice, the remasking step would work as follows. Instead of remasking a fixed number of tokens, we would remask a proportion of s/t tokens over time, from t=1 down to 0, where s is in [0, t]. In particular, this means we remask fewer and fewer tokens as the number of generation steps increases.

Example: if we want N sampling steps (so N discrete steps from t=1 down to t=1/N with steps of 1/N), taking s = (t-1/N) is a good choice, and ensures that s=0 at the end of the process.

The image below summarizes the 3 steps described above. “Mask predictor” simply denotes the Llm (LLaDA), predicting masked tokens.

Pre-training (a.), SFT (b.) and inference (c.) using LLaDA. (source: [1])

Can autoregression and diffusion be combined?

Another clever idea developed in LLaDA is to combine diffusion with traditional autoregressive generation to use the best of both worlds! This is called semi-autoregressive diffusion.

Divide the generation process into blocks (for instance, 32 tokens in each block).
The objective is to generate one block at a time (like we would generate one token at a time in ARMs).
For each block, we apply the diffusion logic by progressively unmasking tokens to reveal the entire block. Then move on to predicting the next block.

Semi-autoregressive process (source: [1])

This is a hybrid approach: we probably lose some of the “backward” generation and parallelization capabilities of the model, but we better “guide” the model towards the final output.

I think this is a very interesting idea because it depends a lot on a hyperparameter (the number of blocks), that can be tuned. I imagine different tasks might benefit more from the backward generation process, while others might benefit more from the more “guided” generation from left to right (more on that in the last paragraph).

Why “Diffusion”?

I think it’s important to briefly explain where this term actually comes from. It reflects a similarity with image diffusion models (like Dall-E), which have been very popular for image generation tasks.

In image diffusion, a model first adds noise to an image until it’s unrecognizable, then learns to reconstruct it step by step. LLaDA applies this idea to text by masking tokens instead of adding noise, and then progressively unmasking them to generate coherent language. In the context of image generation, the masking step is often called “noise scheduling”, and the reverse (remasking) is the “denoising” step.

How do Diffusion Models work? (source: [2])

You can also see LLaDA as some type of discrete (non-continuous) diffusion model: we don’t add noise to tokens, but we “deactivate” some tokens by masking them, and the model learns how to unmask a portion of them.

Results

Let’s go through a few of the interesting results of LLaDA.

You can find all the results in the paper. I chose to focus on what I find the most interesting here.

Training efficiency: LLaDA shows similar performance to ARMs with the same number of parameters, but uses much fewer tokens during training (and no RLHF)! For example, the 8B version uses around 2.3T tokens, compared to 15T for LLaMa3.
Using different block and answer lengths for different tasks: for example, the block length is particularly large for the Math dataset, and the model demonstrates strong performance for this domain. This could suggest that mathematical reasoning may benefit more from the diffusion-based and backward process.

Source: [1]

Interestingly, LLaDA does better on the “Reversal poem completion task”. This task requires the model to complete a poem in reverse order, starting from the last lines and working backward. As expected, ARMs struggle due to their strict left-to-right generation process.

Source: [1]

LLaDA is not just an experimental alternative to ARMs: it shows real advantages in efficiency, structured reasoning, and bidirectional text generation.

Conclusion

I think LLaDA is a promising approach to language generation. Its ability to generate multiple tokens in parallel while maintaining global coherence could definitely lead to more efficient training, better reasoning, and improved context understanding with fewer computational resources.

Beyond efficiency, I think LLaDA also brings a lot of flexibility. By adjusting parameters like the number of blocks generated, and the number of generation steps, it can better adapt to different tasks and constraints, making it a versatile tool for various language modeling needs, and allowing more human control. Diffusion models could also play an important role in pro-active AI and agentic systems by being able to reason more holistically.

As research into diffusion-based language models advances, LLaDA could become a useful step toward more natural and efficient language models. While it’s still early, I believe this shift from sequential to parallel generation is an interesting direction for AI development.

Thanks for reading!

Check out my previous articles:

Training Large Language Models: From TRPO to GRPO

The Math Behind “The Curse of Dimensionality”

Feel free to connect on LinkedIn
Follow me on GitHub for more content
Visit my website: maximewolf.com

References:

[1] Liu, C., Wu, J., Xu, Y., Zhang, Y., Zhu, X., & Song, D. (2024). Large Language Diffusion Models. arXiv preprint arXiv:2502.09992. https://arxiv.org/pdf/2502.09992
[2] Yang, Ling, et al. “Diffusion models: A comprehensive survey of methods and applications.” ACM Computing Surveys 56.4 (2023): 1–39.
[3] Alammar, J. (2018, June 27). The Illustrated Transformer. Jay Alammar’s Blog. https://jalammar.github.io/illustrated-transformer/

The post LLaDA: The Diffusion Model That Could Redefine Language Generation appeared first on Towards Data Science.

AI Agents from Scratch: Single Agents

Mauro Di Pietro — Thu, 20 Feb 2025 17:04:03 +0000

Intro

AI Agents are autonomous programs that perform tasks, make decisions, and communicate with others. Normally, they use a set of tools to help complete tasks. In GenAI applications, these Agents process sequential reasoning and can use external tools (like web searches or database queries) when the LLM knowledge isn’t enough. Unlike a basic chatbot, which generates random text when uncertain, an AI Agent activates tools to provide more accurate, specific responses.

We are moving closer and closer to the concept of Agentic Ai: systems that exhibit a higher level of autonomy and decision-making ability, without direct human intervention. While today’s AI Agents respond reactively to human inputs, tomorrow’s Agentic AIs proactively engage in problem-solving and can adjust their behavior based on the situation.

Today, building Agents from scratch is becoming as easy as training a logistic regression model 10 years ago. Back then, Scikit-Learn provided a straightforward library to quickly train Machine Learning models with just a few lines of code, abstracting away much of the underlying complexity.

In this tutorial, I’m going to show how to build from scratch different types of AI Agents, from simple to more advanced systems. I will present some useful Python code that can be easily applied in other similar cases (just copy, paste, run) and walk through every line of code with comments so that you can replicate this example.

Setup

As I said, anyone can have a custom Agent running locally for free without GPUs or API keys. The only necessary library is Ollama (pip install ollama==0.4.7), as it allows users to run LLMs locally, without needing cloud-based services, giving more control over data privacy and performance.

First of all, you need to download Ollama from the website.

Then, on the prompt shell of your laptop, use the command to download the selected LLM. I’m going with Alibaba’s Qwen, as it’s both smart and lite.

After the download is completed, you can move on to Python and start writing code.

import ollama
llm = "qwen2.5"

Let’s test the LLM:

stream = ollama.generate(model=llm, prompt='''what time is it?''', stream=True)
for chunk in stream:
    print(chunk['response'], end='', flush=True)

Obviously, the LLM per se is very limited and it can’t do much besides chatting. Therefore, we need to provide it the possibility to take action, or in other words, to activate Tools.

One of the most common tools is the ability to search the Internet. In Python, the easiest way to do it is with the famous private browser DuckDuckGo (pip install duckduckgo-search==6.3.5). You can directly use the original library or import the LangChain wrapper (pip install langchain-community==0.3.17).

With Ollama, in order to use a Tool, the function must be described in a dictionary.

from langchain_community.tools import DuckDuckGoSearchResults
def search_web(query: str) -> str:
  return DuckDuckGoSearchResults(backend="news").run(query)

tool_search_web = {'type':'function', 'function':{
  'name': 'search_web',
  'description': 'Search the web',
  'parameters': {'type': 'object',
                'required': ['query'],
                'properties': {
                    'query': {'type':'str', 'description':'the topic or subject to search on the web'},
}}}}
## test
search_web(query="nvidia")

Internet searches could be very broad, and I want to give the Agent the option to be more precise. Let’s say, I’m planning to use this Agent to learn about financial updates, so I can give it a specific tool for that topic, like searching only a finance website instead of the whole web.

def search_yf(query: str) -> str:
  engine = DuckDuckGoSearchResults(backend="news")
  return engine.run(f"site:finance.yahoo.com {query}")

tool_search_yf = {'type':'function', 'function':{
  'name': 'search_yf',
  'description': 'Search for specific financial news',
  'parameters': {'type': 'object',
                'required': ['query'],
                'properties': {
                    'query': {'type':'str', 'description':'the financial topic or subject to search'},
}}}}

## test
search_yf(query="nvidia")

Simple Agent (WebSearch)

In my opinion, the most basic Agent should at least be able to choose between one or two Tools and re-elaborate the output of the action to give the user a proper and concise answer.

First, you need to write a prompt to describe the Agent’s purpose, the more detailed the better (mine is very generic), and that will be the first message in the chat history with the LLM.

prompt = '''You are an assistant with access to tools, you must decide when to use tools to answer user message.''' 
messages = [{"role":"system", "content":prompt}]

In order to keep the chat with the AI alive, I will use a loop that starts with user’s input and then the Agent is invoked to respond (which can be a text from the LLM or the activation of a Tool).

while True:
    ## user input
    try:
        q = input(' >')
    except EOFError:
        break
    if q == "quit":
        break
    if q.strip() == "":
        continue
    messages.append( {"role":"user", "content":q} )
   
    ## model
    agent_res = ollama.chat(
        model=llm,
        tools=[tool_search_web, tool_search_yf],
        messages=messages)

Up to this point, the chat history could look something like this:

If the model wants to use a Tool, the appropriate function needs to be run with the input parameters suggested by the LLM in its response object:

So our code needs to get that information and run the Tool function.

## response
    dic_tools = {'search_web':search_web, 'search_yf':search_yf}

    if "tool_calls" in agent_res["message"].keys():
        for tool in agent_res["message"]["tool_calls"]:
            t_name, t_inputs = tool["function"]["name"], tool["function"]["arguments"]
            if f := dic_tools.get(t_name):
                ### calling tool
                print(' >', f"\x1b[1;31m{t_name} -> Inputs: {t_inputs}\x1b[0m")
                messages.append( {"role":"user", "content":"use tool '"+t_name+"' with inputs: "+str(t_inputs)} )
                ### tool output
                t_output = f(**tool["function"]["arguments"])
                print(t_output)
                ### final res
                p = f'''Summarize this to answer user question, be as concise as possible: {t_output}'''
                res = ollama.generate(model=llm, prompt=q+". "+p)["response"]
            else:
                print(' >', f"\x1b[1;31m{t_name} -> NotFound\x1b[0m")
 
    if agent_res['message']['content'] != '':
        res = agent_res["message"]["content"]
     
    print(" >", f"\x1b[1;30m{res}\x1b[0m")
    messages.append( {"role":"assistant", "content":res} )

Now, if we run the full code, we can chat with our Agent.

Advanced Agent (Coding)

LLMs know how to code by being exposed to a large corpus of both code and natural language text, where they learn patterns, syntax, and semantics of Programming languages. The model learns the relationships between different parts of the code by predicting the next token in a sequence. In short, LLMs can generate Python code but can’t execute it, Agents can.

I shall prepare a Tool allowing the Agent to execute code. In Python, you can easily create a shell to run code as a string with the native command exec().

import io
import contextlib

def code_exec(code: str) -> str:\
    output = io.StringIO()
    with contextlib.redirect_stdout(output):
        try:
            exec(code)
        except Exception as e:
            print(f"Error: {e}")
    return output.getvalue()

tool_code_exec = {'type':'function', 'function':{
  'name': 'code_exec',
  'description': 'execute python code',
  'parameters': {'type': 'object',
                'required': ['code'],
                'properties': {
                    'code': {'type':'str', 'description':'code to execute'},
}}}}

## test
code_exec("a=1+1; print(a)")

Just like before, I will write a prompt, but this time, at the beginning of the chat-loop, I will ask the user to provide a file path.

prompt = '''You are an expert data scientist, and you have tools to execute python code.
First of all, execute the following code exactly as it is: 'df=pd.read_csv(path); print(df.head())'
If you create a plot, ALWAYS add 'plt.show()' at the end.
'''
messages = [{"role":"system", "content":prompt}]
start = True

while True:
    ## user input
    try:
        if start is True:
            path = input(' Provide a CSV path >')
            q = "path = "+path
        else:
            q = input(' >')
    except EOFError:
        break
    if q == "quit":
        break
    if q.strip() == "":
        continue
   
    messages.append( {"role":"user", "content":q} )

Since coding tasks can be a little trickier for LLMs, I am going to add also memory reinforcement. By default, during one session, there isn’t a true long-term memory. LLMs have access to the chat history, so they can remember information temporarily, and track the context and instructions you’ve given earlier in the conversation. However, memory doesn’t always work as expected, especially if the LLM is small. Therefore, a good practice is to reinforce the model’s memory by adding periodic reminders in the chat history.

prompt = '''You are an expert data scientist, and you have tools to execute python code.
First of all, execute the following code exactly as it is: 'df=pd.read_csv(path); print(df.head())'
If you create a plot, ALWAYS add 'plt.show()' at the end.
'''
messages = [{"role":"system", "content":prompt}]
memory = '''Use the dataframe 'df'.'''
start = True

while True:
    ## user input
    try:
        if start is True:
            path = input(' Provide a CSV path >')
            q = "path = "+path
        else:
            q = input(' >')
    except EOFError:
        break
    if q == "quit":
        break
    if q.strip() == "":
        continue
   
    ## memory
    if start is False:
        q = memory+"\n"+q
    messages.append( {"role":"user", "content":q} )

Please note that the default memory length in Ollama is 2048 characters. If your machine can handle it, you can increase it by changing the number when the LLM is invoked:

    ## model
    agent_res = ollama.chat(
        model=llm,
        tools=[tool_code_exec],
        options={"num_ctx":2048},
        messages=messages)

In this usecase, the output of the Agent is mostly code and data, so I don’t want the LLM to re-elaborate the responses.

    ## response
    dic_tools = {'code_exec':code_exec}
   
    if "tool_calls" in agent_res["message"].keys():
        for tool in agent_res["message"]["tool_calls"]:
            t_name, t_inputs = tool["function"]["name"], tool["function"]["arguments"]
            if f := dic_tools.get(t_name):
                ### calling tool
                print(' >', f"\x1b[1;31m{t_name} -> Inputs: {t_inputs}\x1b[0m")
                messages.append( {"role":"user", "content":"use tool '"+t_name+"' with inputs: "+str(t_inputs)} )
                ### tool output
                t_output = f(**tool["function"]["arguments"])
                ### final res
                res = t_output
            else:
                print(' >', f"\x1b[1;31m{t_name} -> NotFound\x1b[0m")
 
    if agent_res['message']['content'] != '':
        res = agent_res["message"]["content"]
     
    print(" >", f"\x1b[1;30m{res}\x1b[0m")
    messages.append( {"role":"assistant", "content":res} )
    start = False

Now, if we run the full code, we can chat with our Agent.

Conclusion

This article has covered the foundational steps of creating Agents from scratch using only Ollama. With these building blocks in place, you are already equipped to start developing your own Agents for different use cases.

Stay tuned for Part 2, where we will dive deeper into more advanced examples.

Full code for this article: GitHub

I hope you enjoyed it! Feel free to contact me for questions and feedback or just to share your interesting projects.

Let’s Connect

The post AI Agents from Scratch: Single Agents appeared first on Towards Data Science.

Llm | Towards Data Science

The Invisible Revolution: How Vectors Are (Re)defining Business Success

What are vectors, anyway?

The Mathematics of Meaning (Kings and Queens)

A Tale of Distances, Angles, and Dinner Parties

Euclidean Distance: Straightforward, but Limited

Cosine Similarity: Focused on Direction

Hiking up the Euclidian Mountain Trail

Real-World Impacts of Metric Choices

Vector Representations in Practice: Industry Transformations

Healthcare Spotlight: Pattern Recognition in Complex Medical Data

Lead and understand, or face disruption. The naked truth.

Circuit Tracing: A Step Closer to Understanding Large Language Models

Context

What is a circuit in LLMs?

Technical Setup

Transcoders

Construct a replacement model

Interpretable presentation of model’s computation: Attribution graph

Feature interpretability using an attribution graph

Limitations of the current approach

Conclusion

References:

AI in Social Research and Polling

Hi, can I ask you a short series of questions?

Do we really need the humans?

What have they tried?

Does it work?

On synthetic data

Is it a good idea?

Sweeping social problems under an LLM rug

Further Reading

Mastering Prompt Engineering with Functional Testing: A Systematic Guide to Reliable LLM Outputs

Balancing precision and consistency in prompt optimisation

From laboratory to AI: Why testing LLM responses requires multiple iterations

A systematic approach: Functional testing for prompt optimization

Step 1: Defining test data fixtures

Step 2: Running automated tests

Execution process

Example: Removing author signatures from an article

Benefits of this method

Systematic prompt testing: Beyond prompt optimization

Overcoming the challenges

Quick pros and cons

Challenges

Conclusion: When should you implement this approach?

Effortless Spreadsheet Normalisation With LLM

Start with the why

Moving on to the what? The tidy data guidelines

Common problems with messy datasets

Thinking about the how

Next articles in the series

Are You Still Using LoRA to Fine-Tune Your LLM?

LoRA

SVD: the simple math

SVF

SVFT

More SVD math

PiSSA

MiLoRA

LoRA-XS

Conclusion

How to Make Your LLM More Accurate with RAG & Fine-Tuning

1. Basics: What is RAG? What is fine-tuning?

RAG — The model remains the same, the input is improved

Fine-tuning — The model is trained and stores knowledge permanently

2. Differences between RAG and fine-tuning

3. Ways to build a RAG model

How does RAG work technically?

4. Options for fine-tuning a model

How does fine-tuning work technically?

5. When is RAG recommended? When is fine-tuning recommended?

And then there’s RAFT — the magic of combination

Final thoughts

Where can you continue learning?

LLM + RAG: Creating an AI-Powered File Reader Assistant

Introduction

What is RAG?

Why use RAG?

Project