Data Labeling | Towards Data Science https://towardsdatascience.com/tag/data-labeling/ The world’s leading publication for data science, AI, and ML professionals. Thu, 10 Apr 2025 19:22:47 +0000 en-US hourly 1 https://wordpress.org/?v=6.7.1 https://towardsdatascience.com/wp-content/uploads/2025/02/cropped-Favicon-32x32.png Data Labeling | Towards Data Science https://towardsdatascience.com/tag/data-labeling/ 32 32 How to Measure Real Model Accuracy When Labels Are Noisy https://towardsdatascience.com/how-to-measure-real-model-accuracy-when-labels-are-noisy/ Thu, 10 Apr 2025 19:22:26 +0000 https://towardsdatascience.com/?p=605709 The math behind “true” accuracy and error correlation

The post How to Measure Real Model Accuracy When Labels Are Noisy appeared first on Towards Data Science.

]]>
Ground truth is never perfect. From scientific measurements to human annotations used to train deep learning models, ground truth always has some amount of errors. ImageNet, arguably the most well-curated image dataset has 0.3% errors in human annotations. Then, how can we evaluate predictive models using such erroneous labels?

In this article, we explore how to account for errors in test data labels and estimate a model’s “true” accuracy.

Example: image classification

Let’s say there are 100 images, each containing either a cat or a dog. The images are labeled by human annotators who are known to have 96% accuracy (Aᵍʳᵒᵘⁿᵈᵗʳᵘᵗʰ). If we train an image classifier on some of this data and find that it has 90% accuracy on a hold-out set (Aᵐᵒᵈᵉˡ), what is the “true” accuracy of the model (Aᵗʳᵘᵉ)? A couple of observations first:

  1. Within the 90% of predictions that the model got “right,” some examples may have been incorrectly labeled, meaning both the model and the ground truth are wrong. This artificially inflates the measured accuracy.
  2. Conversely, within the 10% of “incorrect” predictions, some may actually be cases where the model is right and the ground truth label is wrong. This artificially deflates the measured accuracy.

Given these complications, how much can the true accuracy vary?

Range of true accuracy

True accuracy of model for perfectly correlated and perfectly uncorrelated errors of model and label. Figure by author.

The true accuracy of our model depends on how its errors correlate with the errors in the ground truth labels. If our model’s errors perfectly overlap with the ground truth errors (i.e., the model is wrong in exactly the same way as human labelers), its true accuracy is:

Aᵗʳᵘᵉ = 0.90 — (1–0.96) = 86%

Alternatively, if our model is wrong in exactly the opposite way as human labelers (perfect negative correlation), its true accuracy is:

Aᵗʳᵘᵉ = 0.90 + (1–0.96) = 94%

Or more generally:

Aᵗʳᵘᵉ = Aᵐᵒᵈᵉˡ ± (1 — Aᵍʳᵒᵘⁿᵈᵗʳᵘᵗʰ)

It’s important to note that the model’s true accuracy can be both lower and higher than its reported accuracy, depending on the correlation between model errors and ground truth errors.

Probabilistic estimate of true accuracy

In some cases, inaccuracies among labels are randomly spread among the examples and not systematically biased toward certain labels or regions of the feature space. If the model’s inaccuracies are independent of the inaccuracies in the labels, we can derive a more precise estimate of its true accuracy.

When we measure Aᵐᵒᵈᵉˡ (90%), we’re counting cases where the model’s prediction matches the ground truth label. This can happen in two scenarios:

  1. Both model and ground truth are correct. This happens with probability Aᵗʳᵘᵉ × Aᵍʳᵒᵘⁿᵈᵗʳᵘᵗʰ.
  2. Both model and ground truth are wrong (in the same way). This happens with probability (1 — Aᵗʳᵘᵉ) × (1 — Aᵍʳᵒᵘⁿᵈᵗʳᵘᵗʰ).

Under independence, we can express this as:

Aᵐᵒᵈᵉˡ = Aᵗʳᵘᵉ × Aᵍʳᵒᵘⁿᵈᵗʳᵘᵗʰ + (1 — Aᵗʳᵘᵉ) × (1 — Aᵍʳᵒᵘⁿᵈᵗʳᵘᵗʰ)

Rearranging the terms, we get:

Aᵗʳᵘᵉ = (Aᵐᵒᵈᵉˡ + Aᵍʳᵒᵘⁿᵈᵗʳᵘᵗʰ — 1) / (2 × Aᵍʳᵒᵘⁿᵈᵗʳᵘᵗʰ — 1)

In our example, that equals (0.90 + 0.96–1) / (2 × 0.96–1) = 93.5%, which is within the range of 86% to 94% that we derived above.

The independence paradox

Plugging in Aᵍʳᵒᵘⁿᵈᵗʳᵘᵗʰ as 0.96 from our example, we get

Aᵗʳᵘᵉ = (Aᵐᵒᵈᵉˡ — 0.04) / (0.92). Let’s plot this below.

True accuracy as a function of model’s reported accuracy when ground truth accuracy = 96%. Figure by author.

Strange, isn’t it? If we assume that model’s errors are uncorrelated with ground truth errors, then its true accuracy Aᵗʳᵘᵉ is always higher than the 1:1 line when the reported accuracy is > 0.5. This holds true even if we vary Aᵍʳᵒᵘⁿᵈᵗʳᵘᵗʰ:

Model’s “true” accuracy as a function of its reported accuracy and ground truth accuracy. Figure by author.

Error correlation: why models often struggle where humans do

The independence assumption is crucial but often doesn’t hold in practice. If some images of cats are very blurry, or some small dogs look like cats, then both the ground truth and model errors are likely to be correlated. This causes Aᵗʳᵘᵉ to be closer to the lower bound (Aᵐᵒᵈᵉˡ — (1 — Aᵍʳᵒᵘⁿᵈᵗʳᵘᵗʰ)) than the upper bound.

More generally, model errors tend to be correlated with ground truth errors when:

  1. Both humans and models struggle with the same “difficult” examples (e.g., ambiguous images, edge cases)
  2. The model has learned the same biases present in the human labeling process
  3. Certain classes or examples are inherently ambiguous or challenging for any classifier, human or machine
  4. The labels themselves are generated from another model
  5. There are too many classes (and thus too many different ways of being wrong)

Best practices

The true accuracy of a model can differ significantly from its measured accuracy. Understanding this difference is crucial for proper model evaluation, especially in domains where obtaining perfect ground truth is impossible or prohibitively expensive.

When evaluating model performance with imperfect ground truth:

  1. Conduct targeted error analysis: Examine examples where the model disagrees with ground truth to identify potential ground truth errors.
  2. Consider the correlation between errors: If you suspect correlation between model and ground truth errors, the true accuracy is likely closer to the lower bound (Aᵐᵒᵈᵉˡ — (1 — Aᵍʳᵒᵘⁿᵈᵗʳᵘᵗʰ)).
  3. Obtain multiple independent annotations: Having multiple annotators can help estimate ground truth accuracy more reliably.

Conclusion

In summary, we learned that:

  1. The range of possible true accuracy depends on the error rate in the ground truth
  2. When errors are independent, the true accuracy is often higher than measured for models better than random chance
  3. In real-world scenarios, errors are rarely independent, and the true accuracy is likely closer to the lower bound

The post How to Measure Real Model Accuracy When Labels Are Noisy appeared first on Towards Data Science.

]]>
ML Metamorphosis: Chaining ML Models for Optimized Results https://towardsdatascience.com/ml-metamorphosis-chaining-ml-models-for-optimized-results-d89d952627a9/ Wed, 23 Oct 2024 04:09:37 +0000 https://towardsdatascience.com/ml-metamorphosis-chaining-ml-models-for-optimized-results-d89d952627a9/ The universal principle of knowledge distillation, model compression, and rule extraction

The post ML Metamorphosis: Chaining ML Models for Optimized Results appeared first on Towards Data Science.

]]>
Figure 1. This and other images were created by the author with the help of recraft.ai
Figure 1. This and other images were created by the author with the help of recraft.ai

Machine learning (ML) model training typically follows a familiar pipeline: start with data collection, clean and prepare it, then move on to model fitting. But what if we could take this process further? Just as some insects undergo dramatic transformations before reaching maturity, ML models can evolve in a similar way (see Hinton et al. [1]) – what I will call the ML metamorphosis. This process involves chaining different models together, resulting in a final model that achieves significantly better quality than if it had been trained directly from the start.

Here’s how it works:

  • Start with some initial knowledge, Data 1.
  • Train an ML model, Model A (say, a neural network), on this data.
  • Generate new data, Data 2, using Model A.
  • Finally, use Data 2 to fit your target model, Model B.
Figure 2. An illustration of the ML metamorphosis
Figure 2. An illustration of the ML metamorphosis

You may already be familiar with this concept from knowledge distillation, where a smaller neural network replaces a larger one. But ML metamorphosis goes beyond this, and neither the initial model (Model A) nor the final one (Model B) need be neural networks at all.

Example: ML metamorphosis on the MNIST Dataset

Imagine you’re tasked with training a multi-class decision tree on the MNIST dataset of handwritten digit images, but only 1,000 images are labelled. You could train the tree directly on this limited data, but the accuracy would be capped at around 0.67. Not great, right? Alternatively, you could use ML metamorphosis to improve your results.

But before we dive into the solution, let’s take a quick look at the techniques and research behind this approach.

1. Knowledge distillation (2015)

Even if you haven’t used knowledge distillation, you’ve probably seen it in action. For example, Meta suggests distilling its Llama 3.2 model to adapt it to specific tasks [2]. Or take DistilBERT – a distilled version of BERT [3]— or the DMD framework, which distills Stable Diffusion to speed up image generation by a factor of 30 [4].

At its core, knowledge distillation transfers knowledge from a large, complex model (the teacher) to a smaller, more efficient model (the student). The process involves creating a transfer set that includes both the original training data and additional data (either original or synthesized) pseudo-labeled by the teacher model. The pseudo-labels are known as soft labels – derived from the probabilities predicted by the teacher across multiple classes. These soft labels provide richer information than hard labels (simple class indicators) because they reflect the teacher’s confidence and capture subtle similarities between classes. For instance, they might show that a particular "1" is more similar to a "7" than to a "5."

By training on this enriched transfer set, the student model can effectively mimic the teacher’s performance while being much lighter, faster, and easier to use.

The student model obtained in this way is more accurate than it would have been if it had been trained solely on the original training set.

2. Model compression (2007)

Model compression [5] is often seen as a precursor to knowledge distillation, but there are important differences. Unlike knowledge distillation, model compression doesn’t seem to use soft labels, despite some claims in the literature [1,6]. I haven’t found any evidence that soft labels are part of the process. In fact, the method in the original paper doesn’t even rely on artificial neural networks (ANNs) as Model A. Instead, it uses an ensemble of models – such as SVMs, decision trees, random forests, and others.

Model compression works by approximating the feature distribution p(x) to create a transfer set. This set is then labelled by Model A, which provides the conditional distribution p(y|x). The key innovation in the original work is a technique called MUNGE to approximate p(x). As with knowledge distillation, the goal is to train a smaller, more efficient Model B that retains the performance of the larger Model A.

As in knowledge distillation, the compressed model trained in this way can often outperform a similar model trained directly on the original data, thanks to the rich information embedded in the transfer set [5].

Often, "model compression" is used more broadly to refer to any technique that reduces the size of Model A [7,8]. This includes methods like knowledge distillation but also techniques that don’t rely on a transfer set, such as pruning, quantization, or low-rank approximation for neural networks.

3. Rule extraction (1995)

When the problem isn’t computational complexity or memory, but the opacity of a model’s decision-making, pedagogical rule extraction offers a solution [9]. In this approach, a simpler, more interpretable model (Model B) is trained to replicate the behavior of the opaque teacher model (Model A), with the goal of deriving a set of human-readable rules. The process typically starts by feeding unlabelled examples – often randomly generated – into Model A, which labels them to create a transfer set. This transfer set is then used to train the transparent student model. For example, in a classification task, the student model might be a decision tree that outputs rules such as: "If feature X1 is above threshold T1 and feature X2 is below threshold T2, then classify as positive".

The main goal of pedagogical rule extraction is to closely mimic the teacher model’s behavior, with fidelity – the accuracy of the student model relative to the teacher model – serving as the primary quality measure.

Interestingly, research has shown that transparent models created through this method can sometimes reach higher accuracy than similar models trained directly on the original data used to build Model A [10,11].

Pedagogical rule extraction belongs to a broader family of techniques known as "global" model explanation methods, which also include decompositional and eclectic rule extraction. See [12] for more details.

4. Simulations as Model A

Model A doesn’t have to be an ML model – it could just as easily be a computer simulation of an economic or physical process, such as the simulation of airflow around an airplane wing. In this case, Data 1 consists of the differential or difference equations that define the process. For any given input, the simulation makes predictions by solving these equations numerically. However, when these simulations become computationally expensive, a faster alternative is needed: a surrogate model (Model B), which can accelerate tasks like optimization [13]. When the goal is to identify important regions in the input space, such as zones of system stability, an interpretable Model B is developed through a process known as scenario discovery [14]. To generate the transfer set (Data 2) for both surrogate modelling and scenario discovery, Model A is run on a diverse set of inputs.

Back to our MNIST example

In an insightful article on TDS [15], Niklas von Moers shows how semi-supervised learning can improve the performance of a convolutional neural network (CNN) on the same input data. This result fits into the first stage of the ML metamorphosis pipeline, where Model A is a trained CNN classifier. The transfer set, Data 2, then contains the originally labelled 1,000 training examples plus about 55,000 examples pseudo-labelled by Model A with high confidence predictions. I now train our target Model B, a decision tree classifier, on Data 2 and achieve an accuracy of 0.86 – much higher than 0.67 when training on the labelled part of Data 1 alone. This means that chaining the decision tree to the CNN solution reduces error rate of the decision tree from 0.33 to 0.14. Quite an improvement, wouldn’t you say?

For the full experimental code, check out the GitHub repository.

Conclusion

In summary, ML metamorphosis isn’t always necessary – especially if accuracy is your only concern and there’s no need for interpretability, faster inference, or reduced storage requirements. But in other cases, chaining models may yield significantly better results than training the target model directly on the original data.

Figure 2: For easy reference, here's the illustration again
Figure 2: For easy reference, here’s the illustration again

For a classification task, the process involves:

  • Data 1: The original, fully or partially labeled data.
  • Model A: A model trained on Data 1.
  • Data 2: A transfer set that includes pseudo-labeled data.
  • Model B: The final model, designed to meet additional requirements, such as interpretability or efficiency.

So why don’t we always use ML metamorphosis? The challenge often lies in finding the right transfer set, Data 2 [9]. But that’s a topic for another story.

References

[1] Hinton, Geoffrey. "Distilling the Knowledge in a Neural Network." arXiv preprint arXiv:1503.02531 (2015).

[2] Introducing Llama 3.2

[3] Sanh, Victor, et al. "DistilBERT, a distilled version of BERT: Smaller, faster, cheaper and lighter. " arXiv preprint arXiv:1910.01108 (2019).

[4] Yin, Tianwei, et al. "One-step diffusion with distribution matching distillation." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024.

[5] Buciluǎ, Cristian, Rich Caruana, and Alexandru Niculescu-Mizil. "Model compression." Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining. 2006.

[6] Knowledge distillation, Wikipedia

[7] An Overview of Model Compression Techniques for Deep Learning in Space, on Medium

[8] Distilling BERT Using an Unlabeled Question-Answering Dataset, on Towards Data Science

[9] Arzamasov, Vadim, Benjamin Jochum, and Klemens Böhm. "Pedagogical Rule Extraction to Learn Interpretable Models – an Empirical Study." arXiv preprint arXiv:2112.13285 (2021).

[10] Domingos, Pedro. "Knowledge acquisition from examples via multiple models." Machine Learning-INTERNATIONAL WORKSHOP THEN CONFERENCE-. MORGAN KAUFMANN PUBLISHERS, INC., 1997.

[11] De Fortuny, Enric Junque, and David Martens. "Active learning-based pedagogical rule extraction." IEEE transactions on neural networks and learning systems 26.11 (2015): 2664–2677.

[12] Guidotti, Riccardo, et al. "A survey of methods for explaining black box models." ACM computing surveys (CSUR) 51.5 (2018): 1–42.

[13] Surrogate model, Wikipedia

[14] Scenario discovery in Python, blog post on Water Programming

[15] Teaching Your Model to Learn from Itself, on Towards Data Science

The post ML Metamorphosis: Chaining ML Models for Optimized Results appeared first on Towards Data Science.

]]>
Teaching Your Model to Learn from Itself https://towardsdatascience.com/teaching-your-model-to-learn-from-itself-8b5ef13eb173/ Mon, 16 Sep 2024 18:59:13 +0000 https://towardsdatascience.com/teaching-your-model-to-learn-from-itself-8b5ef13eb173/ In machine learning, more data leads to better results. But labeling data can be expensive and time-consuming. What if we could use the...

The post Teaching Your Model to Learn from Itself appeared first on Towards Data Science.

]]>
A case study on iterative, confidence-based pseudo-labeling for classification

In Machine Learning, more data leads to better results. But labeling data can be expensive and time-consuming. What if we could use the huge amounts of unlabeled data that’s usually easy to get? This is where pseudo-labeling comes in handy.

TL;DR: I conducted a case study on the MNIST dataset and boosted my model’s accuracy from 90 % to 95 % by applying iterative, confidence-based pseudo-labeling. This article covers the details of what pseudo-labeling is, along with practical tips and insights from my experiments.

How Does it Work?

pseudo-labeling is a type of semi-supervised learning. It bridges the gap between supervised learning (where all data is labeled) and unsupervised learning (where no data is labeled).

Process diagram illustrating the procedure on the MNIST dataset. Derived from Yann LeCun, Corinna Cortes, and Christopher J.C. Burges. Licensed under CC BY-SA 3.0.
Process diagram illustrating the procedure on the MNIST dataset. Derived from Yann LeCun, Corinna Cortes, and Christopher J.C. Burges. Licensed under CC BY-SA 3.0.

The exact procedure I followed goes as follows:

  • We start with a small amount of labeled data and train our model on it.
  • The model makes predictions on the unlabeled data.
  • We pick the predictions the model is most confident about (e.g., above 95 % confidence) and treat them as if they were actual labels, hoping that they are reliable enough.
  • We add this "pseudo-labeled" data to our training set and retrain the model.
  • We can repeat this process several times, letting the model learn from the growing pool of pseudo-labeled data.

While this approach may introduce some incorrect labels, the benefit comes from the significantly increased amount of training data.

The Echo Chamber Effect: Can Pseudo-Labeling Even Work?

The idea of a model learning from its own predictions might raise some eyebrows. After all, aren’t we trying to create something from nothing, relying on an "echo chamber" where the model simply reinforces its own initial biases and errors?

This concern is valid. It may remind you of the legendary Baron Münchhausen, who famously claimed to have pulled himself and his horse out of a swamp by his own hair – a physical impossibility. Similarly, if a model solely relies on its own potentially flawed predictions, it risks getting stuck in a loop of self-reinforcement, much like people trapped in echo chambers who only hear their own beliefs reflected back at them.

So, can pseudo-labeling truly be effective without falling into this trap?

The answer is yes. While this story of Baron Münchhausen is obviously a fairytale, you may imagine a blacksmith progressing through the ages. He starts with basic stone tools (the initial labeled data). Using these, he forges crude copper tools (pseudo-labels) from raw ore (unlabeled data). These copper tools, while still rudimentary, allow him to work on previously unfeasible tasks, eventually leading to the creation of tools that are made of bronze, iron, and so on. This iterative process is crucial: You cannot forge steel swords using a stone hammer.

Just like the blacksmith, in machine learning, we can achieve a similar progression by:

  • Rigorous thresholds: The model’s out-of-sample accuracy is bounded by the share of correct training labels. If 10 % of labels are wrong, the model’s accuracy won’t exceed 90 % significantly. Therefore it is important to allow as few wrong labels as possible.
  • Measurable feedback: Constantly evaluating the model’s performance on a separate test set acts as a reality check, ensuring we’re making actual progress, not just reinforcing existing errors.
  • Human-in-the-loop: Incorporating human feedback in the form of manual review of pseudo-labels or manual labeling of low-confidence data can provide valuable course correction.

Pseudo-labeling, when done right, can be a powerful tool to make the most of small labeled datasets, as we will see in the following case study.

Case Study: MNIST Dataset

I conducted my experiments on the MNIST dataset, a classic collection of 28 by 28 pixel images of handwritten digits, widely used for benchmarking machine learning models. It consists of 60,000 training images and 10,000 test images. The goal is to, based on the 28 by 28 pixels, predict what digit is written.

I trained a simple CNN on an initial set of 1,000 labeled images, leaving 59,000 unlabeled. I then used the trained model to predict the labels for the unlabeled images. Predictions with confidence above a certain threshold (e.g., 95 %) were added to the training set, along with their predicted labels. The model was then retrained on this expanded dataset. This process was repeated iteratively, up to ten times or until there was no more unlabeled data.

This experiment was repeated with different numbers of initially labeled images and confidence thresholds.

Results

The following table summarizes the results of my experiments, comparing the performance of pseudo-labeling to training on the full labeled dataset.

Even with a small initial labeled dataset, pseudo-labeling may produce remarkable results, increasing the accuracy by 4.87 %pt. for 1,000 initial labeled samples. When using only 100 initial samples, this effect is even stronger. However, it would’ve been wise to manually label more than 100 samples.

Interestingly, the final test accuracy of the experiment with 100 initial training samples exceeded the share of correct training labels.

Accuracy improvement (y-axis) compared to the first iteration per iteration (color) by threshold (x-axis). There is a clear trend of better improvements for higher thresholds and more iterations. Image by the author.
Accuracy improvement (y-axis) compared to the first iteration per iteration (color) by threshold (x-axis). There is a clear trend of better improvements for higher thresholds and more iterations. Image by the author.
Share of corrrect training labels and number of total training data points per iteration by threshold. Higher thresholds lead to more robust but slower labeling. Image by the author.
Share of corrrect training labels and number of total training data points per iteration by threshold. Higher thresholds lead to more robust but slower labeling. Image by the author.
Accuracies for high and low confidence predictions per iteration by threshold. Higher thresholds lead to better accuracies, but the accuracy decreases with time for every choice of threshold. Image by the author.
Accuracies for high and low confidence predictions per iteration by threshold. Higher thresholds lead to better accuracies, but the accuracy decreases with time for every choice of threshold. Image by the author.
Accuracy improvement per iteration compared to the first iteration by threshold for 100 and 10,000 initially labeled training samples (left and right respectively). Note the different scales. Image by the author.
Accuracy improvement per iteration compared to the first iteration by threshold for 100 and 10,000 initially labeled training samples (left and right respectively). Note the different scales. Image by the author.

Looking at the above graphs, it becomes apparent that, in general, higher thresholds lead to better results – as long as at least some predictions exceed the threshold. In future experiments, one might try to vary the threshold with each iteration.

Furthermore, the accuracy improves even in the later iterations, indicating that the iterative nature provides a true benefit.

Key Findings and Lessons Learned

  • Pseudo-labeling is best applied when unlabeled data is plentiful but labeling is expensive.
  • Monitor the test accuracy: It’s important to keep an eye on the model’s performance on a separate test dataset throughout the iterations.
  • Manual labeling can still be helpful: If you have the resources, focus on manually labeling the low confidence data. However, humans aren’t perfect either and labeling of high confidence data may be delegated to the model in good conscience.
  • Keep track of what labels are AI-generated. If more manually labeled data becomes available later on, you’ll likely want to discard the pseudo-labels and repeat this procedure, increasing the pseudo-label accuracy.
  • Be careful when interpreting the results: When I first did this experiment a few years ago, I focused on the accuracy on the remaining unlabeled training data. This accuracy falls with more iterations! However, this is likely because the remaining data is harder to predict – the model was never confident about it in previous iterations. I should have focused on the test accuracy, which actually improves with more iterations.

Links

The repository containing the experiment’s code can be found here.

Related paper: Iterative Pseudo-Labeling with Deep Feature Annotation and Confidence-Based Sampling

The post Teaching Your Model to Learn from Itself appeared first on Towards Data Science.

]]>
Stop Wasting LLM Tokens https://towardsdatascience.com/stop-wasting-llm-tokens-a5b581fb3e6e/ Wed, 07 Aug 2024 09:47:43 +0000 https://towardsdatascience.com/stop-wasting-llm-tokens-a5b581fb3e6e/ Batching your inputs together can lead to substantial savings without compromising on performance

The post Stop Wasting LLM Tokens appeared first on Towards Data Science.

]]>
If you use LLMs to annotate or process larger datasets, chances are that you’re not even realizing that you are wasting a lot of input tokens. As you repeatedly call an Llm to process text snippets or entire documents, your task instructions and static few-shot examples are repeated for every input example. Just like neatly stacking dishes saves space, batching inputs together can result in substantial savings.

Assume you want to tag a smaller document corpus of 1000 single-page documents with instructions and few-shot examples that are about half a page long. Annotating each document separately would cost you about 1M input tokens. However, if you annotated ten documents in the same call, you’d save about 300K input tokens (or 30%) because we don’t have to repeat instructions! As we’ll show in the example below, this can often happen with minimal performance loss (or even performance gain), especially when you optimize your prompt alongside.

Saving tokens with minibatching

Below I have plotted the savings assuming that our average document length is D tokens and our instructions and few-shot examples have r*D tokens. The example scenario from the previous paragraph where the instructions are half the length of the document (r = 0.5) appears in blue below. For longer shared instructions, our savings can be even higher:

The main takeaways are:

  • Even with relatively short instructions (blue line), there is value in minibatching
  • It’s not necessary to use really large minibatch sizes. Most savings can be obtained with even moderate minibatch sizes (B ≤ 10).

Minibatching in practice

Let’s turn practical with a task where we want to categorize pieces of text for further analysis. We’ll use a fun task from the Natural-Instructions benchmark where we need to annotate sentences in debates with one of four categories (value, fact, testimony or policy).

Looking at an example, we see that we get the current topic for context and then need to categorize the sentence in question.

{
  "input": {
    "topic": "the fight for justice,equality,peaceand love is futile",
    "sentence": "What matters is what I am personally doing to ensure that I am filling the cup!"
  },
  "output": "Value"
}

One question we haven’t answered yet:

How do we pick the right minibatch size?

Previous work has shown that the best minibatch size depends on the task as well as the model. We essentially have two options:

  1. We pick a reasonable minibatch size, let’s say 5 and hope that we don’t see any drops.
  2. We optimize the minibatch size along with other choices, e.g., the number of few-shot examples.

As you might have guessed, we’ll pursue option 2 here. To run our experiments, we’ll use SAMMO, an open-source framework for LLM calling and prompt optimization.

Prompts are coded up in SAMMO as prompt programs (which are simply nested Python classes that’ll be called with input data). We’ll structure our task into three sections and format our minibatches in JSON format.

def prompt_program(fewshot_data, n_fewshot_examples=5, minibatch_size=1):
    return Output(
        MetaPrompt(
            [
                Section("Instructions", task["Definition"]),
                Section(
                    "Examples",
                    FewshotExamples(
                        fewshot_data, n_fewshot_examples
                    ),
                ),
                Section("Output in same format as above", InputData()),
            ],
            data_formatter=JSONDataFormatter(),
            render_as="markdown",
        ).with_extractor(on_error="empty_result"),
        minibatch_size=minibatch_size,
        on_error="empty_result",
    )

Running this without minibatching and using five few-shot examples, we get an accuracy of 0.76 and have to pay 58255 input tokens.

Let’s now explore how minibatching affects costs and performance. Since minibatching reduces the total input costs, we can now use some of those savings to add more few-shot examples! We can study those trade-offs by setting up a search space in SAMMO:

def search_space(fewshot_data):
    minibatch_size = search_op.one_of([1, 5, 10], name="minibatch_size")
    n_fewshot_examples = search_op.one_of([5, 20], name="n_fewshot")

    return prompt_program(fewshot_data, n_fewshot_examples, minibatch_size)

Running this shows us the full gamut of trade-offs:

  setting                                  objective    costs                              parse_errors
  ---------------------------------------  -----------  ---------------------------------  --------------
* {'minibatch_size': 1, 'n_fewshot': 5}    0.76         {'input': 58255, 'output': 5817}   0.0
  {'minibatch_size': 1, 'n_fewshot': 20}   0.76         {'input': 133355, 'output': 6234}  0.0
  {'minibatch_size': 5, 'n_fewshot': 5}    0.75         {'input': 15297, 'output': 5695}   0.0
  {'minibatch_size': 5, 'n_fewshot': 20}   0.77         {'input': 30317, 'output': 5524}   0.0
  {'minibatch_size': 10, 'n_fewshot': 5}   0.73         {'input': 9928, 'output': 5633}    0.0
* {'minibatch_size': 10, 'n_fewshot': 20}  0.77         {'input': 17438, 'output': 5432}   0.0

So, even with 20 few-shot examples, we save nearly 70 % input costs ([58255–17438]/58255) all while maintaining overall accuracy! As an exercise, you can implement your own objective to automatically factor in costs or include different ways of formatting the data in the search space.

Caveats

Implicit in all of this is that (i) we have enough input examples that use the shared instructions and (ii) we have some flexibility regarding latency. The first assumption is met in many annotation scenarios, but obviously doesn’t hold in one-off queries. In annotation or other offline processing tasks, latency is also not super critical as throughput matters most. However, if your task is to provide a user with the answer as quickly as possible, it might make more sense to issue B parallel calls than one call with B input examples.


Conclusions

As illustrated in this quick and practical example, prompting LLMs with multiple inputs at the same time can greatly reduce costs under better or comparable accuracy. The good news is also that even with moderate minibatch sizes (e.g., 5 or 10), savings can be substantial. With SAMMO, you can automatically see how performance behaves under different choices to make an optimal choice.

An open research question is how to integrate this with Retrieval Augmented Generation (RAG) – one can form the union over all retrieved examples or rank them in some fashion. SAMMO lets you explore some of these strategies along with a lot of other choices during prompt construction, for example how to format your input data. Please leave a comment if you would like to see more on this topic or anything else.

Disclaimer: **** I am the author of SAMMO, an open-source MIT licensed framework for prompt optimization.

Resources

The post Stop Wasting LLM Tokens appeared first on Towards Data Science.

]]>
How to automate entity extraction from PDF using LLMs https://towardsdatascience.com/how-to-automate-entity-extraction-from-pdf-using-llms-ea9c1351f531/ Thu, 15 Jun 2023 00:47:30 +0000 https://towardsdatascience.com/how-to-automate-entity-extraction-from-pdf-using-llms-ea9c1351f531/ Leveraging zero-shot labeling

The post How to automate entity extraction from PDF using LLMs appeared first on Towards Data Science.

]]>
Photo by Google DeepMind on Unsplash
Photo by Google DeepMind on Unsplash

The need for high-quality labeled data cannot be overstated in modern machine learning applications. From improving our models’ performance to ensuring fairness, the power of labeled data is immense. Unfortunately, the time and effort required to create such datasets are equally significant. But what if we could reduce the time spent on this task from days to mere hours while maintaining or even enhancing the labeling quality? A utopian dream? Not anymore.

Emerging paradigms in machine learning – Zero-Shot Learning, Few-Shot Learning, and Model-Assisted Labeling – present a transformative approach to this crucial process. These techniques harness the power of advanced algorithms, reducing the need for extensive labeled datasets, and enabling faster, more efficient, and highly effective data annotation.

In this tutorial, we are going to present a method to auto-label unstructured and semi-structured documents using Large Language Model’s (LLM) in-context learning capabilities.

Information extraction from SDS

Unlike traditional supervised models that require extensive labeled data to get trained on solving a specific task, LLMs can generalize and extrapolate information from a few examples by tapping into its large knowledge base. This emergent capability, knows as in-context learning, makes LLM a versatile choice for many tasks that includes not only text generation but also data extraction such as named entity recognition.

For this tutorial, we are going to label Safety Data Sheets (SDS) from various companies using zero-shot and few-shot labeling capabilities of GPT 3.5, also known as ChatGPT. SDS offer comprehensive information regarding specific substances or mixtures, designed to assist workplaces in effectively managing chemicals. These documents play a vital role in providing detailed insights into hazards, encompassing environmental risks, and offering invaluable guidance on safety precautions. SDSs act as an indispensable source of knowledge, enabling employees to make informed decisions regarding the safe handling and utilization of chemicals in the workplace. SDS usually come in PDFs in various layouts but usually contain the same information. In this tutorial, we are interested to train an AI model that automatically identifies the following entities:

  • Product number
  • CAS number
  • Use cases
  • Classification
  • GHS label
  • Formula
  • Molecular weight
  • Synonym
  • Emergency phone number
  • First aid measures
  • Component
  • Brand

Extracting this relevant information and storing it in a searchable database is very valuable for many companies since it allows the search and retrieval of hazardous components very quickly. Here is an example of an SDS:

Publicly available SDS. Image by Author
Publicly available SDS. Image by Author

Zero-shot Labeling

Unlike text generation, information extraction is a much challenging tasks for LLMs to do. LLMs have been trained for text completion tasks and usually tend to hallucinate or generate additional comments or text when prompted to extract relevant information.

In order to correctly parse the result of the LLM, we need to have a consistent output from the LLM such as a JSON. Which requires some prompt engineering to get it right. In addition, once the results are parsed we need to map them to the original tokens in the input text.

Fortunately, all these steps have been done and abstracted away using UBIAI annotation tool. Under hood, UBIAI does the prompting, chunk the data so it is below the context length limit, and send it to OpenAI’s GPT3.5 Turbo API for inference. Once the output is sent back, the data gets parsed, processed and applied to your documents for auto-labeling.

To get started, simply upload your documents, whether its in native Pdf, image, or a simple Docx, then go to the annotation page and select the Few-shot tab in the annotation interface:

UBIAI Few-shot dashboard. Image by Author
UBIAI Few-shot dashboard. Image by Author

For more details, checkout the documentation here: https://ubiai.gitbook.io/ubiai-documentation/zero-shot-and-few-shot-labeling

UBIAI enables you to configure the number of examples that you would like the model to learn from to auto-label the next documents. The app will automatically choose the most informative documents from your already labeled dataset and concatenate them in the prompt. This approach is called Few-shot labeling where "Few" ranges from 0 to n. To configure, the number of examples, simply click on the configuration button and input the number of examples, as shown below.

UBIAI Few-shot configuration window. Image by Author
UBIAI Few-shot configuration window. Image by Author

For this tutorial, we are going to provide zero examples to the LLM to learn from and ask it to label the data based purely on the description of the entity itself. Surprisingly, the LLM is able to understand our document quite well and does most of the labeling correctly!

Below is the result of zero-shot labeling on the SDS PDF without any examples, quite impressive!

Zero-shot labeling using UBIAI. Image by Author
Zero-shot labeling using UBIAI. Image by Author

Conclusion

Automating entity extraction from PDFs using Large Language Models (LLMs) has become a reality with the advent of LLMs in-context learning capabilities such as Zero-Shot Learning and Few-Shot Learning. These techniques harness the power of LLMs latent knowledge to reduce the reliance on extensive labeled datasets and enable faster, more efficient, and highly effective data annotation.

The tutorial presented a method to auto-label semi-structured documents, specifically focusing on Safety Data Sheets (SDS) but would also work for unstructured text. By leveraging the in-context learning capabilities of LLMs, particularly GPT 3.5 (chatGPT), the tutorial demonstrated the ability to automatically identify important entities within SDSs, such as product number, CAS number, use cases, classification, GHS label, and more.

The extracted information, if stored in a searchable database, provides significant value to companies as it allows for quick search and retrieval of hazardous components. The tutorial highlighted the potential of zero-shot labeling, where the LLM can understand and extract information from SDSs without any explicit examples. This showcases the versatility and generalization abilities of LLMs, going beyond text generation tasks.

If you are interested to create your own training dataset using LLMs zero-shot capabilities, schedule a demo with us here.

Follow us on Twitter @UBIAI5 !

The post How to automate entity extraction from PDF using LLMs appeared first on Towards Data Science.

]]>
Bootstrapping Labels with GPT-4 https://towardsdatascience.com/bootstrapping-labels-with-gpt-4-8dc85ab5026d/ Fri, 09 Jun 2023 18:26:43 +0000 https://towardsdatascience.com/bootstrapping-labels-with-gpt-4-8dc85ab5026d/ A cost-effective approach to data labeling

The post Bootstrapping Labels with GPT-4 appeared first on Towards Data Science.

]]>
(Source: Image generated by author with DALL-E, modified by author.)
(Source: Image generated by author with DALL-E, modified by author.)

Data Labeling is a critical component for machine learning projects. It is built on the old adage, "garbage in, garbage out." Labeling involves creating annotated datasets for training and evaluation. But this process can be time-consuming and expensive, especially for projects with lots of data. But, what if we could use the advances in LLMs to reduce the cost and effort involved in data labeling tasks?

GPT-4 is a state-of-the-art language model developed by OpenAI. It has a remarkable ability to understand and generate human-like text and has been a game changer in the natural language processing (NLP) community and beyond. In this blog post, we’ll explore how you can use GPT-4 to bootstrap labels for various tasks. This can significantly reduce the time and cost involved in the labeling process. We’ll focus on sentiment classification to demonstrate how prompt engineering can enable you to create accurate and reliable labels using GPT-4 and how this technique can be used for much more powerful things as well.

Leveraging GPT-4’s Predictions for Data Pre-labeling

As in writing, editing is often less strenuous than composing the original work. That’s why starting with pre-labeled data is more attractive than starting with a blank slate. Using GPT-4 as a prediction engine to pre-label data stems from its ability to understand context and generate human-like text. Therefore, it would be excellent to leverage GPT-4 to reduce the manual effort required for data labeling. This could result in cost savings and make the labeling process less mundane.

So how do we do this? If you’ve used GPT models, you’re probably familiar with prompts. Prompts set the context for the model before it begins generating output and can be tweaked and engineered (i.e. prompt engineering) to help the model deliver highly specific results. This means we can create prompts that GPT-4 can use to generate text that looks like model predictions. For our use case, we will craft our prompts in a way that guides the model toward producing the desired output format as well.

Let’s take a straightforward example of sentiment analysis. If we are trying to classify the sentiment of a given string of text as positive, negative, or neutral we could provide a prompt like:

"Classify the sentiment of the following text as 'positive', 'negative', or 'neutral': <input_text>"

Once we have a well-structured prompt, we can use the OpenAI API to generate predictions from GPT-4. Here’s an example using Python:

import openai
import re

openai.api_key = "<your_api_key>"

def get_sentiment(input_text):
    prompt = f"Respond in the json format: {{'response': sentiment_classification}}nText: {input_text}nSentiment (positive, neutral, negative):"
    response = openai.ChatCompletion.create(
        model="gpt-3.5-turbo",
        messages=[
            {"role": "user", "content": prompt}
        ],
        max_tokens=40,
        n=1,
        stop=None,
        temperature=0.5,
    )
    response_text =  response.choices[0].message['content'].strip()
    sentiment = re.search("negative|neutral|positive", response_text).group(0)
    # Add input_text back in for the result
    return {"text": input_text, "response": sentiment}

We can run this with a single example to inspect the output we’re receiving from the API.

# Test single example
sample_text = "I had a terrible time at the party last night!"
sentiment = get_sentiment(sample_text)
print("Resultn",f"{sentiment}")
Result:
{'text': 'I had a terrible time at the party last night!', 'response': 'negative'}

Once we’re satisfied with our prompt and the results we’re getting, we can scale this up to our entire dataset. Here, we’ll assume a text file with one example per line.

import json

input_file_path = "input_texts.txt"
output_file_path = "output_responses.json"

with open(input_file_path, "r") as input_file, open(output_file_path, "w") as output_file:
    examples = []
    for line in input_file:
        text = line.strip()
        if text:
            examples.append(convert_ls_format(get_sentiment(text)))
    output_file.write(json.dumps(examples))

We can import the data with pre-labeled predictions into Label Studio and have reviewers verify or correct the labels. This approach significantly reduces the manual work required for data labeling, as human reviewers only need to validate or correct the model-generated labels rather than annotate the entire dataset from scratch. See our full example notebook here.

Note that in most situations, OpenAI is allowed to use any information sent to their APIs to train their models further. So it’s important to not send protected or private data to these APIs for labeling if we don’t want to expose the information more broadly.

Reviewing Pre-labeled Data in Label Studio

Once we have our pre-labeled data ready, we will import it into a data labeling tool, such as Label Studio, for review. This section will guide you through setting up a Label Studio project, importing the pre-labeled data, and reviewing the annotations.

Figure 1: Reviewing Sentiment Classification in Label Studio. (Image by author, screenshot with Label Studio)
Figure 1: Reviewing Sentiment Classification in Label Studio. (Image by author, screenshot with Label Studio)

Step 1: Install and Launch Label Studio

First, you need to have Label Studio installed on your machine. You can install it using pip:

pip install label-studio

After installing Label Studio, launch it by running the following command:

label-studio

This will open Label Studio in your default web browser.

Step 2: Create a New Project

Click on "Create Project" and enter a project name, such as "Review Bootstrapped Labels." Next, you need to define the labeling configuration. For Sentiment Analysis, we can use the text Sentiment Analysis Text Classification.

These templates are configurable, so if we want to change any of the properties, it’s really straightforward. The default labeling configuration is shown below.

<View>
  <Header value="Choose text sentiment:"/>
  <Text name="my_text" value="$reviewText"/>
  <Choices name="sentiment" toName="my_text" choice="single" showInline="true">
    <Choice value="Positive"/>
    <Choice value="Negative"/>
    <Choice value="Neutral"/>
  </Choices>
</View>

Click "Create" to finish setting up the project.

Step 3: Import Pre-labeled Data

To import the pre-labeled data, click the "Import" button. Choose the json file and select the pre-labeled data file generated earlier (e.g., "output_responses.json"). The data will be imported along with the pre-populated predictions.

Step 4: Review and Update Labels

After importing the data, you can review the model-generated labels. The annotation interface will display the pre-labeled sentiment for each text sample, and reviewers can either accept or correct the suggested label.

You can improve quality further by having multiple annotators review each example.

By utilizing GPT-4-generated labels as a starting point, the review process becomes much more efficient, and reviewers can focus on validating or correcting the annotations rather than creating them from scratch.

Step 5: Export Labeled Data

Once the review process is complete, you can export the labeled data by clicking the "Export" button in the "Data Manager" tab. Choose the desired output format (e.g., JSON, CSV, or TSV), and save the labeled dataset for further use in your machine learning project.


Cost Analysis

One question rolling around in my mind was: "How much did this cost me at the end of the day?"

Note: Prices shown below reflect current data for the author at the time of publication. Pricing may differ in the future or based on geographic location.

For language models, OpenAI charges based on the number of tokens in your request. Tokens are typically the number of words in the query, but special characters and emojis can sometimes count as an individual token. OpenAI’s pricing page states, "You can think of tokens as pieces of words, where 1,000 tokens is about 750 words." For more information on how tokens are counted, see this page.

The cost per token differs according to the model used. For example, the GPT-4 8K-context model costs $0.03/1K tokens for the prompt, and each generated token costs $0.06/1K tokens, while the GPT-3.5-turbo model costs $0.002/1K tokens.

Summary of token prices for OpenAI. (Source: OpenAI forum, image by author)
Summary of token prices for OpenAI. (Source: OpenAI forum, image by author)

To estimate the cost of pre-labeling a dataset, we can use a simple formula that considers the number of examples in the dataset, the price per token for prompts and completions, and the average number of tokens per example.

Where:

Additionally, we can calculate the total number of tokens in the dataset as follows:

Where:

Using this formula, we can estimate the cost of pre-labeling a dataset by multiplying the number of examples by the sum of the prompt cost and the completion cost, adjusted for the average number of tokens per example.

For instance, if we have a dataset with 1,000 examples that we want to pre-label for sentiment analysis with GPT-4, we can compute it with the following: a prompt price of $0.03 per 1K tokens, a completion price of $0.06 per 1K tokens, a prompt length of 20 tokens, an average example length of 80 tokens, and an average result token length of 3 tokens, the total cost of pre-labeling would be:

In this example, pre-labeling the dataset using GPT-4 would cost $3.18. Note: the same dataset with GPT-3.5-turbo would cost ~$0.21.

If our pre-labeling task requires less specialized knowledge, we may want to use a less robust model to save cost. It’s usually worth manually reviewing a handful of examples with varying levels of complexity to get a sense of how accurate one model is compared to another. For information on the models, see the OpenAI Pricing page.


Beyond Sentiment Analysis: Label Any NLP Task

Screenshot of Named Entity Recognition in Label Studio. (Image by author, screenshot with Label Studio)
Screenshot of Named Entity Recognition in Label Studio. (Image by author, screenshot with Label Studio)

The great thing about this approach is that it’s not limited to just sentiment analysis. We can pre-label data for various NLP tasks using GPT-4 by using prompt engineering and guiding it to produce the correct output. Here are a few examples of prompts for different NLP tasks, all of which can be reviewed and labeled in Label Studio (examples given were generated using GPT-4). Remember that more complex tasks may require longer prompts, which will count towards your token count and subsequently, cost.

Summarization

  • Prompt:

Respond in the json format with a summary for the following text: {‘summary’: summary}

Text: I love going to the park on a sunny day. The customer service was terrible; they were rude and unhelpful. I am neither happy nor sad about the new policy changes. The cake was delicious and the presentation was fantastic. I had a really bad experience with the product; it broke after two days.

Summary:

Response:

{'summary': 'Artificial intelligence is a rapidly developing technology that can learn from data and make decisions without being explicitly programmed. It has the potential to help solve some of the world's most pressing problems, but there are concerns about risks and ethical considerations such as bias and job displacement. It is important to consider AI's impact and ensure responsible and beneficial use.'}

Taxonomy

Prompt:

Respond in the JSON format with a summary for the following text: {‘diagnosis’: diagnosis}, where the possible diagnoses are Pneumonia, Acute bronchitis, Asthma exacerbation, Congestive heart failure, Chronic obstructive pulmonary disease (COPD), Pulmonary embolism, Pleural effusion, Interstitial lung disease, or Lung cancer.

Text: The patient presented with shortness of breath and a persistent cough. On physical examination, crackles were heard in the lower lung fields. Chest X-ray revealed bilateral infiltrates consistent with pneumonia. The patient was started on antibiotics and showed improvement in symptoms within 48 hours.

Diagnosis:

Response:

{'diagnosis': 'Pneumonia'}

Named Entity Recognition (NER)

Prompt:

You are an expert NER labeler. Only return JSON. Do not return explanations. Return the CoNLL format for the following sentence in JSON format.

Response:

{
  "data": [
    {
      "word": "Harry",
      "entity": "B-PER"
    },
    {
      "word": "Potter",
      "entity": "I-PER"
    },
    {
      "word": "was",
      "entity": "O"
    },
    {
      "word": "a",
      "entity": "O"
    },
    {
      "word": "student",
      "entity": "O"
    },
    {
      "word": "at",
      "entity": "O"
    },
    {
      "word": "Hogwarts",
      "entity": "B-LOC"
    }
  ]
}

In my opinion, there’s no substitute for human evaluation, but using the GPT models to give us a starting point can be a huge advantage.

Conclusion

Labeling data is always difficult, and sometimes, even getting that starting point is a huge advantage. In this blog, we showed how you can use the GPT models by OpenAI to generate predictions for data to serve as the starting point for your data labeling workflows. This process can significantly reduce the amount of human effort involved, and focus labelers’ attention on providing more value for their efforts. Check out the resources for more info on the topics presented in this blog.

Resources

Full example notebook – notebook with all code ready to run in Colab

Label Studio – Open Source data labeling tool

OpenAI Pricing page – details for the pricing estimate in this post

The post Bootstrapping Labels with GPT-4 appeared first on Towards Data Science.

]]>
How Few-Shot Learning is Automating Document Labeling https://towardsdatascience.com/how-few-shot-learning-is-automating-document-labeling-43f9868c0f74/ Fri, 07 Apr 2023 00:26:16 +0000 https://towardsdatascience.com/how-few-shot-learning-is-automating-document-labeling-43f9868c0f74/ Leveraging GPT Model

The post How Few-Shot Learning is Automating Document Labeling appeared first on Towards Data Science.

]]>
Photo by DeepMind on Unsplash
Photo by DeepMind on Unsplash

Manual document labeling is a time-consuming and tedious process that often requires significant resources and can be prone to errors. However, recent advancements in machine learning, particularly the technique known as few-shot learning, are making it easier to automate the labeling process. Large Language Models (LLMs) in particular are excellent few shot learners thanks for their emergent capability in context learning.

In this article, we’ll take a closer look at how few-shot learning is transforming document labeling, specifically for Named Entity Recognition which is the most important task in document processing. We will show how the UBIAI‘s platform is making it easier than ever to automate this critical task using few shot labeling techniques.

What is Few-Shot Learning?

Few-shot learning is a machine learning technique that enables models to learn a given task with only a few labeled examples. Without modifying its weights, the model can be tuned to perform a specific task by including concatenated training examples of these tasks in its input and asking the model to predict the output of a target text. Here is an example of few shot learning for the task of Named Entity Recognition (NER) using 3 examples:

###Prompt
Extract entities from the following sentences without changing original words.

###
Sentence: " and storage components. 5+ years of experience deliver
ing scalable and resilient services at large enterprise scale, including experience in data platforms including large-scale analytics on relational, structured and unstructured data. 3+ years of experien
ce as a SWE/Dev/Technical lead in an agile environment including 1+ years of experience operating in a DevOps model. 2+ years of experience designing secure, scalable and cost-efficient PaaS services on
the Microsoft Azure (or similar) platform. Expert understanding of"
DIPLOMA: none
DIPLOMA_MAJOR: none
EXPERIENCE: 3+ years, 5+ years, 5+ years, 5+ years, 3+ years, 1+ years, 2+ years
SKILLS: designing, delivering scalable and resilient services, data platforms, large-scale analytics on relational, structured and unstructured data, SWE/Dev/Technical, DevOps, designing, PaaS services, Microsoft Azure
###

Sentence: "8+ years demonstrated experience in designing and developing enterprise-level scale services/solutions. 3+ years of leadership and people management experience. 5+ years of Agile Experie
nce Bachelors degree in Computer Science or Engineering, or a related field, or equivalent alternative education, skills, and/or practical experience Other 5+ years of full-stack software development exp
erience to include C# (or similar) experience with the ability to contribute to technical architecture across web, mobile, middle tier, data pipeline"
DIPLOMA: BachelorsnDIPLOMA_MAJOR: Computer Science
EXPERIENCE: 8+ years, 3+ years, 5+ years, 5+ years, 5+ years, 3+ years
SKILLS: designing, developing enterprise-level scale services/solutions, leadership and people management experience, Agile Experience, full-stack software development, C#, designing
###

Sentence: "5+ years of experience in software development. 3+ years of experience in designing and developing enterprise-level scale services/solutions. 3+ years of experience in leading and managing
 teams. 5+ years of experience in Agile Experience. Bachelors degree in Computer Science or Engineering, or a related field, or equivalent alternative education, skills, and/or practical experience."

The prompt typically begins by instructing the model to perform a specific task, such as "Extract entities from the following sentences without altering the original words." Notice, we’ve added the instruction "without changing the original words" to prevent the LLM from hallucinating random texts, which it is notoriously known for. This has proven critical in obtaining consistent responses from the model.

The few-shot learning phenomenon has been extensively studied in this article, which I highly recommend. Essentially, the paper demonstrates that, under mild assumptions, the pretraining distribution of the model is a mixture of latent tasks that can be efficiently learned through in-context learning. In this case, in-context learning is more about identifying the task than about learning it by adjusting the model weights.

Few-shot Labeling

Few-shot learning has an excellent practical application in the Data Labeling space, often referred as few-shot labeling. In this case, we provide the model few labeled examples and ask it to predict the labels of the subsequent documents. However, integrating this capability in a functional data labeling platform is easier said than done, here are few challenges:

  • LLMs are inherently text generators and tend to generate variable output. Prompt engineering is critical to make them create predictable output that can be later used to auto-label the data.
  • Token limitation: LLMs such as OpenAI’s GPT-3 is limited to 4000 tokens per request which limits the length of documents that can be sent at once. Chunking and splitting the data before sending the request becomes essential.
  • Span offset calculation: After receiving the output from the model, we need to search its occurrence in the document and label it correctly.

Few Shot Labeling with UBIAI

We’ve recently added few shot labeling capability by integrating OpenAI’s GPT-3 Davinci with UBIAI annotation tool. The tool currently support few-shot NER task for unstructured and semi-structured documents such as PDFs and scanned images.

To get started:

  1. Simply label 1–5 examples
  2. Enable few-shot GPT model
  3. Run prediction on a new unlabeled document

Here is an example of few shot NER on job description with 5 examples provided:

Image by Author: Few Shot NER on unstructured text
Image by Author: Few Shot NER on unstructured text

The GPT model accurately predicts most entities with just five in-context examples. Because LLMs are trained on vast amounts of data, this few-shot learning approach can be applied to various domains, such as legal, healthcare, HR, insurance documents, etc., making it an extremely powerful tool.

However, the most surprising aspect of few-shot learning is its adaptability to semi-structured documents with limited context. In the example below, I provided GPT with only one labeled OCR’d invoice example and asked it to label the next. The model surprisingly predicted many entities accurately. With even more examples, the model does an exceptional job of generalizing to semi-structured documents as well.

Image by Author: Few Shot NER on PDF
Image by Author: Few Shot NER on PDF

For an in-depth tutorial of the few-shot labeling feature, checkout the video below:

Conclusion:

Few-shot learning is revolutionizing the document labeling process. By integrating few-shot labeling capabilities into functional data labeling platforms, such as UBIAI’s annotation tool, it is now possible to automate critical tasks like Named Entity Recognition (NER) in unstructured and semi-structured documents. This does not imply that LLMs will replace human labelers anytime soon. Instead, they augment their capabilities by making them more efficient. With the power of few-shot learning, LLMs can label vast amounts of data and apply to multiple domains, such as legal, healthcare, HR, and insurance documents, to train smaller and more accurate specialized models that can be efficiently deployed.

We’re currently adding support for few-shot relation extraction and document classification, stay tuned!

Follow us on Twitter @UBIAI5 or subscribe here!

The post How Few-Shot Learning is Automating Document Labeling appeared first on Towards Data Science.

]]>
Top 6 Data Labeling Tools To Use In 2023 https://towardsdatascience.com/top-5-data-labeling-tools-to-use-in-2023-52bbc905ebe3/ Mon, 03 Apr 2023 14:27:43 +0000 https://towardsdatascience.com/top-5-data-labeling-tools-to-use-in-2023-52bbc905ebe3/ Speed up your data labeling and have better results with these tools.

The post Top 6 Data Labeling Tools To Use In 2023 appeared first on Towards Data Science.

]]>
Data Labeling is adding metadata or tags to a dataset to make it more useful for machine learning applications. The goal is to provide the machine learning algorithm with accurate and relevant information that it can use to learn from and make predictions. Data labeling is essential because it allows machine learning algorithms to understand and make sense of the data they receive.

Different types of data labeling depend on the type of data being labeled. For example, text data can be labeled for sentiment analysis or named entity recognition, while image data can be labeled for object detection or semantic segmentation.

Moreover, the quality of labeled data directly affects the machine learning algorithm’s performance. If the labeled data is inaccurate, incomplete, or inconsistent, the model cannot learn from it effectively, resulting in poor performance.

Data labeling can be time-consuming and expensive, depending on the size and complexity of the dataset. Therefore, it is essential to use high-quality data labeling tools and processes to ensure accurate and efficient labeling.

In this article, we will review 6 data labeling tools to help you make your data labeling task faster, more efficient, and more accurate.

№1: Labelbox

Labelbox is a popular data labeling platform that allows teams to manage, annotate, and collaborate on data. It offers a user-friendly interface for labeling images, text, and video data. Labelbox supports annotation types, including bounding boxes, polygons, and classifications. It also provides built-in quality control tools and integrates well with popular machine learning frameworks such as TensorFlow and PyTorch.

Labelbox offers a range of annotation tools for images, including object detection, semantic segmentation, and image classification. Its text annotation tools include named entity recognition, sentiment analysis, and text classification. The platform also supports video annotation, including object tracking and action recognition.

One of the key features of Labelbox is its collaborative labeling capabilities. Multiple users can work on the same dataset simultaneously, and changes are synced in real-time. The platform also supports project management features such as task assignments, deadlines, and progress tracking.

Labelbox offers both a cloud-based and on-premise solution, making it a flexible choice for businesses of all sizes. In addition, the platform provides a free version with limited features, making it accessible to individual users and small teams.

№2: Amazon SageMaker Ground Truth

Amazon SageMaker Ground Truth is a fully managed data labeling service that Amazon Web Services (AWS) provides. It offers a scalable and secure platform for labeling data using human annotators or machine learning models. SageMaker Ground Truth supports various data types, including images, text, and audio. It also integrates well with other AWS services.

SageMaker Ground Truth offers a range of annotation tools for images, including bounding boxes, polygons, and semantic segmentation. Its text annotation tools include named entity recognition, sentiment analysis, and text classification. The platform also supports video annotation, including object tracking and action recognition.

One of the key features of SageMaker Ground Truth is its ability to integrate with AWS machine learning services such as SageMaker and Rekognition. This allows users to quickly and easily train machine learning models quickly and easily using their labeled data. The platform also provides automatic quality control and validation tools to ensure accurate labeling.

SageMaker Ground Truth is a flexible choice for businesses of all sizes, offering pay-as-you-go pricing and a free tier for small projects.

№3: SuperAnnotate

SuperAnnotate is an AI-powered data labeling platform that offers a range of annotation tools for images and videos. The platform supports annotation types, including bounding boxes, polygons, and instance segmentation. In addition, it provides a user-friendly interface with advanced features like automatic labeling and quality control tools. SuperAnnotate also supports collaborative labeling and project management.

SuperAnnotate offers a range of annotation tools for images, including object detection, semantic segmentation, and image classification. Its video annotation tools include object tracking and action recognition.

One of the key features of SuperAnnotate is its AI-powered automatic labeling feature. The platform uses machine learning models to suggest annotations, reducing the time and effort required for manual labeling. SuperAnnotate also provides advanced quality control and validation tools to ensure accurate labeling.

SuperAnnotate offers a cloud-based solution, making it a flexible choice for businesses of all sizes. The platform provides a free version with limited features, making it accessible to individual users and small teams.

№4: Prodigy

Prodigy is a powerful data labeling tool designed for machine learning workflows. In addition, it offers a range of annotation tools for text, images, and audio data, making it a versatile choice for various projects.

One of the unique features of Prodigy is its active learning capabilities. It uses machine learning models to suggest annotations, allowing users to focus on challenging cases that require human input. This approach can help speed up the annotation process and improve the annotated data quality.

Prodigy supports annotation types, including entity recognition, classification, and image segmentation. It also offers a user-friendly interface with customizable workflows and keyboard shortcuts. The platform also supports collaborative labeling, allowing multiple users to simultaneously work on the same dataset.

Prodigy offers both a cloud-based and on-premise solution, making it a flexible choice for businesses of all sizes. It also integrates with popular machine learning frameworks such as spaCy and PyTorch. It is an excellent choice for businesses looking for a data labeling tool with advanced active learning capabilities. In addition, its range of annotation tools, customizable workflows, and collaborative capabilities can help streamline the data labeling process and improve the accuracy of annotated data.

№5: Hasty.ai

Hasty.ai is a popular data labeling tool that offers a range of annotation tools for image and video data. It is designed to make the data labeling process more efficient and accurate, with a user-friendly interface and advanced features.

Hasty.ai supports annotation types, including bounding boxes, polygons, and semantic segmentation. It also provides automatic labeling features powered by machine learning, which can reduce the time and effort required for manual labeling. Additionally, Hasty.ai offers quality control tools to ensure accurate labeling.

The platform’s user interface is designed to be intuitive and easy to use, with drag-and-drop functionality and keyboard shortcuts. It also supports collaborative labeling, allowing multiple users to work on the same dataset simultaneously.

Hasty.ai offers both a cloud-based and on-premise solution, making it a flexible choice for businesses of all sizes. The platform also integrates with popular machine learning frameworks such as TensorFlow and PyTorch. Hasty.ai is an excellent choice for businesses seeking a user-friendly and efficient data labeling tool. Its range of annotation tools, automatic labeling features, and collaborative capabilities can help streamline the data labeling process and improve the accuracy of annotated data.

№6: UBIAI

UBIAI is a comprehensive platform with the goal of making easy-to-use NLP tools to help developers and companies try out machine learning ideas quickly and apply them to real-world problems without wasting time coding.

UBIAI offers document annotation with various features, such as: support of over 20+ languages for NER, Relation extraction, and document classification tasks, the ability to create and train models for automated annotation using spacy and transformer models, and support for OCR annotation and object detection to label native PDFs, scanned images and pictures.

Moreover, it allows you to perform shot labeling using the GPT model. It also has great API and Inter-annotator agreement support.

Finally, UBIAI offers a comprehensive collaboration feature to manage and track team progress and performance, including collaboration task assignment, task validation, automatic task assignment, inter-annotator agreement evaluation, role assignment for team members, and viewing active time per document and average document annotation time per collaborator.

Final Thoughts

Data labeling is a critical step in the machine learning workflow that directly affects the accuracy and performance of the final model. Accurate and consistent labeling can improve the performance of machine learning algorithms and make them more useful for real-world applications.

Data labeling tools are essential in the machine learning workflow by simplifying and streamlining the annotation process. The tools in this article are popular data labeling tools offering various features to help teams manage and annotate data efficiently. Their user-friendly interfaces, collaborative capabilities, and advanced annotation tools make them ideal for businesses of all sizes looking to streamline their data labeling processes.

So, if you just started a new Data Science project, try one or more of the data labeling tools in this article!

The post Top 6 Data Labeling Tools To Use In 2023 appeared first on Towards Data Science.

]]>
Looking to seamlessly integrate your time series annotation in your ML workflow? Look no further https://towardsdatascience.com/looking-to-seamlessly-integrate-your-time-series-annotation-in-your-ml-workflow-look-no-further-d7721c8f59e0/ Wed, 22 Jun 2022 13:52:42 +0000 https://towardsdatascience.com/looking-to-seamlessly-integrate-your-time-series-annotation-in-your-ml-workflow-look-no-further-d7721c8f59e0/ Running Label Studio on Amazon SageMaker to seamlessly integrate labeling into your machine learning workflow

The post Looking to seamlessly integrate your time series annotation in your ML workflow? Look no further appeared first on Towards Data Science.

]]>
Photo by Viktor Forgacs on Unsplash
Photo by Viktor Forgacs on Unsplash

I’m working a lot on time series Anomaly Detection for industrial use cases and most of the time I rely on unsupervised approaches. Yet, semi-supervised approaches can add valuable incremental value. In other situations, you might also want to confirm unsupervised model outputs and having a labeling tool to easily integrate in your workflow becomes a must.

This is where Label Studio comes in!

Some time ago, a colleague of mine (Sofian, who you can follow here) wrote the following article to explain how to deploy Label Studio on Amazon SageMaker:

Labeling data with Label Studio on SageMaker

I’ve been toying away with this open source package to label time series data in the past: I thought this was the perfect time to expose how I integrate this labeling tool in my machine learning workflow.

In this article, I will show you the notebook I run to automatically deploy a Label Studio instance in my SageMaker environment. I will then expose how I configure my annotation environment automatically to deal with the structure of the time series data I would like to annotate.

I encourage you to follow along this blog post by browsing to GitHub to grab this series of companion Jupyter notebooks. As the objective is to deploy Label Studio on Amazon SageMaker, you will need to have a AWS account. Then you can create a SageMaker notebook instance (use a t3.medium type to benefit from the free tier if you have a new account). From there, clone this GitHub repository:

git clone https://github.com/aws-samples/amazon-lookout-for-equipment.git

Navigate into the apps/annotation-label-studio/ folder and open the 1-initialize-label-studio.ipynb notebook.

Before we jump in the step by step process to configure your own environment from scratch, let’s have an overview of what we are going to assemble…

Technology overview

In this article, you are going to deploy a Docker image of Label Studio in a SageMaker notebook instance. You will then connect Label Studio to the Amazon S3 bucket where your time series data will be stored.

Label Studio is a flexible data annotation tool that can be used to label every data type: text, images, tabular data or time series data. In this article, we are going to programmatically configure a custom user interface to label time series data for anomaly detection purpose.

Amazon SageMaker is a managed machine learning service that helps you build, train, and deploy machine learning models for any use case with fully managed infrastructure, tools, and workflows. In this article, we are going to use the managed JupyterLab experience offered by SageMaker Notebook Instances.

Installing Label Studio

First, we are going to download the Label Studio docker image and deploy it in our notebook environment. To do this, we need to configure some parameters:

The previous piece of code generates a shell script which will run a dockerized version of Label Studio. This instance will be configured with an initial user that you can customize by changing the username, password, and token. Note that the username must follow an email address format. Otherwise, the user won’t be created when the Label Studio instance is launched. If you don’t create a user at this stage, you will have the opportunity to create one when you sign in into the application.

The token can of course be generate randomly. For instance, you could use the following code for this:

This will generate a token that looks like this one: 2edfe403f2f326e810b9553f8f5423bf04437341.

The get_notebook_name() method is defined using the following method: this is used to generate the URL of your Label Studio instance.

Once you have you shell script generated, you can run it from a cell in Jupyter by running !source ./label-studio.sh. The first time, it will download the docker image for Label Studio. Then it will run it with the parameters you defined above. After a few seconds, you should see the following message:

Django version 3.1.14, using settings 'core.settings.label_studio'
Starting development server at http://0.0.0.0:8080/
Quit the server with CONTROL-C.

This means your Label Studio instance is up and running!

Time to go and configure it to suit the time series dataset you want to label…

Preparing an example dataset

If you’re following this article with the companion GitHub repo, you can now open the second Jupyter notebook (2-configure-label-studio.ipynb) while leaving the other notebook running. You should see an hourglass icon next to the JupyterLab tab name in your browser. That’s your cue that a process is actually running (in this case, your Label Studio instance).

I put a synthetic time series dataset in the repo: if you’re interested into how this dataset was created, feel free to check out this article:

Unhappy about your time series anomalies? Synthesize them!

You can of course use your own time series dataset if you have one ready! You can either store your dataset locally in your instance and let Label Studio access it from here.

However, if you’re an AWS user, most of the time you may already have your data stored in an Amazon S3 Bucket. Label Studio must then access your data from there, which, by default is not authorized. To enable this access, you need to enable cross-origin resource sharing (CORS) for your S3 Bucket. CORS defines a way for client web applications that are loaded in one domain (in our case, our Label Studio instance running in a SageMaker notebook instance) to interact with resources in a different domain (your dataset stored in Amazon S3). To do this, you can check out the CORS documentation and use the following JSON document to configure your access. You will need to update the AllowedOrigins parameter below:

[
    {
        "AllowedHeaders": [
            "*"
        ],
        "AllowedMethods": [
            "GET"
        ],
        "AllowedOrigins": [
            "https://<<notebook_name>>.notebook.<<current_region>>.sagemaker.aws"
        ],
        "ExposeHeaders": [
            "x-amz-server-side-encryption",
            "x-amz-request-id",
            "x-amz-id-2"
        ],
        "MaxAgeSeconds": 3000
    }
]

Time to configure your annotation template to match the data you want to label…

Configuring your Label Studio instance

Let’s assume you now have a dataset ready and loaded into a Panda dataframe. The next step is to configure an annotation template. Label Studio comes with several existing templates. Your template will however depend on how many timeseries (or channels) you have in your file. Building a customized template adapted to your dataset is a two-step process. First, we build a list of channels, one for each field in your multivariate time series dataset:

A given channel will take the following format:

<Channel
    column="signal_00" 
    legend="signal_00" 
    strokeColor="#1f77b4" 
    displayFormat=",.1f" 
/>

You can customize the name of the channel and the color that will be used to plot the time series.

Then, you use this channel_fields to generate the annotation template:

Our template is ready, we will now:

  • Create a new annotation project
  • Configure the storage configuration
  • Log into our Label Studio instance and create some labels
  • Collect the results so that they can be further used in your machine learning pipeline

Creating a new annotation project

We will use the Label Studio API to interact with our Label Studio instance. Creating a project requires you to use the create project API:

After running this code, a new project will be created for your user (identifed by the token variable above) in your Label Studio environment. We can now connect Label Studio to your data storage.

Configure storage to use Amazon S3 as a data source

Using the S3 configuration API from Label Studio you can tell where it can find the time series data to label:

To configure this data location, you need to know the bucket and prefix where your time series data will be located on Amazon S3. However, you will also need AWS credentials to be passed to Label Studio. These credentials can be collected from the current SageMaker environment with the following piece of code:

Once your project is created, you can sync it: when synchronizing, Label Studio searches for valid files (CSV in our case), in the configured data source and add them to your project so that you can start your labeling work. Triggering a sync is simple enough and just requires the project ID obtained following the project creation API call:

Our project is now ready and some annotation tasks should have been added after synchronization…

Creating some labels

We will now access our Label Studio instance by opening this link in a new tab of our browser:

https://**{notebook_name}**.notebook.**{region}**.sagemaker.aws/proxy/8080/

You will have to replace the variables in bold in this URL by your own parameters. You should see the login page and you can use the credentials you configured in the first notebook to login:

Label Studio login page (image by author)
Label Studio login page (image by author)

Once logged in, you should see a project already populated and synced (we can see a 0/1 tasks under the project title, which means there’s one outstanding annotation task):

A Label Studio project is ready (image by author)
A Label Studio project is ready (image by author)

Click on this project tile to bring up the time series to annotate. Each time series dataset will appear as an individual task to label:

Time series tasks to annotate (image by author)
Time series tasks to annotate (image by author)

Scroll down to the bottom of the time series view on the right and reduce the time period using the overview slider until the time series plot appear. You can then start labeling your data (check out the Label Studio website if you want to know more about the actual labeling process):

Annotation process in progress (image by author)
Annotation process in progress (image by author)

Once you have a few labels done, scroll up and click on the Submit button. The annotations are saved in the local database from Label Studio (you can also configure a target location on Amazon S3). You can now collect your results!

Collecting the annotations

Use the following API call to get the labels from your previous labeling job and save them in a CSV format ready to be used by an anomaly detection machine learning service such as Amazon Lookout for Equipment:

And this is it! Congratulation on reading this far, you now know how to seamlessly integrated a labeling workflow to your time series anomaly detection process!

Conclusion

In this article, you learned how to…

  • deploy a Label Studio instance with Docker on a local SageMaker Notebook instance.
  • create a user, a project and configure it to access your time series data from an Amazon S3 bucket.
  • authorize CORS from your Amazon S3 Bucket to allow Label Studio to directly access your time series data from there without the need to copy it locally in your instance.
  • collect your annotation results once a labeling job is done.

I hope you found this article insightful: feel free to leave me a comment here and don’t hesitate to subscribe to my Medium email feed if you don’t want to miss my upcoming posts! Want to support me and future work? Join Medium with my referral link:

Join Medium with my referral link – Michaël HOARAU

The post Looking to seamlessly integrate your time series annotation in your ML workflow? Look no further appeared first on Towards Data Science.

]]>
Weak Supervision: Labeling Your Data Without Actually Labeling It 🤔 https://towardsdatascience.com/weak-supervision-with-snorkel-for-multilabel-classification-tasks-c7af4990ea45/ Mon, 23 May 2022 20:08:28 +0000 https://towardsdatascience.com/weak-supervision-with-snorkel-for-multilabel-classification-tasks-c7af4990ea45/ Label your data programmatically!

The post Weak Supervision: Labeling Your Data Without Actually Labeling It 🤔 appeared first on Towards Data Science.

]]>
Hands-on Tutorials
Photo by Swanson Chan on Unsplash
Photo by Swanson Chan on Unsplash
Table of Contents:
· Exploratory Data Analysis
· Keyword Labeling Functions
· Heuristic Labeling Functions
· Labeling Functions with spaCy
· Combining Labeling Function Outputs
· Training a Classifier
· Wrapping Up

There was a radical idea to entirely eliminate hand-labeling any training data in machine learning projects. It birthed snorkel, a powerful library to programmatically build training data.

There are three programmatic operations in snorkel:

  1. Labeling functions, e.g., using heuristic rules to label data
  2. Transformation functions, e.g., performing data augmentation
  3. Slicing functions, e.g., slicing data into subsets for targeted improvement

In this story, we will focus on labeling functions. The key idea is that labeling functions don’t need to be perfectly accurate. Snorkel will combine these output labels from many noisy heuristic labeling strategies to produce reliable training labels.

This process is widely known as Weak Supervision.

We use a dataset named problems_preprocessed.json that has three keys: text key containing math problems in LaTeX format, tags key containing one or two labels among algebra, combinatorics, geometry, or number theory, and token key. The raw dataset is the same dataset we used in my previous story.

Active Learning: A Practical Approach to Improve Your Data Labeling Experience

To recall, the text is preprocessed (only now without stemming) through several steps to obtain a clean token. We don’t do stemming since we need the original words to build keyword labeling functions. Please kindly visit the previous story for the preprocessing detail.

Define a "problem" as an observation/data point in text. Note that a problem can be categorized into more than one label. For example, this problem below is categorized as algebra and combinatorics.

There are 181 distinct problems in total. Let’s add another column named wordcount, which is the number of words in token.

As you may notice, several problems have tags but many others don’t. The problems that have tags are test problems for the final evaluation of our classifier. These tags are hand-labeled to ensure correctness. The problems that don’t have tags are train problems and to be labeled using weak supervision. We see that there are 135 distinct train problems and 46 distinct test problems.

Train data shape: (135, 4)
Test data shape: (46, 4)

Next, transform the tags in test data into 4 binary columns representing algebra, combinatorics, geometry, and number theory in that order, so we can proceed to modeling, then concatenate the result back to the test data.

Exploratory Data Analysis

To create labeling functions, you need at least some idea of the dataset. Hence, EDA is very important. First off, you can visualize the tokens in a word cloud for each tag.

Image by author
Image by author

Some words are strongly associated with a tag. For example, if a problem contains the phrase "real numbers", then it’s most likely an algebra problem, while geometry problems contain words like "triangle" or "circle".

Some tokens can be cleaned further such as "let" and "prove" which don’t emphasize any tag since every problem is very possible to have these command words in it. However, since we only do heuristic labeling here, we can just ignore these words in creating labeling functions without doing some extensive cleaning.

Remember wordcount? We can also use this information to form labeling functions. Look at the distribution plot below.

Image by author
Image by author

It’s apparent that combinatorics problems are longer: they have many words in them! This makes sense since combinatorics problems sometimes convey some sort of story, such as this one below.

m boys and n girls (m>n) sat across a round table, supervised by a teacher, and they did a game, which went like this. At first, the teacher pointed a boy to start the game. The chosen boy put a coin on the table. Then, consecutively in a clockwise order, everyone did his turn. If the next person is a boy, he will put a coin to the existing pile of coins. If the next person is a girl, she will take a coin from the existing pile of coins. If there is no coin on the table, the game ends. Notice that depending on the chosen boy, the game could end early, or it could go for a full turn. If the teacher wants the game to go for at least a full turn, how many possible boys could be chosen?

We can safely say that a problem that has more than 60 words is a combinatorics problem.

Next, let’s define some variables for easy code readability.

Keyword Labeling Functions

There are several techniques to create labeling functions. The easiest one is using keywords. From EDA, we can pick dominant keywords in each tag. For example, if a problem contains the words "prime" or "integer", we label it as number theory.

We build 3 keyword labeling functions for each tag, yielding 12 labeling functions in total. Note that some labeling functions have more than one keyword. If a problem has no keywords, then leave it abstain.

One way to make labeling functions is by using LabelingFunction class which accepts a python function that implements the core labeling function logic.

If the true labels of training data are not available such as in our case now, there are 4 summary statistics in snorkel to evaluate labeling functions:

  1. Polarity: the set of unique labels each labeling function outputs, excluding abstains
  2. Coverage: the fraction of the dataset each labeling function labels
  3. Overlaps: the fraction of the dataset where each labeling function and at least another labeling function label
  4. Conflicts: the fraction of the dataset where each labeling function and at least another labeling function label, and they’re disagree

Since adding false positives will increase coverage, having high coverage is not always good. Labeling functions can be applied to training data using PandasLFApplier class.

100%|███████████████████████████| 135/135 [00:00<00:00, 2327.52it/s]

Heuristic Labeling Functions

Next, from EDA, we are agreed that a problem that has more than 60 words is to be labeled as a combinatorics problem. So, let’s make a labeling function to do just that and leave the problems as abstains if their word counts are less than or equal to 60.

Here, we use @labeling_function decorator to make the labeling function as follows, which can be applied to any python function that returns a label for a single observation.

100%|███████████████████████████| 135/135 [00:00<00:00, 4784.38it/s]

Labeling Functions with spaCy

Now, for a little bit more advanced implementation, we don’t use raw data as before to derive labeling functions. Instead, we take advantage of the spaCy library, which is made easy by the @nlp_labeling_function decorator in snorkel.

We use spaCy to recognize the entities which are labeled as "PERSON" from the problems. Then these problems are tagged as combinatorics if they contain the entity. This is useful since combinatorics problems, as explained before, sometimes convey some sort of story about a person. Otherwise, leave the problems as abstains.

100%|█████████████████████████████| 135/135 [00:02<00:00, 61.12it/s]

Combining Labeling Function Outputs

We now have 14 labeling functions, which are expected to overlap or conflict with each other. Snorkel has the ability to combine and denoise their outputs.

But first, let’s create the calc_score function to calculate the weighted precision, recall, and f1 score of the test data between the true and predicted labels.

Don’t forget to also apply the labeling functions to test data as follows, since we can only evaluate the performance of labeling functions on test data.

100%|███████████████████████████████| 46/46 [00:00<00:00, 63.67it/s]

Now, how exactly to combine the outputs from many labeling functions into one or more labels for each observation? A simple way is using what we call MajorityLabelVoter, where the chosen label would be the one that is voted by the majority of labeling functions. To understand how it works, let’s take a look at the first five observations from the test data. We have these labels:

array([[-1,  0, -1, -1, -1, -1, -1, -1, -1,  3, -1, -1, -1,  1],
       [-1, -1, -1, -1, -1, -1, -1, -1, -1,  3, -1, -1, -1, -1],
       [-1, -1, -1, -1, -1, -1,  2,  2, -1, -1, -1, -1, -1,  1],
       [-1, -1, -1, -1, -1, -1,  2, -1, -1, -1, -1, -1, -1, -1],
       [-1, -1, -1, -1,  1, -1,  2, -1, -1, -1, -1, -1, -1,  1]])

It’s a 5 × 14 matrix since we have 14 labeling functions. Each element represents a label where -1 means abstain. Let’s remove the abstains to see what labels have been chosen.

[0, 3, 1]
[3]
[2, 2, 1]
[2]
[1, 2, 1]

Now it becomes clearer. For example, we can understand that for the third observation (labels displayed as [2, 2, 1] above), 2 out of 14 labeling functions output GEOMETRY and 1 labeling function outputs COMBINATORICS. Let’s call MajorityLabelVoter with cardinality 4 (since there are 4 tags) and see what happens.

array([-1,  3,  2,  2,  1])

We observe that MajorityLabelVoter has three conditions:

  1. If there’s only one labeling function that votes, it outputs the corresponding label.
  2. If there’s more than one labeling function vote and one of them votes dominantly, it outputs the dominant label.
  3. If there’s more than one labeling function vote and two (or more) of them vote equally dominantly, it outputs abstain.

All in all, MajorityLabelVoter outputs a single label for each observation, which is not what we really want since we are working with a multilabel classification task. In fact, snorkel doesn’t natively support multilabel classification.

To solve this problem, we need a workaround. We will use the predict_proba method from MajorityLabelVoter.

array([[0.33333333, 0.33333333, 0.        , 0.33333333],
       [0.        , 0.        , 0.        , 1.        ],
       [0.        , 0.        , 1.        , 0.        ],
       [0.        , 0.        , 1.        , 0.        ],
       [0.        , 1.        , 0.        , 0.        ]])

As expected, it gives a nonzero value to all (dominant) labels equally for each observation. Now, we interpret these nonzero values as labels that are selected by MajorityLabelVoter. In other words, the final label y_pred is a boolean matrix with element 1 if and only if the corresponding element of probs_test is nonzero. Hence the final label prediction is as follows.

array([[1, 1, 0, 1],
       [0, 0, 0, 1],
       [0, 0, 1, 0],
       [0, 0, 1, 0],
       [0, 1, 0, 0]])

We see that the labels serve their purpose for the multilabel classification task, i.e. there are some observations with multiple 1s as labels. Calculate the weighted precision, recall, and f1 score using calc_score function.

{'precision': '0.70', 'recall': '0.90', 'f1': '0.77'}

We obtain 0.70 precision, 0.77 f1 score, with suspiciously high recall. However, this is predictable since our method above labels abstains as [1, 1, 1, 1] hence giving many false positives and indirectly leaving a small margin for false negatives.

Training a Classifier

The output of MajorityLabelVoter is just a set of labels that can be used with the most popular libraries for performing supervised learning. In this story, we use the Logistic Regression from the scikit-learn library. To be precise, we will first extract text features into a matrix of TF-IDF, then employ logistic regression with balanced class weight (to address class imbalance). The model will be trained in a multiclass setting as One-Vs-The-Rest.

Our training data would be df_train with its label y_train, where y_train is a boolean matrix with element 1 if and only if the corresponding element of probs_train is nonzero. However, we need to be careful. There may exist some observations that are not covered in our labeling functions. Hence, we need to filter out these observations using filter_unlabeled_dataframe.

Lastly, train the model, predict on df_test, and calculate the score.

{'precision': '0.83', 'recall': '0.80', 'f1': '0.79'}

We observe an overall boost in scores over the MajorityLabelVoter with no suspiciously high recall! This is in part because the discriminative model generalizes beyond the labeling function’s labels and makes good predictions on all data points, not just the ones covered by labeling functions. The discriminative model can generalize beyond the noisy labeling heuristics.

Wrapping Up

We’ve been introduced to Weak Supervision, a Data Labeling method without actually labeling any data (manually). Even though it looks like a cult, weak supervision cuts labeling time from months to days, even hours, with reliable results. We use snorkel to do so, and successfully devise a way to facilitate a multilabel classification task using predict_proba method from MajorityLabelVoter.

The idea of weak supervision is to combine the outputs of many labeling functions which are used to programmatically label data. In our case, we found that this method gives high recall. Then, we showed that a classifier trained on a weakly supervised dataset can outperform an approach based on the labeling functions alone as it learns to generalize beyond the noisy heuristics we provide.


🔥 Hi there! If you enjoy this story and want to support me as a writer, consider becoming a member. For only $5 a month, you’ll get unlimited access to all stories on Medium. If you sign up using my link, I’ll earn a small commission.

🔖 Want to know more about how classical Machine Learning models work and how they optimize their parameters? Or an example of MLOps megaprojects? What about cherry-picked top-notch articles of all time? Continue reading:

Machine Learning from Scratch

Advanced Optimization Methods

MLOps Megaproject

My Best Stories

Data Science in R


[1] Snorkel’s API documentation, https://snorkel.readthedocs.io/en/v0.9.7/

[2] Snorkel Intro Tutorial: Data Labeling, https://www.snorkel.org/use-cases/01-spam-tutorial

The post Weak Supervision: Labeling Your Data Without Actually Labeling It 🤔 appeared first on Towards Data Science.

]]>