David Martin, Author at Towards Data Science

Learnings from a Machine Learning Engineer — Part 6: The Human Side

David Martin — Fri, 11 Apr 2025 18:44:39 +0000

In my previous articles, I have spent a lot of time talking about the technical aspects of an Image Classification problem from data collection, model evaluation, performance optimization, and a detailed look at model training.

These elements require a certain degree of in-depth expertise, and they (usually) have well-defined metrics and established processes that are within our control.

Now it’s time to consider…

The human aspects of machine learning

Yes, this may seem like an oxymoron! But it is the interaction with people — the ones you work with and the ones who use your application — that help bring the technology to life and provide a sense of fulfillment to your work.

These human interactions include:

Communicating technical concepts to a non-technical audience.
Understanding how your end-users engage with your application.
Providing clear expectations on what the model can and cannot do.

I also want to touch on the impact to people’s jobs, both positive and negative, as AI becomes a part of our everyday lives.

Overview

As in my previous articles, I will gear this discussion around an image classification application. With that in mind, these are the groups of people involved with your project:

AI/ML Engineer (that’s you) — bringing life to the Machine Learning application.
MLOps team — your peers who will deploy, monitor, and enhance your application.
Subject matter experts — the ones who will provide the care and feeding of labeled data.
Stakeholders — the ones who are looking for a solution to a real world problem.
End-users — the ones who will be using your application. These could be internal and external customers.
Marketing — the ones who will be promoting usage of your application.
Leadership — the ones who are paying the bill and need to see business value.

Let’s dive right in…

AI/ML Engineer

You may be a part of a team or a lone wolf. You may be an individual contributor or a team leader.

Photo by Christina @ wocintechchat.com on Unsplash

Whatever your role, it is important to see the whole picture — not only the coding, the data science, and the technology behind AI/ML — but the value that it brings to your organization.

Understand the business needs

Your company faces many challenges to reduce expenses, improve customer satisfaction, and remain profitable. Position yourself as someone who can create an application that helps achieve their goals.

What are the pain points in a business process?
What is the value of using your application (time savings, cost savings)?
What are the risks of a poor implementation?
What is the roadmap for future enhancements and use-cases?
What other areas of the business could benefit from the application, and what design choices will help future-proof your work?

Communication

Deep technical discussions with your peers is probably our comfort zone. However, to be a more successful AI/ML Engineer, you should be able to clearly explain the work you are doing to different audiences.

With practice, you can explain these topics in ways that your non-technical business users can follow along with, and understand how your technology will benefit them.

To help you get comfortable with this, try creating a PowerPoint with 2–3 slides that you can cover in 5–10 minutes. For example, explain how a neural network can take an image of a cat or a dog and determine which one it is.

Practice giving this presentation in your mind, to a friend — even your pet dog or cat! This will get you more comfortable with the transitions, tighten up the content, and ensure you cover all the important points as clearly as possible.

Be sure to include visuals — pure text is boring, graphics are memorable.
Keep an eye on time — respect your audience’s busy schedule and stick to the 5–10 minutes you are given.
Put yourself in their shoes — your audience is interested in how the technology will benefit them, not on how smart you are.

Creating a technical presentation is a lot like the Feynman Technique — explaining a complex subject to your audience by breaking it into easily digestible pieces, with the added benefit of helping you understand it more completely yourself.

MLOps team

These are the people that deploy your application, manage data pipelines, and monitor infrastructure that keeps things running.

Without them, your model lives in a Jupyter notebook and helps nobody!

Photo by airfocus on Unsplash

These are your technical peers, so you should be able to connect with their skillset more naturally. You speak in jargon that sounds like a foreign language to most people. Even so, it is extremely helpful for you to create documentation to set expectations around:

Process and data flows.
Data quality standards.
Service level agreements for model performance and availability.
Infrastructure requirements for compute and storage.
Roles and responsibilities.

It is easy to have a more informal relationship with your MLOps team, but remember that everyone is trying to juggle many projects at the same time.

Email and chat messages are fine for quick-hit issues. But for larger tasks, you will want a system to track things like user stories, enhancement requests, and break-fix issues. This way you can prioritize the work and ensure you don’t forget something. Plus, you can show progress to your supervisor.

Some great tools exist, such as:

Jira, GitHub, Azure DevOps Boards, Asana, Monday, etc.

We are all professionals, so having a more formal system to avoid miscommunication and mistrust is good business.

Subject matter experts

These are the team members that have the most experience working with the data that you will be using in your AI/ML project.

Photo by National Cancer Institute on Unsplash

SMEs are very skilled at dealing with messy data — they are human, after all! They can handle one-off situations by considering knowledge outside of their area of expertise. For example, a doctor may recognize metal inserts in a patient’s X-ray that indicate prior surgery. They may also notice a faulty X-ray image due to equipment malfunction or technician error.

However, your machine learning model only knows what it knows, which comes from the data it was trained on. So, those one-off cases may not be appropriate for the model you are training. Your SMEs need to understand that clear, high quality training material is what you are looking for.

Think like a computer

In the case of an image classification application, the output from the model communicates to you how well it was trained on the data set. This comes in the form of error rates, which is very much like when a student takes an exam and you can tell how well they studied by seeing how many questions — and which ones — they get wrong.

In order to reduce error rates, your image data set needs to be objectively “good” training material. To do this, put yourself in an analytical mindset and ask yourself:

What images will the computer get the most useful information out of? Make sure all the relevant features are visible.
What is it about an image that confused the model? When it makes an error, try to understand why — objectively — by looking at the entire picture.
Is this image a “one-off” or a typical example of what the end-users will send? Consider creating a new subclass of exceptions to the norm.

Be sure to communicate to your SMEs that model performance is directly tied to data quality and give them clear guidance:

Provide visual examples of what works.
Provide counter-examples of what does not work.
Ask for a wide variety of data points. In the X-ray example, be sure to get patients with different ages, genders, and races.
Provide options to create subclasses of your data for further refinement. Use that X-ray from a patient with prior surgery as a subclass, and eventually as you can get more examples over time, the model can handle them.

This also means that you should become familiar with the data they are working with — perhaps not expert level, but certainly above a novice level.

Lastly, when working with SMEs, be cognizant of the impression they may have that the work you are doing is somehow going to replace their job. It can feel threatening when someone asks you how to do your job, so be mindful.

Ideally, you are building a tool with honest intentions and it will enable your SMEs to augment their day-to-day work. If they can use the tool as a second opinion to validate their conclusions in less time, or perhaps even avoid mistakes, then this is a win for everyone. Ultimately, the goal is to allow them to focus on more challenging situations and achieve better outcomes.

I have more to say on this in my closing remarks.

Stakeholders

These are the people you will have the closest relationship with.

Stakeholders are the ones who created the business case to have you build the machine learning model in the first place.

Photo by Ninthgrid on Unsplash

They have a vested interest in having a model that performs well. Here are some key point when working with your stakeholder:

Be sure to listen to their needs and requirements.
Anticipate their questions and be prepared to respond.
Be on the lookout for opportunities to improve your model performance. Your stakeholders may not be as close to the technical details as you are and may not think there is any room for improvement.
Bring issues and problems to their attention. They may not want to hear bad news, but they will appreciate honesty over evasion.
Schedule regular updates with usage and performance reports.
Explain technical details in terms that are easy to understand.
Set expectations on regular training and deployment cycles and timelines.

Your role as an AI/ML Engineer is to bring to life the vision of your stakeholders. Your application is making their lives easier, which justifies and validates the work you are doing. It’s a two-way street, so be sure to share the road.

End-users

These are the people who are using your application. They may also be your harshest critics, but you may never even hear their feedback.

Photo by Alina Ruf on Unsplash

Think like a human

Recall above when I suggested to “think like a computer” when analyzing the data for your training set. Now it’s time to put yourself in the shoes of a non-technical user of your application.

End-users of an image classification model communicate their understanding of what’s expected of them by way of poor images. These are like the students that didn’t study for the exam, or worse didn’t read the questions, so their answers don’t make sense.

Your model may be really good, but if end-users misuse the application or are not satisfied with the output, you should be asking:

Are the instructions confusing or misleading? Did the user focus the camera on the subject being classified, or is it more of a wide-angle image? You can’t blame the user if they follow bad instructions.
What are their expectations? When the results are presented to the user, are they satisfied or are they frustrated? You may noticed repeated images from frustrated users.
Are the usage patterns changing? Are they trying to use the application in unexpected ways? This may be an opportunity to improve the model.

Inform your stakeholders of your observations. There may be simple fixes to improve end-user satisfaction, or there may be more complex work ahead.

If you are lucky, you may discover an unexpected way to leverage the application that leads to expanded usage or exciting benefits to your business.

Explainability

Most AI/ML model are considered “black boxes” that perform millions of calculations on extremely high dimensional data and produce a rather simplistic result without any reason behind it.

The Answer to Ultimate Question of Life, the Universe, and Everything is 42.
— The Hitchhikers Guide to the Galaxy

Depending on the situation, your end-users may require more explanation of the results, such as with medical imaging. Where possible, you should consider incorporating model explainability techniques such as LIME, SHAP, and others. These responses can help put a human touch to cold calculations.

Now it’s time to switch gears and consider higher-ups in your organization.

Marketing team

These are the people who promote the use of your hard work. If your end-users are completely unaware of your application, or don’t know where to find it, your efforts will go to waste.

The marketing team controls where users can find your app on your website and link to it through social media channels. They also see the technology through a different lens.

Gartner hype cycle. Image from Wikipedia – https://en.wikipedia.org/wiki/Gartner_hype_cycle

The above hype cycle is a good representation of how technical advancements tends to flow. At the beginning, there can be an unrealistic expectation of what your new AI/ML tool can do — it’s the greatest thing since sliced bread!

Then the “new” wears off and excitement wanes. You may face a lack of interest in your application and the marketing team (as well as your end-users) move on to the next thing. In reality, the value of your efforts are somewhere in the middle.

Understand that the marketing team’s interest is in promoting the use of the tool because of how it will benefit the organization. They may not need to know the technical inner workings. But they should understand what the tool can do, and be aware of what it cannot do.

Honest and clear communication up-front will help smooth out the hype cycle and keep everyone interested longer. This way the crash from peak expectations to the trough of disillusionment is not so severe that the application is abandoned altogether.

Leadership team

These are the people that authorize spending and have the vision for how the application fits into the overall company strategy. They are driven by factors that you have no control over and you may not even be aware of. Be sure to provide them with the key information about your project so they can make informed decisions.

Photo by Adeolu Eletu on Unsplash

Depending on your role, you may or may not have direct interaction with executive leadership in your company. Your job is to summarize the costs and benefits associated with your project, even if that is just with your immediate supervisor who will pass this along.

Your costs will likely include:

Compute and storage — training and serving a model.
Image data collection — both real-world and synthetic or staged.
Hours per week — SME, MLOps, AI/ML engineering time.

Highlight the savings and/or value added:

Provide measures on speed and accuracy.
Translate efficiencies into FTE hours saved and customer satisfaction.
Bonus points if you can find a way to produce revenue.

Business leaders, much like the marketing team, may follow the hype cycle:

Be realistic about model performance. Don’t try to oversell it, but be honest about the opportunities for improvement.
Consider creating a human benchmark test to measure accuracy and speed for an SME. It is easy to say human accuracy is 95%, but it’s another thing to measure it.
Highlight short-term wins and how they can become long-term success.

Conclusion

I hope you can see that, beyond the technical challenges of creating an AI/ML application, there are many humans involved in a successful project. Being able to interact with these individuals, and meet them where they are in terms of their expectations from the technology, is vital to advancing the adoption of your application.

Photo by Vlad Hilitanu on Unsplash

Key takeaways:

Understand how your application fits into the business needs.
Practice communicating to a non-technical audience.
Collect measures of model performance and report these regularly to your stakeholders.
Expect that the hype cycle could help and hurt your cause, and that setting consistent and realistic expectations will ensure steady adoption.
Be aware that factors outside of your control, such as budgets and business strategy, could affect your project.

And most importantly…

Don’t let machines have all the fun learning!

Human nature gives us the curiosity we need to understand our world. Take every opportunity to grow and expand your skills, and remember that human interaction is at the heart of machine learning.

Closing remarks

Advancements in AI/ML have the potential (assuming they are properly developed) to do many tasks as well as humans. It would be a stretch to say “better than” humans because it can only be as good as the training data that humans provide. However, it is safe to say AI/ML can be faster than humans.

The next logical question would be, “Well, does that mean we can replace human workers?”

This is a delicate topic, and I want to be clear that I am not an advocate of eliminating jobs.

I see my role as an AI/ML Engineer as being one that can create tools that aide in someone else’s job or enhance their ability to complete their work successfully. When used properly, the tools can validate difficult decisions and speed through repetitive tasks, allowing your experts to spend more time on the one-off situations that require more attention.

There may also be new career opportunities, from the care-and-feeding of data, quality assessment, user experience, and even to new roles that leverage the technology in exciting and unexpected ways.

Unfortunately, business leaders may make decisions that impact people’s jobs, and this is completely out of your control. But all is not lost — even for us AI/ML Engineers…

There are things we can do

Be kind to the fellow human beings that we call “coworkers”.
Be aware of the fear and uncertainty that comes with technological advancements.
Be on the lookout for ways to help people leverage AI/ML in their careers and to make their lives better.

This is all part of being human.

The post Learnings from a Machine Learning Engineer — Part 6: The Human Side appeared first on Towards Data Science.

Learnings from a Machine Learning Engineer — Part 5: The Training

David Martin — Thu, 13 Feb 2025 21:04:32 +0000

In this fifth part of my series, I will outline the steps for creating a Docker container for training your image classification model, evaluating performance, and preparing for deployment.

AI/ML engineers would prefer to focus on model training and data engineering, but the reality is that we also need to understand the infrastructure and mechanics behind the scenes.

I hope to share some tips, not only to get your training run running, but how to streamline the process in a cost efficient manner on cloud resources such as Kubernetes.

I will reference elements from my previous articles for getting the best model performance, so be sure to check out Part 1 and Part 2 on the data sets, as well as Part 3 and Part 4 on model evaluation.

Here are the learnings that I will share with you:

Infrastructure overview
Building your Docker container
Executing your training run
Deploying your model

Infrastructure overview

First, let me provide a brief description of the setup that I created, specifically around Kubernetes. Your setup may be entirely different, and that is just fine. I simply want to set the stage on the infrastructure so that the rest of the discussion makes sense.

Image management system

This is a server you deploy that provides a user interface to for your subject matter experts to label and evaluate images for the image classification application. The server can run as a pod on your Kubernetes cluster, but you may find that running a dedicated server with faster disk may be better.

Image files are stored in a directory structure like the following, which is self-documenting and easily modified.

Image_Library/
  - cats/
    - image1001.png
  - dogs/
    - image2001.png

Ideally, these files would reside on local server storage (instead of cloud or cluster storage) for better performance. The reason for this will become clear as we see what happens as the image library grows.

Cloud storage

Cloud Storage allows for a virtually limitless and convenient way to share files between systems. In this case, the image library on your management system could access the same files as your Kubernetes cluster or Docker engine.

However, the downside of cloud storage is the latency to open a file. Your image library will have thousands and thousands of images, and the latency to read each file will have a significant impact on your training run time. Longer training runs means more cost for using the expensive GPU processors!

The way that I found to speed things up is to create a tar file of your image library on your management system and copy them to cloud storage. Even better would be to create multiple tar files in parallel, each containing 10,000 to 20,000 images.

This way you only have network latency on a handful of files (which contain thousands, once extracted) and you start your training run much sooner.

Kubernetes or Docker engine

A Kubernetes cluster, with proper configuration, will allow you to dynamically scale up/down nodes, so you can perform your model training on GPU hardware as needed. Kubernetes is a rather heavy setup, and there are other container engines that will work.

The technology options change constantly!

The main idea is that you want to spin up the resources you need — for only as long as you need them — then scale down to reduce your time (and therefore cost) of running expensive GPU resources.

Once your GPU node is started and your Docker container is running, you can extract the tar files above to local storage, such as an emptyDir, on your node. The node typically has high-speed SSD disk, ideal for this type of workload. There is one caveat — the storage capacity on your node must be able to handle your image library.

Assuming we are good, let’s talk about building your Docker container so that you can train your model on your image library.

Building your Docker container

Being able to execute a training run in a consistent manner lends itself perfectly to building a Docker container. You can “pin” the version of libraries so you know exactly how your scripts will run every time. You can version control your containers as well, and revert to a known good image in a pinch. What is really nice about Docker is you can run the container pretty much anywhere.

The tradeoff when running in a container, especially with an Image Classification model, is the speed of file storage. You can attach any number of volumes to your container, but they are usually network attached, so there is latency on each file read. This may not be a problem if you have a small number of files. But when dealing with hundreds of thousands of files like image data, that latency adds up!

This is why using the tar file method outlined above can be beneficial.

Also, keep in mind that Docker containers could be terminated unexpectedly, so you should make sure to store important information outside the container, on cloud storage or a database. I’ll show you how below.

Dockerfile

Knowing that you will need to run on GPU hardware (here I will assume Nvidia), be sure to select the right base image for your Dockerfile, such as nvidia/cuda with the “devel” flavor that will contain the right drivers.

Next, you will add the script files to your container, along with a “batch” script to coordinate the execution. Here is an example Dockerfile, and then I’ll describe what each of the scripts will be doing.

#####   Dockerfile   #####
FROM nvidia/cuda:12.8.0-devel-ubuntu24.04

# Install system software
RUN apt-get -y update && apg-get -y upgrade
RUN apt-get install -y python3-pip python3-dev

# Setup python
WORKDIR /app
COPY requirements.txt
RUN python3 -m pip install --upgrade pip
RUN python3 -m pip install -r requirements.txt

# Pythong and batch scripts
COPY ExtractImageLibrary.py .
COPY Training.py .
COPY Evaluation.py .
COPY ScorePerformance.py .
COPY ExportModel.py .
COPY BulkIdentification.py .
COPY BatchControl.sh .

# Allow for interactive shell
CMD tail -f /dev/null

Dockerfiles are declarative, almost like a cookbook for building a small server — you know what you’ll get every time. Python libraries benefit, too, from this declarative approach. Here is a sample requirements.txt file that loads the TensorFlow libraries with CUDA support for GPU acceleration.

#####   requirements.txt   #####
numpy==1.26.3
pandas==2.1.4
scipy==1.11.4
keras==2.15.0
tensorflow[and-cuda]

Extract Image Library script

In Kubernetes, the Docker container can access local, high speed storage on the physical node. This can be achieved via the emptyDir volume type. As mentioned before, this will only work if the local storage on your node can handle the size of your library.

#####   sample 25GB emptyDir volume in Kubernetes   #####
containers:
  - name: training-container
    volumeMounts:
      - name: image-library
        mountPath: /mnt/image-library
volumes:
  - name: image-library
    emptyDir:
      sizeLimit: 25Gi

You would want to have another volumeMount to your cloud storage where you have the tar files. What this looks like will depend on your provider, or if you are using a persistent volume claim, so I won’t go into detail here.

Now you can extract the tar files — ideally in parallel for an added performance boost — to the local mount point.

Training script

As AI/ML engineers, the model training is where we want to spend most of our time.

This is where the magic happens!

With your image library now extracted, we can create our train-validation-test sets, load a pre-trained model or build a new one, fit the model, and save the results.

One key technique that has served me well is to load the most recently trained model as my base. I discuss this in more detail in Part 4 under “Fine tuning”, this results in faster training time and significantly improved model performance.

Be sure to take advantage of the local storage to checkpoint your model during training since the models are quite large and you are paying for the GPU even while it sits idle writing to disk.

This of course raises a concern about what happens if the Docker container dies part-way though the training. The risk is (hopefully) low from a cloud provider, and you may not want an incomplete training anyway. But if that does happen, you will at least want to understand why, and this is where saving the main log file to cloud storage (described below) or to a package like MLflow comes in handy.

Evaluation script

After your training run has completed and you have taken proper precaution on saving your work, it is time to see how well it performed.

Normally this evaluation script will pick up on the model that just finished. But you may decide to point it at a previous model version through an interactive session. This is why have the script as stand-alone.

With it being a separate script, that means it will need to read the completed model from disk — ideally local disk for speed. I like having two separate scripts (training and evaluation), but you might find it better to combine these to avoid reloading the model.

Now that the model is loaded, the evaluation script should generate predictions on every image in the training, validation, test, and benchmark sets. I save the results as a huge matrix with the softmax confidence score for each class label. So, if there are 1,000 classes and 100,000 images, that’s a table with 100 million scores!

I save these results in pickle files that are then used in the score generation next.

Score Performance script

Taking the matrix of scores produced by the evaluation script above, we can now create various metrics of model performance. Again, this process could be combined with the evaluation script above, but my preference is for independent scripts. For example, I might want to regenerate scores on previous training runs. See what works for you.

Here are some of the sklearn functions that produce useful insights like F1, log loss, AUC-ROC, Matthews correlation coefficient.

from sklearn.metrics import average_precision_score, classification_report
from sklearn.metrics import log_loss, matthews_corrcoef, roc_auc_score

Aside from these basic statistical analyses for each dataset (train, validation, test, and benchmark), it is also useful to identify:

Which ground truth labels get the most number of errors?
Which predicted labels get the most number of incorrect guesses?
How many ground-truth-to-predicted label pairs are there? In other words, which classes are easily confused?
What is the accuracy when applying a minimum softmax confidence score threshold?
What is the error rate above that softmax threshold?
For the “difficult” benchmark sets, do you get a sufficiently high score?
For the “out-of-scope” benchmark sets, do you get a sufficiently low score?

As you can see, there are multiple calculations and it’s not easy to come up with a single evaluation to decide if the trained model is good enough to be moved to production.

In fact, for an image classification model, it is helpful to manually review the images that the model got wrong, as well as the ones that got a low softmax confidence score. Actually looking at what the model got wrong, you can get a gut-feel for how well it will perform in the real world.

Check out Part 3 for more in-depth discussion on evaluation and scoring.

Export Model script

All of the heavy lifting is done by this point. Since your Docker container will be shutdown soon, now is the time to copy the model artifacts to cloud storage and prepare them for being put to use.

The example Python code snippet below is more geared to Keras and TensorFlow. This will take the trained model and export it as a saved_model. Later, I will show how this is used by TensorFlow Serving in the Deploy section below.

# Increment current version of model and create new directory
next_version_dir, version_number = create_new_version_folder()

# Copy model artifacts to the new directory
copy_model_artifacts(next_version_dir)

# Create the directory to save the model export
saved_model_dir = os.path.join(next_version_dir, str(version_number))

# Save the model export for use with TensorFlow Serving
tf.keras.backend.set_learning_phase(0)
model = tf.keras.models.load_model(keras_model_file)
tf.saved_model.save(model, export_dir=saved_model_dir)

This script also copies the other training run artifacts such as the model evaluation results, score summaries, and log files generated from model training. Don’t forget about your label map so you can give human readable names to your classes!

Bulk Identification script

Your training run is complete, your model has been scored, and a new version is exported and ready to be served. Now is the time to use this latest model to assist you on trying to identify unlabeled images.

As I described in Part 4, you may have a collection of “unknowns” — really good pictures, but no idea what they are. Let your new model provide a best guess on these and record the results to a file or a database. Now you can create filters based on closest match and by high/low scores. This allows your subject matter experts to leverage these filters to find new image classes, add to existing classes, or to remove images that have very low scores and are no good.

By the way, I put this step inside the GPU container since you may have thousands of “unknown” images to process and the accelerated hardware will make light work of it. However, if you are not in a hurry, you could perform this step on a separate CPU node, and shutdown your GPU node sooner to save cost. This would especially make sense if your “unknowns” folder is on slower cloud storage.

Batch Control script

All of the scripts described above perform a specific task — from extracting your image library, executing model training, performing evaluation and scoring, exporting the model artifacts for deployment, and perhaps even bulk identification.

One script to rule them all

To coordinate the entire show, this batch control script gives you the entry point for your container and an easy way to trigger everything. Be sure to produce a log file in case you need to analyze any failures along the way. Also, be sure to write the log to your cloud storage in case the container dies unexpectedly.

#!/bin/bash
# Main batch control script

# Redirect standard output and standard error to a log file
exec > /cloud_storage/batch-logfile.txt 2>&1

/app/ExtractImageLibrary.py
/app/Training.py
/app/Evaluation.py
/app/ScorePerformance.py
/app/ExportModel.py
/app/BulkIdentification.py

Executing your training run

So, now it’s time to put everything in motion…

Start your engines!

Let’s go through the steps to prepare your image library, fire up your Docker container to train your model, and then examine the results.

Image library ‘tar’ files

Your image management system should now create a tar file backup of your data. Since tar is a single-threaded function, you will get significant speed improvement by creating multiple tar files in parallel, each with a portion of you data.

Now these files can be copied to your shared cloud storage for the next step.

Start your Docker container

All the hard work you put into creating your container (described above) will be put to the test. If you are running Kubernetes, you can create a Job that will execute the BatchControl.sh script.

Inside the Kubernetes Job definition, you can pass environment variables to adjust the execution of your script. For example, the batch size and number of epochs are set here and then pulled into your Python scripts, so you can alter the behavior without changing your code.

#####   sample Job in Kubernetes   #####
containers:
  - name: training-job
    env:
      - name: BATCH_SIZE
        value: 50
      - name: NUM_EPOCHS
        value: 30
    command: ["/app/BatchControl.sh"]

Once the Job is completed, be sure to verify that the GPU node properly scales back down to zero according to your scaling configuration in Kubernetes — you don’t want to be saddled with a huge bill over a simple configuration error.

Manually review results

With the training run complete, you should now have model artifacts saved and can examine the performance. Look through the metrics, such as F1 and log loss, and benchmark accuracy for high softmax confidence scores.

As mentioned earlier, the reports only tell part of the story. It is worth the time and effort to manually review the images that the model got wrong or where it produced a low confidence score.

Don’t forget about the bulk identification. Be sure to leverage these to locate new images to fill out your data set, or to find new classes.

Deploying your model

Once you have reviewed your model performance and are satisfied with the results, it is time to modify your TensorFlow Serving container to put the new model into production.

TensorFlow Serving is available as a Docker container and provides a very quick and convenient way to serve your model. This container can listen and respond to API calls for your model.

Let’s say your new model is version 7, and your Export script (see above) has saved the model in your cloud share as /image_application/models/007. You can start the TensorFlow Serving container with that volume mount. In this example, the shareName points to folder for version 007.

#####   sample TensorFlow pod in Kubernetes   #####
containers:
  - name: tensorflow-serving
    image: bitnami/tensorflow-serving:2.18.0
    ports:
      - containerPort: 8501
    env:
      - name: TENSORFLOW_SERVING_MODEL_NAME
        value: "image_application"
    volumeMounts:
      - name: models-subfolder
        mountPath: "/bitnami/model-data"

volumes:
  - name: models-subfolder
    azureFile:
      shareName: "image_application/models/007"

A subtle note here — the export model script should create a sub-folder to models/007, also named 007, with the saved model export. This may seem a little redundant, but TensorFlow Serving will mount this share folder as /bitnami/model-data and detect the numbered sub-folder inside it for the version to serve. This will allow you to query the API for the model version (good for logging) as well as the identification.

Conclusion

As I mentioned at the start of this article, this setup has worked for my situation. This is certainly not the only way to approach this challenge, and I invite you to customize your own solution.

I wanted to share my hard-fought learnings as I embraced cloud services in Kubernetes, with the desire to keep costs under control. Of course, doing all this while maintaining a high level of model performance is an added challenge, but one that you can achieve.

I know that the information above is mostly an outline of the process, but I hope it provided enough information here so that you can fill in the rest, specific to your own application.

Happy learnings!

The post Learnings from a Machine Learning Engineer — Part 5: The Training appeared first on Towards Data Science.

Learnings from a Machine Learning Engineer — Part 3: The Evaluation

David Martin — Thu, 13 Feb 2025 21:00:06 +0000

In this third part of my series, I will explore the evaluation process which is a critical piece that will lead to a cleaner data set and elevate your model performance. We will see the difference between evaluation of a trained model (one not yet in production), and evaluation of a deployed model (one making real-world predictions).

In Part 1, I discussed the process of labelling your image data that you use in your Image Classification project. I showed how to define “good” images and create sub-classes. In Part 2, I went over various data sets, beyond the usual train-validation-test sets, such as benchmark sets, plus how to handle synthetic data and duplicate images.

Evaluation of the trained model

As machine learning engineers we look at accuracy, F1, log loss, and other metrics to decide if a model is ready to move to production. These are all important measures, but from my experience, these scores can be deceiving especially as the number of classes grows.

Although it can be time consuming, I find it very important to manually review the images that the model gets wrong, as well as the images that the model gives a low softmax “confidence” score to. This means adding a step immediately after your training run completes to calculate scores for all images — training, validation, test, and the benchmark sets. You only need to bring up for manual review the ones that the model had problems with. This should only be a small percentage of the total number of images. See the Double-check process below

What you do during the manual evaluation is to put yourself in a “training mindset” to ensure that the labelling standards are being followed that you setup in Part 1. Ask yourself:

“Is this a good image?” Is the subject front and center, and can you clearly see all the features?
“Is this the correct label?” Don’t be surprised if you find wrong labels.

You can either remove the bad images or fix the labels if they are wrong. Otherwise you can keep them in the data set and force the model to do better next time. Other questions I ask are:

“Why did the model get this wrong?”
“Why did this image get a low score?”
“What is it about the image that caused confusion?”

Sometimes the answer has nothing to do with that specific image. Frequently, it has to do with the other images, either in the ground truth class or in the predicted class. It is worth the effort to Double-check all images in both sets if you see a consistently bad guess. Again, don’t be surprised if you find poor images or wrong labels.

Weighted evaluation

When doing the evaluation of the trained model (above), we apply a lot of subjective analysis — “Why did the model get this wrong?” and “Is this a good image?” From these, you may only get a gut feeling.

Frequently, I will decide to hold off moving a model forward to production based on that gut feel. But how can you justify to your manager that you want to hit the brakes? This is where putting a more objective analysis comes in by creating a weighted average of the softmax “confidence” scores.

In order to apply a weighted evaluation, we need to identify sets of classes that deserve adjustments to the score. Here is where I create a list of “commonly confused” classes.

Commonly confused classes

Certain animals at our zoo can easily be mistaken. For example, African elephants and Asian elephants have different ear shapes. If your model gets these two mixed up, that is not as bad as guessing a giraffe! So perhaps you give partial credit here. You and your subject matter experts (SMEs) can come up with a list of these pairs and a weighted adjustment for each.

Photo by Matt Bango on Unsplash

Photo by Mathew Krizmanich on Unsplash

This weight can be factored into a modified cross-entropy loss function in the equation below. The back half of this equation will reduce the impact of being wrong for specific pairs of ground truth and prediction by using the “weight” function as a lookup. By default, the weighted adjustment would be 1 for all pairings, and the commonly confused classes would get something like 0.5.

In other words, it’s better to be unsure (have a lower confidence score) when you are wrong, compared to being super confident and wrong.

Modified cross-entropy loss function, image by author

Once this weighted log loss is calculated, I can compare to previous training runs to see if the new model is ready for production.

Confidence threshold report

Another valuable measure that incorporates the confidence threshold (in my example, 95) is to report on accuracy and false positive rates. Recall that when we apply the confidence threshold before presenting results, we help reduce false positives from being shown to the end user.

In this table, we look at the breakdown of “true positive above 95” for each data set. We get a sense that when a “good” picture comes through (like the ones from our train-validation-test set) it is very likely to surpass the threshold, thus the user is “happy” with the outcome. Conversely, the “false positive above 95” is extremely low for good pictures, thus only a small number of our users will be “sad” about the results.

Example Confidence Threshold Report, image by author

We expect the train-validation-test set results to be exceptional since our data is curated. So, as long as people take “good” pictures, the model should do very well. But to get a sense of how it does on extreme situations, let’s take a look at our benchmarks.

The “difficult” benchmark has more modest true positive and false positive rates, which reflects the fact that the images are more challenging. These values are much easier to compare across training runs, so that lets me set a min/max target. So for example, if I target a minimum of 80% for true positive, and maximum of 5% for false positive on this benchmark, then I can feel confident moving this to production.

The “out-of-scope” benchmark has no true positive rate because none of the images belong to any class the model can identify. Remember, we picked things like a bag of popcorn, etc., that are not zoo animals, so there cannot be any true positives. But we do get a false positive rate, which means the model gave a confident score to that bag of popcorn as some animal. And if we set a target maximum of 10% for this benchmark, then we may not want to move it to production.

Photo by Linus Mimietz on Unsplash

Right now, you may be thinking, “Well, what animal did it pick for the bag of popcorn?” Excellent question! Now you understand the importance of doing a manual review of the images that get bad results.

Evaluation of the deployed model

The evaluation that I described above applies to a model immediately after training. Now, you want to evaluate how your model is doing in the real world. The process is similar, but requires you to shift to a “production mindset” and asking yourself, “Did the model get this correct?” and “Should it have gotten this correct?” and “Did we tell the user the right thing?”

So, imagine that you are logging in for the morning — after sipping on your cold brew coffee, of course — and are presented with 500 images that your zoo guests took yesterday of different animals. Your job is to determine how satisfied the guests were using your model to identify the zoo animals.

Using the softmax “confidence” score for each image, we have a threshold before presenting results. Above the threshold, we tell the guest what the model predicted. I’ll call this the “happy path”. And below the threshold is the “sad path” where we ask them to try again.

Your review interface will first show you all the “happy path” images one at a time. This is where you ask yourself, “Did we get this right?” Hopefully, yes!

But if not, this is where things get tricky. So now you have to ask, “Why not?” Here are some things that it could be:

“Bad” picture — Poor lighting, bad angle, zoomed out, etc — refer to your labelling standards.
Out-of-scope — It’s a zoo animal, but unfortunately one that isn’t found in this zoo. Maybe it belongs to another zoo (your guest likes to travel and try out your app). Consider adding these to your data set.
Out-of-scope — It’s not a zoo animal. It could be an animal in your zoo, but not one typically contained there, like a neighborhood sparrow or mallard duck. This might be a candidate to add.
Out-of-scope — It’s something found in the zoo. A zoo usually has interesting trees and shrubs, so people might try to identify those. Another candidate to add.
Prankster — Completely out-of-scope. Because people like to play with technology, there’s the possibility you have a prankster that took a picture of a bag of popcorn, or a soft drink cup, or even a selfie. These are hard to prevent, but hopefully get a low enough score (below the threshold) so the model did not identify it as a zoo animal. If you see enough pattern in these, consider creating a class with special handling on the front-end.

After reviewing the “happy path” images, you move on to the “sad path” images — the ones that got a low confidence score and the app gave a “sorry, try again” message. This time you ask yourself, “Should the model have given this image a higher score?” which would have put it in the “happy path”. If so, then you want to ensure these images are added to the training set so next time it will do better. But most of time, the low score reflects many of the “bad” or out-of-scope situations mentioned above.

Perhaps your model performance is suffering and it has nothing to do with your model. Maybe it is the ways you users interacting with the app. Keep an eye out of non-technical problems and share your observations with the rest of your team. For example:

Are your users using the application in the ways you expected?
Are they not following the instructions?
Do the instructions need to be stated more clearly?
Is there anything you can do to improve the experience?

Collect statistics and new images

Both of the manual evaluations above open a gold mine of data. So, be sure to collect these statistics and feed them into a dashboard — your manager and your future self will thank you!

Photo by Justin Morgan on Unsplash

Keep track of these stats and generate reports that you and your can reference:

How often the model is being called?
What times of the day, what days of the week is it used?
Are your system resources able to handle the peak load?
What classes are the most common?
After evaluation, what is the accuracy for each class?
What is the breakdown for confidence scores?
How many scores are above and below the confidence threshold?

The single best thing you get from a deployed model is the additional real-world images! You can add these now images to improve coverage of your existing zoo animals. But more importantly, they provide you insight on other classes to add. For example, let’s say people enjoy taking a picture of the large walrus statue at the gate. Some of these may make sense to incorporate into your data set to provide a better user experience.

Creating a new class, like the walrus statue, is not a huge effort, and it avoids the false positive responses. It would be more embarrassing to identify a walrus statue as an elephant! As for the prankster and the bag of popcorn, you can configure your front-end to quietly handle these. You might even get creative and have fun with it like, “Thank you for visiting the food court.”

Double-check process

It is a good idea to double-check your image set when you suspect there may be problems with your data. I’m not suggesting a top-to-bottom check, because that would a monumental effort! Rather specific classes that you suspect could contain bad data that is degrading your model performance.

Immediately after my training run completes, I have a script that will use this new model to generate predictions for my entire data set. When this is complete, it will take the list of incorrect identifications, as well as the low scoring predictions, and automatically feed that list into the Double-check interface.

This interface will show, one at a time, the image in question, alongside an example image of the ground truth and an example image of what the model predicted. I can visually compare the three, side-by-side. The first thing I do is ensure the original image is a “good” picture, following my labelling standards. Then I check if the ground-truth label is indeed correct, or if there is something that made the model think it was the predicted label.

At this point I can:

Remove the original image if the image quality is poor.
Relabel the image if it belongs in a different class.

During this manual evaluation, you might notice dozens of the same wrong prediction. Ask yourself why the model made this mistake when the images seem perfectly fine. The answer may be some incorrect labels on images in the ground truth, or even in the predicted class!

Don’t hesitate to add those classes and sub-classes back into the Double-check interface and step through them all. You may have 100–200 pictures to review, but there is a good chance that one or two of the images will stand out as being the culprit.

Up next…

With a different mindset for a trained model versus a deployed model, we can now evaluate performances to decide which models are ready for production, and how well a production model is going to serve the public. This relies on a solid Double-check process and a critical eye on your data. And beyond the “gut feel” of your model, we can rely on the benchmark scores to support us.

In Part 4, we kick off the training run, but there are some subtle techniques to get the most out of the process and even ways to leverage throw-away models to expand your library image data.

The post Learnings from a Machine Learning Engineer — Part 3: The Evaluation appeared first on Towards Data Science.

Learnings from a Machine Learning Engineer — Part 1: The Data

David Martin — Thu, 13 Feb 2025 20:55:53 +0000

It is said that in order for a machine learning model to be successful, you need to have good data. While this is true (and pretty much obvious), it is extremely difficult to define, build, and sustain good data. Let me share with you the unique processes that I have learned over several years building an ever-growing image classification system and how you can apply these techniques to your own application.

With persistence and diligence, you can avoid the classic “garbage in, garbage out”, maximize your model accuracy, and demonstrate real business value.

In this series of articles, I will dive into the care and feeding of a multi-class, single-label image classification app and what it takes to reach the highest level of performance. I won’t get into any coding or specific user interfaces, just the main concepts that you can incorporate to suit your needs with the tools at your disposal.

Here is a brief description of the articles. You will notice that the model is last on the list since we need to focus on curating the data first and foremost:

Part 1 — The Data — Labelling standards, classes and sub-classes
Part 2 — The Data Sets — Cutoffs and thresholds, benchmark sets, staged and synthetic data
Part 3 — The Evaluation — Trained model versus deployed model evaluations
Part 4 — The Model — Fine tuning, bulk identification, performance reporting

Background

Over the past six years, I have been primarily focused on building and maintaining an image classification application for a manufacturing company. Back when I started, most of the software did not exist or was too expensive, so I created these from scratch. In this time, I have deployed two identifier applications, the largest handles 1,500 classes and achieves 97–98% accuracy.

It was about eight years ago that I started online studies for Data Science and machine learning. So, when the exciting opportunity to create an AI application presented itself, I was prepared to build the tools I needed to leverage the latest advancements. I jumped in with both feet!

I quickly found that building and deploying a model is probably the easiest part of the job. Feeding high quality data into the model is the best way to improve performance, and that requires focus and patience. Attention to detail is what I do best, so this was a perfect fit.

It all starts with the data

I feel that so much attention is given to the model selection (deciding which neural network is best) and that the data is just an afterthought. I have found the hard way that even one or two pieces of bad data can significantly impact model performance, so that is where we need to focus.

For example, let’s say you train the classic cat versus dog image classifier. You have 50 pictures of cats and 50 pictures of dogs, however one of the “cats” is clearly (objectively) a picture of a dog. The computer doesn’t have the luxury of ignoring the mislabelled image, and instead adjusts the model weights to make it fit. Square peg meets round hole.

Another example would be a picture of a cat that climbed up into a tree. But when you take a wholistic view of it, you would describe it as a picture of a tree (first) with a cat (second). Again, the computer doesn’t know to ignore the big tree and focus on the cat — it will start to identify trees as cats, even if there is a dog. You can think of these pictures as outliers and should be removed.

It doesn’t matter if you have the best neural network in the world, you can count on the model making poor predictions when it is trained on “bad” data. I’ve learned that any time I see the model make mistakes, it’s time to review the data.

Example Application — Zoo animals

For the rest of this write-up, I will use an example of identifying zoo animals. Let’s assume your goal is to create a mobile app where guests at the zoo can take pictures of the animals they see and have the app identify them. Specifically, this is a multi-class, single-label application.

Here is your challenge:

Variety — There are a lot of different animals at the zoo and many of them look very similar.
Quality — Guests using the app don’t always take good pictures (zoomed out, blurry, too dark), so we don’t want to provide an answer if the image is poor.
Growth — The zoo keeps expanding and adding new species all the time.
Out-of-scope — Occasionally you might find that people take pictures of the sparrows near the food court grabbing some dropped popcorn.
Pranksters — Just for fun, guests may take a picture of the bag of popcorn just to see what it comes back with.

These are all real challenges — being able to tell the subtle differences between animals, handling out-of-scope cases, and just plain poor images.

Before we get there, let’s start from the beginning.

Collecting and Labelling

There are a lot of tools these days to help you with this part of the process, but the challenge remains the same — collecting, labelling, and curating the data.

Having data to collect is challenge #1. Without images, you have nothing to train. You may need to get creative on sourcing the data, or even creating synthetic data. More on that later.

A quick note about image pre-processing. I convert all my images to the input size of my neural network and save them as PNG. Inside this square PNG, I preserve the aspect ratio of the original picture and fill the background black. I don’t stretch the image nor crop any features out. This also helps center the subject.

Challenge #2 is to establish standards for data quality…and ensure that these standards are followed! These standards will guide you toward that “good” data. And this assumes, of course, correct labels. Having both is much easier said than done!

I hope to show how “good” and “correct” actually go hand-in-hand, and how important it is to apply these standards to every image.

Good Data

First, I want to point out that the image data discussed here is for the training set. What qualifies as a good image for training is a bit different than what qualifies as a good image for evaluation. More on that in Part 3.

So, what is “good” data when talking about images? “A picture is worth a thousand words”, and if the first words you use to describe the picture do not include the subject you are trying to label, then it is not good and you need remove it from your training set.

For example, let’s say you are shown a picture of a zebra and (removing bias toward your application) you describe it as an “open field with a zebra in the distance”. In other words, if “open field” is the first thing you notice, then you likely do not want to use that image. The opposite is also true — if the picture is way too close, you would described it as “zebra pattern”.

Photo by Meg von Haartman on Unsplash

Photo by Jason Dent on Unsplash

Photo by Martin Olsen on Unsplash

What you want is a description like, “a zebra, front and center”. This would have your subject taking up about 80–90% of the total frame. Sometimes I will take the time to crop the original image so the subject is framed properly.

Keep in mind the use of image augmentation at the time of training. Having that buffer around the edges will allow “zoom in” augmentation. And “zoom out” augmentation will simulate smaller subjects, so don’t start out less than 50% of the total frame for your subject since you lose detail.

Another aspect of a “good” image relates to the label. If you can only see the back side of your zoo animal, can you really tell, for example, that it is a cheetah versus a leopard? The key identifying features need to be visible. If a human struggles to identify it, you can’t expect the computer to learn anything.

Photo by Jan Harder on Unsplash

What does a “bad” image look like? Here is what I frequently watch out for:

Wide angle lens stretching
Back-lit or silohuette
High contrast or dark shadows
Blurry or hazy
Obscured features
Multiple subjects
“Doctored” images, drawn lines and arrows
“Unusual” angles or situations
Picture of a mobile device that has a picture of your subject

Correct Labels

If you have a team of subject matter experts (SMEs) on hand to label the images, you are in a good starting position. Animal trainers at the zoo know the various species, and can spot the differences between, for example, a chimpanzee and a bonobo.

Photo by Adèle on Unsplash

Photo by Andrius Ordojan on Unsplash

To a Machine Learning Engineer, it is easy for you to assume all labels from your SMEs are correct and move right on to training the model. However, even experts make mistakes, so if you can get a second opinion on the labels, your error rate should go down.

In reality, it can be prohibitively expensive to get one, let alone two, subject matter experts to review image labels. The SME usually has years of experience that make them more valuable to the business in other areas of work. My experience is that the machine learning engineer (that’s you and me) becomes the second opinion, and often the first opinion as well.

Over time, you can become pretty adept at labelling, but certainly not an SME. If you do have the luxury of access to an expert, explain to them the labelling standards and how these are required for the application to be successful. Emphasize “quality over quantity”.

It goes without saying that having a correct label is so important. However, all it takes is one or two mislabelled images to degrade performance. These can easily slip into your data set with careless or hasty labelling. So, take the time to get it right.

Ultimately, we as the ML engineer are responsible for model performance. So, if we take the approach of only working on model training and deployment, we will find ourselves wondering why performance is falling short.

Unknown Labels

A lot of times, you will come across a really good picture of a very interesting subject, but have no idea what it is! It would be a shame to simply dispose of it. What you can do is assign it a generic label, like “Unknown Bird” or “Random Plant” that are not included in your training set. Later in Part 4, you’ll see how to come back to these images at a later date when you have a better idea what they are, and you’ll be glad you saved them.

Model Assistance

If you have done any image labelling, then you know how time consuming and difficult it can be. But this is where having a model, even a less-than-perfect model, can help you.

Typically, you have a large collection of unlabelled image and you need to go through them one at a time to assign labels. Simply having the model offer a best guess and display the top 3 results lets you step through each image in a matter of seconds!

Even if the top 3 results are wrong, this can help you narrow down your search. Over time, newer models will get better, and the labelling process can even be somewhat fun!

In Part 4, I will show how you can bulk identify images and take this to the next level for faster labelling.

Classes and Sub-Classes

I mentioned the example above of two species that look very similar, the chimpanzee and the bonobo. When you start out building your data set, you may have very sparse coverage of one or both of these species. In machine learning terms, we these “classes”. One option is to roll with what you have and hope that the model picks up on the differences with only a handful of example images.

The option that I have used is to merge two or more classes into one, at least temporarily. So, in this case I would create a class called “chimp-bonobo”, which is composed of the limited example pictures of chimpanzee and bonobo species classes. Combined, these may give me enough to train the model on “chimp-bonobo”, with the trade-off that it’s a more generic identification.

Sub-classes can even be normal variations. For example, juvenile pink flamingos are grey instead of pink. Or, male and female orangutans have distinct facial features. You wan to have a fairly balanced number of images for these normal variations, and keeping sub-classes will allow you to accomplish this.

Photo by David Valentine on Unsplash

Photo by Hongbin on Unsplash

Don’t be concerned that you are merging completely different looking classes — the neural network does a nice job of applying the “OR” operator. This works both ways — it can help you identify male or female variations as one species, but it can hurt you when “bad” outlier images sneak in like the example “open field with a zebra in the distance.”

Over time, you will (hopefully) be able to collect more images of the sub-classes and then be able to successfully split them apart (if necessary) and train the model to identify them separately. This process has worked very well for me. Just be sure to double-check all the images when you split them to ensure the labels didn’t get accidentally mixed up — it will be time well spent.

All of this certainly depends on your user requirements, and you can handle this in different ways either by creating a unique class label like “chimp-bonobo”, or at the front-end presentation layer where you notify the user that you have intentionally merged these classes and provide guidance on further refining the results. Even after you decide to split the two classes, you may want to caution the user that the model could be wrong since the two classes are so similar.

Up next…

I realize this was a long write-up for something that on the surface seems intuitive, but these are all areas that I have tripped me up in the past because I didn’t give them enough attention. Once you have a solid understanding of these principles, you can go on to build a successful application.

In Part 2, we will take the curated data we collected here to create the classic data sets, with a custom benchmark set that will further enhance your data. Then we will see how best to evaluate our trained model using a specific “training mindset”, and switch to a “production mindset” when evaluating a deployed model.

The post Learnings from a Machine Learning Engineer — Part 1: The Data appeared first on Towards Data Science.

Learnings from a Machine Learning Engineer — Part 4: The Model

David Martin — Thu, 13 Feb 2025 20:53:42 +0000

In this latest part of my series, I will share what I have learned on selecting a model for Image Classification and how to fine tune that model. I will also show how you can leverage the model to accelerate your labelling process, and finally how to justify your efforts by generating usage and performance statistics.

In Part 1, I discussed the process of labelling your image data that you use in your image classification project. I showed how define “good” images and create sub-classes. In Part 2, I went over various data sets, beyond the usual train-validation-test sets, with benchmark sets, plus how to handle synthetic data and duplicate images. In Part 3, I explained how to apply different evaluation criteria to a trained model versus a deployed model, and using benchmarks to determine when to deploy a model.

Model selection

So far I have focused a lot of time on labelling and curating the set of images, and also evaluating model performance, which is like putting the cart before the horse. I’m not trying to minimize what it takes to design a massive neural network — this is a very important part of the application you are building. In my case, I spent a few weeks experimenting with different available models before settling on one that fit the bill.

Once you pick a model structure, you usually don’t make any major changes to it. For me, six years into deployment, I’m still using the same one. Specifically, I chose Inception V4 because it has a large input image size and an adequate number of layers to pick up on subtle image features. It also performs inference fast enough on CPU, so I don’t need to run expensive hardware to serve the model.

Your mileage may vary. But again, the main takeaway is that focusing on your data will pay dividends versus searching for the best model.

Fine tuning

I will share a process that I found to work extremely well. Once I decided on the model to use, I randomly initialized the weights and let the model train for about 120 epoch before improvements plateau at a fairly modest accuracy, like 93%. At this point, I performed the evaluation of the trained model (see Part 3) to clean up the data set. I also incorporated new images as part of the data pipeline (see Part 1) and prepared the data sets for the next training run.

Before starting the next training run, I simply take the last trained model, pop the output layer, and add it back in with random weights. Since the number of output classes are constantly increasing in my case, I have to pop that layer anyway to account for the new number of classes. Importantly, I leave the rest of the trained weights as they were and allow them to continue updating for the new classes.

This allows the model to train much faster before improvements stall. After repeating this process dozens of times, the training reaches plateau after about 20 epochs, and the test accuracy can reach 99%! The model is building upon the low-level features that it established from the previous runs while re-learning the output weights to prevent overfitting.

It took me a while to trust this process, and for a few years I would train from scratch every time. But after I attempted this and saw the training time (not to mention the cost of cloud GPU) go down while the accuracy continued to go up, I started to embrace the process. More importantly, I continue to see the evaluation metrics of the deployed model return solid performances.

Augmentation

During training, you can apply transformations on your images (called “augmentation”) to give you more diversity from you data set. With our zoo animals, it is fairly safe to apply left-right flop, slight rotations clockwise and counterclockwise, and slight resize that will zoom in and out.

With these transformations in mind, make sure your images are still able to act as good training images. In other words, an image where the subject is already small will be even smaller with a zoom out, so you probably want to discard the original. Also, some of your original pictures may need to be re-oriented by 90 degrees to be upright since a further rotation would make them look unusual.

Bulk identification

As I mentioned in Part 1, you can use the trained model to assist you in labelling images one at a time. But the way to take this even further is to have your newly trained model identify hundreds at a time while building a list of the results that you can then filter.

Typically, we have large collections of unlabelled images that have come in either through regular usage of the application or some other means. Recall from Part 1 assigning “unknown” labels to interesting pictures but you have no clue what it is. By using the bulk identification method, we can sift through the collections quickly to target the labelling once we know what they are.

By combining your current image counts with the bulk identification results, you can target classes that need expanded coverage. Here are a few ways you can leverage bulk identification:

Increase low image counts — Some of your classes may have just barely made the cutoff to be included in the training set, which means you need more examples to improve coverage. Filter for images that have low counts.
Replace staged or synthetic images — Some classes may be built entirely using non-real-world images. These pictures may be good enough to get started with, but may cause performance issues down the road because they look different than what typically comes through. Filter for classes that depend on staged images.
Find look-alike classes — A class in your data set may look like another one. For example, let’s say your model can identify an antelope, and that looks like a gazelle which your model cannot identify yet. Setting a filter for antelope and a lower confidence score may reveal gazelle images that you can label.
Unknown labels — You may not have known how to identify the dozens of cute wallaby pictures, so you saved them under “Unknown” because it was a good image. Now that you know what it is, you can filter for its look-alike kangaroo and quickly add a new class.
Mass removal of low scores — As a way to clean out your large collection of unlabelled images that have nothing worth labelling, set a filter for lowest scores.

Throw-away training run

Recall the decision I made to have image cutoffs from Part 2, which allows us to ensure an adequate number of example images of a class before we train and server a model to the public. The problem is that you may have a number of classes that are just below your cutoff (in my case, 40) and don’t make it into the model.

The way I approach this is with a “throw-away” training run that I do not intend to move to production. I will decrease the lower cutoff from 40 to perhaps 35, build my train-validation-test sets, then train and evaluate like I normally do. The most important part of this is the bulk identification at the end!

There is a chance that somewhere in the large collection of unlabelled images I will find the few that I need. Doing the bulk identification with this throw-away model helps find them.

Performance Reporting

One very important aspect of any machine learning application is being able to show usage and performance reports. Your manager will likely want to see how many times the application is being used to justify the expense, and you as the ML engineer will want to see how the latest model is performing compared to the previous one.

You should build logging into your model serving to record every transaction going through the system. Also, the manual evaluations from Part 3 should be recorded so you can report on performance for such things as accuracy over time, by model version, by confidence scores, by class, etc. You will be able to detect trends and make adjustments to improve the overall solution.

There are a lot of reporting tools, so I won’t recommend one over the other. Just make sure you are collecting as much information as you can to build these dashboards. This will justify the time, effort, and cost associated with maintaining the application.

Conclusion

We covered a lot of ground across this four-part series on building an image classification project and deploying it in the real world. It all starts with the data, and by investing the time and effort into maintaining the highest quality image library, you can reach impressive levels of model performance that will gain the trust and confidence of your business partners.

As a Machine Learning Engineer, you are primarily responsible for building and deploying your model. But it doesn’t stop there — dive into the data. The more familiar you are with the data, the better you will understand the strengths and weaknesses of your model. Take a close look at the evaluations and use them as an opportunity to adjust the data set.

I hope these articles have helped you find new ways to improve your own machine learning project. And by the way, don’t let the machine do all the learning — as humans, our job is to continue our own learning, so don’t ever stop!

Thank you for taking this deep dive with me into a data-driven approach to model optimization. I look forward to your feedback and how you can apply this to your own application.

The post Learnings from a Machine Learning Engineer — Part 4: The Model appeared first on Towards Data Science.

Learnings from a Machine Learning Engineer — Part 2: The Data Sets

David Martin — Thu, 13 Feb 2025 20:29:39 +0000

In Part 1, we discussed the importance of collecting good image data and assigning proper labels for your Image Classification project to be successful. Also, we talked about classes and sub-classes of your data. These may seem pretty straight forward concepts, but it’s important to have a solid understanding going forward. So, if you haven’t, please check it out.

Now we will discuss how to build the various data sets and the techniques that have worked well for my application. Then in the next part, we will dive into the evaluation of your models, beyond simple accuracy.

I will again use the example zoo animals image classification app.

Data Sets

As machine learning engineers, we are all familiar with the train-validation-test sets, but when we include the concept of sub-classes discussed in Part 1, and incorporate to concepts discussed below to set a minimum and maximum image count per class, as well as staged and synthetic data to the mix, the process gets a bit more complicated. I had to create a custom script to handle these options.

I will walk you through these concepts before we split the data for training:

Image cutoffs — Too few images and your model performance will suffer. Too many and you spend more time training than it’s worth.
Confidence thresholds — Your model indicates how confident it is in the predictions. Let’s use that to decide when to present results to the user.
Benchmark sets — Real-world data is messy and the benchmark sets should reflect that. These need to stretch the model to the limit and help us decide when it is ready for production.
Staged and synthetic data — Real-world data is king, but sometimes you need to produce the your own or even generate data to get off the ground. Be careful it doesn’t hurt performance.
Duplicate images — Repeat data can skew your results and give you a false sense of performance. Make sure your data is diverse.
Building the data sets — Combine sub-classes, apply cutoffs, and create your train-validation-test sets. Now we are ready to get the show started.

Image cutoffs

In my experience, using a minimum of 40 images per class provides descent performance. Since I like to use 10% each for the test set and validation set, that means at least 4 images will be used to check the training set, which feels just barely adequate. Using fewer than 40 images per class, I notice my model evaluation tends to suffer.

On the other end, I set a maximum of about 125 images per class. I have found that the performance gains tend to plateau beyond this, so having more data will slow down the training run with little to show for it. Having more than the maximum is fine, and these “overflow” can be added to the test set, so they don’t go to waste.

There are times when I will drop the minimum cutoff to, say 35, with no intention of moving the trained model to production. Instead, the purpose is to leverage this throw-away model to find more images from my unlabelled set. This is a technique that I will go into more detail in Part 3.

Confidence threshold

You are likely familiar with the softmax score. As a reminder, softmax is the probability assigned to each label. I like to think of it as a confidence score, and we are interested in the class that receives the highest confidence. Softmax is a value between zero and one, but I find it easier to interpret confidence scores between zero and 100, like a percentage.

In order to decide if the model is confident enough with its prediction, I have chosen a threshold of 95. I use this threshold when determining if I want to present results to the user.

Scores above the threshold have a better changes of being right, so I can confidently provide the results. Scores below the threshold may not be right — in fact it could be “out-of-scope”, meaning it’s something the model doesn’t know how to identify. So, instead of taking the risk of presenting incorrect results, I instead prompt the user to try again and offer suggestions on how to take a “good” picture.

Admittedly this is somewhat arbitrary cutoff, and you should decide for your use-case what is appropriate. In fact, this score could probably be adjusted for each trained model, but this would make it harder to compare performance across models.

I will refer to this confidence score frequently in the evaluations section in Part 3.

Benchmark sets

Let me introduce what I call the benchmark sets, which you can think of as extended test sets. These are hand-picked images designed to stretch the limits of your model, and provide a measure for specific classes of your data. Use these benchmarks to justify moving your model to production, and for an objective measure to show to your manager.

Difficult Benchmark — These are the “extra credit” images, like the bonus questions a professor would add to the quiz to see which students are paying attention. You need a keen eye to spot the difference between the ground truth and a similar looking class. For example, a cheetah sleeping in the shade that could pass as a leopard if you don’t look closely.
Out-of-scope Benchmark — These are the “trick question” images. Our model is trained on zoo animals, but people are known for not following the rules. For example, a zoo guest takes a picture of their child wearing cheetah face paint.
Most-Common Benchmark — These are your “bread and butter” classes that need to get near perfect scores and zero errors. This would be a make-or-break benchmark for moving to production.
Least-Common Benchmark — These are your “rare but exceptional” classes that again need to be correct, but reach a minimum score like the confidence threshold.

When looking for images to add to the benchmarks, you can likely find them in real-world images from your deployed model. See the evaluation in Part 3.

For each benchmark, calculate the min, max, median, and mean scores, and also how many images get scores above and below the confidence threshold. Now you can compare these measures against what is currently in production, and against your minimum requirements, to help decide if the new model is production worthy.

Staged or Synthetic data

Perhaps the biggest hurdle to any supervised machine learning application is having data to train the model. Clearly, “real-world” data that comes from actual users of the application is ideal. However you can’t really collect these until the model is deployed. Chicken and egg problem.

One way to get started to is to have volunteers collect “staged” images for you, trying to act like real users. So, let’s have our zoo staff go around taking pictures of the animals. This is a good start, but there will be a certain level of bias introduced in these images. For example, the staff may take the photos over a few days, so you may not get the year-round weather conditions.

Another way to get pictures is use computer-generated “synthetic” images. I would avoid these at all costs, to be honest. Based on my experience, the model struggles with these because they look…different. The lighting is not natural, the subject may superimposed on a background and so the edges look too sharp, etc. Granted, some of the AI generated images look very realistic, but if you look closely you may spot something unusual. The neural network in your model will notice these, so be careful.

Image generated using Dall-E

The way that I handle these staged or synthetic images is as a sub-class that gets merged into the training set, but only after giving preference to the real-world images. I cap the number of staged images to 60, so if I have 10 real-world, I now only need 50 staged. Eventually, these staged and synthetic images are phased out completely, and I rely entirely on real-world.

Duplicate images

One problem that can creep into your image set are duplicate images. These can be exact copies of pictures, or they can be extremely similar. You may think that this is harmless, but imagine having 100 pictures of an elephant that are exactly the same — your model will not know what to do with a different angle of the elephant.

Now, let’s say you have only two pictures that are nearly the same. Not so bad, right? Well, here is what can happen to them:

Both pictures go in the training set — The model doesn’t learn anything from the repeated image and it wastes time processing them.
One goes into the training set, the other goes into the test set — Your test score will be higher, but it is not an accurate evaluation.
Both are in the test set — Your test score will be compounded either higher or lower than it should be.

None of these will help your model.

There are a few ways to find duplicates. The approach I have taken is to calculate a hamming distance on all the pictures and identify the ones that are very close. I have an interface that displays the duplicates and I decide which one I like best, and remove the other.

Another way (I haven’t tried this yet) is to create a vector representation of your images. Store these a vector database, and you can do a similarity search to find nearly identical images.

Whatever method you use, it is important to clean up the duplicates.

Building the data sets

Now we are ready to build the traditional training, validation, and test sets. This is no longer a straight forward task since I want to:

Merge sub-classes into a main class.
Prioritize real-world images over staged or synthetic images.
Apply a minimum number of images per class.
Apply a maximum number of images per class, sending the “overflow” to the test set.

This process is somewhat complicated and depends on how you manage your image library. First, I would recommend keeping your images in a folder structure that has sub-class folders. You can get image counts by using a script to simply read the folders. Second is to keep a configuration of how the sub-classes are merged. To really set yourself up for success, put these image counts and merge rules in a database for faster lookups.

My train-validation-test set splits are usually 90–10–0. I originally started out using 80–10–10, but with diligence on keeping the entire data set clean, I noticed validation and test scores became pretty even. This allowed me to increase the training set size, and use “overflow” to become the test set, as well as using the benchmark sets.

Up next…

In this part, we’ve built our data sets by merging sub-classes and using the image count cutoffs. Plus we handle staged and synthetic data as well as cleaning up duplicate images. We also created benchmark sets and defined confidence thresholds, which help us decide when to move a model to production.

In Part 3, we will discuss how we are going to evaluate the different model performances. And then finally we will get to the actual model training and the techniques to enhance accuracy.

The post Learnings from a Machine Learning Engineer — Part 2: The Data Sets appeared first on Towards Data Science.