Jonte Dancker, Author at Towards Data Science

A Data Scientist’s Guide to Docker Containers

Jonte Dancker — Tue, 08 Apr 2025 20:02:45 +0000

For a ML model to be useful it needs to run somewhere. This somewhere is most likely not your local machine. A not-so-good model that runs in a production environment is better than a perfect model that never leaves your local machine.

However, the production machine is usually different from the one you developed the model on. So, you ship the model to the production machine, but somehow the model doesn’t work anymore. That’s weird, right? You tested everything on your local machine and it worked fine. You even wrote unit tests.

What happened? Most likely the production machine differs from your local machine. Perhaps it does not have all the needed dependencies installed to run your model. Perhaps installed dependencies are on a different version. There can be many reasons for this.

How can you solve this problem? One approach could be to exactly replicate the production machine. But that is very inflexible as for each new production machine you would need to build a local replica.

A much nicer approach is to use Docker containers.

Docker is a tool that helps us to create, manage, and run code and applications in containers. A container is a small isolated computing environment in which we can package an application with all its dependencies. In our case our ML model with all the libraries it needs to run. With this, we do not need to rely on what is installed on the host machine. A Docker Container enables us to separate applications from the underlying infrastructure.

For example, we package our ML model locally and push it to the cloud. With this, Docker helps us to ensure that our model can run anywhere and anytime. Using Docker has several advantages for us. It helps us to deliver new models faster, improve reproducibility, and make collaboration easier. All because we have exactly the same dependencies no matter where we run the container.

As Docker is widely used in the industry Data Scientists need to be able to build and run containers using Docker. Hence, in this article, I will go through the basic concept of containers. I will show you all you need to know about Docker to get started. After we have covered the theory, I will show you how you can build and run your own Docker container.

What is a container?

A container is a small, isolated environment in which everything is self-contained. The environment packages up all code and dependencies.

A container has five main features.

self-contained: A container isolates the application/software, from its environment/infrastructure. Due to this isolation, we do not need to rely on any pre-installed dependencies on the host machine. Everything we need is part of the container. This ensures that the application can always run regardless of the infrastructure.
isolated: The container has a minimal influence on the host and other containers and vice versa.
independent: We can manage containers independently. Deleting a container does not affect other containers.
portable: As a container isolates the software from the hardware, we can run it seamlessly on any machine. With this, we can move it between machines without a problem.
lightweight: Containers are lightweight as they share the host machine’s OS. As they do not require their own OS, we do not need to partition the hardware resource of the host machine.

This might sound similar to virtual machines. But there is one big difference. The difference is in how they use their host computer’s resources. Virtual machines are an abstraction of the physical hardware. They partition one server into multiple. Thus, a VM includes a full copy of the OS which takes up more space.

In contrast, containers are an abstraction at the application layer. All containers share the host’s OS but run in isolated processes. Because containers do not contain an OS, they are more efficient in using the underlying system and resources by reducing overhead.

Containers vs. Virtual Machines (Image by the author based on docker.com)

Now we know what containers are. Let’s get some high-level understanding of how Docker works. I will briefly introduce the technical terms that are used often.

What is Docker?

To understand how Docker works, let’s have a brief look at its architecture.

Docker uses a client-server architecture containing three main parts: A Docker client, a Docker daemon (server), and a Docker registry.

The Docker client is the primary way to interact with Docker through commands. We use the client to communicate through a REST API with as many Docker daemons as we want. Often used commands are docker run, docker build, docker pull, and docker push. I will explain later what they do.

The Docker daemon manages Docker objects, such as images and containers. The daemon listens for Docker API requests. Depending on the request the daemon builds, runs, and distributes Docker containers. The Docker daemon and client can run on the same or different systems.

The Docker registry is a centralized location that stores and manages Docker images. We can use them to share images and make them accessible to others.

Sounds a bit abstract? No worries, once we get started it will be more intuitive. But before that, let’s run through the needed steps to create a Docker container.

Docker Architecture (Image by author based on docker.com)

What do we need to create a Docker container?

It is simple. We only need to do three steps:

create a Dockerfile
build a Docker Image from the Dockerfile
run the Docker Image to create a Docker container

Let’s go step-by-step.

A Dockerfile is a text file that contains instructions on how to build a Docker Image. In the Dockerfile we define what the application looks like and its dependencies. We also state what process should run when launching the Docker container. The Dockerfile is composed of layers, representing a portion of the image’s file system. Each layer either adds, removes, or modifies the layer below it.

Based on the Dockerfile we create a Docker Image. The image is a read-only template with instructions to run a Docker container. Images are immutable. Once we create a Docker Image we cannot modify it anymore. If we want to make changes, we can only add changes on top of existing images or create a new image. When we rebuild an image, Docker is clever enough to rebuild only layers that have changed, reducing the build time.

A Docker Container is a runnable instance of a Docker Image. The container is defined by the image and any configuration options that we provide when creating or starting the container. When we remove a container all changes to its internal states are also removed if they are not stored in a persistent storage.

Using Docker: An example

With all the theory, let’s get our hands dirty and put everything together.

As an example, we will package a simple ML model with Flask in a Docker container. We can then run requests against the container and receive predictions in return. We will train a model locally and only load the artifacts of the trained model in the Docker Container.

I will go through the general workflow needed to create and run a Docker container with your ML model. I will guide you through the following steps:

build model
create requirements.txt file containing all dependencies
create Dockerfile
build docker image
run container

Before we get started, we need to install Docker Desktop. We will use it to view and run our Docker containers later on.

1. Build a model

First, we will train a simple RandomForestClassifier on scikit-learn’s Iris dataset and then store the trained model.

Second, we build a script making our model available through a Rest API, using Flask. The script is also simple and contains three main steps:

extract and convert the data we want to pass into the model from the payload JSON
load the model artifacts and create an onnx session and run the model
return the model’s predictions as json

I took most of the code from here and here and made only minor changes.

2. Create requirements

Once we have created the Python file we want to execute when the Docker container is running, we must create a requirements.txt file containing all dependencies. In our case, it looks like this:

3. Create Dockerfile

The last thing we need to prepare before being able to build a Docker Image and run a Docker container is to write a Dockerfile.

The Dockerfile contains all the instructions needed to build the Docker Image. The most common instructions are

FROM — this specifies the base image that the build will extend.
WORKDIR — this instruction specifies the “working directory” or the path in the image where files will be copied and commands will be executed.
COPY — this instruction tells the builder to copy files from the host and put them into the container image.
RUN — this instruction tells the builder to run the specified command.
ENV — this instruction sets an environment variable that a running container will use.
EXPOSE — this instruction sets the configuration on the image that indicates a port the image would like to expose.
USER — this instruction sets the default user for all subsequent instructions.
CMD ["", ""] — this instruction sets the default command a container using this image will run.

With these, we can create the Dockerfile for our example. We need to follow the following steps:

Determine the base image
Install application dependencies
Copy in any relevant source code and/or binaries
Configure the final image

Let’s go through them step by step. Each of these steps results in a layer in the Docker Image.

First, we specify the base image that we then build upon. As we have written in the example in Python, we will use a Python base image.

Second, we set the working directory into which we will copy all the files we need to be able to run our ML model.

Third, we refresh the package index files to ensure that we have the latest available information about packages and their versions.

Fourth, we copy in and install the application dependencies.

Fifth, we copy in the source code and all other files we need. Here, we also expose port 8080, which we will use for interacting with the ML model.

Sixth, we set a user, so that the container does not run as the root user

Seventh, we define that the example.py file will be executed when we run the Docker container. With this, we create the Flask server to run our requests against.

Besides creating the Dockerfile, we can also create a .dockerignore file to improve the build speed. Similar to a .gitignore file, we can exclude directories from the build context.

If you want to know more, please go to docker.com.

4. Create Docker Image

After we created all the files we needed to build the Docker Image.

To build the image we first need to open Docker Desktop. You can check if Docker Desktop is running by running docker ps in the command line. This command shows you all running containers.

To build a Docker Image, we need to be at the same level as our Dockerfile and requirements.txt file. We can then run docker build -t our_first_image . The -t flag indicates the name of the image, i.e., our_first_image, and the . tells us to build from this current directory.

Once we built the image we can do several things. We can

view the image by running docker image ls
view the history or how the image was created by running docker image history
push the image to a registry by running docker push

5. Run Docker Container

Once we have built the Docker Image, we can run our ML model in a container.

For this, we only need to execute docker run -p 8080:8080 in the command line. With -p 8080:8080 we connect the local port (8080) with the port in the container (8080).

If the Docker Image doesn’t expose a port, we could simply run docker run . Instead of using the image_name, we can also use the image_id.

Okay, once the container is running, let’s run a request against it. For this, we will send a payload to the endpoint by running curl X POST http://localhost:8080/invocations -H "Content-Type:application/json" -d @.path/to/sample_payload.json

Conclusion

In this article, I showed you the basics of Docker Containers, what they are, and how to build them yourself. Although I only scratched the surface it should be enough to get you started and be able to package your next model. With this knowledge, you should be able to avoid the “it works on my machine” problems.

I hope that you find this article useful and that it will help you become a better Data Scientist.

See you in my next article and/or leave a comment.

The post A Data Scientist’s Guide to Docker Containers appeared first on Towards Data Science.

The Impact of GenAI and Its Implications for Data Scientists

Jonte Dancker — Fri, 14 Mar 2025 22:56:09 +0000

GenAI systems affect how we work. This general notion is well known. However, we are still unaware of the exact impact of GenAI. For example, how much do these tools affect our work? Do they have a larger impact on certain tasks? What does this mean for us in our daily work?

To answer these questions, Anthropic released a study based on millions of anonymized conversations on Claude.ai. The study provides data on how GenAI is incorporated into real-world tasks and reveals actual GenAI usage patterns.

In this article, I will go through the four main findings of the study. Based on the findings I will derive how GenAI changes our work and what skills we need in the future.

Main findings

GenAI is mostly used for software development and technical writing tasks, reaching almost 50 % of all tasks. This is likely due to LLMs being mostly text-based and thus being less useful for certain tasks.

GenAI has a stronger impact on some groups of occupations than others.More than one-third of occupations use GenAI in at least a quarter of their tasks. In contrast, only 4 % of occupations use it for more than three-quarters of their tasks. We can see that only very few occupations use GenAI across most of their tasks. This suggests that no job is being entirely automated.

GenAI is used for augmentation rather than automation, i.e., 57% vs 43 % of the tasks. But most occupations use both, augmentation and automation across tasks. Here, augmentation means the user collaborates with the GenAI to enhance their capabilities. Automation, in contrast, refers to tasks in which the GenAI directly performs the task. However, the authors guess that the share of augmentation is even higher as users might adjust GenAI answers outside of the chat window. Hence, what seems to be automation is actually augmentation. The results suggest that GenAI serves as an efficiency tool and a collaborative partner, resulting in improved productivity. These results align very well with my own experience. I mostly use GenAI tools to augment my work instead of automating tasks. In the article below you can see how GenAI tools have increased my productivity and what I use them for daily.

GenAI is mostly used for tasks associated with mid-to-high-wage occupations, such as data scientists. In contrast, the lowest and highest-paid roles show a much lower usage of GenAI. The authors conclude that this is due to the current limits of GenAI capabilities and practical barriers when it comes to using GenAI.

Overall, the study suggests that occupations will rather evolve than disappear. This is because of two reasons. First, GenAI integration remains selective rather than comprehensive within most occupations. Although many jobs use GenAI, the tools are only used selectively for certain tasks. Second, the study saw a clear preference for augmentation over automation. Hence, GenAI serves as an efficiency tool and a collaborative partner.

Limitations

Before we can derive the implications of GenAI, we should look at the limitations of the study:

It is unknown how the users used the responses. Are they copy-pasting code snippets uncritically or editing them in their IDE? Hence, some conversations that look like automation might have been augmentation instead.
The authors only used conversations from Claude.ai’s chat but not from API or Enterprise users. Hence, the dataset used in the analysis shows only a fraction of actual GenAI usage.
Automating the classification might have led to the wrong classification of conversations. However, due to the large amount of conversation used the impact should be rather small.
Claude being only text-based restricts the tasks and thus might exclude certain jobs.
Claude is advertised as a state-of-the-art coding model thus attracting mostly users for coding tasks.

Overall, the authors conclude that their dataset is not a representative sample of GenAI use in general. Thus, we should handle and interpret the results with care. Despite the study’s limitations, we can see some implications from the impact of GenAI on our work, particularly as Data Scientists.

Implications

The study shows that GenAI has the potential to reshape jobs and we can already see its impact on our work. Moreover, GenAI is rapidly evolving and still in the early stages of workplace integration.

Thus, we should be open to these changes and adapt to them.

Most importantly, we must stay curious, adaptive, and willing to learn. In the field of Data Science changes happen regularly. With GenAI tools change will happen even more frequently. Hence, we must stay up-to-date and use the tools to support us in this journey.

Currently, GenAI has the potential to enhance our capabilities instead of automating them.

Hence, we should focus on developing skills that complement GenAI. We need skills to augment workflows effectively in our work and analytical tasks. These skills lie in areas with low penetration of GenAI. This includes human interaction, strategic thinking, and nuanced decision-making. This is where we can stand out.

Moreover, skills such as critical thinking, complex problem-solving, and judgment will remain highly valuable. We must be able to ask the right questions, interpret the output of LLMs, and take action based on the answers.

Moreover, GenAI will not replace our collaboration with colleagues in projects. Hence, improving our emotional intelligence will help us to work together effectively.

Conclusion

GenAI is rapidly evolving and still in the early stages of workplace integration. However, we can already see some implications from the impact of GenAI on our work.

In this article, I showed you the main findings of a recent study from Anthropic on the use of their LLMs. Based on the results, I showed you the implications for Data Scientists and what skills might become more important.

I hope that you find this article useful and that it will help you become a better Data Scientist.

See you in my next article.

The post The Impact of GenAI and Its Implications for Data Scientists appeared first on Towards Data Science.

How GenAI Tools Have Changed My Work as a Data Scientist

Jonte Dancker — Tue, 28 Jan 2025 18:01:57 +0000

(Image by the author)

It has been slightly more than two years since ChatGPT came out and started the hype around GenAI. Since then many things have happened. GenAI tools came and went. The ones that stayed became better and better, extending to more use cases.

Over the last year, I began to integrate some GenAI tools more and more into my daily work. I tried different GenAI tools and different use cases.

In this article, I will share how GenAI tools help me in my daily work. I will share what GenAI tools I use, and how they have changed my work as a Data Scientist. I will talk about how they have increased my productivity and also some words of warning.

Generally, I use GenAI tools for many things as they allow me to

increase my productivity in some areas
focus on the important and hard stuff that adds value
increase the quality of my work
learn faster and easier

I am more productive as I can iterate faster and try more things as I do not need to write all the code myself. As GenAI tools take over some of the easy but time-consuming parts of my work, I have more time to spend on strategic thinking, complex problem-solving, and being creative. Moreover, they help me to increase the quality of my work in many areas, not only when it comes to writing code. Finally, GenAI tools help me learn new things faster as I can create a learning path that suits me best.

But how does this look in detail?

Let’s jump right into it.

My (current) use cases for GenAI tools

I have four use cases for which I use GenAI tools daily:

learning
coding
documentation
writing

Learning

When it comes to learning, GenAI tools help me in different ways

brainstorm ideas, refine ideas, and get new input
get an overview of a new topic
explain concepts and research papers
learn new programming languages

The main benefit of using GenAI tools for learning is that I can ask many questions. This allows me to approach a topic that is new to me from different angles. I do not need to search hundreds of websites to find an explanation that suits me. The GenAI tools can generate many examples on the spot.

This has greatly sped up my learning and improved my learning experience. For all these tasks, the GenAI tools helped me get a good overview. However, it is often not enough when it comes to details or complex concepts described in research papers. Hence, I use the GenAI tools as a first go-to source but then dive into the details the old-school way. Read the research paper or go through a few websites. But since I already have some good basis thanks to the GenAI tools, understanding research papers has become a lot easier and faster as well.

For my learning tasks, I mostly use ChatGPT, Claude, and Perplexity AI, depending on the exact task.

Perplexity AI has partly replaced my Google search. It is much faster than clicking through many links on Google, trying to find a good answer. Moreover, it gives a better-organized answer. However, the biggest benefit for me compared to other tools is that it adds sources to the statements. This makes it easy to verify the answer and tell how trustworthy the information is. Moreover, I already have a starting point to dig deeper if needed.

The quality of the results strongly depends on the prompt. The better your prompt, the more helpful the result. I usually try to be as detailed as possible and give some context and keywords the answer should focus on. Often it also helps when specifying a role, for example, "act as an expert in …" or let the AI tool give you alternatives. However, I am not a great prompt writer and there is room for improvement.

Coding

GenAI tools have greatly improved my coding experience and skills. In general, GenAI tools have changed what I spend most time on when coding. The tools help me be quicker in solving small and easy tasks as well as building prototypes. Thus, I have more time to focus on complex problems or try more approaches.

I use the tools mainly for

code generation
code improvements/refactoring
error investigation
understanding third-party libraries and their documentation

Using AI tools has become a natural part of my coding. This is mostly because of GitHub Copilot’s auto-complete / suggestion function. I can easily accept or decline suggestions and do not have to think about "asking" for help.

When it comes to code generation I use GenAI tools in two ways.

If I know how to implement an idea and already have the steps in mind, I go step by step and use GitHub Copilot’s auto-complete function. This is usually a good enough start. However, as the output does not always work or does what I want I pay special attention to the desired behavior.

If I know what the result should look like but have no idea how to get there, I use ChatGPT or Claude to show me different approaches. This has greatly improved my work. Before, I had to tediously search the internet to get ideas for my specific problem. This search usually took a lot of time before finding the right thing (if I found it at all). Now I write a few prompts until I have what I want. I use ChatGPT or Claude as they give different suggestions and it is hard to know which one works better for a specific problem.

For example, I often use ChatGPT or Claude to generate plots for data analysis as it is much faster. Without the tools, it took me a lot of time to get a matplotlib plot to look like I wanted. Especially when I wanted to deviate from the easy and standard plots.

In some use cases, I also use GenAI tools to help me generate unit tests. I am not using the tools to create entire unit tests but to give me a starting point and a basis to build on. Building this basis is usually tedious and takes a lot of time. For this, I use GitHub Copilot as the tool is aware of the code base and can give me better suggestions. However, as the unit tests ensure the robustness of the code, I rely as little as possible on the GenAI tools.

Once I have a first version of the code I often use GitHub Copilot to refactor the code, improving code readability and conciseness. This is because almost always the first version is a mess. It’s inefficient, looks messy, and is too convoluted. It reflects my thinking when developing and testing ideas. It is far away from being production-ready. Here, GitHub Copilot’s suggestions are often a good first step. They also help me to learn different ways to solve a problem.

When investigating errors, AI tools have helped me a lot. Sometimes the error message is unclear, particularly when learning a new programming language. Using GenAI tools helps me avoid googling for hours and reading through lots of Stackoverflow posts. I mostly use GitHub Copilot as it knows the context without me needing to copy the code. This works well for easy errors and bugs. Although it usually does not work well for more complex errors the answers indicate what to look for. I rarely use ChatGPT or Claude to explain error messages as they miss context.

My last use case when it comes to coding, is probably the most interesting. I use GenAI tools to find and explain the functionalities of third-party libraries. The documentation of these libraries is sometimes not very user-friendly. It is hard to find what I am looking for and/or understand. Using ChatGPT to search documentation and act as a translator helped me many times. With this, the hassle of finding and using the functionality of third-party libraries decreased.

Documentation

Documenting my code has always been a tedious and unwanted work for me. Writing docstrings and ReadMes always took a lot of time, which I wanted to spend elsewhere.

Letting GenAI tools support me in writing documentation has been one of my first use cases. My go-to tool is GitHub Copilot as it is nicely integrated with my IDE. Hence, I do not need to copy/paste code blocks between my IDE and a GenAI tool.

GitHub Copilot takes over most of the work now for writing docstrings and ReadMes. I only refine the suggestions and do smaller re-writes.

Writing

Using AI tools has greatly improved and sped up my writing process. As I am not a native English speaker I mostly use Grammarly to check my grammar and spelling. With this, I can focus on writing and editing. I do not need to spend hours fixing grammar and spelling mistakes. Moreover, I use Grammarly and Hemmingway AI to suggest where to improve my writing. Here, I mostly focus on sentences that are hard to read and understand.

However, I only use these tools to make suggestions. I do not use any GenAI tools to write entire paragraphs or re-write my articles. This has two reasons. First, by rewriting entire paragraphs, the text would lose my personality and change my writing style. The articles would become boring and less interesting to read. Second, I see writing as a big part of my learning process. Writing helps me to uncover knowledge gaps, bring order into my ideas, and connect the dots. Letting GenAI tools do the writing would take away the benefit of writing for my learning process.

Words of Warnings

Although GenAI tools can help us with many things, we should also be cautious when using them. They are a double-edged sword as they have some caveats.

Most importantly, we should not rely too much on GenAI tools. We should not blindly trust the answers but always critically reflect on the answers. This is due to the hallucination of these tools. The GenAI tool gives you a seemingly correct answer which sounds reasonable, but the answer is wrong.

When it comes to learning, GenAI tools can prevent us from learning as much as they can help us learn. A big part of learning (at least for me) is to think critically about a topic, understand the fundamentals, and make connections to other subjects I am already familiar with. This takes time and can be hard work. With GenAI tools, I can ask anything and get a well-prepared answer. This is easy in the short term. However, as I do not do the hard work I need to learn a new subject, I quickly forget what I "learned". Hence, we should be cautious that the GenAI tools enhance how we learn and we continue learning new skills and extend our knowledge.

When it comes to coding, GenAI tools can slow us down as much as they can speed us up. Often the code contains bugs or does not exactly do what it is supposed to do. I would say it is fine when there is an obvious mistake but it is a problem when these errors become very subtle. It can be hard and time-consuming to validate that the code is correct and does not introduce obvious bugs. Hence, in the end, you could have been faster without using a GenAI tool. Thus, I only let the GenAI tools generate small code pieces. This makes it easy to ensure that the code does exactly what I want. However, these simpler tasks took up most of my time before I used GenAI tools.

Lastly, we should think about what we share with such tools. I try not to share any private data because I do not know what will happen with it once I send my request. This is particularly important when it comes to company data. I usually create simplified use cases and code examples that represent my problem before asking ChatGPT or Claude. Moreover, I do not use any Analytics functionality on complete data sets.

Conclusion

Looking at my use cases for GenAI tools and how they have changed my work as a Data Scientist you might wonder if I could work without them. Yes, I could.

But do I want to? No, as they have improved my work in certain areas.

There are caveats but they are outweighed by the overall benefit. Moreover, GenAI tools are getting better and better. The development over the last two years has been amazing and jaw-dropping. New tools are coming up and existing ones improve quickly. We can use GenAI tools for more and more use cases.

There are probably many more use cases and AI tools to help us in our daily work as Data Scientists. Hence, let me know in the comments what tools you use and what use cases you have. Otherwise, see you in my next article.

The post How GenAI Tools Have Changed My Work as a Data Scientist appeared first on Towards Data Science.

Uncertainty Quantification in Time Series Forecasting

Jonte Dancker — Mon, 09 Dec 2024 20:21:49 +0000

Image by the author

Most of my recent articles revolved around knowing how sure a model is about its predictions. If we know the uncertainty of predictions, we can make well-informed decisions. I showed you how we can use Conformal Prediction to quantify a model’s uncertainty. I wrote about Conformal Prediction approaches for classification and regression problems.

Calibrating Classification Probabilities the Right Way

Increase Trust in Your Regression Model The Easy Way

For these approaches, we assume that the order of observation does not matter, i.e., that our data is exchangeable. This is reasonable for classification and regression problems. However, the assumption does not hold for time series problems. Here, the order of observations often contains important information, such as trends or seasonality. Hence, the order must stay intact when we want to quantify the uncertainty of our prediction.

Does this mean that we cannot apply Conformal Prediction to time series?

Luckily no. We only need to use a different algorithm.

In 2021 the first Conformal Prediction algorithm for time series, "ENsemble Batch Prediction Interval" (EnbPI), was published. The approach does not require data exchangeability. Hence, EnbPI can handle non-stationary and spatio-temporal data dependencies. The results of different forecasting tasks were promising. EnbPI’s prediction intervals had approximately valid marginal coverage. Also, they maintained coverage where other methods failed.

So, how does EnbPI work?

The idea of EnbPI is to train one underlying model on different subsets of the data and then derive the prediction interval by ensembling these models. EnbPI creates the different subsets by applying a bootstrapping approach in which we randomly sample subsets of the time series with replacement.

As EnbPI extends Conformal Prediction to time series, the approach has similar advantages:

constructs distribution-free prediction intervals that reach approximate marginal coverage
can use any estimator
is computationally efficient during inference as we do not require retraining models
performs well on small datasets as we do not need a separate calibration set

However, we should note that training the ensemble takes more time as we train several models.

Now that we covered the high-level idea and advantages, let’s look at how EnbPI works under the hood.

The EnbPI Recipe

We can split the algorithm into a training and prediction phase. Each phase consists of two steps.

Before we start training our EnbPI ensemble we must choose an ensemble estimator. This estimator can be any model, e.g., a boosted tree, a neural network, a linear regression, or a statistical model.

The steps when using EnbPI. The yellow boxes show the steps taken during training. The blue boxes show the steps taken when making a prediction. The step in the green box is optional and can be seen as an online update. (Image by the author)

Training Phase

The training phase consists of two steps: bootstrapping non-overlapping subsets from the training data and fitting a model to each of these subsets.

Step 1: Sample Bootstrap Subsets

Before we create the subsets we must decide how many models our ensemble should contain. For each model, we need one bootstrap sample.

One important requirement for these subsets is that they are not overlapping. This means that each subset is unique and independent of all other subsets. Each subset contains different data from the original data set. As we train each model on different data we introduce variability across the trained models in the ensemble. This increases the diversity of our ensemble and reduces overfitting.

Choosing a good ensemble size depends on the data set size. The smaller the data set, the fewer non-overlapping subsets we can create as we need to ensure that each subset has enough data points to train a model. Yet, a small ensemble size results in less diversity in our ensemble and thus a reduced performance of the EnbPI approach. This in turn will lead to wider prediction intervals.

The more models we train, the better our performance, i.e., the narrower the prediction interval. This is because we have a more diverse ensemble that can capture more variability in the data. Also, the ensemble becomes more robust as we aggregate more forecasts. However, we must train more models, resulting in higher computational costs.

Usually, an ensemble size of 20 to 50 is enough, balancing efficiency and accuracy.

Once we have decided on the ensemble size, we must create one subset of the data for each model in the ensemble. We get a subset by drawing with replacement from the original dataset.

Note, that we sample blocks instead of single values to account for the time dependency between observations. As we sample with replacement some blocks may appear multiple times, while others are absent.

The process of bootstrapping subsets from a time series. The time series is split into different blocks. To build each subset we sample blocks randomly with replacement. Hence, blocks can be repeated withing a subset. (Image by the author)

Once we have our bootstrap samples, we can train our ensemble models.

Step 2: Fit Bootstrap Ensemble Models

For each bootstrapped subset of the training data, we train one model, creating a diverse ensemble of models. As we train each model on different data, we will receive different predictions on the same input data. This diversity is key to robust estimations of the prediction interval.

Prediction Phase

Step 3: Leave-One-Out (LOO) Estimation

Once we have the trained ensemble we determine the ensemble’s variance when forecasting on unseen data. We use the variance to calibrate our ensemble and decide the width of our prediction interval. For this, we follow the standard Conformal Prediction recipe. If you want to read more about the recipe in detail, I recommend the article below.

All You Need Is Conformal Prediction

The first step is to determine the non-conformity scores. In the case of EnbPI, the non-conformity score is the difference between the true and predicted value.

But on what data set do we calibrate the ensemble?

We do not have a calibration set. Remember that we trained each estimator in the ensemble on a different part of the training set. Thus, no estimator has seen all the data in the training set. Hence, we can use our training set for calibration.

For each observation in the training set, we make a prediction using the ensemble. However, we only use the ensemble estimators that have NOT seen the observation during training. Then we aggregate the predictions from these models using an aggregation function, e.g., mean or median.

The aggregation function affects the robustness of the predictions. The mean is generally sensitive to outliers but reduces the overall error. The median is robust against outliers and thus suitable for noisy data. Hence, we should choose the aggregation based on the data set and our use case.

Finally, we use the aggregated prediction to determine the non-conformity score for that particular observation. With this EnbPI uses out-of-sample errors as the non-conformity score. We will use these non-conformity scores to calibrate a forecast on unseen data in the next step.

Our EnbPI ensemble contains 4 models. To calibrate the ensemble we will make a prediction for each observation in the training set. For the first observation, we only use model 3 and 4 because the observation was in the training subset of model 1 and 2. We then take the mean of the predictions (violet dot). We repeat this process for all observations in the training set. For the second observation, we can use model 1, 3 and 4 as only model 3’s training set contained the observation. And so on. Then we calculate the error between our ensemble mean and the true value, i.e., our non-conformity score. Based on the distribution and a chosen significance level alpha we determine the cut of value q. (Image by the author)

Step 4: Construct Prediction Intervals

For predictions on unseen data, we use each of the trained ensemble estimators. We aggregate the single predictions using the same aggregation function as in Step 3. The resulting value will be the center of our prediction interval.

To construct the prediction interval around the center, we use the distribution of residuals from step 3. We determine a cut-off value using a pre-defined significance level. The cut-off value is then added/subtracted from the predicted center to create our prediction interval.

This step should be familiar to you as most conformal prediction methods do it. If you are unfamiliar with it, I recommend the above article in which I describe the procedure in more detail.

(Optional) Step 5: Updating the non-conformity scores

The above-described approach uses the non-conformity scores we computed based on our training data.

We do not update them once we receive new data. Hence, the width of our prediction interval will not change over time as we do not add new information. This can be problematic if our underlying data or the model’s performance changes.

To account for such changes, we can update the non-conformity scores as soon as new observations become available. For this, we do not need to retrain the ensemble estimators. We only need to compute the non-conformity scores. With this, they reflect the most recent data and model dynamics, resulting in an updated interval width.

Forecasting example using EnbPI

Now that we know how EnbPI works, let’s apply the model to a forecasting task.

We will predict the wholesale electricity prices in Germany in the next day. The data is available from Ember under a CC-BY-4.0 license.

Luckily different packages, like MAPIE, sktime, and Amazon Fortuna have implemented EnbPI. Hence, it is straightforward for us to use the method. Here, I will use the EnbPI implementation from the mapie library.

Please note that I am not trying to get a forecast as accurate as possible but rather show how we can apply EnbPI.

Alright. Let’s get started.

Let’s import all the libraries and the datasets we need.

I also changed the data to mimic a data shift by 100 €/MWh in the last two weeks of the data set. This is to see how adaptive the prediction intervals are to sudden changes.

We will skip a detailed data exploration step here. However, we can see two seasonal components:

Daily: Prices are higher in the morning and evening hours as the electricity consumption is usually higher during these hours.
Weekly: Prices are higher on weekdays than on weekends as electricity consumption is usually higher on weekdays.

Based on that I derive the following datetime and lag features:

datetime features: day, month, year, and if the day is a weekend
lag features: same hour values of the past 7 days and the moving average over the past 24 hours lagged by 1 day

The next step is splitting our dataset into a training and test set. Although I only want to forecast the next 24 hours, I will use the last 30 days as my test set. This will give us a better idea of how the prediction intervals change over time.

Finally, I created a function to plot the results.

Ensemble model

Before we can apply EnbPI, we need to decide on an underlying model for the ensemble. I will use LightGBM but we could use any other model.

I will skip the hyperparameter optimization as we are not interested in the performance of the underlying model. However, the more accurate your model is, the better your prediction interval will be.

Training the EnbPI ensemble

Let’s wrap EnbPI around the model. The implementation is straightforward.

First, we must create our bootstrap samples, using Mapie’s BlockBootsrap class.

Here, we choose the number of blocks, i.e., how many models should be in the ensemble and the length of each block. I choose 20 blocks with a length equal to our forecast horizon of 24 hours. Moreover, I state that the blocks are not overlapping.

Second, we initialize the EnbPI model using Mapie’s MapieTimeSeriesRegressor class. We pass in our model and define the aggregation function. Once we have initialized our model, we can fit it with the fit() method.

Once we have fitted the ensemble, we can run predictions using the predict() method. Besides passing in the the features of our test set we also pass in the significance level alpha. I use 0.05 which translates to the prediction interval should contain the true value with a probability of 95 %.

Let’s look at the result.

Forecast results using EnbPI (Image by the author).

This looks good. The coverage is 91 % which is below our target and the width of the interval is 142.62 €/MWh. The coverage is below the target of 95 % probably because of the shift of the target in the middle of the test period. Moreover, we can see that the width of the interval does not change.

Updating the Non-Conformity Scores

We can easily update the non-conformity scores after new data becomes available. For this we can use Mapie’s partial_fit() method.

We will need to update the code slightly. The only difference is that we now simulate that only next 24 hours of data from the test set becomes available.

Let’s look at the result.

Forecast results using EnbPI with an online update of the residuals (Image by the author).

The results looks the same as above. The coverage is 91 % which is below our target and the width of the interval is 142.62 €/MWh. Unfortunately, the width of the interval also stays the same.

Conclusion

The article has been very long. Longer than I intended. But there was a lot to cover. If you stayed until here, you now should

have a good understanding of how the EnbPI method works, and
be able to use EnbPI in practice.

If you want to dive deeper into the EnbPI method, check out [this](https://arxiv.org/pdf/2010.09107) and this paper. Otherwise, comment and/or see you in my next article.

The post Uncertainty Quantification in Time Series Forecasting appeared first on Towards Data Science.

Confidence Interval vs. Prediction Interval

Jonte Dancker — Sun, 24 Nov 2024 14:02:06 +0000

In many Data Science-related tasks, we want to know how certain we are about the result. Knowing how much we can trust a result helps us to make better decisions.

Once we have quantified the level of uncertainty that comes with a result we can use it for:

scenario planning to evaluate a best-case and worst-case scenario
risk assessment to evaluate the impact on decisions
model evaluation to compare different models and model performance
communication with decision-makers about how much they should trust the results

Uncertainty Quantification and Why You Should Care

Where does the uncertainty come from?

Let’s look at a simple example. We want to estimate the mean price of a 300-square-meter house in Germany. Collecting the data for all 300-square-meter houses is not viable. Instead, we will calculate the mean price based on a representative subset.

And that’s where the uncertainty comes from: the sampling process. We only have information about a subset or sample of a population. Unfortunately, a sample is never a perfect representation of the entire population. Thus, the true population parameter will differ from our sample estimate. This is also known as the sampling error. Moreover, depending on how we sample, the results will be different. Comparing two samples, we will get a different mean price for a 300-square-meter house.

If we want to predict the mean price, we have the same problem. We cannot collect all the population data that we would need. Instead, we must build our model based on a population subset. This results in a sampling uncertainty as we do not know the exact relationship between the mean price, i.e., the dependent variable, and the square meter, i.e., the independent variable.

Hence, we always have some uncertainty due to the sampling process. And this uncertainty we should quantify. We can do this by giving a range in which we expect the true value to lie. The narrower the range or interval, the more certain we are. (Assuming that the interval guarantees coverage.)

To quantify uncertainty two concepts are often used interchangeably: Confidence Interval and Prediction Interval.

You will hear them often as they are essential concepts in Statistics and thus, in the field of data science. On a high level, both provide a probabilistic upper and lower bound around an estimate of a target variable. These bounds create an interval that quantifies the uncertainty.

However, from a more detailed point of view, they refer to two different things. So, we should not use them interchangeably. Interpreting a Confidence Interval as a Prediction Interval gives a wrong sense of the uncertainty. As a result, we could make wrong decisions.

This article will help you avoid this trap. I will show you what a Confidence Interval and a Prediction Interval measure. Based on that I will show you their differences and when to use which interval.

So, let’s get started with the more famous/more often used one.

Confidence Interval

A Confidence Interval quantifies the sampling Uncertainty when estimating population parameters, such as the mean, from a sample set. Hence, the Confidence Interval shows the uncertainty in the mean response of our sampled parameter.

But what does it mean?

Let’s take the house prize example. We want to estimate the mean price of a 300-square-meter house in Germany. Our population is all houses in this category. However, we cannot gather all the data about all houses. Instead, we collect data for a few houses, i.e., our sample.

Then, we determine the Confidence Interval of our choice for the sample mean by

in which x is the mean, z is the number of the standard deviation from the mean (i.e., indicating the confidence level (1.96 for 95 % and 2.576 for 99 %)), s the sampled standard deviation and n the sample size.

We can repeat this process for different samples of the population.

Okay, but how do we interpret the Confidence Interval?

A confidence level of 95 % means that if we repeat the sampling process many times, 95% of the intervals would contain the true population parameter. The confidence level refers to the long-run performance of the interval generation process. The confidence level does not apply to a specific interval. It does not mean there is a 95% chance that the true value lies in the interval of a single sample. This is also known as the frequentist approach.

Drawing different samples from a normal distribution and determining the 90 % Confidence Interval for the mean. Some Confidence Intervals do not contain the population mean (red columns). (Image by the author)

It is a very subtle but important difference. The 95% confidence level applies to the process of interval generation, not a specific interval.

Let’s assume we have a 95% confidence interval of 400,000 € to 1,000,000 € for a 300-square-meter house in Germany.

We can expect that 95% of the samples we draw will contain the true mean value in their Confidence Interval. This statement emphasizes the long-run probability of capturing the true mean if you repeat the sampling and interval calculation process many times.

Yet, you often hear "We are 95% confident that the true population mean lies between 400,000 € and 1,000,000 €." This is technically incorrect and implies more certainty about a specific interval. But it gives us a general intuition as it is easier to interpret. The statement reflects that 95% of similarly calculated intervals would capture the true parameter.

What factors influence the width of the Confidence Interval?

Looking at the equation above, we can identify two factors: The population variance and the sample size.

The higher the population variance, the more our samples will vary. Hence, the sample standard deviation is larger, resulting in wider Confidence Intervals. This makes sense. Due to the higher variation, we can be less certain that the sampled parameter is close to the population parameter.

A larger sample size can balance the effect of a few outliers while the samples are more similar. Hence, we can be more certain and thus have a narrower Confidence Interval. This is also reflected in the above equation. With an increasing sample size, the denominator becomes larger resulting in a narrower interval. In contrast, a small sample size results in wider Confidence Intervals. Fewer draws provide less information and will vary more as we increase the likelihood of a sampling error.

Prediction Interval

A Prediction Interval quantifies the uncertainty of a future individual observation from specific values of independent variables and previous data. Hence, the Prediction Interval must account for the uncertainty of estimating the expected value and the random variation of individual values.

For example, we have a 95% Prediction Interval stating a price range of 400,000 € to 1,000,000 € for a 300-square-meter house in Germany. This means any 300-square-meter house will fall in this range with a 95% chance.

What factors influence the width of the Prediction Interval?

Two factors influence the width of a Prediction Interval: the variance of the model’s estimation and the variance of the target.

Similarly, to the Confidence Interval, the Prediction Interval must account for the variability in the model. The greater the variance of the estimation, the higher the uncertainty and the wider the interval.

Moreover, the Prediction Interval also depends on the variance of the target variable. The greater the variance of the target variable, the wider the Prediction Interval will be.

After we have covered the fundamentals, let’s move on to the differences.

Differences between Confidence Interval and Prediction Interval

Confidence Interval

shows the uncertainty of a population parameter, such as the mean or a regression coefficient. ("We are 95% confident that the population mean falls within this range." (although this is technically not correct as I described above))
focuses on past or current events

Prediction Interval

shows the uncertainty of a specific value. ("We are 95% confident that the next observation will fall within this range.")
focuses on future events

To make things a bit clearer. Let’s take a regression problem that looks like:

Here, y is the target value, E[y|x] the expected mean response, x the feature value, _beta0 the slope coefficient, _beta1 the intercept coefficient and epsilon a noise term.

The Confidence Interval shows the sampling uncertainty associated with estimating the expected value E[y|x]. In contrast, the Prediction Interval shows the uncertainty in the whole range of y. Not only the expectation.

The difference between a Confidence Interval and a Prediction Interval. The Confidence Interval shows the uncertainty for the mean of y given x, i.e., the expectation E[y|x]. The Prediction Interval shows the uncertainty for an individual y given x. (Image by the author)

Let’s assume we have a linear regression model predicting house prices based on square meters. A 95% Confidence Interval for a 300-square-meter house might be (250,000 €, 270,000 €). A 95% Prediction Interval for the same house might be (220,000 €, 300,000 €).

We can see that the Confidence Interval is narrower than the Prediction Interval. This is natural. The Prediction Interval must account for the additional uncertainty of a single observation compared to the mean. The Prediction Interval shows the uncertainty of an individual 300-square-meter house’s price. In contrast, the Confidence Interval shows the uncertainty of the average price for a 300-square-meter house.

Hence, using a Confidence Interval to show the uncertainty of single, future observations might lead to a wrong sense of forecast accuracy.

Conclusion

In this article, I have shown you two basic but very important concepts that are used to quantify uncertainty. Although they are often used interchangeably they should not.

If you stayed until here, you now should…

know what a Confidence Interval and Prediction Interval is and what they measure
most importantly, know the difference between them and when to use which interval

If you want to dive deeper and know more about the underlying mathematics, check out this post. Otherwise, comment and/or see you in my next article.

The post Confidence Interval vs. Prediction Interval appeared first on Towards Data Science.

Increase Trust in Your Regression Model The Easy Way

Jonte Dancker — Wed, 13 Nov 2024 01:33:39 +0000

We must know how sure our model is about its predictions to make well-informed decisions. Hence, returning only a point prediction is not enough. It does not tell us anything about whether we can trust our model or not. If you want to know why, check out my article below.

Uncertainty Quantification and Why You Should Care

In the article, I use a classification problem as an example. However, many real-world problems are regression problems. For example, we want to know how certain a model is when predicting tomorrow’s temperature.

As the temperature is a continuous variable, we would like to know in which interval the true temperature will lie.

The wider the interval, the more uncertain the model. Hence, we should trust it less when making decisions.

But how do we get such a prediction interval?

Two approaches come to mind. Either we use a set of models that predict the interval or we turn a point prediction into a prediction interval.

Let’s start with the first approach also known as quantile regression.

We fit two models on the data, one low-quantile regressor and one high-quantile regressor. Each regressor estimates a conditional quantile of the target variable. Combining these two gives us our prediction interval.

The main advantage is that we can use any model architecture for quantile regression by using the Pinball loss function. But, the main disadvantage is that the prediction interval is not calibrated. There is no guarantee that the true value will lie in the interval with a predefined probability. As the interval is thus not reliable, we should not put too much trust into the interval. Particularly, for critical downstream decisions.

Let’s see if the second approach is better.

In previous articles, I described how Conformal Prediction turns point predictions into prediction sets and guarantees coverage for classification problems.

All You Need Is Conformal Prediction

Calibrating Classification Probabilities the Right Way

Luckily Conformal Prediction doesn’t stop there. Conformal Prediction is a framework that can be wrapped around any prediction model. Hence, we can apply Conformal Prediction and the same steps as we do for classification problems. The only difference is the non-conformity score. Hence, if you have read my other articles you should be familiar with the process.

Process turning a point prediction into a prediction interval using Conformal Prediction (Image by the author).

First, we choose a significance level alpha and a non-conformity score. As the non-conformity score, we use the forecast error, i.e., y_true – y_pred. Second, we split the dataset into a train, calibrate, and test subset. Third, we train the model on the training subset of the dataset. Fourth, we calibrate the model on the calibration subset of the data. For this, we calculate the non-conformity score, i.e., the prediction error. Based on the distribution of the non-conformity score, we determine the threshold that covers 1-alpha values. To form the prediction interval for unseen data, we add and subtract the threshold from the predicted value.

That’s it. We have turned a point prediction into a calibrated prediction interval.

Although the approach is straightforward, it has one big disadvantage. The prediction interval is not adaptive. The prediction interval has always the same width. It does not adapt to different regions of the feature space. Hence, it does not state which data points are harder to predict.

So, what now?

We have two approaches. One is adaptive but not calibrated (quantile Regression). The other is not adaptive but calibrated (conformal prediction). Can we combine them to receive adaptive prediction intervals with guaranteed coverage?

This is exactly what Conformalized Quantile Regression does. The approach was first published in 2019.

How does Conformalized Quantile Regression work?

It is quite easy. We wrap Conformal Prediction around a quantile regression, adjusting the interval. With this, we calibrate (or conformalize) the prediction interval of the quantile regression. To calibrate the quantile regression model, we determine a factor by which we extend or shrink the interval.

For this, we apply the same steps as earlier. Again, the only difference is the non-conformity score we choose. We now deal with an interval instead of a point prediction. Hence, we define the non-conformity score as the difference between the true value and its nearest predicted quantile, i.e., max(lb-y, y-ub).

If the true value lies between the predicted quantiles, the non-conformity score is negative. If the true values fall outside the predicted interval, the non-conformity score is positive.

Conformalized Quantile Regression implementation (Image by the author).

We then build the distribution of the non-conformity score and determine the threshold that covers 1 – alpha values. If the threshold value is positive we need to grow the predicted interval while we shrink it if the value is negative. We do this by adding the value to the upper quantile and subtracting it from the lower quantile.

That’s how easy it is. We now have an adaptive prediction interval that guarantees coverage for regression problems.

Conclusion

In this article, I have shown you an approach to quantifying the uncertainty in regression problems.

If you stayed until here, you now should…

have an understanding of how to use Conformal Prediction for regression problemns and
be able to use Conformalized Qunatile Regression in practice.

If you want to dive deeper into Conformalized Quantile Regression, check out the paper. Otherwise, comment and/or see you in my next article.

Obviously there are many more Conformal Prediction approaches for regression tasks and timeseries forecasting in particular, such as EnbPI, or Adaptive Conformal Inference (ACI). So, stay tuned for my next articles.

The post Increase Trust in Your Regression Model The Easy Way appeared first on Towards Data Science.

Calibrating Classification Probabilities the Right Way

Jonte Dancker — Wed, 18 Sep 2024 04:58:19 +0000

Or why you should not trust `predict_proba` methods

Venn-Abers predictors and its output for a Binary classifier (Image by the author).

In previous articles, I pointed out the importance of knowing how sure a model is about its predictions.

Uncertainty Quantification and Why You Should Care

For Classification problems, it is not helpful to only know the final class. We need more information to make well-informed decisions in downstream processes. A classification model that only outputs the final class covers important information. We do not know how sure the model is and how much we can trust its prediction.

How can we achieve more trust in the model?

Two approaches can give us more insight into classification problems.

We could turn our point prediction into a prediction set. The goal of the prediction set is to guarantee that it contains the true class with a given probability. The size of the prediction set then tells us how sure our model is about its prediction. The fewer classes the prediction set contains, the surer the model is.

But that does not tell us anything about the probability of a specific class being the true class. For example, we would like something such as "This picture shows a cat with 20% probability". These class probabilities give us more insight than a prediction set. We can use the probabilities to get a better view of the benefits and costs of the prediction.

Now, you could say, "Lucky us. Most ML models have a predict_proba method that does give us exactly what we want." But that is not true. The predict_proba method returns misleading values, at least if we care about the actual probabilities. We should not trust them because these "probabilities" are not calibrated. Calibrated means that the predicted probabilities reflect the real underlying probabilities. For example, we predict that a picture shows a cat with an 80 % probability. We expect that if our model predicts a cat with 80 % for ten pictures, eight show a cat.

But if predict_proba is not the correct method, what is it then?

Let me introduce you to Venn-ABERS predictors.

Venn-ABERS predictors are part of the Conformal Prediction family and thus can operate on top of any ML classifier. We can use them to turn the output of a scoring classifier into a calibrated probability.

How does it work?

Venn-ABERS predictors return an interval of the class probability instead of a single probability. For example, instead of saying "This picture shows a cat with 80% probability", we would say "This picture shows a cat with a probability between 75 and 85%."

For this, Venn-ABERS predictors fit two isotonic regressions to a calibration set, g_0 and g_1. We fit One isotonic regression, g_0, on the subset of data where the actual outcome is class 0. In the other isotonic regression, g_1, we fit on the subset of data where the actual outcome is class 1.

By using isotonic regression, we do not assume any function of probabilities. This gives us more freedom and a higher quality of the calibrated probabilities. For example, Platt scaling assumes a sigmoid function.

If you are not familiar with calibration sets, please check out my introduction article on Conformal Prediction.

Each regression maps a predicted classification score to a probability p_0 and p_1. p_0 states how likely the sample is class 0, while p1 states how likely it is that the sample is class 1. The probabilities p_0 and p_1 form an interval that guarantees to contain the correct class probability.

Venn-ABERS predictors inherit the validity guarantee inherited from Venn predictors. However, Venn-ABERS predictors are computationally more efficient and easier to use than Venn predictors as we only need to fit one model instead of many models to calibrate class probabilities. Although we only fit one model, Venn-ABERS predictors are as accurate and reliable as possible.

But why does Venn-ABERS fit two isotonic regressions?

Using two regressions separates the calibration process for positive and negative outcomes. Because each regression only focuses on one class, they can fit the data more accurately. Compared to a single isotonic regression, we reduce the chance of overfitting.

Fitting a regression to the positive and negative class ensures that we cover the true probability. No matter if the sample belongs to class 0 or class 1. The width of the interval shows us how sure the model is for an individual sample. The larger the difference between p_0 and p_1, the higher our uncertainty.

Usually, the interval is smaller for large data sets and larger for small or more challenging data sets. This is because the quality of p_0 and p_1 improves the more samples we have in our calibration set. With more samples, we can create more granular groups and derive a better representation of the true probability. However, there is a trade-off between the number of groups and the samples in each group. The fewer samples in one group, the less accurate our probability estimate will be because we are more prone to outliers in that group.

However, we might want a single-class probability and not a range to compare the probability with other approaches. We can transform p_0 and p_1 into a single probability by p = p_1 / (1 – p_0 + p_1) for logloss.

But why do we get calibrated probabilities by fitting two isotonic regressions?

Understanding how we fit an isotonic regression line to our data is essential to understanding Venn-ABERS predictors.

How does isotonic regression work?

In general, isotonic regression fits a monotonically increasing function. The fitted line can stay at the same level or go up. But never down. This behavior is fundamental.

As the function is non-decreasing it respects our original sorting of the predicted class scores. This ensures that our predicted scores and the true probabilities match up correctly.

The line maps any predicted score to the true probability. Here, the probability is the mean of the true labels for that given predicted score in our calibration set.

For this, we must understand how we fit the line to the data.

Let’s assume we have a trained binary classifier and a calibration set. Our classifier did not see the calibration set during training. We collect the scores of the classifier outputs for each sample in the calibration set. We then sort these scores from lowest to highest. If we have the same scores for different samples, we group them. For each group, we determine how often the True class occurred. For this, we divide the number of samples with class 1 by the total number of samples with the same predicted score.

Predicted score of a binary classifier compared to the probability of the sample being the true class (Image by the author).

If our trained classifier is good then our predicted scores should align well with the probabilities of the true labels.

But probably there are cases in which the probability of the true class goes down even though the predicted scores increase. In this case, we group the samples of two adjacent scores and determine the true probability of the group. In other words, we take the average of the true probability of both predicted scores.

First step in fitting an isotonic regression line on the data points. Every time a data point with a higher predicted score has a lower probability the two groups are merged. The probability for the new group is determined by the average of the two groups. (Image by the author).

We continue doing this until the fitted line only stays at the same level or goes up. Essentially we smooth out the predictions to ensure they align better with the actual outcomes.

Merging groups continues until the fitted line only stays the same or increases (Image by the author).

The isotonic regression assigns a probability to each group of scores, reflecting the average of the true labels in that group. This ensures that if we map two predicted scores to the true probabilities, a higher score will lead to the same or a higher true probability. But never a lower one. Hence, the isotonic regression respects the sorting of the probabilities and scores.

Now that we understand the high-level theory behind Venn-ABERS predictors, let’s use it.

How can we use it?

Luckily, we do not need to code the logic ourselves. Instead, we can get everything from the [venn-abers](https://github.com/ip200/venn-abers) library. The library contains an implementation of Venn-ABERS for binary and multiclass classification problems.

The library provides us with a VennAbersCalibrator class. The class contains all the needed methods to calibrate our class probabilities. However, we can use the class in two ways. We can wrap it around a scikit-learn algorithm or treat our classification algorithm and the Venn-Abers calibrator separately.

To show you both approaches, I will extend the example given in the library.

Let’s begin with the first approach: Wrapping the Venn-ABERS predictor around a scikit algorithm.

Let’s go through it in a bit more detail. First, we create a toy data set and split that into a training and test set. Then, we define the Venn-ABERS predictor. We pass in a scikit algorithm and choose the type of Venn-ABERS we want to use: inductive Venn-ABERS (IVAP) or cross Venn-ABERS (CVAP). The difference between IVAP and CVAP lies in how they run the calibration. IVAP uses a calibration set and fits the Venn-ABERS predictor once. CVAP, in contrast, uses cross-validation, fitting the predictor multiple times. The results of the validation sets are combined to create the final predictor.

In the example, I chose the IVAP with a calibration set of 20 % of the training set. To use CVAP, we can set the inductive parameter to False and add n_splits parameter, defining the number of splits for the cross-validation.

After defining the Venn-ABERS predictor, we can fit the predictor with the training set. Then, we derive the calibrated class probabilities on the test set using the predict_proba method.

The second approach is very similar. Instead of passing a scikit classifier when defining the Venn-ABERS predictor, we can handle both separately. For this, we create a calibration set and fit the classifier separately from the Venn-ABERS predictor.

Then, we use the predict_proba method from the VennAbersCalibrator class to calibrate the class probabilities. For this, we pass in the predicted scores of the classifier on the calibration set p_cal, the true values of the calibration set y_cal, and the predicted scores on the test set p_test.

If we are interested in the probability range of p_0 and p_1, we can set the p0_p1_output parameter to True.

Conclusion

In this article, I have shown you another approach to quantifying the uncertainty in classification problems. Instead of using prediction sets, we can calibrate the class probabilities.

If you stayed until here, you now should

have an understanding of the high-level theory behind Venn-ABERS predictors and
be able to use the Venn-ABERS predictors in practice and

If you want to dive deeper into Venn-ABERS predictors, check out the Venn-ABERS paper. Otherwise, comment and/or see you in my next article.

The post Calibrating Classification Probabilities the Right Way appeared first on Towards Data Science.

Why You (Currently) Do Not Need Deep Learning for Time Series Forecasting

Jonte Dancker — Thu, 20 Jun 2024 17:10:08 +0000

Deep Learning for Time Series Forecasting receives a lot of attention. Many articles and scientific papers write about the latest Deep Learning model and how it is much better than any ML or statistical model. This gives the impression that Deep Learning will solve all our problems for time series forecasting. In particular for new people in the field.

But from my experience, Deep Learning is not what you need. Other things are more important and work better for time series forecasting.

Hence, in this article, I want to show you what works. I will show you things that have proven themselves in many ways. I will use the findings of the Makridakis M5 competitions and the Kaggle AI Report 2023 and compare them against my experience.

The Makridakis competitions compare forecasting methods on real-world data sets. They show what works in practice. Since the start of the competitions almost 40 years ago, their findings have changed. In the first three competitions, M1 to M3, statistical models dominated the field. In the M4 competition, ML models began to show their potential in the form of hybrid approaches, which combine ML models and statistical methods.

In the M5 competition, participants had to predict the hierarchical unit sales of Walmart. The competition was split into an Accuracy and Uncertainty competition. The goal of the Accuracy competition was to find the best point forecast for each time series. The Uncertainty competition focused on probabilistic forecasts. For this, participants had to predict nine different quantiles describing the complete distribution of future sales.

The Kaggle AI Report, in contrast, contains a collection of essays written by the Kaggle community as part of a Kaggle competition. Usually, they collect key learnings and trends from recent high-performing solutions.

As a side note, Kaggle hosted the M5 competition. Thus, the competition attracted many people who participated in other Kaggle competitions. This in turn might be the reason why the results of the Kaggle AI Report and the M5 competitions are very similar.

But let’s see what the findings are.

ML models show superior performance

Over the past years, ML approaches have taken over the field of Time Series Forecasting. In the M4 competition from 2020, they started to become important as part of hybrid approaches. But two years later in the M5 competitions ML dominated the competition. All top-performing solutions were pure ML approaches, beating all statistical benchmarks.

Particularly Gradient Boosting Machines (GBMs) dominate both M5 and Kaggle competitions. The most successful ones are LightGBM, XGBoost, and CatBoost. Their effective handling of many features, little to no data pre-processing and transformation, fast training, and ability to quantify feature importance make them one of the best off-the-shelf models in recent years.

They are very convenient for experimenting and allow a fast iteration, which is important in identifying the best features. Moreover, we only need to optimize a few hyperparameters. Often the default settings already result in good performance. Hence, they have become the go-to algorithm for many time series problems.

Moreover, these approaches beat deep learning models in performance and training time. Hence, Deep Learning models are less popular.

My first choice are GBMs as well due to the above-stated advantages. I moved from XGBoost over CatBoost to LightGBM over time. In many problems, LightGBM had a better performance and shorter training time. Also, LightGBM was more capable for smaller data sets. For example, for one problem I trained a CatBoost model and a LightGBM model using the same features and loss function. The CatBoost took roughly 50% more time to train and resulted in an MAE of 2% greater that the LightGBM.

Statistical methods are still valuable

Although ML methods often outperform statistical models, we should not forget statistical models. 92.5 % of the teams in the M5 accuracy competition could not beat the simple baseline method which was an off-the-shelf forecasting method. Using an ML model does not guarantee the best performance. Moreover, they take more time to develop.

Thus, we should always start with a simple model before starting on any ML model. We can use them as baseline models to support our decision-making. For example, does the ML model add enough value to balance the added complexity?

If you want to read more about baseline models and why you should start with them, check out my article.

Why You Should Always Start With a Baseline Model

I usually start with the simplest possible model as my baseline. From my experience, a simple statistical baseline model is often hard to beat. Moreover, it only takes a small fraction of the time needed compared to developing an ML model.

Ensembles improve performance

The M5 and Kaggle competitions show that combining different models often results in better forecast accuracy. Ensembling is particularly successful when the individual models make uncorrelated errors. For example, some teams in the M5 competitions combined models trained on subsets of the training data and with different loss functions.

We can build ensembles through simple averaging (blending), complex blending, or stacking approaches. However, often an equal weighting is enough.

Although ensembling improves model performance, they have drawbacks in real-world applications. They add complexity which reduces explainability and is harder to maintain in a production environment. Hence very complex ensembles are often not used. A simpler alternative usually provides good enough results while being easier to maintain.

I have seen the benefit of using ensembles in real-world applications. For me, two approaches usuall yworked best. One approach is to combine models of the same type, like LightGBM, which are trained on different loss functions. The other approach is to use models which are trained with different features. I usually go with these two approaches as they are easy to implement and fast to test. I only need to change the loss function or the model input but can leave the rest of my pipeline exactly the same.

Moreover, I usually focus on simple ensembles to keep the complexity as small as possible. Usually, I use a simple averaging. However, the benefit of the ensemble must be large enough to make it into production. And usually, the ensemble does not improve the performance enough.

Scientific literature had a small effect on applied time series forecasting

We see a gap between the scientific literature and applied ML forecasting for time series. The scientific literature mostly focuses on Deep Learning models. These, however, are not used in practice.

But how can it be that there is such a large gap? Papers show how their models beat ML and statistical models. Why is nobody using them then? Because Deep Learning approaches are often not successful in real-world applications. Moreover, they can be very costly to train.

But why is the scientific literature focused so much on Deep Learning if it is not practical?

Well, I can only guess, but Deep Learning catches more attention than other ML models. I have seen this with both my articles about N-BEATS and N-HiTS. Everybody wants to work on Deep Learning since the great success of LLMs for NLP.

N-BEATS – The First Interpretable Deep Learning Model That Worked for Time Series Forecasting

N-HiTS – Making Deep Learning for Time Series Forecasting More Efficient

Deep Learning approaches like Transformers work well for NLP but not for time series. Although both tasks might look similar as both are sequences of values, they are not. In NLP the context matters, while for time series the order matters. This is a small difference that has a big impact on the approaches that work well.

Hence, you might ask yourself, if it is worth training a Deep Learning model, which probably will perform worse than an ML model. I have tried some approaches like N-BEATS and N-HiTS. But these models were always performing worse and took a lot longer to train than my ML models. Hence, I have never successfully applied them in a real-world forecasting task.

Now I have talked a lot about what models work and which do not. But focusing only on the models is not enough. Other things are even more important than choosing the right model. The Kaggle AI Report and the M5 competition conclude that good feature engineering is more important than the model. It is probably the most important aspect when developing a new model.

Feature engineering is more important than models

Real-world data is messy and needs a lot of cleaning to be of use. Hence, we need a lot of time to clean and understand the data to find good features. This is where I spent most of my time when developing a model.

In Kaggle competitions, feature engineering has been repeatedly proven to be crucial. Often the quality of the features makes the difference between solutions. The team that can extract the most information from the data and create a greater separation between features often had the better-performing model. Hence, spending time creating, selecting, and transforming features is crucial.

But what is a good approach to feature engineering?

From my experience, there is no single best approach. It is rather a trial-and-error approach to trying different features and feature combinations. What has helped me to find good features is being creative and flexible and using a data-driven approach with a good cross-validation strategy.

Sometimes it helps to take a broad approach, generating many features. For example, if I have many exogenous variables and external data sources available, I usually start with using them without any transformation. Here, the choice of variables depends on the results of my explanatory data analysis and what I find there.

Sometimes, it is better to concentrate on a single feature and expand it in several ways. For example, if you have a limited amount of data. How I expand a single feature depends on the problem I want to solve. Sometimes using window features such as the mean, standard deviation, or min and max results in a performance boost. Sometimes it is enough to use different lag features.

The M5 and Kaggle competitions show that domain knowledge is not needed to find good features. However, domain knowledge has often helped me to understand the data faster and derive better features.

Exogenous/explanatory variables can boost performance

Using external data is critical for improving the forecasting performance. They can give a strong boost to the model’s performance. In the M5 competitions, models that used external data performed better than models that only relied on historical data. Hence, finding these explanatory variables is crucial in real-world applications.

I try to identify as many external factors as possible. Then I test these during my feature engineering process. These can be simple date-time-related features, such as holidays, or data from other sources. My choice depends on the data availability and accessibility during inference.

In practice, I saw the greatest boost of performance happen when I added exogenous variables. Because in most real-world applications, the behavior of the time series you want to forecast depends on external factors. For example, if you want to forecast electricity prices, using the electricity consumption and generation as features gives a great performance boost. Why? Because electricity prices usually increase when electricity consumption increases or the electricity generation decreases.

Iterate as fast as possible

As we usually try many features in our model development process, we must be as fast as possible. We do not want to wait long to see if adding a new feature is helpful. For example, the winning solution in the M5 accuracy competition tested 220 models.

Hence, Kagglers often use LightGBM as their go-to model as the model trains very fast. Thus, they can run many experiments in a short period.

Again, I agree with the findings. I usually get better results the more things I can test. More features, different feature combinations and different loss functions. Being fast helps me to test many hypotheses in a shorter time.

However, to decide which features improve performance, we must repeatedly evaluate how good our model is. We need an approach that we can trust and that helps us identify the best features and models.

Effective cross-validation strategies are crucial

An effective cross-validation strategy is critical to choose the best model objectively. Building a local validation process using cross-validation helps us…

understand if the model performance is reliably improved
uncover areas in which the model is making mistakes or unreliable predictions
guide our feature engineering
simulate post-sample accuracy
avoid overfitting on test data
mitigate uncertainty
tune the hyperparameters of the model

The M5 and Kaggle competitions show this importance. Many teams in the M5 competition failed to choose the best model for their final submission. In Kaggle competitions, the best models in the public leaderboard are often not the competition winners.

However, choosing a good cross-validation strategy can be difficult, there are many options.

What period do we select?
What is the size of the validation windows?
How do we update the windows?
What criteria/metrics do we use to summarize the forecasting performance?

Hence, the strategy should always be problem-dependent.

Every problem needs a unique approach

The results of the Kaggle and M5 competitions show that we need a unique approach for every data set and problem. There is no off-the-shelf solution.

To decide which approach works, we need our knowledge and experience.

We must adjust the models based on the intricacies of the forecasting task. This includes all the points I have discussed above. From the feature engineering to the cross-validation to the choice of model.

For example, in the M5 accuracy competition, combining the best models for each time series beat the winning solution by roughly two percent.

This is why deep learning models currently do not work in real-world applications. They try to be a one-size-fits-all solution to a problem that needs a custom solution. Deep learning models promise to make our lives easier by taking away the need for detailed feature engineering. For example, N-BEATS and N-HiTS promise to automatically find seasonality and trend components in the time series. However, they do not find the unique intricacies of a time series. In contrast, with our knowledge we can find these intricacies and encode them in features that ML models can use and thus beat Deep Learning models.

Conclusion

The M5 competitions and the 2023 Kaggle AI report agree on what works and is important for time series forecasting. I can only support their conclusion from my experience working on applied time series forecasting tasks.

The most important factors are:

ML models show superior performance
Statistical methods are still valuable
Ensembles improve performance
Scientific literature had a small effect on applied time series forecasting
Feature engineering is more important than models
Exogenous/explanatory variables can boost performance
Iterate as fast as possible
Effective cross-validation strategies are crucial
Every problem needs a unique approach

As you can see Deep Learning is not part of the list.

Deep Learning currently is not good enough. These models have not successfully moved from the literature to real-world problems. However, this might change in the future. We saw a shift from statistical models, which dominated time series forecasting for a long time, to ML models. Such a shift might happen in the future to Deep Learning models. Hence, it is good to stay up-to-date on the development of Deep Learning. But currently, they are not useful in real-world applications today.

Hence, there is still a tremendous potential for further research and improvement, not only on the Deep Learning side but also on the ML side.

I hope that you find a lot of good stuff in this article. I hope that it will help you to improve the performance of your next time series forecasting models.

See you in my next article and/or leave a comment.

The post Why You (Currently) Do Not Need Deep Learning for Time Series Forecasting appeared first on Towards Data Science.

N-HiTS – Making Deep Learning for Time Series Forecasting More Efficient

Jonte Dancker — Thu, 30 May 2024 18:46:33 +0000

Architecture of N-HiTS (Image taken from Challu and Olivares et al.).

In 2020, N-BEATS was the first deep-learning model to outperform statistical and hybrid models in time series forecasting.

Two years later, in 2022, a new model threw N-BEATS off its throne. Challu and Olivares et al. published the Deep Learning model N-HiTS. They addressed two shortcomings of N-BEATS for longer forecast horizons:

decreasing accuracy and
increasing computation.

N-HiTS stands for Neural Hierarchical Interpolation for Time Series Forecasting.

The model builds on N-BEATS and its idea of neural basis expansion. The neural basis expansion takes place in several blocks across layered stacks.

In this article, I will go through the architecture behind N-HiTS, particularly the differences to N-BEATS. But do not be afraid, the deep dive will be easy-to-understand. However, it is not enough to only understand how N-HiTS works. Thus, I will show you how we can easily implement a N-HiTS model in Python and also tune its hyperparameters.

If the core idea is the same, what is the difference between N-BEATS and N-HiTS?

The difference lies in how each model treats the input and output of each stack. The main idea of N-HiTS is to combine forecasts of different time scales.

For this, N-HiTS applies

a multi-rate data sampling of the input and
a hierarchical interpolation of the output.

With this N-HiTS achieves better accuracy for longer horizons and lower computational cost.

The multi-rate data sampling forces stacks to specialize in short-term or long-term effects. Hence, it becomes easier for these stacks to learn the respective components. The focus on long-term behavior results in improved long-horizon forecasting compared to N-BEATS.

The hierarchical interpolation allows each block to forecast on a different time scale. The model then interpolates the forecast to match the time scale of each block to the final prediction. The resampling and interpolation reduce the number of learnable parameters. This results in a lighter model with shorter training times.

Now that we know what N-HiTS does differently, let’s see how the architecture includes these changes.

How does N-HiTS work in detail?

The N-HiTS model has the following architecture:

Architecture of N-HiTS (Image taken from Challu and Olivares et al.).

As we can see, there are many similarities compared to N-BEATS.

First, N-HiTS splits the time series into a lookback and forecast period. Second, the model consists of multi-layered stacks and blocks, generating a backcast and forecast. In each block, a multi-layer perceptron produces basis expansion coefficients for the backcast and forecast. The backcast shows what part of the time series the block captured. Before we pass the time series into a block we remove the backcast of the previous block. With this, each block learns a different pattern as we only pass residuals from block to block. The model generates the final prediction through the sum of all block’s forecasts.

As for the similarities, I will keep it at this level of detail. For more information, I point you to my N-BEATS article.

N-BEATS – The First Interpretable Deep Learning Model That Worked for Time Series Forecasting

But let’s dive deeper into the differences: multi-rate data sampling and hierarchical interpolation.

Multi-rate signal sampling of the input

N-HiTS does the multi-rate sampling at block level through a MaxPool layer.

The MaxPool layer smooths the input by taking the largest value within a chosen kernel size. Hence, the kernel size determines the rate of sampling. The larger the kernel size, the more aggressive the smoothing will be.

The larger the kernel size of the the MaxPool layer, the stronger the smoothing of the input signal. Hence, a large kernel size emphasizes long-term effects. (Image by the author)

We define the kernel size of the MaxPool layer at the stack level. Hence, each block within the same stack has the same kernel size.

For resampling, N-HiTS uses a top-down approach. The first stacks focus on the long-term effects through a larger kernel size. The latter stacks focus on short-term effects through a smaller kernel size.

Hierarchical Interpolation of the output

N-HiTS uses hierarchical interpolation to reduce the number of predictions of each stack, i.e., the cardinality. The smaller cardinality results in reduced computing requirements for long-horizon forecasts.

What does this mean?

Assume we want to predict the next 24 hours of a time series. We expect our model to output 24 predictions (one for each hour). If we want to predict the next two weeks of hourly data, we need 336 predictions (14 * 24). That makes sense, right?

But this is where it becomes problematic. Let’s take the N-BEATS model. The final forecast is a combination of the partial forecasts of each stack. Hence, each stack must predict 336 values, which is computationally expensive. N-BEATS is not the only model suffering the same problem for longer forecast horizons. Other deep learning approaches, such as Transformers or Recurrent Neural Networks, face the same problem.

N-HiTS overcomes this challenge by letting each stack make predictions at different time scales. N-HiTS then matches the time scales of each stack to the final output using interpolation.

For this, N-HiTS uses the concept of expressiveness ratio. The ratio determines the number of predictions in the forecast horizon. A small expressiveness ratio means that the stack makes fewer predictions. Hence, the stack has a small cardinality. For example, we choose an expressiveness ratio of 1/2. This results in the stack predicting every second value we want in our final forecast.

The expressiveness ratio relates the output to the resampling of the input. Combined with the resampling of the input, each stack thus works at a different frequency. Hence, each stack can specialize in treating the time series at different rates.

The authors of N-HiTS suggest that stacks close to the input should focus on the long-term effects. Hence, these stacks should have smaller expressiveness ratios. For example, we could have three stacks. The first stack specializes in weekly behavior, the second in daily behavior, and the third in hourly behavior.

Each stack makes predictions at a different time scale. Thus each stack has a different cardinality. Here, stack 1 has a cardinality of 3, stack 2 a cardinality 5 and stack 3 a cardinality of 10. The final forecast is then the sum of each stack’s forecast. In comparison, in the N-BEATS model, each stack would have the same cardinality, i.e., a cardinality of 10. (Image by the author)

But what is a reasonable choice for the expressiveness ratio?

It depends on the time series. The authors recommend two options.

use exponentially increasing expressiveness ratios between stacks to reduce the number of parameters while handling a wide range of frequencies
use known cycles of the time series, such as daily, weekly, etc.

Forecasting example using N-HiTS

Now that we know how N-HiTS works, let’s apply the model to a forecasting task.

As in my N-BEATS article, we will predict the next two weeks of wholesale electricity prices in Germany. We take the "European Wholesale Electricity Price" data, which Ember provides with a CC-BY-4.0 license. We will use the N-HiTS implementation from Nixtla’s neuralforecast library.

Electricity wholsesale prices in Germany (Image by the author, data by Ember).

Without doing a detailed data exploration, we can see two seasonal components:

daily: Prices are higher in the morning and evening hours as the electricity consumption is usually higher during these hours.
weekly: Prices are higher on weekdays than on weekends as electricity consumption is usually higher during weekdays.

As I used the same dataset in my N-BEATS article we can re-use all code for the data preparation, the train-test split, the plotting of results and the baseline model. Hence, I am not going to show those code snippets here.

Before we jump into the code, please note that I am not trying to get a forecast as accurate as possible but rather show how we can apply N-HiTS.

Baseline Model

But let’s start with a simple model as our baseline.

Why You Should Always Start With a Baseline Model

I will use the same seasonal naïve model which I used in my N-BEATS article. Hence, I will not go into much detail and only show the results.

Using the last week of data in the training set as our forecast results in an MAE of 17.84. Which is quite good already.

Forecast of the Seasonal Naive baseline model (Image by the author).

Training the N-HiTS model

Let’s train our first N-HiTS model. Because we use Nixtla’s neuralforecast library, the implementation is straightforward. We initialize our N-HiTS model defining our forecast and lookback period. In this case, I use a lookback period of one week.

Then, we have some customization options. We can customize

the model by choosing the number of stacks and blocks, size of the MLP layers, activation function, kernel size for the MaxPooling, pooling type, etc.
the training by choosing the loss function, learning rate, batch size, etc.
the scaling of our input data.

See the Nixtla’s documentation for a full description.

In contrast to the N-BEATS model, we can see some differences. We have more parameters to customize our model. We can customize the multi-rate data sampling by choosing the kernel size and pooling type. The hierarchical interpolation we can customize through the interpolation type and expressivity ratio. In the code snippet, I have already played around with some of the hyperparameters.

After we have initialized our model, we wrap it with the neuralforecast class and fit the model. If you have read my N-BEATS article, you should be familiar with these steps.

The results look better than our baseline. The MAE goes down to 17.01 compared to 17.84 of our baseline.

Forecast results with the N-HiTS model (Image by the author).

Tuning the hyperparameters of the N-HiTS model

Instead of playing around to find good hyperparameters, we can run an optimization.

It is not complicated. We do not need to add many lines of code. We only need to replace the NHITS model with Nixtla’s AutoNHITS model. The AutoNHITS model does the hyperparameter tuning for us. We only need to choose the backend (ray oroptuna) and choose the search space of our hyperparameters.

These two choices are the only difference compared to running the NHITS model. All other steps stay the same.

Let’s start with choosing Optuna and using a custom config file.

We see that the "optimized" N-HiTS has a worse accuracy (MAE of 22.63) compared to the baseline and N-HiTS model.

Forecast results of the AutoNHITS model. The results show the best model of the hyperparameter tuning experiment. (Image by the author)

Perhaps my choice of the search space was not good. Hence, we could try running more trials and using a different search space to get better results. Or we could use the default config of AutoNHITS. We can either use it directly by not passing a config to the model or by making small changes to the default config.

N-HiTS with exogenous variables

Before I finish this article, I want to show you one last thing. We can also use exogenous variables in the N-HiTS model. For this, we only need to pass exogenous variables into the NHITS model as futr_exog_list during initialization. For example, we could pass the day of the week to the model as there is a weekly seasonality.

Adding the day of the week as an exogenous variable resulted in an MAE of 21.62. Trying different exogenous variables or different hyperparameters could improve the accuracy.

Forecast using the N-HiTS model with exogenous variables (Image by the author).

A final note on N-HiTS

N-HiTS has shown very good performance on a wide range of data sets on paper. However, that does not mean N-HiTS will work best for all problems and data sets. Particularly yours.

We saw in the examples, that N-HiTS could barely beat the simple seasonal naive baseline model. But it took me time to get there. First, I spent more time setting up the model and finding a good set of hyperparameters. Second, training took over 30 times as long as the baseline model.

So, if this had been a company project, I would choose the baseline model. Although N-HiTS adds a small accuracy gain, the added complexity is not worth the trouble.

Hence, although N-HiTS is easy to use and seems like a promising model, do not start a project with the model. Start with a simple baseline model. Based on your baseline, you can decide if N-HiTS is a good choice for your problem. For example, if N-HiTS adds enough value compared to the added complexity.

Conclusion

The article has been very long. But there was a lot to cover. If you stayed until here, you now should

have a very good understanding of how the N-HiTS model works,
what N-HiTS does differently compared to N-BEATS,
be able to use the N-HiTS model in practice, and
be able to change the model’s inner workings during your hyperparameter tuning.

If you want to dive deeper into the N-HiTS model, check out the N-HiTS paper. Otherwise, leave a comment and/or see you in my next article.

The post N-HiTS – Making Deep Learning for Time Series Forecasting More Efficient appeared first on Towards Data Science.

N-BEATS – The First Interpretable Deep Learning Model That Worked for Time Series Forecasting

Jonte Dancker — Sat, 11 May 2024 16:51:47 +0000

Architecture of N-BEATS (Image taken from Oreshkin et al.).

Time series forecasting has been the only area in which Deep Learning and Transformers did not outperform other models.

Looking at the Makridakis M-competition, the winning solutions always relied on statistical models. Until the M4 competition, winning solutions were either pure statistical or a hybrid of ML and statistical models. Pure ML approaches barely surpassed the competition baseline.

This changed with a paper published by Oreshkin, et al. in 2020. The authors published N-BEATS, a promising pure Deep Learning approach. The model beat the winning solution of the M4 competition. It was the first pure Deep Learning approach that outperformed well-established statistical approaches.

N-BEATS stands for Neural Basis Expansion Analysis for Interpretable Time Series.

In this article, I will go through the architecture behind N-BEATS. But do not be afraid, the deep dive will be easy-to-understand. I also show you how we can make the deep learning approach interpretable. However, it is not enough to only understand how N-BEATS works. Thus, I will show you how we can easily implement a N-BEATS model in Python and also tune its hyperparameters.

Let’s understand the core idea of N-BEATS before looking at its architecture.

The core functionality of N-BEATS lies in neural basis expansion.

Basis expansion is a method to augment data. We expand our feature set to be able to model non-linear relationships. Sounds abstract, right?

How does basis expansion work? For example, we have a problem in which our target value y stands in some relationship to a feature x. We want to represent the relationship between y and x using a linear model. In the 1d space, this will result in a linear relationship (left plot in the figure below).

However, the feature and target might not show a linear relationship, resulting in a useless model. Is there anything we can do?

Yes, we can expand our feature set. Let’s add the quadratic value of the original feature to the feature set, resulting in [x, x²]. With this, we moved from a 1d space into a 2d space since we now have two features instead of one. We can now fit a linear model in the 2d space, resulting in a second-degree polynomial model (right plot in the figure below).

Linear regression in 1D space (left) compared to a linear regression in 2D space (right). Through Basis Expansion, we can better fit our model on the data (Image by the author).

And this is all there is to basis expansion. We extend our feature set by adding new features based on the original features. In this case, we used a polynomial expansion of degree 2. We added the quadratic value of each original feature to the feature set.

The most common basis expansion method is polynomial basis expansion. Yet, there are many other approaches, such as binning, piecewise-linear splines, natural cubic splines, logarithms, or squares.

The N-BEATS model decides which basis expansion to use. During training, the model tries to find the best basis expansion method to fit the data. We let the model do the work. That is why it is called "neural basis expansion."

How does N-BEATS work in detail?

The N-BEATS model has the following architecture:

Architecture of N-BEATS (Image taken from Oreshkin et al.).

A lot is going on here, and the picture might be overwhelming. But, the idea behind N-BEATS is straightforward.

We can observe two things.

First, the model splits the time series into a lookback and forecast period. The model uses the lookback period to make a forecast. The lookback period has multiple lengths of the forecast period. An optimal multiple usually lies between two and six.

Second, the N-BEATS model (yellow rectangle on the right) consists of layered stacks. Each stack, in turn, consists of layered blocks.

Each block has a fork-like structure (blue rectangle on the left). One branch in the block produces a backcast, and the other branch a forecast from some block input data. The forecast contains the prediction of unseen values. The backcast, in contrast, shows us the model’s fit on the input data.

How do we receive the backcast and forecast from the input? First, the model passes the input through a fully connected neural network with four layers. The MLP produces the expansion coefficients, theta, for the backcast and forecast. These expansion coefficients flow into two branches, one for backcasting and one for forecasting. In each branch, we perform the basis expansion. Here, the actual "neural basis expansion" happens.

As we can see in the picture above, N-BEATS connects various blocks in a stack. Because each block returns a backcast and forecast, two things happen. First, the model adds the partial forecast of each block to produce the stack’s forecast. Second, the model removes the backcast of a block from the block’s input. Hence, each block only receives the residual of the previous block. With this, the model only passes information that is not captured by the previous block to the next. Hence, each block tries to approximate only a part of the input signal, focusing on local patterns.

The N-BEATS model then layers various stacks. Like the blocks in a stack, each stack, except the first one, is trained on the residuals of the previous stack. With this, each stack learns a global pattern that was not captured before. The final forecast is the sum of the stack’s forecasts, providing a hierarchical decomposition.

As we can see, N-BEATS applies a double residual stacking approach. The backcast and forecast result in backward and forward residuals. The layered architecture of blocks and stacks leads to the stacking of these residuals. Through the double residual stacks, N-BEATS can recreate the mechanisms of statistical models.

The advantages of N-BEATS

Compared to other deep learning approaches, N-BEATS enables us to build very deep NNs with interpretable results. Moreover, the training is faster. The model does not contain any recurrent or self-attention layer. Its double residual stacking facilitates a more fluid gradient backpropagation.

Compared to classical Time Series Forecasting approaches, we do not need to do any feature engineering. We do not need to identify time series-specific characteristics, like seasonality and trend. N-BEATS does this for us. This makes the model easy to use, and we can get quickly started.

Moreover, the model is capable of Zero-Shot Transfer Learning.

How can the deep learning architecture be interpretable?

Well, the generic version of N-BEATS, I described above, is not interpretable. There are no constraints on what basis functions the model can learn and the depth of the network. We do not know what the model learns and if these are time series-specific components, such as trend.

How do we gain interpretability?

There apply a trick. We restrict the depth of the model and which basis expansion functions the model can learn.

For example, we often use trend and seasonality in time series forecasting.

We can force the model to learn only these two characteristics. First, we restrict the depth of the model by only using two stacks. The first stack learns the trend, and the second stack learns the seasonality.

We can then interpret the model’s results by extracting each stack’s partial forecasts.

Second, we must force the model to learn the trend and seasonality only. We must introduce a problem-specific inductive bias. We achieve this by setting the basis expansion functions to specific functional forms. For this, we replace the last layer in each block with a function. We use a polynomial basis to determine the trend and a Fourier basis for seasonality.

General Architecture of the generic version of N-BEATS (left) and the interpretable version on N-BEATS (right) (Image by the author).

Forecasting example using N-BEATS

Now that we know how N-BEATS works, let’s apply the model to a forecasting task.

We will predict the next two weeks of wholesale electricity prices in Germany. The data is provided by Ember in the "European Wholesale Electricity Price" under a CC-BY-4.0 license. We will use the N-BEATS implementation from Nixtla’s neuralforecast library. The library makes it very easy for us to apply N-BEATS.

Electricity wholsesale prices in Germany (Image by the author, data by Ember).

Please note that I am not trying to get a forecast as accurate as possible but rather show how we can apply N-BEATS.

Alright. Let’s get started.

Let’s import all the libraries we need and the dataset.

Nixtla has a time series format that every model expects. The time series format is a DataFrame with columns ds containing the timestamps, y containing the target value, and unique_id which is a unique identifier. The unique_id allows us to train a model on different time series simultaneously.

We will skip a detailed data exploration step here. However, we can see two seasonal components:

daily: Prices are higher in the morning and evening hours as the electricity consumption is usually higher during these hours.
weekly: Prices are higher on weekdays than on weekends as electricity consumption is usually higher during weekdays.

The next step is splitting our dataset into a training and test set. As I want to forecast the next two weeks of consumption, I use the last two weeks of the dataset as my test set.

I had some trouble getting the model to work when using Panda’s Datetime. Hence, I converted the date time into timestamps.

Finally, I created a function to plot the results.

Baseline Model

For completeness, we will start with a simple model as our baseline. I have written about baseline models and why you should start with them in another article.

We will use a seasonal naïve baseline model as the consumption shows different seasonality patterns. We will use the SeasonalNaive model provided in Nixtla’s statsforecast library. Here, we take the last week of data in the training set as our forecast.

The results look quite good. The baseline gives us a MAE of 17.84. The accuracy of the baseline is very good in the first week of the test set. However, the further we are away from the last known value, the smaller the accuracy.

Forecast of the Seasonal Naive baseline model (Image by the author).

We probably could do better if we spend more time on the baseline model. However, the baseline should be good enough for our use case.

Training the N-BEATS model

Let’s train our first N-BEATS model. The implementation is straightforward. We initialize our N-BEATS model, defining our forecast and lookback period. In this case, I use a lookback period of two week.

Then, we have some customization options. We can customize

the model by choosing stack types, number of blocks, size of the MLP layers, activation function, etc.
the training by choosing the loss function, learning rate, batch size, etc.
the scaling of our input data.

See the Nixtla’s documentation for a full description. In the code snippet, I have already played around with the hyperparameters.

Once we have initialized our model, we must wrap it with the neuralforecast class. The class provides us with methods we need to train the model and later make predictions. Also, the class allows us to train different models by passing a list of models to models.

Then, we can finally fit the model using the fit() method and run a prediction for the next month.

The results look slightly better than our baseline. The MAE goes down to 17.5 compared to 17.84 of our baseline.

Forecast results with the N-BEATS model (Image by the author).

Tuning the hyperparameters of the N-BEATS model

Instead of playing around to find good hyperparameters, let’s run a hyperparameter optimization.

It is not complicated. Nixtla provides us with an AutoNBEATS model, doing the hyperparameter tuning. We can choose between ray and optuna as the backend and either use a default config for the hyperparameters or create a custom config. We define our choices when initializing the AutoNBEATS model. That is the only difference compared to running the NBEATS model. All other steps stay the same.

In this case, I will use Optuna and a custom config file.

Running the predict() method after the hyperparameter tuning gives us predictions using the best model.

We see that the "optimized" N-BEATS has a worse accuracy (MAE of 22.64) compared to the baseline and N-BEATS model. However, running more trials during the hyperparameter tuning with a different search space might result in different results.

Forecast results of the AutoNBEATS model. The results show the best model of the hyperparameter tuning experiment. (Image by the author.)

If we do not pass a config to the model, AutoNBEATS uses the default config. If we only want to change a few parameters from the default config, we could run this snippet.

If you want to see the results of the hyperparameter tuning, you can get them from the results attribute of the model.

# show results of hyperparameter tuning runs
results = fcst.models[0].results.trials_dataframe()

N-BEATS with exogenous variables

Before I finish this article, I want to show you one last thing. We can also use exogenous variables in the N-BEATS model. For this, we only need to use Nixtla’s [NBEATSx](https://nixtlaverse.nixtla.io/neuralforecast/models.nbeatsx.html) and define what exogenous variables we want to use. For example, we could pass the day of the week to the model as there is a weekly seasonality.

Adding the day of the week as an exogenous variable resulted in a MAE of 20.01. However, trying different exogenous variables, such as electricity consumption or generation forecast, or different hyperparameters could improve the accuracy.

Forecast using the N-BEATS model with exogenous variables (Image by the author).

A final note on the examples

In all the examples, I showed you a very static approach. We only make a forecast once on the test set. But let’s say we work for a company that owns some power generation units. The company wants to know how they should plan the operation of their units to maximize revenue. For this, they want an updated forecast every morning. Hence, we must run the model every day.

But how do we manage? Above, we only predicted the weeks directly after the training set.

We always need to re-train our model when we want to make a new prediction. The N-BEATS model relies on the lookback period. Hence, we must update our data set with the latest available data and train the model again. In our example, the wholesale price of the last day.

Should we run our hyperparameter tuning as well? Probably not. The tuning can take quite a long time and be computationally expensive. Instead, we run the hyperparameter tuning on the train set once. We use the same hyperparameters when re-training the model every day. If we see that the model degrades too much over time, we can re-run the hyperparameter tuning.

Conclusion

The article has been very long. Longer than I intended. But there was a lot to cover. If you stayed until here, you now should

have a very good understanding of how the N-BEATS model works
be able to use the N-BEATS model in practice and
be able to change the model’s inner workings during your hyperparameter tuning.

If you want to dive deeper into the N-BEATS model, check out the N-BEATS paper. Otherwise, leave a comment and/or see you in my next article.

The post N-BEATS – The First Interpretable Deep Learning Model That Worked for Time Series Forecasting appeared first on Towards Data Science.

Jonte Dancker, Author at Towards Data Science

A Data Scientist’s Guide to Docker Containers

What is a container?

What is Docker?

What do we need to create a Docker container?

Using Docker: An example

1. Build a model

2. Create requirements

3. Create Dockerfile

4. Create Docker Image

5. Run Docker Container

Conclusion

The Impact of GenAI and Its Implications for Data Scientists

Main findings

Limitations

Implications

Conclusion

How GenAI Tools Have Changed My Work as a Data Scientist

My (current) use cases for GenAI tools

Learning

Coding

Documentation

Writing

Words of Warnings

Conclusion

Uncertainty Quantification in Time Series Forecasting

So, how does EnbPI work?

The EnbPI Recipe

Training Phase

Step 1: Sample Bootstrap Subsets

Step 2: Fit Bootstrap Ensemble Models

Prediction Phase

Step 3: Leave-One-Out (LOO) Estimation

Step 4: Construct Prediction Intervals

(Optional) Step 5: Updating the non-conformity scores

Forecasting example using EnbPI

Ensemble model

Training the EnbPI ensemble

Updating the Non-Conformity Scores

Conclusion

Confidence Interval vs. Prediction Interval

Where does the uncertainty come from?

Confidence Interval

Okay, but how do we interpret the Confidence Interval?

What factors influence the width of the Confidence Interval?

Prediction Interval

What factors influence the width of the Prediction Interval?

Differences between Confidence Interval and Prediction Interval

Confidence Interval

Prediction Interval

Conclusion

Increase Trust in Your Regression Model The Easy Way

But how do we get such a prediction interval?

Let’s start with the first approach also known as quantile regression.

Let’s see if the second approach is better.

So, what now?

How does Conformalized Quantile Regression work?

Conclusion

Calibrating Classification Probabilities the Right Way

Or why you should not trust predict_proba methods

How does it work?

But why does Venn-ABERS fit two isotonic regressions?

How does isotonic regression work?

How can we use it?

Conclusion

Why You (Currently) Do Not Need Deep Learning for Time Series Forecasting

ML models show superior performance

Statistical methods are still valuable

Ensembles improve performance

Scientific literature had a small effect on applied time series forecasting

Feature engineering is more important than models

Exogenous/explanatory variables can boost performance

Iterate as fast as possible

Effective cross-validation strategies are crucial

Every problem needs a unique approach

Conclusion

N-HiTS – Making Deep Learning for Time Series Forecasting More Efficient

If the core idea is the same, what is the difference between N-BEATS and N-HiTS?

How does N-HiTS work in detail?

Multi-rate signal sampling of the input

Or why you should not trust `predict_proba` methods