Piero Paialunga, Author at Towards Data Science

From Resume to Cover Letter Using AI and LLM, with Python and Streamlit

Piero Paialunga — Wed, 05 Feb 2025 05:44:03 +0000

DISCLAIMER: The idea of doing Cover Letter or even Resume with AI does not obviously start with me. A lot of people have done this before (very successfully) and have built websites and even companies from the idea. This is just a tutorial on how to build your own Cover Letter AI Generator App using Python and a few lines of code. All the code you will say in this blog post can be found in my public Github folder. Enjoy.

Pep Guardiola is a (very successful) Manchester City football coach. During Barcelona’s Leo Messi years, he invented a way of playing football known as “Tiki-Taka”. This means that as soon as you receive the ball, you pass the ball, immediately, without even controlling it. You can pass the ball 30–40 times before scoring a goal.

More than a decade later, we can see how the way of playing football made Guardiola and his Barcelona famous is gone. If you look at a Manchester City match, they take the ball and immediately look for the striker or the winger. You only need a few, vertical passes, immediately looking for the opportunity. It is more predictable, but you do it so many times that you will eventually find the space to hit the target.

I think that the job market has somehow gone in the same direction.

Before you had the opportunity to go to the company, hand in your resume, talk to them, be around them, schedule an interview, and actively talk to people. You would spend weeks preparing for that trip, polishing your resume, and reviewing questions and answers.

For many, this old-fashioned strategy still works, and I believe it. If you have a good networking opportunity, or the right time and place, the handing the resume thing works very well. We love the human connection, and it is very effective to actually know someone.

It is important to consider that there is a whole other approach as well. Companies like LinkedIn, Indeed, and even in general the internet completely changed the game. You can send so many resumes to so many companies and find a job out of statistics. AI is changing this game a little bit further. There are a lot of AI tools to tailor your resume for the specific company, make your resume more impressive, or build the job specific cover letter. There are indeed many companies that sell this kind of services to people that are looking for jobs.

Now, believe me, I have got nothing against these companies, at all, but the AI that they are using it’s not really “their AI”. What I mean by that is that if you use ChatGPT, Gemini, or the super new DeepSeek to do the exact task you will very likely not get a worse response than the (paid) tool that you are using on their website. You are really paying for the “commodity” of having a backend API that does what we would have to do through ChatGPT. And that’s fair.

Nonetheless, I want to show you that it is indeed very simple and cheap to make your own “resume assistant” using Large Language Models. In particular, I want to focus on cover letters. You give me your resume and the job description, and I give you your cover letter you can copy and paste to LinkedIn, Indeed, or your email.

In one image, it will look like this:

Now, Large Language Models (LLMs) are specific AI models that produce text. More specifically, they are HUGE Machine Learning models (even the small ones are very big).

This means that building your own LLM or training one from scratch is very, very expensive. We won’t do anything like that. We will use a perfectly working LLM and we will smartly instruct it to perform our task. More specifically, we will do that in Python and using some APIs. Formally, it is a paid API. Nonetheless, since I started the whole project (with all the trial and error process) I spent less than 30 cents. You will likely spend 4 or 5 cents on it.

Additionally, we will make a working web app that will allow you to have your cover letter in a few clicks. It will be an effort of a couple hundred lines of code (with spaces ).

To motivate you, here are screenshots of the final app:

Pretty cool right? It took me less than 5 hours to build the whole thing from scratch. Believe me: it’s that simple. In this blog post, we will describe, in order:

The LLM API Strategy. This part will help the reader understand what LLM Agents we are using and how we are connecting them.
The LLM Object. This is the implementation of the LLM API strategy above using Python.
The Web App and results. The LLM Object is then transferred into a web app using Streamlit. I’ll show you how to access it and some results.

I’ll try to be as specific as possible so that you have everything you need to make it yourself, but if this stuff gets too technical, feel free to skip to part 3 and just enjoy the sunset .

Let’s get started!

1. LLM API Strategy

This is the Machine Learning System Design part of this project, which I kept extremely light, because I wanted to maximize the readability of the whole approach (and because it honestly didn’t need to be more complicated than that).

We will use two APIs:

A Document Parser LLM API will read the Resume and extract all the meaningful information. This information will be put in a .json file so that, in production, we will have the resume already processed and stored somewhere in our memory.
A cover letter LLM API. This API will read the parsed resume (the output of the previous API) and the job description and it will output the Cover Letter.

Two main points:

What is the best LLM for this task? For text extraction and summarization, LLama or Gemma are known to be a reasonably cheap and efficient LLM. As we are going to use LLama for the summarization task, in order to keep consistency, we can adopt it for the other API as well. If you want to use another model, feel free to do so.
How do we connect the APIs? There are a variety of ways you can do that. I decided to give it a try to Llama API. The documentation is not exactly extensive, but it works well and it allows you to play with many models. You will need to log in, buy some credit ($1 is more than sufficient for this task), and save your API key. Feel free to switch to another solution (like Hugging Face or Langchain) if you feel like it.

Ok, now that we know what to do, we just need to actually implement it in Python.

2. LLM Object

The first thing that we need is the actual LLM prompts. In the API, the prompts are usually passed using a dictionary. As they can be pretty long, and their structure is always similar, it makes sense to store them in .json files. We will read the JSON files and use them as inputs for the API call.

2.1 LLM Prompts

In this .json file, you will have the model (you can call whatever model you like) and the content which is the instruction for the LLM. Of course, the content key has a static part, which is the “instruction” and a “dynamic” part, which is the specific input of the API call. For example: this is the .json file for the first API, I called it resume_parser_api.json:

As you can see from the “content” there is the static call:

“You are a resume parser. You will extract information from this resume and put them in a .json file. The keys of your dictionary will be first_name, last_name, location, work_experience, school_experience, skills. In selecting the information, keep track of the most insightful.”

The keys I want to extract out of my “.json” files are:

[first_name, last_name, location, work_experience, school_experience, skills]

Feel free to add anything more information that you want to be “extracted” out of your resume, but remember that this is stuff that should matter only for your cover letter. The specific resume will be added after this text to form the full call/instruction. More on that later.

The order instruction is the cover_letter_api.json:

Now the instruction is this one:

“You are an expert in job hunting and a cover letter writer. Given a resume json file, the job description, and the date, write a cover letter for this candidate. Be persuasive and professional. Resume JSON: {resume_json} ; Job Description: {job_description}, Date: {date}”

As you can see, there are three placeholders: “Resume_json”, “job_description” and “date”. As before, these placeholders will then be replaced with the correct information to form the full prompt.

2.2 constants.py

I made a very small constants.py file with the path of the two .json prompt files and the API that you have to generate from LLamaApi (or really whatever API you are using). Modify this if you want to run the file locally.

2.3 file_loader.py

This file is a collection of “loaders” for your resume. Boring stuff but important.

2.4 cover_letter.py

The whole implementation of the LLM Strategy can be found in this object that I called CoverLetterAI. There it is:

I spent quite some time trying to make everything modular and easy to read. I also made a lot of comments to all the functions so you can see exactly what does what. How do we use this beast?

So the whole code runs in 5 simple lines. Like this:

from cover_letter import CoverLetterAI
cover_letter_AI = CoverLetterAI()
cover_letter_AI.read_candidate_data('path_to_your_resume_file')
cover_letter_AI.profile_candidate()
cover_letter_AI.add_job_description('Insert job description')
cover_letter_AI.write_cover_letter()

So in order:

You call the CoverLetterAI object. It will be the star of the show
You give me the path to your resume. It can be PDF or Word and I read your information and store them in a variable.
You call profile_candidate(), and I run my first LLM. This process the candidate word info and creates the .json file we will use for the second LLM
You give me the job_description and you add it to the system. Stored.
You call write_cover_letter() and I run my second LLM that generates, given the job description and the resume .json file, the cover letter

3. Web App and Results

So that is really it. You saw all the technical details of this blog post in the previous paragraphs.

Just to be extra fancy and show you that it works, I also made it a web app, where you can just upload your resume, add your job description and click generate cover letter. This is the link and this is the code.

Now, the cover letters that are generated are scary good.

This is a random one:

February 1, 2025

Hiring Manager,
[Company I am intentionally blurring]

I am thrilled to apply for the Distinguished AI Engineer position at [Company I am intentionally blurring], where I can leverage my passion for building responsible and scalable AI systems to revolutionize the banking industry. As a seasoned machine learning engineer and researcher with a strong background in physics and engineering, I am confident that my skills and experience align with the requirements of this role.

With a Ph.D. in Aerospace Engineering and Engineering Mechanics from the University of Cincinnati and a Master’s degree in Physics of Complex Systems and Big Data from the University of Rome Tor Vergata, I possess a unique blend of theoretical and practical knowledge. My experience in developing and deploying AI models, designing and implementing machine learning algorithms, and working with large datasets has equipped me with the skills to drive innovation in AI engineering.

As a Research and Teaching Assistant at the University of Cincinnati, I applied surrogate models to detect and classify cracks in pipes, achieving a 14% improvement in damage detection experiments. I also developed surrogate models using deep learning algorithms to accelerate Finite Element Methods (FEM) simulations, resulting in a 1M-fold reduction in computational time. My experience in teaching and creating courses in signal processing and image processing for teens interested in AI has honed my ability to communicate complex concepts effectively.

In my previous roles as a Machine Learning Engineer at Gen Nine, Inc., Apex Microdevices, and Accenture, I have successfully designed, developed, and deployed AI-powered solutions, including configuring mmWave radar and Jetson devices for data collection, implementing state-of-the-art point cloud algorithms, and leading the FastMRI project to accelerate MRI scan times. My expertise in programming languages such as Python, TensorFlow, PyTorch, and MATLAB, as well as my experience with cloud platforms like AWS, Docker, and Kubernetes, has enabled me to develop and deploy scalable AI solutions.

I am particularly drawn to [Company I am intentionally blurring] commitment to creating responsible and reliable AI systems that prioritize customer experience and simplicity. My passion for staying abreast of the latest AI research and my ability to judiciously apply novel techniques in production align with the company’s vision. I am excited about the opportunity to work with a cross-functional team of engineers, research scientists, and product managers to deliver AI-powered products that transform how [Company I am intentionally blurring] serves its customers.

In addition to my technical skills and experience, I possess excellent communication and presentation skills, which have been demonstrated through my technical writing experience at Towards Data Science, where I have written comprehensive articles on machine learning and data science, reaching a broad audience of 50k+ monthly viewers.

Thank you for considering my application. I am eager to discuss how my skills and experience can contribute to the success of the [Company I am intentionally blurring] and [Company I am intentionally blurring]’s mission to bring humanity and simplicity to banking through AI. I am confident that my passion for AI, my technical expertise, and my ability to work collaboratively will make me a valuable asset to your team.

Sincerely,

Piero Paialunga

They look just like I would write them for a specific job description. That being said, in 2025, you need to be careful because hiring managers do know that you are using AI to write them and the “computer tone” is pretty easy to spot (e.g. words like “eager” are very ChatGPT-ish lol). For this reason, I’d like to say to use these tools wisely. Sure, you can build your “template” with them, but be sure to add your personal touch, otherwise your cover letter will be exactly like the other thousands of cover letters that the other applicants are sending in.

This is the code to build the web app.

4. Conclusions

In this blog article, we discovered how to use LLM to convert your resume and job description into a specific cover letter. These are the points we touched:

The use of AI in job hunting. In the first chapter we discussed how job hunting is now completely revolutionized by AI.
Large Language Models idea. It is important to design the LLM APIs wisely. We did that in the second paragraph
LLM API implementation. We used Python to implement the LLM APIs organically and efficiently
The Web App. We used streamlit to build a Web App API to display the power of this approach.
Limits of this approach. I think that AI generated cover letters are indeed very good. They are on point, professional and well crafted. Nonetheless, if everyone starts using AI to build cover letters, they all really look the same, or at least they all have the same tone, which is not great. Something to think about.

5. References and other brilliant implementations

I feel that is just fair to mention a lot of brilliant people that have had this idea before me and have made this public and available for anyone. This is only a few of them I found online.

Cover Letter Craft by Balaji Kesavan is a Streamlit app that implements a very similar idea of crafting the cover letter using AI. What we do different from that app is that we extract the resume directly from the word or PDF, while his app requires copy-pasteing. That being said, I think the guy is incredibly talented and very creative and I recommend giving a look to his portoflio.

Randy Pettus has a similar idea as well. The difference between his approach and the one proposed in this tutorial is that he is very specific in the information, asking questions like “current hiring manager” and the temperature of the model. It’s very interesting (and smart) that you can clearly see the way he is thinking of Cover Letters to guide the AI to build it the way he likes them. Highly recommended.

Juan Esteban Cepeda does a very good job in his app as well. You can also tell that he was working on making it bigger than a simple streamlit add because he added the link to his company and a bunch of reviews by users. Great job and great hustle.

6. About me!

Thank you again for your time. It means a lot

My name is Piero Paialunga and I’m this guy here:

I am a Ph.D. candidate at the University of Cincinnati Aerospace Engineering Department and a Machine Learning Engineer for Gen Nine. I talk about AI, and Machine Learning in my blog posts and on Linkedin. If you liked the article and want to know more about machine learning and follow my studies you can:

A. Follow me on Linkedin, where I publish all my stories
B. Subscribe to my newsletter. It will keep you updated about new stories and give you the chance to text me to receive all the corrections or doubts you may have.
C. Become a referred member, so you won’t have any “maximum number of stories for the month” and you can read whatever I (and thousands of other Machine Learning and Data Science top writers) write about the newest technology available.
D. Want to work with me? Check my rates and projects on Upwork!

If you want to ask me questions or start a collaboration, leave a message here or on Linkedin:

piero.paialunga@hotmail.com

The post From Resume to Cover Letter Using AI and LLM, with Python and Streamlit appeared first on Towards Data Science.

Hands-On Delivery Routes Optimization (TSP) with AI, Using LKH and Python

Piero Paialunga — Tue, 14 Jan 2025 20:02:15 +0000

Photo by Mudit Agarwal on Unsplash

The code of this article can be found on this GitHub folder.

One of my favorite professors throughout my studies told me this:

"Just because your algorithm is inefficient, it doesn’t mean that the problem is hard"

This means that if you want to solve a whatever problem (easy or hard), there will always be an approach that is naive enough to be extremely inefficient. For example, let’s say you have to go to work in a new workplace. Instead of using Google Maps, you start from your house’s alley and try all the possible combinations of the streets (north, south, west, and east). By the time you will arrive to work your company might be filing bankruptcy or having you fired.

Let’s try to be a little more formal. Let’s say that in whatever business or engineering environment, you have to find the minimum or maximum of a function. For example, your company has to maximize revenue from the sales of a given department. We call this function f. The "strings" you pull, meaning the decisions that you can take to maximize the revenue is a vector x. You can’t obviously increase the sales by hiring 10⁶ people a week or making people work 18 hours a day, so you obviously have some constraints. So we say that you can make these decisions in a decision space X. So the task is this:

Let’s assume the f function is linear.

If you had to do this naively you literally stand no chance, because the space is continuous, meaning you would have to explore a closed infinite space: if x¹ is between 0 and 1, you have infinite numbers between the boundaries and the same applies for all the other variables, combinatorially. So if you have 10 variables it is like infinite¹⁰ to explore. Nonetheless, this problem can be solved with a method called simplex. This algorithm only focuses on the vertices of the X space and solves the problem in 2^n (for the worst case) where n is the number of variables of your problem. Most of the time, the runtime is actually n³, so way quicker than 2^n.

So going back to our question. Is the problem of optimizing your sales in a linear function "hard"? Not really, but if you are naive enough to try and enumerate all the possible cases then the problem doesn’t only become hard, it becomes impossible.

Now let’s say we are truck drivers. We need to run through multiple cities around the USA. For example, let’s say we need to tour multiple cities on the East Coast starting from Cincinnati, we have to visit Cleveland, Chicago, Minneapolis, New York, Nashville, and Detroit:

Screenshot made by author using Google Maps.

Now, what is the route that minimizes the distance? What is the way to follow so that I can visit all the cities while being the shortest on the road? This problem is known as the Traveling Salesman Problem (TSP).

Now, there is a huge literature on this, and this problem is known to be NP-hard, meaning that there is no known method to solve this problem in n^k computational time, where n is the number of cities and k is a constant. In short, this problem is hard because it is indeed intrinsically hard, not because we are "bad at it".

In this optimization case, what we usually do is find a compromise. We know that it would take a crazy long amount of time to find the guaranteed best solution. So we solve this problem in a way that we cannot actually guarantee to get the best of the best of the solutions, but we can get pretty close to the best solution in a much shorter amount of time. These methods are known as heuristic algorithms. The heuristic algorithm I want to use in this blog post is the Lin–Kernighan–Helsgaun (LKH) algorithm.

1. LKH’s Theory

Optimized route for a large number of cities using LKH. Image from LKH Documentation.

So why LKH?

This method is one of the most elegant and efficient methods to solve the Travelling Salesman Problem. The coolest thing about the LKH algorithm is the fact that it scales very very well. LKH has been tested up to almost 2M cities and it is still able to find solutions in a feasible amount of time. Another desirable aspect of this approach is that, even if technically LKH does not guarantee optimality in practice LKH is usually very likely to converge to optimality within a few runs.

The idea of this method is very simple: you start with a random tour, that will have a random initial distance, and iteratively improve it by changing some of the connections and see if this decreases the initial distance. More specifically, for every iteration, you perform the so-called k-opt moves that replace k connections of the original tour, with k new ones. Once you have done that, you compute a quantity called "gain" which is the difference between the total distance of the removed edges and the total distance of the added edges. If the gain is positive, it means the move has reduced the total tour length, and the new tour is accepted. Otherwise, the algorithm discards the change and continues exploring other possible moves. If you run out of options, your algorithm stops. *

For more information about this method, you can read the original paper here

This method is fairly simple, but its application is extremely satisfying. Let’s take a look.

2. LKH’s Code

Let’s start by setting up our Python environment and defining the libraries.

I would like to solve this problem in the real world, meaning that I want to take boundaries that come from real areas, not just random points in a 2D space or whatever. For this reason, we will need a library called Folium to display our points on a 2D map. The LKH algorithm is implemented in the library called elkai. The other libraries are just normal ones like numpy and scipy.

We are going to add random pins in the world, and we are going to find the shortest distance route to travel around all these pins.

For this task, I created a script TravelSalesman.py with a class named TaxiDriver():

Let’s walk it step by step.

This class takes two inputs, that have default arguments. The first input is the json file that gives you the boundary box of the random points. For example, I chose to focus on the east coast of the United States. The boundaries are these:

{
    "min_lat": 30.0,
    "max_lat": 39.0,
    "min_lon": -95.0,
    "max_lon": -81.0
}

Write a file like that and put it wherever you want, then pass the path into _route_box_file_.

The other input involves where we are going to store the output. The output of this code will be the optimized route and it will be saved as a map. We can also store random routes as maps for reference.

The maps will be saved in interactive .html files. The only thing you need to provide is the folder where you want them saved. By default, the folder will be _map_folder._

The whole code can be found in this GitHub folder.

3. LKH’s applications

So the whole article can be summarized, coding-wise, in one line:

from TravelSalesman import *

That we can use doing:

driver_ai = TaxiDriverAI()

Ok, let’s have fun with it. The first thing we do is to adopt the .generate_random_points(n) function to populate the maps with n points. For example, let’s say 100 points.

driver_ai.generate_random_points(num_points = 100)

… and let’s display them (all steps together):

from TravelSalesman import * 

driver_ai = TaxiDriverAI()
driver_ai.generate_random_points(num_points = 100)
driver_ai.static_plot()

Output made by author using code above

Now, given the way we set our system, you would find this map in a .html file named _"random_points_map.html"_ in whatever folder you chose.

So the point is: we have to visit all these n=100 cities. If we examine all the possible paths to visit them we get:

100! = 9.332622e+157 paths

Even if computing a single path takes a nanosecond, you will not be able to visit all the paths. And that is "just" for 100 cities. In short, yes, the problem is bad, but our approach is terrible as well.

So we could say: "Ok, let’s try a random route", using the plot_with_connections() function. This function randomly connects the points of our route.

driver_ai.plot_with_connections()

Image made by author using code above

This is one route, but it is obviously so far from ideal. We go back and forth so many times, waste a lot of gas, and travel a lot of unnecessary distance.

So we can’t visit all the routes, but it’s also very risky to just trust a random path.

In these desperate times, our LKH hero comes into play. If we use the proposed LKH above, we can get a route-optimized solution almost immediately. Is it going to be optimal optimal (meaning the best of all the possible routes)? Maybe, maybe not (we will never know because we can never possibly visit them all), but it will be very close.

All this LKH-talk can be done in one line, using the function find_shortest_path(), which applies elkai’s LKH to find the shortest path

optimal_tour = driver_ai.find_shortest_path()

The output for this particular example is:

Elkai Optimized Tour: [0, 4, 90, 78, 3, 8, 98, 77, 55, 10, 35, 70, 52, 82, 49, 58, 42, 54, 60, 2, 81, 36, 37, 74, 7, 85, 25, 57, 43, 71, 67, 47, 50, 26, 87, 79, 28, 56, 23, 66, 34, 75, 64, 88, 39, 63, 80, 29, 45, 38, 9, 73, 20, 46, 32, 22, 44, 33, 18, 13, 91, 97, 21, 83, 61, 11, 40, 24, 19, 14, 99, 93, 89, 72, 53, 65, 68, 48, 51, 1, 59, 31, 95, 96, 76, 41, 69, 92, 6, 17, 16, 94, 86, 15, 84, 27, 12, 5, 62, 30, 0]

And as it is supposed to be, we go back to the original point (otherwise it wouldn’t be a "tour"). If we use the .display_optimized_route(tour) function, we are able to display the optimized route:

driver_ai.display_optimized_route(optimal_tour)

Image made by author using code above

Ok, now we are talking. Our algorithm is defining an optimized route that doesn’t make us waste months of our lives retraveling the same roads. This plot itself is the beauty of LKH and why I think this algorithm is so elegant and brilliant.

Just to reiterate, do we know that this is the best out of the 10¹⁵⁷ routes? Absolutely not. Is it a pretty good solution? Absolutely yes, and we can see that, by eye, we are indeed being pretty smart into finding the best routes out of so many cities.

4. LKH’s limitations

LKH is known to be a pretty good Traveling Salesman Problem (TSP) solution. Now, optimization experts can disagree with it and they might have a point. For example, Dynamic Programming (DP) guarantees an optimal solution but it’s way more expensive. Methods like Genetic Algorithms (GA) are intriguing, because they iterate from a route and find a better and better solution with a "survival of the fittest" approach. People are also improving LKH using reinforcement learning, awarding a higher score for a shorter distance of a proposed route. There are people dedicating their lives to the TSP and they are doing an amazing job at it.

On another note, the main issue of LKH’s is probably not (or not only) it’s complexity but rather its intrinsic static nature.

In real life we don’t travel in an absolute perfect world where the time to go from A to B depends only on the distance between A and B: we travel on roads. This means that traffic, time of the day, gas stations (and their prices), condition of the roads, delivery urgency are all factors that should be included (and they are not). One can think: "Ok then, let’s not optimize on , but on where convenience is a sum of all the other factors". The problem with that is this approach considers fixed factors. Methods like Dijkstra’s Algorithm are more suitable, while other approaches use Machine Learning methods to predict the time to arrive from A to B (if we do that, then we can apply LKH on the predicted times).

5. Conclusions

Thank you for "travelling" with me throughout this blog post

In this blogpost, we explored the optimization problem known as "Travelling Salesman Problem". In particular, we did it following these steps:

We discussed "hard" optimization problems. An optimization problem is "hard" not necessarily because our algorithm takes a lot of times to solve it. Most of the times, the brute force approach is not feasible but that doesn’t mean that the task itself is "hard".
We described the Traveling Salesman Problem (TSP), that is the problem of optimizing a "tour" of multiple points, minimizing the distance.
We talked about the LKH algorithm. First we described the theory of it, then we implemented a class LKH algorithm, using Python and the elkai’s library.
We applied the LKH algorithm in a real world travelling experiment, by finding the shortest route that allows you to visit all the selected cities on the east coast.
We discussed LKH’s static nature and its limitations.

6. About me!

Thank you again for your time. It means a lot

My name is Piero Paialunga and I’m this guy here:

Image made by author

A. Follow me on Linkedin, where I publish all my stories B. Subscribe to my newsletter. It will keep you updated about new stories and give you the chance to text me to receive all the corrections or doubts you may have. C. Become a referred member, so you won’t have any "maximum number of stories for the month" and you can read whatever I (and thousands of other Machine Learning and Data Science top writers) write about the newest technology available. D. Want to work with me? Check my rates and projects on Upwork!

If you want to ask me questions or start a collaboration, leave a message here or on Linkedin:

piero.paialunga@hotmail.com

The post Hands-On Delivery Routes Optimization (TSP) with AI, Using LKH and Python appeared first on Towards Data Science.

Hands-On Optimization Using Genetic Algorithms, with Python

Piero Paialunga — Sun, 29 Sep 2024 11:01:42 +0000

Have you ever heard of this sales tactic?

"Are you tired of wasting hours during X? Why don’t you try Y!"

I’m sure you did. For example: "Don’t spend hours writing your own code, use our software instead!". Or "Stop wasting hours on your ads, use AI instead!" (my YouTube recommendation algorithm loves that one, and it’s probably my fault because I’m always talking about AI).

This sales tactic is based on the fact that the time that you spend on a problem that you want to solve can be optimized.

"Nerdly" speaking, when you read this sales tactic, you usually have two options: exploration or exploitation. What do I mean by that? I’ll explain in a moment. For now, remember the sales tactic, and put it in the back of your head. We’ll be back on it.

Let’s say you have a tool, a knob you can tune. By moving this knob, you get a waiting time. Imagine you are in line for a coffee at Starbucks and the knob has two states (0 and 1): 0 could mean "stay in the same line" and 1 could be "change the line". Which one means "waiting the least amount of time"? Should you change the line or be patient on the same one? I know you have had this problem at least once. I’m a big line changer, and my wife is a big line stayer.

Now, let’s consider a knob that can move continuously, meaning you have infinite states between 0 and 1 (imagine a whole lot of "lines" if you will). Let’s call this state x. Mathematically speaking, our problem has x (state) and y (waiting time):

Image made by author

Let’s say we have states 0.2, 0.4, 0.6 and 0.8. For those states, we found the corresponding waiting times.

Image made by author

Now, given these 4 points, we have that the lowest waiting time is, let’s say, at state x= 0.4. But are we sure that it is really the smallest amount of time? We don’t know, but as we only have 4 points, it is highly unlikely that we found the smallest amount of time, especially considering that the area between 0.0 and 0.2 and the area between 0.8 and 1 are completely unexplored. What can we do?

We explore the other areas (between 0 and 0.2 and 0.8 and 1).
We exploit the areas where we think our minimum is going to be (between 0.2 and 0.6)

Do you remember the example of the sales tactic? This is equivalent to doing this:

We trust the sales guy, and we explore the unexplored area: we modify our previous behavior in order to see what happens in the unexplored part. For example, we buy the software they are trying to sell and give up with our coding.
We do not trust the sales guy and we exploit what we know: we dig in and refine our estimate. We do not buy their software and we improve our own coding skills.

Image made by author

Now, it would be nice to do both, right? It would be nice not to trust the sales guy 100% and trust ourselves 0%, but rather to explore and exploit at the same time. For example, you pay 20% of what he’s telling you (for the hardest part that you don’t want to deal with), and you do the dirty work you want to deal with by yourself, saving money.

Sometimes the situation is messy, and the story is more complicated than a guy trying to sell software: this is where Genetic Algorithms (GAs) come in. GA is an optimization method that is usually very good in considering both exploration and exploitation. The roots of Genetic Algorithms are in biology and I find this method extremely fascinating and powerful.

In this blogpost, we will do the following:

We will very briefly define the problem that Genetic Algorithms try to solve (optimization and global optimum)
We will describe the Genetic Algorithm from a theoretical point of view.
We will apply the Genetic Algorithm to a multiminima case.

Ok, a lot to cover. Ready? Let’s get it started.

0. Optimization Idea

We did cover a little bit of the optimization idea in the introduction and there is no need to be super fancy with the math here. I’m just going to formalize the problem a little bit so that we will understand what we are talking about throughout the article.

Everything start with your parameters. Going back to the knob example, let’s say you don’t have only one knob, but you have multiple ones. Let’s say you have k of them. Those are your parameters. Your parameters form a vector (or a list if you like computers), which is indicated with x:

Image made by author

For every x that you give me, I give you the corresponding loss (or cost). The loss is indicated with the letter L. The objective of the Genetic Algorithm is to minimize this loss, making it as small as possible.

Image made by author

Where X is your "space of solutions" represented by the whole boundaries for parameter 1, parameter 2,…, parameter k.

Good. That is as far as I’m going to get with the small talk. Let’s talk about GA specifically now.

1. Genetic Algorithm Theory

1.1 Defining a population

Now, do you remember that I promised you that we were going to explore and exploit? The first thing to do to explore is to populate X with N elements. This will make sure that we start with an equal opportunity to "see" (or explore) everything without missing possible optimum "spots".

The idea of Genetic Algorithm comes from biology, so you will see a lot of biological terms. The original random sample is known as "Population". The Population will evolve throughout time (t), so we will define it as P(t). The Population has m elements, which are called chromosomes.

So this is our first population:

Image made by author

As we said, the elements of P(0) are randomly sampled from the k dimensional parameter space.

If you noticed that our parameter vector x became a function, it is because the vector will evolve with time as we look for the optimal solution

1.2 Selection process

So this is our first list of candidate. From this list, we can already see which one is better and which one is worse, right? For example, if m = 50, we might have L(x(0)_30)

This is our start. How do we improve? The first thing is to select the best ones (i.e. the ones that are more likely to be the minimum). The "best ones" in math mean "the ones that maximize the fitness function". In the case of minimization, the fitness function is defined as f(x) = -L(x).

So again, at time 0, the fitness function will give us the following values:

Image made by author

Now, how do we select "the best ones" using this fitness function? We could just do "the one with the largest fitness values", which is very intuitive. Nonetheless, one of the best ways to do that is by doing the "roulette selection" which means extracting them with a probability that is proportional to the fitness function. This means that the probability of extracting the ith chromosome is, at time t=0:

Image made by author

This method is robust: if you only have 1 chromosome, the probability is 1 because you don’t have any other choice.

So, based on the probability defined above, we start selecting kchromosomes, which we will call parents for reasons you will know in a second. These parents are randomly picked, but they are most likely elements with good fitness values because of how we defined the probability.

1.3 Parents And Children

Now, Genetic Algorithms are part of evolutionary algorithms, meaning that step t is the evolution (improvement, if you may) of step t-1. So how do we evolve? Unfortunately, it’s not like Pokémon, you actually need at least two "parents" to generate a "child" also known as the "offspring". The "parents" candidates are the k candidates that we selected, for example, using the roulette selection described above. You need to generate the same amount of offsprings as the original population (m).

You can select the couple of parents in multiple ways. You could take them all randomly, or start with the first parent and "pair" it with all the other parents. Then do it with the second parent and "pair" with the remaining other parents, and keep doing this until you reach number "m" offsprings.

But how do we combine two parents? We do so by applying a technique called crossover. We select the index k, then we mix the two starting from that index, just like you see in this scheme below:

Image made by author

This idea has two principles:

Trying to get the best out of both parents: we are aiming to get the qualities that make parent P a good fitness chromosome and the qualities that make parent Q a good fitness chromosome and mixing them.
Explore new region: geometrically speaking, by doing so you are actively exploring new areas.

This is the core of exploration vs exploitation, and that is why evolutionary methods are historically known to have a good tradeoff of these two wanted effects.

By doing so m times we will have m offsprings. These offsprings are the new population and we’ll call this new set of offsprings at time t:

Image made by author

So, let’s recap what we did so far:

We selected our sample randomly
We sampled a list of parents candidates using the fitness function as a probability indicator
We selected the parents from the list of candidates and paired them
We generated an offspring from each pair, thus enhancing the exploration and exploitation of the minima.

Beautiful. We are almost there, I promise.

1.4 Mutation

Now the next step is to mutate our offsprings. What I mean by that is that we want to add some sort of randomness, meaning that we don’t necessarily want x(t) to be just a mix of x(t-1) but we want it to be different. If you think about it, in life, you are not just the result of a genetic mix up but you are also the result of randomness: how things that happen around you change you.

Philosophy aside, this is how you add mutation:

Image made by author

This allows you to explore different realizations of xs but, at the same time, you are also exploiting the surroundings of your best maximum (or minimum). Remember that this will give us m elements, because X(t) is a matrix with m rows, so your population is ready for step t+1.

1.5 Stopping Criterion

When do we stop this process? There are multiple ways to do so. The first method is just to set the number of iterations. For example, tmax = 10: after 10 refinements, you stop (at P(10)). Or you can consider the fitness as a reference: when the fitness doesn’t increase enough from time t and time t+1 or_ when it reaches a certain threshold, you stop optimizing. Whatever your criterion is:

If the criterion is not met, you redo step 1.2–1.4
If the criterion is met, you stop.

2. Genetic Algorithm implementation

Now, this process is fairly simple in practice: select the population randomly, improve it by crossover and mutation, stop if your stopping criterion is met.

Surprisingly, it is even simpler to implement it in practice. Let’s see it step by step.

2.1 The library

PyGAD is a library that gives you full control of the parameters (population size, number of parents, stopping criterion and much more), but it is also extremely easy to use. Let me be fair: the math of Genetic Algorithms is not complicated and you could develop it from scratch using object-oriented programming. Nonetheless, for the sake of simplicity and efficiency we’ll use the PyGAD implementation.

Let’s install it using:

pip install pygad

This library offers very cool integrations too, for example, they train Neural Networks using GA, which is very interesting.

2.2 The parameters

Tuning Genetic Algorithms based on your best need (the so called hyperparameter tuning process) is an art itself. There are so many things that you could change to find a minimum that is lower than the minimum you just found and improve your results. The list of all the possible parameters can be found in here. Let me just give you an idea:

fitness_func and gene_space: they are super important as they are the fitness function (what you define as "best solution") and the boundaries of the space you are looking for
keep_parents: is a cool one, it gives you the option to keep the parents in population P(t+1), so you don’t only have the offsprings, but you keep the parents as well for next iteration.
parent_selection_type: there are multiple ways to select the parents out of a population. Above, we described the "roulette wheel selection" but there are methods like "steady-state selection" or "stochastic universal selection" too. This parameter allows you to modify the method
mutation_type: There are multiple ways to "mutate" your population. We discussed "random" but there is also "swap" or "inversion".
crossover_type: There are also multiple ways to do the "crossover" operation. We described the "single point" (we called it k) but you can also apply "two-point", or "uniform" crossover strategies for combining genes.

Again, this is far from being the full list as there are so many parameters you can tune. But this is a good thing! Tuning the parameters is how you tune your GA to find the best optimum and converge to the global optimum.

2.3 The code

This whole story can be summarized in a single block of code. I’m going to give you the code and then we are going to discuss it.

This is what we did:

Defined our fitness function, that is "minus the function we are trying to optimize". If the function you are trying to optimize is a black box (maybe it comes from a complex model or something), the fitness is still minus that function by definition
Set the parameters, like "m" (population size), "num_generations" (how many refinements of loop you do) or "num_genes" that is the dimensionality of your input.
We gave all these parameters as input to the GA instance.
We run the optimization by doing ga_instance.run()

2.3 The application

I don’t want to focus on the results too much, but just for the sake of showing you that this works, I’m going to show you how they look like. If you run the previous code successfully, the ga_instance.solutions will be a kD x N matrix, where:

k is the dimensionality of the input
N is the number of iterations

We can plot this matrix to keep track of how our Genetic Algorithm is optimizing our function:

In this example, we were trying to minimize a bizarre function of x and y in 2D and we can see how the GA is correctly guessing the minimum area (even with multiple local minima).

But again, the point of this blogpost is to describe how to use GA and how to use PyGAD for your own optimization task, so absolutely do feel free of changing the fitness/loss function and all the other parameters.

3. Conclusions

Thank you for being with me, I hope this article was a good optimization of your time

In this article, we explored how Genetic algorithms work, from theory to practice. These are the highlights:

We talked about local vs global optimum and the exploration vs exploitation problem
We described the Genetic Algorithm in theory, step by step.
We talked about the PyGAD implementation of the Genetic Algorithm, highlighting how important it is to set the vast range of parameters that this library offers (as there are MULTIPLE genetic algorithm variations)
We showed the results on a toy multiminima example, highlighting how this method is correctly migrating towards the area of the lowest minima (global minima)

4. About me!

Thank you again for your time. It means a lot

My name is Piero Paialunga and I’m this guy here:

Image made by author

A. Follow me on Linkedin, where I publish all my stories B. Subscribe to my newsletter. It will keep you updated about new stories and give you the chance to text me to receive all the corrections or doubts you may have. C. Become a referred member, so you won’t have any "maximum number of stories for the month" and you can read whatever I (and thousands of other Machine Learning and Data Science top writers) write about the newest technology available. D. Want to work with me? Check my rates and projects on Upwork!

If you want to ask me questions or start a collaboration, leave a message here or on Linkedin:

piero.paialunga@hotmail.com

The post Hands-On Optimization Using Genetic Algorithms, with Python appeared first on Towards Data Science.

Hands-On Numerical Derivative with Python, from Zero to Hero

Piero Paialunga — Sun, 22 Sep 2024 11:02:02 +0000

There is a legendary statement that you can find in at least one lab at every university and it goes like this:

Theory is when you know everything but nothing works. Practice is when everything works but no one knows why. In this lab, we combine theory and practice: nothing works and nobody knows why

I find this sentence so relatable in the Data Science world. I say this because data science starts as a mathematical problem (theory): you need to minimize a loss function. Nonetheless, when you get to real life (experiment/lab) things start to get very messy and your perfect theoretical world assumptions might not work anymore (they never do), and you don’t know why.

For example, take the concept of derivative. Everybody who deals with complex concepts of data science knows (or, even better, MUST know) what a derivative is. But then how do you apply the elegant and theoretical concept of derivative in real life, on a noisy signal, where you don’t have the analytic equation?

In this blog post, I would like to show how a simple concept like a derivative can underline much complexity when dealing with real life. We would do this in the following order:

A very brief introduction to the concept of derivative, symbolic derivative, and numeric derivative
Implementation of the numerical derivative concept and limitation
Refinements of the numerical derivative to overcome the simple implementation limitations.

Let’s get into this!

1. A theoretical introduction of Derivatives

I am sure that the concept of derivative is familiar to many of you and I don’t want to waste your time: I’ll be very quick on the theoretical idea and then we’ll focus on the numerical part.

So, the derivative is the slope of the tangent line of a curve. For example, the derivative is the slope of the orange line at point x = 2 for the curve y = x² .

Image made by author

Now, a large positive derivative means that you are rapidly increasing at that point: if you increase the x, then you are going to increase y. All is summarized in this powerful definition:

Image made by author

That’s it. That’s the last theoretical bit of this article, I promise. Let’s do the real thing now.

2. Introduction to the numerical implementation

When we have the symbolic representation of an equation, everything is very simple. For example, we do know (from theory) that:

Image made by author

But let’s say we do not have the symbolic equation. How do we compute the derivative?

Well, in real life (with a numeric signal) you don’t have the luxury to take "h that tends to 0". The only thing you have is:

A signal: that is a list of values.
The "time" axis: another list of values

Surely none of these two (neither the signal nor the time axis) are continuous. So your only hope is to do something like this:

def dydx(y,x,i):
return (y[x[i]]-y[x[i+1]])/(x[i+1]-x[i])

Meaning that you can’t really do "less than 1 step" in the discrete space. So you might think: "ok then, let’s do that! Let’s consider h as 1 and move 1 step at a time"

The problem is that when the signal is noisy the derivative might be incredibly large not because the signal is really increasing on the x direction, but just because it happens to increase because of the noise.

Think of an up and downy signal like this:

Image made by author

The derivative is extremely unstable but the signal is not increasing or decreasing anywhere. It is changing locally, but it’s just a random variation of the noise that we are not interested in.

So what do we do in those cases? Let’s see it together:

2.1 The simple implementation and its limitations:

Let’s explore a real case and let’s do it in Python. For this blogpost you don’t need a lot, I used sklearn, scipy, numpy, and Matplotlib. They might all be in the Anaconda package (it’s hard to keep track on it) but if not a simple pip install would do the trick.

Anyways, let’s import them:

Now let’s consider a very very simple quadratic signal.

The numerical derivative can be done using numpy and:

np.diff(y)/step

Where "step" would be "h" and, in this case, is the distance between the second and the first x. So if we do "theoretical vs numerical" we get something like this:

So it works well, let’s wrap up the article (I’m kidding)

The problem comes when the signal is not the fairy tale signal but it is one that comes from the real world. When the data comes from the real world they will necessarily have noise. For example, let’s consider the same previous signal but let’s make it a tiny bit noisier.

This is the slightly noisy signal:

It’s noisy but it’s not THAT noisy, right? HOWEVER, the derivative is completely messed up by this very small noise.

So how are we going to fix this? Let’s see some of the solutions.

3. Derivative Refinement

3.1 Cleaning the signal

The first thing we could do (the easier) is to smooth the original signal: if we know our signal is noisy, we should try to reduce the noise before doing anything else.

My favorite way to reduce noise is the Savitzky-Golay filter (I talk about it in another article you can find it here). Let’s smooth the signal and see how it looks like:

Now we can do the derivative on the smoothed signal:

MUCH better. Nonetheless, this is not really a derivative refinement. This is more like applying the same concept to a smoothed signal. A little bit of cheating, but it’s a first step. Let’s do some more refinement.

3.2 Custom step size

From a theoretical point of view, the smaller the h is, the better it is. Nonetheless, if the signal is noisy, we might consider h as a parameter. So we can tune this parameter and see how the derivative looks like.

For example, if the x axis is between -2 (seconds for example) and 2 seconds with step = 0.004 seconds, then we might think that we can compute the derivative after 0.4 seconds (100 points) rather than after 0.004 seconds (1 point).

In order to do that we have to define our own numerical derivative function. Just like this*:

*We need to be careful because at the end of the signal we will have that we don’t have anymore 100 points but like 20, or 30. So I made a little check that if we get less than half of what you have in the beginning you stop computing the derivative

Let’s test this method again:

And let’s modify the definition of the derivative.

So our "windowed" version of the derivative is way better than the derivative on the smoothed signal (first refinement).

3.3 Custom step size + Linear Regression

Now, modifying the step size helps, but it might not be enough. Let’s consider this case.

Image made by author

If we want to compute the derivative of f(x) for the point = blue cross we have, so far, two alternatives:

Green line, we just trust the next point and do the tangent line
You do the windowed approach we proposed (black line)

As we can see, the windowed approach is similar to the orange (real) ground truth, but it is not quite there. This is because, if the noise preserves, we will have a little bit of noise even after k steps. So how can we fix that?

Well, instead of considering the difference between blue and red crosses only, and then normalizing it with the x difference, we can run the Linear Regression algorithm between ALL the points between blue and red cross.

By applying the linear regression we are actively considering the tangent line (line that passes between blue and red point) but we are also incorporating all the other points in the middle. This means that we don’t just blindly trust the red cross, but we consider everything that is in between, thus averaging on the noise. Of course, once we do the linear regression, the corresponding "derivative" is the coefficient (or slope) of the fitted line. If you think about it, at the extreme limit, if you only have two points the slope of the fitted linear regression is exactly the definition of derivative.

So in a few words: we keep doing our derivative by "windowing" BUT we apply the linear regression method as well.

The effect of this (spoiler) is to efficiently smooth the derivative of the signal, as we expected.

I really like this estimation of the derivative so let’s test it with an hardcore case, i.e. a very noisy signal:

So we are going to test the two methods:

Windowed Derivative on the Filtered Signal
Windowed Linear Regression Derivative on the Filtered Signal

First let’s implement number 2:

These are the implemented methods:

As we can see, the orange and red estimate of the Derivatives are similar but the red estimate is smoother, making the derivative less noisy. So we can consider this refinement as better than the previous refinement.

4. Conclusions

There we are, many derivatives versions later

In this blogpost, we talked about derivatives. In particular, we did the following:

We described the theoretical definition of the derivatives, with a very brief example on a quadratic signal.
We defined the standard numerical definition of the derivative on a simple quadratic signal and on a noisy one. We saw that the estimate of the derivative on a noisy signal is extremely poor as it tries to "model the noise" rather than the ground truth of the signal.
We improved the standard numerical derivative in three ways: by smoothing the signal (first method), by smoothing the signal and "windowing" the derivative (second method), by smoothing the signal, "windowing" the derivative, and applying a linear regression algorithm (third method).
We showed how each refinement improves the previous ones, eventually getting to an efficient definition of derivatives even on noisy signals.

I really hope you appreciated this blog post and you found this interesting.

5. About me!

Thank you again for your time. It means a lot.

My name is Piero Paialunga and I’m this guy here:

Image made by author

A. Follow me on Linkedin, where I publish all my stories B. Subscribe to my newsletter. It will keep you updated about new stories and give you the chance to text me to receive all the corrections or doubts you may have. C. Become a referred member, so you won’t have any "maximum number of stories for the month" and you can read whatever I (and thousands of other Machine Learning and Data Science top writers) write about the newest technology available. D. Want to work with me? Check my rates and projects on Upwork!

If you want to ask me questions or start a collaboration, leave a message here or on Linkedin:

piero.paialunga@hotmail.com

The post Hands-On Numerical Derivative with Python, from Zero to Hero appeared first on Towards Data Science.

Applications of Rolling Windows for Time Series, with Python

Piero Paialunga — Sun, 15 Sep 2024 14:02:02 +0000

Last night I was doing laundry with my wife. We have this non-verbal agreement (it becomes pretty verbal when I break it though) about laundry: she is the one who puts the laundry in the washer and drier and I am the one who folds it.

The way we do this is usually like this:

Image made by author using DALLE

Now, I don’t really fold all the clothes and put them away. Otherwise, I would be swimming in clothes. What I do is an approach that reminds me of the rolling window method:

Image made by author using DALLE

Why do I say that it reminds me of a rolling window? Let’s see the analogy.

Image made by author using DALLE

The idea of rolling windows is exactly the one that I apply when folding laundry. I have a task to do but you don’t do it all at once, because it would be nonpractical and not useful. Instead, I do it on a small portion of "data", then store the "result", and then move to the next section of "data".

This idea looks very simple, but there are SO MANY things you can do with Rolling Windows, as this very simple approach is also incredibly powerful.

In this blog post, I want to describe very briefly what rolling windows are on a technical level, then show a few powerful applications of rolling windows on some specific tasks that are often required when dealing with a signal.

We will do this in the following order:

A technical introduction about the rolling windows idea
We’ll use rolling windows as feature extractors (first use case)
We’ll use rolling windows as a smoother/noise reducer (second use case)
We’ll use rolling windows to extract peaks and valleys of a signal (third use case)
We’ll use rolling windows to perform a Fourier Transform (fourth use case)

A lot to cover, better get started!

1. About Rolling Windows

First off, our Avatar World for this example is signals.

Let’s define our signal: the signal is a discrete object y[t] with t that is our index and can go from 0 to len(signal)-1 (t≥0 and t≤len(y)-1). For every position t, the rolling window covers the segment y[t],y[t+1],…,y[t+n−1]. For example, the first window is:

Image made by author

Beautiful, we defined the window. Now we got to do something with it.

The result of the application of the rolling window is:

Image made by author

Now, who is O and who is M? That depends on what you are actually doing with your rolling window. Are you doing a _rolling_mean_? If that is the case M_0 is a single point and O(w_0) is just the average operation of all the points from 0 to n-1. If you are doing a Short Time Fourier Transform then, for a given window you will have the whole vector of frequencies. So the dimension of M_0 is much larger than 1 and O is the Short Time Fourier Transform operation (more about this later).

Whatever your case study might be, M_0 will be your first entry for your rolling window result.

What is the next step? We identify the next window by moving s steps (stride) and we perform the same operation. For example, if stride s = 1 (maximum overlap), we get:

Image made by author

If stride = n (maximum stride) then we have:

Image made by author

And M_1 = O(w_1).

I know it might sound very confusing, but that’s ok: we will cover everything in the next chapters. The only thing I want you to keep in mind is this idea that we select a window of our signal, process it, extract the result, and move to the next window. Just like this:

Image made by author

We’ll learn by doing. Let’s start with the first application!

2. Rolling Windows as Feature Extractors

Imagine that you don’t have one signal, but you have multiple ones. Every signal can come from source 1,2,3,…,k.

Signals coming from the same source have similar averages, standard deviations, or peak-to-peak distances. Signals coming from two different sources have bits of data that are completely different.

For example, a signal coming from source 1 and a signal coming from source k have completely different averages in the first bit of the signal and similar averages in the last bit of the signal.

In this case, it is very useful to implement a simple Machine Learning classification algorithm (e.g. Logistic Regression, Random Forest, SVC,…). Nonetheless, it is a waste to apply those algorithms to the whole signal. It is much simpler to do so once you extract the features using a rolling window. What I mean by that is that you take your signal, you run it through a rolling window that extracts statistical features (e.g. mean, std, peak to peak distance) for every given window, and you use those features in place of the long signal. This can drastically reduce the number of points you need to consider to perform your ML task while preserving all the information you need to maintain a high level of performance.

For example, let’s consider the case where we have a quadratic signal (y=x²) with a middle random disturbance and a high frequency disturbance:

This is a signal of 500 points. We can drastically reduce the information by applying this rolling window:

Here I’m computing the mean (average value), standard deviation (dispersion around the mean), peak to peak (max-min) and energy (complexity of the window), but it doesn’t have to be like this! You can consider the mean only, for example, or actually add more features. You can also change the sride and the window size. This is up to you. Consider that, as a smart person once said "the model needs to be as complex as needed but not more complex than that", so don’t add too many features if you don’t actually need them. A small window size considers more granular information, a larger window size considers more gross features. Consider adjusting that as well based on your needs.

If we want to apply this to multiple signals, the code is extremely easy and it is the following:

So for example, we have that, for the first signal, the mean for the first window is 7.26, the mean for the 2 window is 15.19 and so on and so forth. This method is extremely helpful when we want the gross feature of a signal, without getting lost in its continous behavior.

3. Smoothing Rolling Window

The method I’m going to show here is known as Savitzky-Golay filter. It’s a method I really love because it is simple, extremely elegant, and very helpful for cases where a smoothing of a signal is required.

This method uses a rolling window and fits a polynomial for every rolling window. For example, if you have data from 0 to 100 you do a polynomial fit of that dataset, then you extract the data from the fit you did on the specific window, replace the value based on your polynomial fit, then you go to the next window.

The math can be a little tricky (not too much, I’d say an evening of pen and paper) but the implementation is very easy, so let’s dive in.

Let’s consider our little quadratic signal with noise added in:

Now our rolling window is literally one line of code. This:

savgol_filter(combined_signal, 5, 2)

So that’s the application:

As we can see, the rolling window has almost completely zeroed the noise. The blue line is almost identical to the quadratic dependency we put in, and the Noise is almost identical to signal-filtered_signal.

I would say that when you have a signal to clean and you can anticipate that the problem is in the higher frequency, your first shot (and probably last one too, as it works very well) should most likely be the Savitzky-Golay filter.

4. Rolling windows for Peak Detection

Now imagine you want to detect peaks, as a human travelling throughout the signal (I know, it’s weird, you have to trust me on this a little). You start "walking" and see the first value of the signal: y[0], and you put it in your backpack. Then you see the second value: y[1], you put it in your backpack. You keep going like this to the point of window size: let’s say 100. At this point, in your backpack you have a lot of history. ** You have 100 points, and you can use them to compute the mean and std. Is point 101 reasonable, given that mean and std? If yes, then it’s not a peak. If not the**n it is a peak!

Then you go to point 102, you consider the mean and std from point 1 (not 0!!!) to 101 as reference, and classify that point.

Now, if I find a peak, do I consider that in the mean and std? We can set that as a parameter, called influence. You can ignore it completely influence=0 or all of it influence=1.

I implemented this method from scratch (so we have more control on it). I generated an object with RealTimePeakDetector().

The signal we consider is the sin(2pi t) + noise. Let’s add 4 peaks in random positions. Once we define it we have:

detector = RealTimePeakDetector(signal,threshold=2.5)

With threshold = 2.5 that is the number of standard deviation that we consider to classify a new point as a peak. If the point is:

point > mean + threshold*std -> positive peak
point < mean -threshold*std -> negative peak

We run throughout the signal and detect the peaks using this:

signals = []
for i in range(len(signal)):
    signals.append(detector.detect_peak(signal[i]))

We can run everything in a single block of code:

There is a whole literature about peak detection, so this method is obviously not going to be an evergreen that can be used every time. Nonetheless, using a rolling window for peak detection can be a simple and effective solution in many real-world applications. I would try this before doing a super complicated Encoder Decoder, as a valuable and simple baseline.

5. Rolling window for Fourier Transform

This is probably the most delicate application and the hardest one to explain in a few lines, so be gentle with me .

The idea of the Fourier Transform (FFT) is to decompose the signal in the sine and cosine components. This method transforms a time-domain signal into the frequency domain. FFT is used for multiple tasks like understanding the seasonal components of a signal, clean the signal from noise, or compressing audio or image data.

Nonetheless, this method has two notorious problems:

This method is meant to be applied to perfectly periodic signals. So if a signal is periodic, the FFT perfectly able to distinguish the components. In real life, most signals are not perfectly periodic and this method has some leakage.
The FFT assumes that all frequency components exist every time (throughout the entire signal). But think about it, maybe sometimes you work out 3 times a week (before summer for example, we are all guilty) sometimes you work out once a week. Frequency, in real life, changes with time.

Short Time Fourier Transform (STFT) is very helpful because it uses a rolling window to consider bits of the signal, impose the periodicity using a Hann Window (or similar) and allows you to consider the Fourier Transform at different time steps.

The result of a STFT is an image that gives you, for every time and frequency the amplitude of the component.

Let’s do an example; let’s consider a signal that has a frequency from step = 0 to step = 1000 and a much higher frequency from step = 1000 to step = 2000.

Kind of like this:

If we run the Short Time Fourier transform we can clearly see the time when the frequency changes.

When you consider a signal with multiple components that can happen in different moments of your long signal, it is a good idea to see when the frequency changes with time.

6. Conclusions

Thank you very much for rolling with me throughout the article it means a lot to me

In this blogpost we did the following:

We introduced the rolling window idea with me doing laundry. We talked about the idea of isolating a part of your signal, process it, and store the data.
We used rolling windows to extract features. This can be used to prepare the signal before your Machine Learning algorithm.
We showed how rolling windows to smooth a signal. In order to smooth a signal you can combine a rolling window with the polynomial fit approach. This is called Savitzky-Golay filter.
We used rolling windows to detect the peak of a signal. We run throughout the signal and use the mean and standard deviation to detect positive and negative peaks. We did this with our own custom object.
We applied rolling windows to detect when a certain frequency appears in a signal. This is a refinement of the Fourier Transform and can be used when the frequencies vary over time.

7. About me!

Thank you again for your time. It means a lot

My name is Piero Paialunga and I’m this guy here:

Image made by author

A. Follow me on Linkedin, where I publish all my stories B. Subscribe to my newsletter. It will keep you updated about new stories and give you the chance to text me to receive all the corrections or doubts you may have. C. Become a referred member, so you won’t have any "maximum number of stories for the month" and you can read whatever I (and thousands of other Machine Learning and Data Science top writers) write about the newest technology available. D. Want to work with me? Check my rates and projects on Upwork!

If you want to ask me questions or start a collaboration, leave a message here or on Linkedin:

piero.paialunga@hotmail.com

The post Applications of Rolling Windows for Time Series, with Python appeared first on Towards Data Science.

From Theory to Practice with Particle Swarm Optimization, Using Python

Piero Paialunga — Sat, 07 Sep 2024 01:22:20 +0000

Photo by James Wainscoat on Unsplash

There is a joke that cracks me up:

"Did you know that, before the clock was invented, people had to actively roam around and ask people the time?"

There is obviously no need to explain the joke, but if we were to overthink it a little bit (like good mathematicians do) we can say that the joke is about the fact that the information of a particle of a group can be used to inform all the other particles. This concept is actually way deeper than the joke I just said and can be exploited further.

Let’s consider a self-organized system, such as bird flocking or fish schooling. We can define this system as one made of particles (e.g. a particle is a bird). We can also assume with a good degree of approximation that these particles move around in space adjusting their positions based on two factors:

The best position that the specific particle knows: what the bird thinks is best for themselves.
The global best position that is given by all the particles "communicating" with each other: what the bird is instructed to do by the "main bird"

Now, what is "best" in Nature? What is the best thing to do for the bird? What is the best for the "swarm"? I’m absolutely not the right person to ask this, as I really have no idea. What I DO know is that by observing this kind of behavior in Nature, we are able to formalize a very fascinating optimization algorithm. In other words, if we do define what is best, then we can use this evolutionary approach to optimize the function that we selected.

This algorithm is known as Particle Swarm Optimization (PSO). I know, this is a pretty big leap. What is "optimization"? Why are we talking about math all of a sudden? What is that we are optimizing? Throughout this article, I will try to cover all these steps and more importantly, we will use object-based programming in Python to create our own ParticleSwarmOptimizer() class. In other words, we will cover the PSO world from theory to practice.

Let’s get started!

0. Introducing "Optimization"

I have the feeling that if you are reading about "Particle Swarm Optimization" maybe you already know a little bit about "optimization" and you don’t know about the "particle swarm" thing, so I don’t want to spend too much time on this and bore the readers.

What I do want to know is to give a little bit of a formalization to the problem. As we said earlier, mathematically, what is the "best thing to do for a swarm"?

Well, mathematically, the "birds of a swarm" (or "particles") are points in a K-dimensional space. If it is the real space, our "domain" can be the 3D space we live in, but it can indeed be much larger than that. Actually, this method gets interesting when we forget about the "bird" example and we consider the fact that we can use this method, in general, to find the minimum or maximum of a given function.

For example, let’s say that you have that the price of a house is automatically generated exclusively from an algorithm (veeery dangerous not to keep a person in the loop but anyway):

Image made by author

Now this doesn’t much make sense, right? What is the "House" Icon? We will have a list of features, for example, locations, size, etc…

Image made by author

Now this makes sense. The house is translated to the feature space. So each house is a k dimensional point. The "process" that converts the Feature space into a house price is a "black box" function:

The real game here is this:

"What is the x that I need to have to get the minimum cost?"

This is our optimization problem.

Image made by author

The end game of PSO will be the coordinates of x_min. Let’s see how we deal with it.

1. Introduction to Particle Swarm Optimization

Before we start, I want to say that it’s extremely helpful to look at the original paper. I would say that it’s one of the least-technical technical papers I have ever read. It is a very easy read and, at the same time, it is very informative and fun.

Alright, so, what is the idea behind this? As we said, the idea is that we are creating a set of particles that collaborate to find the global minimum by exchanging the information that they have. If you want, it is very similar to what happens to shows like "Stranger Things" where there is an enemy to beat and all the characters work together and talk with each other using all the means that they have.

More technically, we start with num_particles particles. These particles have random locations, and for each one of these locations the cost function L will have its value. For example, particle 1 will have location x_1 and cost L(x_1). This is iteration 0, and "after" this first iteration we have built our system. Now the system needs to evolve. How do we do this? With a quantity called "velocity". We say, that for the particle x, for each dimension (i) and each iteration (j), we have that:

Image made by author

Or in the more elegant vector form:

Image made by author

Now how do we define v? That is where the beauty of this method comes in. The vector v has basically three components:

Image made by author

Let’s explore them:

v_intertial is basically the memory of the previous velocity. The term comes from physics where the inertial system is a system of reference where a body will remain at rest or move at a constant velocity.

Image made by author

where k_intertial is a constant.

v_cognitive is the velocity that is suggested by the own particle (that is in some way cognitive). It suggests that we should move in the direction where we know we have our best for that particle. We also add a little bit of randomness r_1 (random number):

Image made by author

k_cognitive is again a constant. x_best is the best (lowest cost function) for the x particle.

v_social is gathering the information from all the other particles. We move in the direction that is the best for all of the particles.

Image made by author

k_social is a constant, r2 is a random value and x{global best} is the best (lowest cost function) for all the particles.

Now, that is, I think, extremely elegant. Let’s wrap it up:

We start by selecting a certain number of random particles (i.e. particles in random locations)
We move these particles based on the velocity parameters
These velocities are given by three factors: an inertia, which keeps the memory of the previous velocities, a cognitive velocity, which keeps us moving in the direction that is the best for the particle, and a social velocity, which keeps us moving in the direction that is suggested by the whole group of particles.
We keep changing the location of these particles for every iteration until we get to num_iteration. Then we select our best option.

Ok, enough words for today. Let’s run some Python

2. PSO Implementation

We will use two classes to build our PSO. The first one is the class of a single particle, and the other class builds, given the single particle class as a builder, our swarm.

The first thing I want you to build is a constants.py file. This one keeps all the default info, like the boundaries of the particle, number of iterations, and all the k values. How do you change this? You need to create a .json file and put it in the name "_config/pso_config.json_". If you don’t have one, the script I will provide in the classes will build it for you.

Now, everything we said in our code is inside a .py file that I called particle_swarm_optimizer.py

Let me describe it to you in a few words:

Particle() builds a single particle. Starts with random location and then has the update_velocity() and update_position() features to update the velocities and positions based on what we said above (you can check to make sure I’m not lying).
ParticleSwarmOptimizer() takes two objects: the Particle() class, to build a class (you can make the algorithm more complex if you want by adding features to the particle or modifying the update features), the objective_function is the black-box function that we want to minimize.

Now, the star of the show is ParticleSwarmOptimizer() of course. With optimize() we are going to run the optimization process. We can select if we want to print the results (iteration per iteration) and if we want the solution array (again, iteration per iteration). plot_pso_convergence() will help us in the plotting stage. If you set animated=True you **** will have the .GIF of the convergence, meaning that you will see the global minimum move, iteration per iteration (very cool stuff, you’ll see it in a second).

3. Hands On Examples

We did all the dirty work before, now we can run everything in a very few lines. If we select a very simple quadratic function:

Image made by author

This is the .GIF we generate:

Image made by author using code above

Now PSO does a very good job even with fairly complex objective functions like:

Image made by author

Running the same code above we get:

This is the .GIF:

Image made by author using code above

So we can see how our little PSO doesn’t only get to the minimum in cases where there is one evident minimum, but it gets to the global (we believe, as we can’t tell analytically) minimum even for cases that are much more complex and have multiple local minima.

4. Conclusions

Thank you very much for spending time with me in this optimization process. I hope I maximized your knowledge given the parameters of this being a Towards Data Science blog post.

In this article, we did the following:

We described the idea of working on a problem by analyzing the information of multiple particles at once
We described the "optimization" task with the very simple use cases of the house price minimization.
We talked about Particle Swarm Optimization (PSO) in general terms and talked about the fascinating idea behind it
We went into the equation world, and we saw how the algorithm evolves an initially random state for num_particles particles and gets to the best estimate of the global minimum
We used object-based programming to implement PSO from scratch.
We tested it on a very simple quadratic problem and on a much more complex multi-minima problem. In both cases, we saw very promising results!

5. About me!

Thank you again for your time. It means a lot

My name is Piero Paialunga and I’m this guy here:

Image made by author

A. Follow me on Linkedin, where I publish all my stories B. Subscribe to my newsletter. It will keep you updated about new stories and give you the chance to text me to receive all the corrections or doubts you may have. C. Become a referred member, so you won’t have any "maximum number of stories for the month" and you can read whatever I (and thousands of other Machine Learning and Data Science top writers) write about the newest technology available. D. Want to work with me? Check my rates and projects on Upwork!

If you want to ask me questions or start a collaboration, leave a message here or on Linkedin:

piero.paialunga@hotmail.com

The post From Theory to Practice with Particle Swarm Optimization, Using Python appeared first on Towards Data Science.

Hands-On Global Optimization Methods, with Python

Piero Paialunga — Wed, 04 Sep 2024 07:04:59 +0000

Imagine going out with your best friend. You decide to go to this place that you have never been to, but your best friend has.

Also, imagine that you are in a city with a bit of a traffic problem, like Rome (I would know as I’m from there and it’s painful), and consider that you and your best friend will meet there but leave the same house with two different cars.

You and your best friend leave the house at the same time, and as you have never been to the place you use the GPS to arrive there.

When you arrive there, you find your friend who is already ordering the drinks and tells you that he has been waiting for you for the last 10 minutes. Assuming that neither you nor your friend went above the speed limits, this story tells us two things:

You should be mad at your best friend as he could tell you the faster route instead of letting you use the GPS. Why didn’t he call you anyway?
You just encountered a local minimum problem

During this article, we will focus on problem number 2, and I will let you deal with your best friend on your own.

Why are local minima relevant in this example?

Let’s take a step back.

Services like Google Maps (or Maps) in general are already very well-known optimization algorithms. You have to go from point A to point B and you want to minimize the time that is required to do so. This is exactly what we mean by "optimization": the process of optimizing an objective function given a set of parameters.

Everybody relies on Google Maps to go to places where we have never been, and we do so because we are fairly confident that Google Maps won’t waste a lot of your time and will take you to your desired destination without major delays or problems. Nonetheless, when we do know the location, most of the time we are confident that we will get to the destination quicker if we take another route. That is because maybe we know that there is a scheduled road work at a specific time of the day, because traffic will start popping out after you have left your house (and it won’t give Google Maps time to recalculate), because there are a lot of speed traps over there and cars tend to stomp on their brakes and slow the whole car flow, and so on and so forth.

In other words, we think have more data than Google Maps (and sometimes we do). As we have more data, we end up getting to a minimum that is lower than the one of Google Maps. But wait, didn’t we say that Google Maps runs an optimization too? Yes, indeed! That’s a case where we can say that their minimum is local. Is our minimum local too? Maybe, maybe not, possibly. Maybe it is another local minimum that is lower than Google Maps local minimum. Or maybe that is indeed the fastest way to go ever (global minimum). What is certain is that the task of finding the global optimization, meaning the "lowest of all the minima" or the "largest of all the maxima" is a very fascinating exercise.

Hopefully, with this intro I gave you, you have enough interest to bear with me in this global optimization article. We will start by giving a formalization of the global optimization problem, and then we will find multiple ways (or algorithms) to reach the global optimum. In particular, we will list these methods from the simplest ones to the most complex ones.

Let’s get this started!

0. Formalizing the problem

So, global optimum. What do we mean by that?

Let’s start with something that may be slightly trivial but pretty crucial to say. All this stuff that we are saying (and have said) works and makes sense with black box functions. When I say that I mean that I give the function the input, there is a numerical process that happens in the background, and the function "spits" out the number. In other words, we have to assume that we don’t have any analytical estimate of what the function looks like.

Why is that? Well, because if I knew the analytical function I would get away with doing the derivative, see when it goes to 0 and the second derivative has the sign "-" or "+" depending on max or min, and find the optimum value.

For example, let’s consider the argmax (optimum = maximum) situation. We have this equation:

Image made by author

So, again, in a case where we know the analytical form of f(x) we get something like:

Image made by author

But in reality, we don’t have f'(x) because you don’t have the analytical expression of f(x) in the first place, so you can’t really get X_opt analytically.

But there is much more that adds to the problem. In reality, X is a continuous set*, so we can’t really "check them all", meaning that we can’t really see the value of f(x) for all possible Xs.

That’s an overview of the problems that we face when we do global optimization, and that is also why we use numerical methods to solve them. This means that we give up on the analytical definition of the derivatives and we try to approach the problem with algorithms. Of course, we are going to use the analytical expression to test it (invent the black box) but then we don’t use any anaytical operation for our optimization.

Do you buy it? I hope you buy it. Let’s start!

*Notice that X is a k-dimensional space. Multiple parameters can be considered in your problem.

1. Brute force method(s)

1.1 The idea

When you want to find the minimum of a dataset, the most reasonable thing to do, the thing that even a non technical person would suggest is this:

Let’s explore all the possibilities!

Now this is possible only for very limited cases. For example, we can do it if we only have four different choices for one parameter. Or 2 choices for 3 parameters. Or 3 choices for 2 parameters. I think you get the gist.

If you have only a few choices it is worth to try and explore them all. But if your parameter space is continuous, you are already out of business (you can’t explore all the possibilities of a closed infinite space). You can still use this kind of logic, though. A very simple thing to do is to sample your parameter space (X) by dividing it into multiple steps (grid approach). Nonetheless, the literature clearly shows that when you want to do a few simulations, it is smarter to add some randomness to this. Latin Hypercube Sampling is a very good alternative.

I have a fear that it’s getting too technical too quickly. I’d like to smooth it out with some coding, so we have the opportunity to explain.

1.2 The code

Let’s consider lines with 1D inputs: y = mx+c Let’s say that we fix m and c to be m = 5 and c = 3. Let’s say we want to identify m and c. Of course you can just do a fit of the line and call it a day, but we can also treat this (for didactic purposes) as an optimization problem.

Ok enough talking, let’s import some libraries. We’ll use the super known libraries: numpy and scipy.

Now, the target objective is defined with this function:

We also added the "_compute_error_" function, which computes the difference between the target lines and a new line with equation y = mx + c.

These are some random lines and the target ones:

Now we start with 1 random red line, we compute the error, and it’s probably going to be very high. Then, we keep adding simulations. The number of simulations will be larger and larger. By having more and more lines to compare, we will end up having a more and more accurate estimate.

We can clearly see it from this:

So that means that we are converging, with 100,000 samples, to a very good estimate of our m and c. Here’s if we plot it:

In this specific case:

The cost function is the difference between the target line and the new (candidate) line
We are filling the 2d parameter space (m,c) with "num" simulations from a Latin hypercube
We are computing the cost function between the target line and the "num" lines and extracting the best one

Now this is the equivalent of "enumerating the possibilities" but they are not quite ALL the possibilities, because we are in the continuous space. It’s a very simple approach and it is prone to the presence of local minima. On the other hand, when the number of simulations increases, the probability of being in a local minimum gets lower (you probably explored all the minima and got to the lowest of them all).

This is called brute force method because it solemnly relies on the number of simulations and the error computation. No sophistication at all and very intense computationally. Brute force methods differ in the sampling method you apply. As it has been said before, we applied Latin Hypercube.

2. Gradient Descent

2.1 The idea

The method that I’m going to show is very well-known in the Machine Learning community because it’s the method that gets used to train Neural Networks.

We are starting to get into more complex algorithms. This idea is the following:

We start with a random parameter vector x_start. Anywhere in the k-dimensional space of the problem of interest.
We compute the loss for that random parameter x_start.
We compute the gradient of the loss for that specific x_start and we move in the opposite direction*

*Notice that this method works exactly the same if we want the maximum, but you follow the direction of the gradient (instead of the opposite).

As you can tell, the idea itself is very simple. Let’s look at the code.

The whole idea is very simple and can be implemented in a few lines

Now, the problem of local minimum is still present in this problem as you can tell:

As you can tell from the example above if we start with a random parameter in the wrong area, we get stuck in a local minimum. Now, there are multiple ways to make this Gradient Descent approach more efficient and avoid local minima. Methods are stochastic gradient descent (SGD) (which is used in Keras a lot as an "optimizer"), adding the momentum, and doing random restarts. An article that summarizes all these possibilities very well is given by Mohit Mishra.

These methods are not perfect, as they are just numerical ways to try to escape local minima: you can not still be sure that the minimum is global, as always. Nonetheless, techniques like SGD and momentum are famous for being good methods for escaping local minima in Gradient Descent and are commonly used when training Neural Networks.

3. Bayesian Optimization

3.1 The idea

Now we are getting into the business of surrogate models, which also happens to be my PhD thesis topic. Bayesian Optimization is used when the black box function is actually computationally intensive, and you might have to wait a long time (tens of seconds) before getting your output f(x) given your input x.

The strategy is the following:

You take a small fraction of points and you compute your loss function on that small fraction of points.
The couples (x,f(x)) for those fraction of points are used to train a Gaussian Process Regression (GPR)
The trained GPR is used to intensively sample the parameter space.
The Expected Improvement (EI) tells you where the GPR is uncertain. In other words, using the EI we can select the areas where the global minimum or maximum might be
You select the point with the largest EI. Compute (x_new, f(x_new)) and retrain the GPR. Then rerun 3,4 and 5.

If you want to get in the math of all of this, you can look at my article [here](http://Hands On Optimization with Expected Improvement and Gaussian Process Regression, in Python), where I still 10 minutes of your time to talk about it in depth.

3.2 The code

Alright so let’s consider this function here*:

Image made by author

As I said, let’s take a small portion of the data as training set:

We train our GPR:

Then we select the areas with largest EI and retrain the GPR. We do this iteratively. This is the result:

I have a love/hate relationship with this method. I find it extremely fascinating, elegant, and helpful but it actually takes a lot to train a GPR for large datasets, especially with large dimensions (k>>0). In a few words, I use GPR all the chances that I have and I would like to have more chances to use it.

4. Genetic Algorithm

4.1 The idea

The Genetic Algorithm is another extremely fascinating method to find the global optimum of a black box function. The logic is to select a random set of candidates, choose the best ones based on your cost function, mate them together, and evolve them into new candidates. More precisely, these are the steps:

You start with a randomly generated population of possible solutions.
You evaluate each candidate’s fitness using a predefined cost (or fitness) function. The fitness function tells you how "good" a solution is.
You combine pairs of selected candidates (parents) to produce new candidates (offspring). This process is done by exchanging parts of the parent solutions to create diversity in the offspring and it is usually referred as mating.
You can (and usually do) introduce random changes to some of the offspring’s genes (individual parts of the solution) to maintain diversity within the population. This is just to introduce some randomness (and it’s actually helpful to avoid local minima)
You replace the old population with the new set of candidates and repeat the process over multiple generations.

The stopping (termination) is obtained with a predefined criterion (e.g. max iteration = 1000)

4.2 The code

The amazing people from PyGAD made the usage of Genetic Algorithms (GA) extremely easy. Install it doing:

pip install pygad

Now let’s consider this loss function to **maximize***.

*Note: PyGAD is used to minimize -f(x), which is the same as maximizing f(x)

Let’s run the GA with x in [0,10] and y in [0,10]. This is how it looks*:

*Note: There are a lot of parameters to consider. Thankfully, PyGAD explains everything very clearly. Refer to the documentation to know more!

And as we can see, we converge to the global minimum:

GAs are extremely well at detecting the global optima, but are a little too time and resource consuming when dealing with large dimensions (k>>0)

5. Conclusions

Thank you so much for spending time with me in this global optima journey. Let’s see what we did:

We introduced the problem of global optimum using the example of the GPS and the shortest time to arrive at a destination
We formalized the problem of global optimum and explained why we need the numerical methods to solve them
We talked about the brute force method: sampling the parameter space and computing the loss. Simple but not efficient.
We described the gradient descent method: notorious in Machine Learning, and used for high dimensional problems. You can try to prevent getting stuck in local minima using SGD or Momentum
We talked about surrogate modelling (Bayesian Optimization): you can use a surrogate model to model the loss function and use Expected Improvement to guide you to the global minimum. Extremely elegant, but computationally heavy.
We described Genetic Algorithms: an elegant class of algorithms based on modifying a set of possible random solutions. We tried it on a 2D problem with success. This method is not indicated for large dimensions but it’s a very good method against local minima/maxima.

6. About me!

Thank you again for your time. It means a lot

My name is Piero Paialunga and I’m this guy here:

Image made by author

A. Follow me on Linkedin, where I publish all my stories B. Subscribe to my newsletter. It will keep you updated about new stories and give you the chance to text me to receive all the corrections or doubts you may have. C. Become a referred member, so you won’t have any "maximum number of stories for the month" and you can read whatever I (and thousands of other Machine Learning and Data Science top writers) write about the newest technology available. D. Want to work with me? Check my rates and projects on Upwork!

If you want to ask me questions or start a collaboration, leave a message here or on Linkedin:

piero.paialunga@hotmail.com

The post Hands-On Global Optimization Methods, with Python appeared first on Towards Data Science.

Hands On Neural Networks and Time Series, with Python

Piero Paialunga — Fri, 30 Aug 2024 03:58:59 +0000

During my Bachelor’s Degree, my favorite professor told me this:

Once something works well enough, nobody calls it "AI" anymore

This concept goes in the same direction of Larry Tesler who said "AI is whatever hasn’t been done yet." The first example of Artificial Intelligence was the calculator, which was (and is) able to do very complex mathematical computations in a fraction of a second while it would take minutes or hours for a human being. Nonetheless, when we talk about AI today we don’t think about a calculator. We don’t think of it because it simply works incredibly well, and you take it for granted. The Google Search algorithm, which is in many ways much more complex than a calculator, is a form of AI that we use in our everyday lives but we don’t even think about it.

So what is really "AI"? When do we stop defining something as AI? The question is pretty complex as, if we really think about it, AI has multiple layers and domains.

It surely has multiple layers of complexity. For example, ChatGPT is more complex than the "digit recognition" 2D CNN proposed by LeCun, both conceptually and computationally, and a simple regression algorithm is far less complex than the digit recognition 2D CNN proposed by LeCun (more on this later).

It surely also has multiple domains. Every time we see a CAPTCHA, we are actively creating the input for a Neural Network to process an image. Every time we interact with a GPT we are building the text input for an NLP algorithm. Every time we say "Alexa turn on the light in the kitchen" we are feeding the audio input for a Neural Network. And while it is true that at the end of the day, everything is nothing but a 0/1 signal for a computer, it is also true that, practically, the Neural Network that processes an image has a completely different philosophy, implementation, and complexity than the one that processes an audio or a text.

This is why companies are looking more and more for specialized Machine Learning Engineers who know how to treat a special kind of data more than others. For example, in my professional career, I have worked more in the time series domain than with anything else. This blog post wants to give the reader an idea of how to use Neural Networks when we are in the time series domain. We will also do it at different levels of complexity. We will start from the most simple Neural Network that we have, which is known as Feed Forward Neural Network, to the most fancy, modern, and complex structure of Transformers.

I don’t want to bore any of the readers, and I also know that a lot of you wouldn’t find this article useful without any coding, so we are going to translate everything from English to Python.

Let’s get started!

1.Feed Forward Neural Network

1.1 Explanation

I want to start this section with a brief formalization of the problem. A timeseries is a sequence of observations recorded over time. Each observation in the sequence corresponds to a specific point in time. It kind of works like this:

Image made by author

For every time step t_i there is a corresponding y_i. Now it is obvious that y_i is correlated with y_i+1. The data are, as smart people say, "sequential". Nonetheless, a very simple approach is to ignore the time axis completely and to consider the sequence y_1,…, y_T as the units of the input layer of our Feed Forward Neural Network:

Image made by author

Once we have our input layer, we process it from left to right in a Feed Forward Neural Network. For example, if we want to do a binary classification or regression task, we can build the following FFNN:

Image made by author using NN SVG

Now, each arrow (or line) that you see in this network is a weight, that is a real number that gets multiplied and then added together (+ bias) in the units (that are the white circles).

You might already see the limitation of this method and why we define it as simple: all the units are mixed together, so the "sequentiality" is completely lost. Sure, if the model is trained well it might be able to partially recover it by adjusting the weights accordingly, but we are not enforcing it, which is not the best. The negative thing about the model is the simplicity, and the positive thing is also the simplicity: for very simple tasks FFNN works just fine, at a very limited computational cost.

1.2 Code

Now, let’s consider a sine wave. A very simple one, like this one:

Now we’ll make this task very simple. First off, we’ll do a simple regression, real number, easy and sweet. The input will be a piece of this sine wave and the output is the next point. For example, you give me the sine from t= – 2.9s and t= – 2.7s and I give you what happens at t= – 2.6s. It’s easier done than said, this is how it looks:

Then we train our Neural Network to go from a sequence of N points to 1, where 1 is basically the next point. I made it extremely simple, you can add layers by adding Dense, changing the number of units, changing the activation function, changing the optimizer, and so on and so forth.

And we can test the results with this code:

Pretty good, as we expected: we created a very simple task and we asked a very simple Neural Network to do the job.

I don’t want to leave you with just the made-up example, so I will say this: consider using the FFNN when:

The "sequentiality" of the time series is not that relevant. For example, it might be that you have a signal vs another quantity that is also time dependent. In that case, maybe your time series is not really a "time" series and FFNN could do a good job.
You want to keep things simple. This is far more common. Models that consider the sequential inputs are usually more computationally intensive. FFNNs are a very good alternative, as they allow you to consider even very very simple structures.

Arguably, unless you have a lot of instances (i.e. a big dataset), it’s a good practice to start with FFNN, because it is the simplest way to start.

2. (1D) Convolutional Neural Networks

2.1 Explanation

So the first thing to make this thing more complex is to actually consider the sequentiality of the input. A way to do that is by using a model that takes care of this sequentiality by "running" with a kernel (small set of weights) along the signal like shown in the picture below:

Image made by author

This operation is called Convolution and the corresponding network is known as Convolutional Neural Network (CNN). When someone talks about CNNs they probably refer to the structure built by the brilliant mind of Yann Lecun who developed this Network to classify the handwritten digits into 0–9.

What I showed you above came a little later, and it was developed by Serkan Kiranyaz et al. in 2015, in a paper known as Convolutional Neural Networks for patient-specific ECG classification. Now, the original paper is performing a classification task, and it is actually a great way to use 1DCNN (and CNN in general). For this reason, we will build a very simple 1DCNN classification algorithm to distinguish square waves and sine waves.

2.2 Code

This is how we build "num (x 2)" time series num is an integer number. By default is set to 1000 so we will generate 2000 time series. We will have 1000 square curves and 1000 sine curves. This is how:

This sine waves have random frequency, phase and amplitudes. The square waves are the result of the sign operation applied to random sine waves. Let’s give a look to make sure they look ok:

Yep. They look very fine.

Now we build a 1D CNN with 32 layers and a kernel vector of 3 units with relu activation. We also do some max pooling and then 2 fully connected layers right after that. We compile it, and train the model on the training set. A lot of words, but it’s this simple:

Beautiful. And it performs very well too, as we can see from this:

A lot of times, 1D CNNs are a good choice, especially for classification. This is because, they are simple Neural Networks (because at the end of the day, it is "just" a convolution operation) that complex enough to perform complicated classification tasks. They are a step forward in complexity with respect to FFNN, so you might need to be mindful of that, but, again, they are not monster-big Neural Networks.

3. Long Short Term Memory/Recurrent Neural Networks

3.1 Explanation

Do you remember Chapter 1, where we had to predict the next value of a sine wave, given the previous sequence of values?

The most common way to do that is not with FFNN but with Recurrent Neural Networks:

Image made by author

Now the input layer y_1,…,y_11 is processed through a hidden state sequentially, meaning that the hidden state h_i input is not only yi but also the previous hidden state h{i-1}. In a few words, the information of the previous units is preserved throughout the layer.

Now, the way this information is preserved depends on the specific RNN. In this specific example, we will use Long Short Term Memory cells. In particular, we will do a multistep forecasting task. It is basically the same thing as Chapter 1, but we will not only predict the next step, but we will predict the next k steps. We are generating a signal that has a quadratic dependency with time plus a small high-frequency small (fixed) amplitude sine wave.

3.2 Code

When you want to use LSTM you usually have a little bit of boring preprocessing to do to split your input time series in chunks. I did it in one function and put the whole pipeline in one snippet:

Let me zoom in on the predictions:

Image made by author using the code above

It’s pretty impressive that it’s guessing both the quadratic dependency (even if it looks like there is a small bias) and the wiggly behavior of sine waves.

When you use LSTM you are getting in the serious business of complexity. They are a pretty big beast. They tend to be hard to train, and they should be used with caution. Nonetheless, they are notorious for being the state of the art for forecasting in numerous study cases, so they do work very well if you have enough data and computational power. It’s almost indispensable to have a GPU during training time if you don’t want to wait ages.

4. Transformers

4.1 Explanation

In 2017 a paper came out and, as of today, it has been cited 130441 times (as of today, August 2024). This paper is called "Attention is all you need" and it explains how multiple sequence transduction models can be replaced with a Transformer model, that replaces RNN and CNN with the attention mechanisms.

The attention mechanism allows the model to focus (put the attention, as the name suggests) on the relevant parts of the input sequence. This is because, when you translate from sequence A to sequence B, it is not necessarily true that the first unit of sequence A "corresponds" (or translates) to the first unit of sequence B. Think of translating from one language to another. Just because the sentence in English starts with "I", it doesn’t mean that in Indian, the first word is the translation of "I".

The attention mechanism is a very fascinating one and I strongly suggest the read of the original paper, especially because it would take a long time to explain it from scratch.

4.2 Code

Now. I’ll be honest with you. Training a transformer is hard. Like, HARD hard. Imagine that the T in GPT stands for Transformer. That’s how hard.

For this reason, I would do a very simple case of training a Neural Network on a sine wave as an input and convert it to cosine. This approach is called sequence to sequence. If you read "seq2seq" you have heard it from me first :).

We used TensorFlow so far, but now we are betraying it and using Pytorch. This is because if you want to train a Transformer using Tensorflow you have to build your own class. The comfortable thing about Tensorflow is the fact that you don’t have to build your own class and you can just to model.fit() where model is defined iteratively (just like we did above). If we have to do the class business, in my opinion, Pytorch is superior because it is more intuitive.

Pytorch also has this nn.Transformer function which is extremely cool as it is very much in line with the paper.

Now. This very simple task is already super long to train with GPU. Imagine multiple signals. It gets messy. What is the good thing? The good thing is that it’s very rare that you need to train your Transformer from scratch. The only case I can think of is:

You have a very large dataset
For some reason, nobody has ever heard of this dataset and fine tuning an already existing transformer is not an option

As you can tell, very rare. Most of the time, you don’t need to use such a complex technology. Almost all the times that you do need to use such a complex technology a simple fine tune of an existing transformer is the way to go.

5. Conclusions

Thank you for spending your time with me. It means a lot. Let’s go through the article together and summarize it:

We talked about Neural Networks and we discussed how they have multiple domains (audio vs image vs text) and complexity (FFNN vs CNN vs Transformer)
We applied a Feed Forward Neural Network (FFNN) with a 1 step forecasting (regression) task of a sine wave. We showed how FFNN is a good option for very simple cases with limited computational power and dataset size.
We applied a 1D Convolutional Neural Network (1DCNN) to a classification task. In particular, we distinguished sine waves from square waves. We showed a very good accuracy and we have demonstrated how 1DCNN can be used for classification tasks where a little bit more computational power is allowed.
We talked about Long Short Term Memory (LSTM/RNN) Networks. We applied LSTM for multiple-step forecasting. We talked about the complexity of this neural network and how it is important to be cautious as it requires large computational power.
We described the Transformers using the "Attention is all you need" paper to explain it. We did a very simple seq2seq sine-to-cosine algorithm translation. We noticed how computationally expensive they are and we stated that training one from scratch is something very rare as most of the time a simple fine-tuning is the way to go.

6. About me!

Thank you again for your time. It means a lot

My name is Piero Paialunga and I’m this guy here:

Image made by author

A. Follow me on Linkedin, where I publish all my stories B. Subscribe to my newsletter. It will keep you updated about new stories and give you the chance to text me to receive all the corrections or doubts you may have. C. Become a referred member, so you won’t have any "maximum number of stories for the month" and you can read whatever I (and thousands of other Machine Learning and Data Science top writers) write about the newest technology available. D. Want to work with me? Check my rates and projects on Upwork!

If you want to ask me questions or start a collaboration, leave a message here or on Linkedin:

piero.paialunga@hotmail.com

The post Hands On Neural Networks and Time Series, with Python appeared first on Towards Data Science.

Feature Extraction for Time Series, from Theory to Practice, with Python

Piero Paialunga — Sat, 24 Aug 2024 00:22:14 +0000

Time series are a special animal.

When I started my Machine Learning career I did it because I loved Physics (weird reason to start Machine Learning) and from Physics I understood that I also loved coding and Data Science a lot. I didn’t really care about the type of data. All I wanted was to be in front of a computer writing 10k lines of code per day.

The truth is that even when you don’t care (I still really don’t) your career will drift you to some kinds of data rather than others.

If you work at SpaceX, you probably won’t do a lot of NLP but you will do a lot of signal processing. If you work at Netflix, you might end up working with a lot of NLP and recommendation systems. If you work at Tesla you will most definitely be a Computer Vision expert and work with images.

When I started as a Physicist, and then I kept going with my PhD in Engineering, I was immediately thrown into the world of signals. This is just the natural world of engineering: every time you have a setup and extract the information from it, at the end of the day, you treat a signal. Don’t get me wrong, engineering is not the only world where signals are the celebrities of the movie. Another very famous example is the one of finance and stock prices time series. That is another example of signals (time vs price). But if for whatever reason you are dealing with signals you should remember the first sentence of this blog post:

Time series are a special animal.

This means that a lot of transformation/operation/processing techniques that you would do with tabular data or images have another meaning (if they even have a meaning) for time series. Let’s take feature extraction, for example.

The idea of "feature extraction" is to "work" on the data that we have and make sure that we extract all the meaningful features that we can so that the next step (typically the Machine Learning application) can benefit from them. In other words, it is a way of "helping" the machine learning step by feeding important features and filtering out all the less important ones.

This is the full feature extraction process:

Image made by author

Now, when we consider feature extractors for, let’s say, tabular data and signals we are playing two completely different sports.

For example, the concept of peak and valley, the idea of Fourier Transform or Wavelet Transform, and the concept of Independent Component Analysis (ICA) only really make sense when dealing with signals. I’m doing all this talking and showing just to convince you that there is a set of feature extraction techniques that only belong to signals.

Now there are two macro classes of methods to do feature extractions:

Data driven based methods: Those methods aim to extract features by just looking at the signals. We ignore the Machine Learning step and its goal (e.g. classification, forecasting, or regression) and we only look at the signal, work on it, and extract information from it.
Model based methods: Those methods look at the whole pipeline and aim to find the features for that specific problem to solve.

The pros of the data-driven methods are that they are usually computationally simple to use and don’t require the corresponding target output. The cons of them are that the features are not specific to your problem. For example, doing a Fourier Transform of a signal and using that as a feature might be suboptimal to use specific features trained in an end to end model.

For the sake of this blog post, we’ll talk about data driven methods only. In particular we’ll talk about domain specific based methods, frequency based methods, time based methods and statistical based methods. Let’s get started!

1. Domain Specific Feature Extraction

The first one I’m going to describe is a little bit intentionally vague. The reality is that the best way to extract features is to consider the specific problem that you are facing. For example, let’s say you are dealing with a signal from an engineering experiment and you really care about the amplitude after t = 6s. Those are cases where the feature extraction doesn’t really make sense in general (for a random case t=6s might not be more special than t =10s) but it’s actually extremely relevant for your case. That is what we mean by domain-specific feature extraction. I know this is not a lot of math and coding, but this is what is meant to be as it is extremely dependent on your specific situation.

2. Frequency based Feature Extraction

2.1 Explanation

This method is related to the spectral analysis of our time series/signal. What do we mean by that? If we look at our signal we have a natural domain. The natural domain is the simplest way to look at a signal, and it is the time domain, meaning that we consider the signal as a value (or vector) at a given time.

For example, let’s consider this signal, in its natural domain:

If we plot it we get this:

Image made by author

This is the natural (time) domain, and it is the simplest domain of our dataset. We can convert this in the frequency domain. As we saw in the symbolic expression, our signal has three periodic components. The idea of the frequency domain is to decompose the signal in its periodic components frequencies, amplitudes and phases.

The Fourier Transform Y(k) of the signal y(t) is the following:

Image made by author

This describes the amplitude and phase of the component with frequency k. In terms of extracting the meaningful features, we can extract the amplitudes, phases, and frequency values for the 10 main components (the one with the highest amplitudes). These will be 10×3 features (amplitude, frequency, and phase x 10 ) that will describe your time series based on the spectral information.

Now, this method can be expanded. For example, we can decompose our signal not in based on the sines/cosines functions but based on wavelets, which are another form of periodic wave. ** That kind of decomposition is called Wavelet Decompositio**n.

I understand this is a lot to digest, so let’s start with the coding part to show you what I mean…

2.2 Code

Now, let’s build it in real life. Let’s start with the very simple Fourier Transform.

First we need to invite some friends to the party:

Now let’s take this signal as an example:

This signal has three major components. One with amplitude = 1 and frequency = 1, one with amplitude = 0.4 and frequency = 2 and one with amplitude = 2 and frequency = 3.2. We can recover them by running the Fourier Transform:

We can clearly see three peaks with corresponding amplitudes and frequency.

Now, we don’t really need any fancy plotting (that was just to show that this method works), but we can just do everything with a very simple function, which would be this one:

So you give me the signal y and (optionally):

the x or time array
the number of features (or peaks) to consider
the largest frequency that you are willing to explore

This is the output:

If we want to extract features using **wavelets (not sines/cosine)*** we can do the wavelet transform. We would need to install this guy:

pip install PyWavelets

And then run this:

*I talk about wavelets in detail in this article here. Give a look to learn more about those majestic beasts

3. Statistical based Feature Extraction

3.1 Explanation

Another approach to feature extraction is to rely on the good old statistics. Given a signal, there are multiple things one can do to extract some statistical information out of it. In order of simple to complex, this is a list of information we can extract:

The mean, is nothing but the sum of the signal divided by the number of time steps of the signal. Super simple:

The variance, that is how much the signal vary from the mean value:

Skewness and Kurtosis. Those are metrics to test how "not Gaussian" the distribution of your time series is. Skewness describes how asymmetric it is and the kurtosis describes how "tailed" it is.
Quantiles: Those are the values that divide the time series into intervals with probability ranges. For example a 0.25 quantile with value = 10 means that 25% of the values in your time series are below 10 and the remaining 75% are larger than 10
Autocorrelation. This basically tells you how much your time series is "patterned", meaning the intensity of patterns in your time series. More formally, this metric indicates how much the time series values are correlated with their own past values.
Entropy: Represents the complexity or unpredictability of the time series. I did a whole blog post about it here.

In 2024 each one of these properties can be implemented with one line of code (what a time to be alive). This is how:

4. Time based Feature Extractor

4.1 Explanation

In this specific section, we will focus on how to extract the information of a Time Series by just extracting the time feature. In particular, we will extract the information of the peaks and valleys. To do so, we will use the find_peaks function from SciPy.

There are multiple features that would be best to know to extract the peaks of a signal such as the expected width, the expected threshold, or the plateau size. If you have any of this information (for example you only want to consider peaks with amplitude > 2 for some reason) you can tweak the parameters, otherwise, you can just leave everything to default.

We also have the privilege to decide how many peaks/features we want. For example, we might want N = 10: only the largest 10 peaks+valleys. Just keep in mind that if we do so, we actively have Nx2 = 20 features (10 locations and 10 amplitudes).

4.2 Code

As always, the code to do so is fairly easy:

Keep in mind that, if we select N = 10 peaks but you only really have M = 4 peaks in your signals the remaining 6 locs and peak amplitudes will be 0.

5. Which method to use?

Ok, so we have seen 4 different classes of methods.

Which one should we use?

I’m not going to hit you with the diplomatic "It depends on the problem," because, of course, it always depends on the problem.

The truth of the matter is that if you have a domain-based feature extraction that is always your best bet: if the physics of the experiment or the prior knowledge of the problem is clear, you should rely on that and consider those features as the most important ones, maybe even consider them as the only ones. Sometimes (a lot of times) you don’t have the domain based features and that’s ok.

For how much it concerns the frequency, statistical, and time based features you should, in my opinion, use them all together. You should add those features to your dataset and then see if some of them help, don’t help, or actually confuse your machine learning model.

6. Conclusions

Thank you very much for spending time with me. I would like to take this space to summarize everything that we have done.

We introduced the idea of feature extraction. We explained why it’s important and why it’s important to know the specific techniques for time series
We explained the difference between model based and data driven feature extraction techniques. Model based techniques are feature extraction techniques that are trained end to end. In this blogpost we focused on data driven techniques that are performed independently from the given task
We discussed the domain based feature extraction techniques, which are specific techniques that stem from the specific problem of interest
We discussed the spectral techniques, which are techniques that involve the Fourier/frequency spectrum of the signal
We discussed the statistical techniques, which extract values like mean, std, entropy, and autocorrelation from the signals
We discussed the time based techniques, which extract the peak information from the signal
We briefly gave an idea of which technique to adopt for your specific case

7. About me!

Thank you again for your time. It means a lot

My name is Piero Paialunga and I’m this guy here:

Image made by author

A. Follow me on Linkedin, where I publish all my stories B. Subscribe to my newsletter. It will keep you updated about new stories and give you the chance to text me to receive all the corrections or doubts you may have. C. Become a referred member, so you won’t have any "maximum number of stories for the month" and you can read whatever I (and thousands of other Machine Learning and Data Science top writers) write about the newest technology available. D. Want to work with me? Check my rates and projects on Upwork!

If you want to ask me questions or start a collaboration, leave a message here or on Linkedin:

piero.paialunga@hotmail.com

The post Feature Extraction for Time Series, from Theory to Practice, with Python appeared first on Towards Data Science.

Hands-on Time Series Anomaly Detection using Autoencoders, with Python

Piero Paialunga — Wed, 21 Aug 2024 00:03:49 +0000

Anomalous time series are a very serious business.

If you think about earthquakes, anomalies are the irregular seismic signals of sudden spikes or drops in data that hint that something bad is going on.

In financial data, everyone remembers the Wall Street Crush in 1929, and that was a clear example of a signal with anomaly in the financial domain. In engineering, signals with spikes can represent a reflection of an ultrasound to a wall or a person.

All these stories stem from a very well-defined problem:

If I have a bank of normal signals, and a new signal comes in, how can I detect if that signal is anomalous or not?

Note that this problem is slightly different than the problem of detecting the anomaly in a given signal (which is also a well-known problem to solve). In this case, we assume that we get a whole new signal and we want to know if the signal is sufficiently different than the ones that are considered "normal" in our datasets.

So how would you approach a problem like that?

The powerful Neural Networks give us a solution for the problem, and this solution has been around since 2016. Implementing a Neural Network per se is now a fairly easy game, but understanding how to use NNs for Anomaly Detection can get a little tricky.

The scope of this blog post is to guide the reader towards the idea of anomaly detection with Neural Networks, combining the two subjects in one unique piece of code, from A to Z. We will also do our own anomaly detection case study for Time Series on a synthetic dataset.

Hopefully the intro was interesting enough . Now let’s get started!

0. The idea

Generally speaking, a good idea to approach a machine learning model is to "pretend to be the computer". The question is "What would you do if you (a human) were to do this task?". Well, we do have our bank of signals right? So it is fair to say that we need to process the signals to find some sort of relevant features and use them as reference(s).

Image made by author

And that is the first step. We now have a set of reference values. These reference values will be the ones that will determine if a new signal is anomalous or not by a simple comparison.

Image made by author

So it works like this:

A new signal comes in,
We extract the feature of this new signal
We compare the features with the features we previously extracted from the bank of data and we check whether that’s an anomaly or not. As simple as that.

Now, the real deal is the following:

"How do we process the bank of data (and the new signal) to extract the features?"

That is really what makes an anomaly detection algorithm different than another one. With not a lot of imagination, the model that extracts the features, in Machine Learning, is usually known as feature extractor.

1. The feature extractor

There are countless feature extractors out there, a lot of them. The one I’d like to talk about today involves Deep Learning.

Deep Learning is (and it has been) the hot thing in tech: everybody talks about it. And for a reason, I’d say. All the magnificent things that we see in our daily news: the new Meta AI LLAma powered chatbot, the circle to search feature powered by Google, the Midjourney outstanding "imagine" feature are all examples of (very successful) Deep Learning architectures.

But Deep Learning can be much simpler than those examples. Even a very simple cat/dog image classifier can be a Deep Learning algorithm. In general, we refer to a Deep Learning algorithm when we talk about an algorithm that learns in a "layered" fashion (that’s why it’s deep) and it allows you to extract features by itself, bypassing the manual feature engineering step. For example, if we distinguish cats and a dogs images, maybe the first layer gets very simple features (like the main color of the image), and as we get deeper into the rabbit hole, the last layer gets very fine details (like the paws in the image).

I don’t want to talk philosophy too much, so let’s talk about our specific Deep Learning algorithm, which is known as encoder-decoder or more specifically, autoencoder.

1. 1 Autoencoders

Our Deep Learning algorithm is a very special one as it aims to reconstruct the same object that you put as an input. Just like this:

Image made by author, doggo’s photo by Caleb Fisher on Unsplash

Now, this looks a little bit silly right? What is the need for a Machine Learning algorithm that is only trained to replicate what you see? Imagine ChatGPT and imagine if the model just replicates exactly what you write, how is that even helpful?

Well, the trick of Autoencoders is in their architecture. The input of the autoencoder has the number of units (that are nothing but values, real numbers) that is the same as the input size. For example, if the doggo image is 10×10=100 pixels, the input will be 100 units long. The output has the same number of units as the input. Nonetheless, in the middle, the number of units gets smaller. The smaller amount of units is what is called "latent space" and its the space of the meaningful features. You only need k<100 units (the one of the latent space) to reconstruct the full image so they must be the key ones.

Image made by author using the amazing NN-SVG tool

In other words, the "hidden layer" or "latent space" contains the key information that we can use to replicate the signal (or image or text) that we get as an input.

I can hear you from the computer asking me "Why do we need this at all for anomaly detection?". Well, this is the logic:

If I train a Machine Learning model to replicate a cat given an image of a cat, that means that it will be successful in replicating cat images.

But the contrary is also true: if I give you another image, and it’s not able to replicate it successfully, then that image is probably not the image of a cat -> it’s an anomaly!

More technically, if the error of the non anomalous reconstructed signal is on average, let’s say 0.01 when we find an anomalous signal we will hopefully get an error of 0.10 and as it is 10 times larger than what it should be we can confidently call it an anomaly.

I know it can be confusing, but hopefully it will be much more clear right now as we will do a hands-on example in Python. Let’s start with it!

2. Our hands-on example

In this section we will:

Define the Python libraries that we need
Build our "normal signals"
Build our "anomalous signals"
Train our Autoencoder
Classify anomalous vs non anomalous signals.

Alright let’s go.

2.1 Python environment

We used numpy for data manipulation (you can also use pandas), matplotlib, seaborn to plot images, **tensorflow* for Deep Learning and sklearn** for error analysis.

*For this very simple exercise, we don’t need a whole lot of control over the Neural Network, as we will use a very simple 1D CNN (spoiler lol). For this reason we can just use tensorflow. If for some reason you are a torch guy, no problems at all! This approach will work just the same

2.2 Data Generator

Let’s first describe how we are going to set up our experiment. We will have our x ranges that will go from -8 pi to 8 pi. A "non anomalous" signal is built as follows:

Image ade by author

where:

A_1, A_2, and A_3 are three random amplitudes that change signal by signal and can be everything from -2 to 2
f_1, f_2, and f_3 are three random frequencies that change signal by signal and can be everything from -2 to 2

An anomalous signal is built as follows:

Image made by author

where everything stays the same as a normal signal but:

A_anomaly can be everything between -5 to -2 and from 2 to 5
f_anomaly can be everything between -10 to -5 and from 5 to 10

In other words:

A normal signal is a sum of three sines components with random amplitudes and random frequencies
An anomalous signal is a sum of three sines components with random amplitudes and random frequencies and a sine component with larger absolute amplitude and larger absolute frequency

This is how we implement the idea in Python:

The code above generates "num" normal and "num" anomalous signals. We set num=1000, but feel free to increase or decrease it. You can also increase or decrease the number of sine components from 3 (default) to howver you want.

So if we plot them we clearly see the difference* in terms of amplitude and (mostly) in terms of frequency:

2.3 Autoencoders

The Autoencoder that we are using in this project is a 1D CNN. 1D CNNs are used a lot in signals, and the principle is quite simple. There is a small kernel (vector) that runs over the signal with a convolution operation. Just like in 2D CNN, multiple filters can be used to extract the meaningful features of the signals. For example, one filter could look at the maxima, one at their widths, one at the minima, and so on and so forth.

This is how we build our autoencoder:

In order to train our model we need to reshape our vectors. We can do this using this:

So now we can train our model with this:

2.4 Anomaly Detector

Now that we have our model we can try and reconstruct the signals.

We will have the normal signals as input and extract some statistics of our MSE. For example, we can consider the p = 0.99 percentile value. Then we will reconstruct our anomalous signals and we will give a look at the MSE. If the MSE is larger than the one of the 0.99 percentile value, then we call it an anomaly, otherwise, we call it a normal signal.

Just like this:

As we saw all the anomalies have been correctly classified as anomalies and at a "cost" of only 3 normal signals (out of a 1000) that are classified as anomalous.

2.5 All at once!

Now you might think that the normal and anomalous signals are indeed too different, so we can, for example, make them much more similar. I created a function that allows you to regulate the range of frequencies and amplitudes of the real and anomalous signals. You just have to put them in a dictionary like this:

And the whole function to do the same analysis with a custom setup_dict boundary is this one:

And you can just run it like this:

In the default setup_dict above the anomalous frequencies and amplitudes are very similar to the normal ones.

Nonetheless, the performance is very good:

Image made by author using code above

Only 44 anomalies are not detected and only 13 normal signals are wrongly detected as anomalous.

Eventually, when the normal frequencies and the anomalous frequencies are the same or very similar, of course, it would be impossible to distinguish a normal and anomalous signal, but it is interesting to see how far we can push our Autoencoders in distinguishing normal and anomalous signals.

3. Conclusions

Thank you very much for spending time with me. In this article we did the following:

Talked about anomaly detection: We described the problem of anomaly detection, specifically in the case of Time Series in multiple scenarios like engineering, finance, and geology
Introduced autoencoders: We explained the idea of autoencoders for anomaly detection. We started with the description of a Deep Learning algorithm, we talked about autoencoders and we introduced the idea of using autoencoders for anomaly detection based on the reconstruction error
Generated Synthetic Data: We built our synthetic data of sine waves, both with and without anomalies.
Implemented the model: A 1D Convolutional Neural Network (CNN) was used to build the Autoencoder, which was trained to replicate normal signals.
Made it customizable: We showed that our method worked with our dataset and made a very simple function to allow the users to modify the normal and synthetic datasets and play with it.

4. About me!

Thank you again for your time. It means a lot

My name is Piero Paialunga and I’m this guy here:

Image made by author

A. Follow me on Linkedin, where I publish all my stories B. Subscribe to my newsletter. It will keep you updated about new stories and give you the chance to text me to receive all the corrections or doubts you may have. C. Become a referred member, so you won’t have any "maximum number of stories for the month" and you can read whatever I (and thousands of other Machine Learning and Data Science top writers) write about the newest technology available. D. Want to work with me? Check my rates and projects on Upwork!

If you want to ask me questions or start a collaboration, leave a message here or on Linkedin:

piero.paialunga@hotmail.com

The post Hands-on Time Series Anomaly Detection using Autoencoders, with Python appeared first on Towards Data Science.