Hands-On Global Optimization Methods, with Python

Imagine going out with your best friend. You decide to go to this place that you have never been to, but your best friend has.

Also, imagine that you are in a city with a bit of a traffic problem, like Rome (I would know as I’m from there and it’s painful), and consider that you and your best friend will meet there but leave the same house with two different cars.

You and your best friend leave the house at the same time, and as you have never been to the place you use the GPS to arrive there.

When you arrive there, you find your friend who is already ordering the drinks and tells you that he has been waiting for you for the last 10 minutes. Assuming that neither you nor your friend went above the speed limits, this story tells us two things:

You should be mad at your best friend as he could tell you the faster route instead of letting you use the GPS. Why didn’t he call you anyway? 😡
You just encountered a local minimum problem

During this article, we will focus on problem number 2, and I will let you deal with your best friend on your own.

Why are local minima relevant in this example?

Let’s take a step back.

Services like Google Maps (or Maps) in general are already very well-known optimization algorithms. You have to go from point A to point B and you want to minimize the time that is required to do so. This is exactly what we mean by "optimization": the process of optimizing an objective function given a set of parameters.

Everybody relies on Google Maps to go to places where we have never been, and we do so because we are fairly confident that Google Maps won’t waste a lot of your time and will take you to your desired destination without major delays or problems. Nonetheless, when we do know the location, most of the time we are confident that we will get to the destination quicker if we take another route. That is because maybe we know that there is a scheduled road work at a specific time of the day, because traffic will start popping out after you have left your house (and it won’t give Google Maps time to recalculate), because there are a lot of speed traps over there and cars tend to stomp on their brakes and slow the whole car flow, and so on and so forth.

In other words, we think have more data than Google Maps (and sometimes we do). As we have more data, we end up getting to a minimum that is lower than the one of Google Maps. But wait, didn’t we say that Google Maps runs an optimization too? Yes, indeed! That’s a case where we can say that their minimum is local. Is our minimum local too? Maybe, maybe not, possibly. Maybe it is another local minimum that is lower than Google Maps local minimum. Or maybe that is indeed the fastest way to go ever (global minimum). What is certain is that the task of finding the global optimization, meaning the "lowest of all the minima" or the "largest of all the maxima" is a very fascinating exercise.

Hopefully, with this intro I gave you, you have enough interest to bear with me in this global optimization article. We will start by giving a formalization of the global optimization problem, and then we will find multiple ways (or algorithms) to reach the global optimum. In particular, we will list these methods from the simplest ones to the most complex ones.

Let’s get this started! 🚀

0. Formalizing the problem

So, global optimum. What do we mean by that?

Let’s start with something that may be slightly trivial but pretty crucial to say. All this stuff that we are saying (and have said) works and makes sense with black box functions. When I say that I mean that I give the function the input, there is a numerical process that happens in the background, and the function "spits" out the number. In other words, we have to assume that we don’t have any analytical estimate of what the function looks like.

Why is that? Well, because if I knew the analytical function I would get away with doing the derivative, see when it goes to 0 and the second derivative has the sign "-" or "+" depending on max or min, and find the optimum value.

For example, let’s consider the argmax (optimum = maximum) situation. We have this equation:

So, again, in a case where we know the analytical form of f(x) we get something like:

But in reality, we don’t have f'(x) because you don’t have the analytical expression of f(x) in the first place, so you can’t really get X_opt analytically.

But there is much more that adds to the problem. In reality, X is a continuous set*, so we can’t really "check them all", meaning that we can’t really see the value of f(x) for all possible Xs.

That’s an overview of the problems that we face when we do global optimization, and that is also why we use numerical methods to solve them. This means that we give up on the analytical definition of the derivatives and we try to approach the problem with algorithms. Of course, we are going to use the analytical expression to test it (invent the black box) but then we don’t use any anaytical operation for our optimization.

Do you buy it? I hope you buy it. Let’s start!

*Notice that X is a k-dimensional space. Multiple parameters can be considered in your problem.

1. Brute force method(s)

1.1 The idea

When you want to find the minimum of a dataset, the most reasonable thing to do, the thing that even a non technical person would suggest is this:

Let’s explore all the possibilities!

Now this is possible only for very limited cases. For example, we can do it if we only have four different choices for one parameter. Or 2 choices for 3 parameters. Or 3 choices for 2 parameters. I think you get the gist.

If you have only a few choices it is worth to try and explore them all. But if your parameter space is continuous, you are already out of business (you can’t explore all the possibilities of a closed infinite space). You can still use this kind of logic, though. A very simple thing to do is to sample your parameter space (X) by dividing it into multiple steps (grid approach). Nonetheless, the literature clearly shows that when you want to do a few simulations, it is smarter to add some randomness to this. Latin Hypercube Sampling is a very good alternative.

I have a fear that it’s getting too technical too quickly. I’d like to smooth it out with some coding, so we have the opportunity to explain.

1.2 The code

Let’s consider lines with 1D inputs: y = mx+c Let’s say that we fix m and c to be m = 5 and c = 3. Let’s say we want to identify m and c. Of course you can just do a fit of the line and call it a day, but we can also treat this (for didactic purposes) as an optimization problem.

Ok enough talking, let’s import some libraries. We’ll use the super known libraries: numpy and scipy.

Now, the target objective is defined with this function:

We also added the "_compute_error_" function, which computes the difference between the target lines and a new line with equation y = mx + c.

These are some random lines and the target ones:

Now we start with 1 random red line, we compute the error, and it’s probably going to be very high. Then, we keep adding simulations. The number of simulations will be larger and larger. By having more and more lines to compare, we will end up having a more and more accurate estimate.

We can clearly see it from this:

So that means that we are converging, with 100,000 samples, to a very good estimate of our m and c. Here’s if we plot it:

In this specific case:

The cost function is the difference between the target line and the new (candidate) line
We are filling the 2d parameter space (m,c) with "num" simulations from a Latin hypercube
We are computing the cost function between the target line and the "num" lines and extracting the best one

Now this is the equivalent of "enumerating the possibilities" but they are not quite ALL the possibilities, because we are in the continuous space. It’s a very simple approach and it is prone to the presence of local minima. On the other hand, when the number of simulations increases, the probability of being in a local minimum gets lower (you probably explored all the minima and got to the lowest of them all).

This is called brute force method because it solemnly relies on the number of simulations and the error computation. No sophistication at all and very intense computationally. Brute force methods differ in the sampling method you apply. As it has been said before, we applied Latin Hypercube.

2. Gradient Descent

2.1 The idea

The method that I’m going to show is very well-known in the Machine Learning community because it’s the method that gets used to train Neural Networks.

We are starting to get into more complex algorithms. This idea is the following:

We start with a random parameter vector x_start. Anywhere in the k-dimensional space of the problem of interest.
We compute the loss for that random parameter x_start.
We compute the gradient of the loss for that specific x_start and we move in the opposite direction*

*Notice that this method works exactly the same if we want the maximum, but you follow the direction of the gradient (instead of the opposite).

As you can tell, the idea itself is very simple. Let’s look at the code.

The whole idea is very simple and can be implemented in a few lines

Now, the problem of local minimum is still present in this problem as you can tell:

As you can tell from the example above if we start with a random parameter in the wrong area, we get stuck in a local minimum. Now, there are multiple ways to make this Gradient Descent approach more efficient and avoid local minima. Methods are stochastic gradient descent (SGD) (which is used in Keras a lot as an "optimizer"), adding the momentum, and doing random restarts. An article that summarizes all these possibilities very well is given by Mohit Mishra.

These methods are not perfect, as they are just numerical ways to try to escape local minima: you can not still be sure that the minimum is global, as always. Nonetheless, techniques like SGD and momentum are famous for being good methods for escaping local minima in Gradient Descent and are commonly used when training Neural Networks.

3. Bayesian Optimization

3.1 The idea

Now we are getting into the business of surrogate models, which also happens to be my PhD thesis topic. Bayesian Optimization is used when the black box function is actually computationally intensive, and you might have to wait a long time (tens of seconds) before getting your output f(x) given your input x.

The strategy is the following:

You take a small fraction of points and you compute your loss function on that small fraction of points.
The couples (x,f(x)) for those fraction of points are used to train a Gaussian Process Regression (GPR)
The trained GPR is used to intensively sample the parameter space.
The Expected Improvement (EI) tells you where the GPR is uncertain. In other words, using the EI we can select the areas where the global minimum or maximum might be
You select the point with the largest EI. Compute (x_new, f(x_new)) and retrain the GPR. Then rerun 3,4 and 5.

If you want to get in the math of all of this, you can look at my article [here](http://Hands On Optimization with Expected Improvement and Gaussian Process Regression, in Python), where I still 10 minutes of your time to talk about it in depth.

3.2 The code

Alright so let’s consider this function here*:

As I said, let’s take a small portion of the data as training set:

We train our GPR:

Then we select the areas with largest EI and retrain the GPR. We do this iteratively. This is the result:

I have a love/hate relationship with this method. I find it extremely fascinating, elegant, and helpful but it actually takes a lot to train a GPR for large datasets, especially with large dimensions (k>>0). In a few words, I use GPR all the chances that I have and I would like to have more chances to use it. 🙁

4. Genetic Algorithm

4.1 The idea

The Genetic Algorithm is another extremely fascinating method to find the global optimum of a black box function. The logic is to select a random set of candidates, choose the best ones based on your cost function, mate them together, and evolve them into new candidates. More precisely, these are the steps:

You start with a randomly generated population of possible solutions.
You evaluate each candidate’s fitness using a predefined cost (or fitness) function. The fitness function tells you how "good" a solution is.
You combine pairs of selected candidates (parents) to produce new candidates (offspring). This process is done by exchanging parts of the parent solutions to create diversity in the offspring and it is usually referred as mating.
You can (and usually do) introduce random changes to some of the offspring’s genes (individual parts of the solution) to maintain diversity within the population. This is just to introduce some randomness (and it’s actually helpful to avoid local minima)
You replace the old population with the new set of candidates and repeat the process over multiple generations.

The stopping (termination) is obtained with a predefined criterion (e.g. max iteration = 1000)

4.2 The code

The amazing people from PyGAD made the usage of Genetic Algorithms (GA) extremely easy. Install it doing:

pip install pygad

Now let’s consider this loss function to **maximize***.

*Note: PyGAD is used to minimize -f(x), which is the same as maximizing f(x)

Let’s run the GA with x in [0,10] and y in [0,10]. This is how it looks*:

*Note: There are a lot of parameters to consider. Thankfully, PyGAD explains everything very clearly. Refer to the documentation to know more!

And as we can see, we converge to the global minimum:

GAs are extremely well at detecting the global optima, but are a little too time and resource consuming when dealing with large dimensions (k>>0)

5. Conclusions

Thank you so much for spending time with me in this global optima journey. Let’s see what we did:

We introduced the problem of global optimum using the example of the GPS and the shortest time to arrive at a destination
We formalized the problem of global optimum and explained why we need the numerical methods to solve them
We talked about the brute force method: sampling the parameter space and computing the loss. Simple but not efficient.
We described the gradient descent method: notorious in Machine Learning, and used for high dimensional problems. You can try to prevent getting stuck in local minima using SGD or Momentum
We talked about surrogate modelling (Bayesian Optimization): you can use a surrogate model to model the loss function and use Expected Improvement to guide you to the global minimum. Extremely elegant, but computationally heavy.
We described Genetic Algorithms: an elegant class of algorithms based on modifying a set of possible random solutions. We tried it on a 2D problem with success. This method is not indicated for large dimensions but it’s a very good method against local minima/maxima.

6. About me!

Thank you again for your time. It means a lot ❤

My name is Piero Paialunga and I’m this guy here:

Image made by author

I am a Ph.D. candidate at the University of Cincinnati Aerospace Engineering Department and a Machine Learning Engineer for Gen Nine. I talk about AI, and Machine Learning in my blog posts and on Linkedin. If you liked the article and want to know more about machine learning and follow my studies you can:

A. Follow me on Linkedin, where I publish all my stories B. Subscribe to my newsletter. It will keep you updated about new stories and give you the chance to text me to receive all the corrections or doubts you may have. C. Become a referred member, so you won’t have any "maximum number of stories for the month" and you can read whatever I (and thousands of other Machine Learning and Data Science top writers) write about the newest technology available. D. Want to work with me? Check my rates and projects on Upwork!

If you want to ask me questions or start a collaboration, leave a message here or on Linkedin:

[email protected]

Hands-On Global Optimization Methods, with Python

0. Formalizing the problem

1. Brute force method(s)

1.1 The idea

1.2 The code

2. Gradient Descent

2.1 The idea

3. Bayesian Optimization

3.1 The idea

3.2 The code

4. Genetic Algorithm

4.1 The idea

4.2 The code

5. Conclusions

6. About me!

Related Articles

Implementing Convolutional Neural Networks in TensorFlow

How to Forecast Hierarchical Time Series

Hands-on Time Series Anomaly Detection using Autoencoders, with Python

3 AI Use Cases (That Are Not a Chatbot)

Solving a Constrained Project Scheduling Problem with Quantum Annealing

Back To Basics, Part Uno: Linear Regression and Cost Function

Must-Know in Statistics: The Bivariate Normal Projection Explained