Shubham Gandhi, Author at Towards Data Science

Why CatBoost Works So Well: The Engineering Behind the Magic

Shubham Gandhi — Thu, 10 Apr 2025 00:28:11 +0000

Gradient boosting is a cornerstone technique for modeling tabular data due to its speed and simplicity. It delivers great results without any fuss. When you look around you’ll see multiple options like LightGBM, XGBoost, etc. Catboost is one such variant. In this post, we will take a detailed look at this model, explore its inner workings, and understand what makes it a great choice for real-world tasks.

Target Statistic

Target Encoding Example: the average value of the target variable for a category is used to replace each category. Image by author

Target Encoding Example: the average value of the target variable for a category is used to replace each category

One of the important contributions of the CatBoost paper is a new method of calculating the Target Statistic. What is a Target Statistic? If you have worked with categorical variables before, you’d know that the most rudimentary way to deal with categorical variables is to use one-hot encoding. From experience, you’d also know that this introduces a can of problems like sparsity, curse of dimensionality, memory issues, etc. Especially for categorical variables with high cardinality.

Greedy Target Statistic

To avoid one-hot encoding, we calculate the Target Statistic instead for the categorical variables. This means we calculate the mean of the target variable at each unique value of the categorical variable. So if a categorical variable takes the values — A, B, C then we will calculate the average value of \(\text{y}\) over all these values and replace these values with the average of \(\text{y}\) at each unique value.

That sounds good, right? It does but this approach comes with its problems — namely Target Leakage. To understand this, let’s take an extreme example. Extreme examples are often the easiest way to eke out issues in the approach. Consider the below dataset:

Categorical Column	Target Column
A	0
B	1
C	0
D	1
E	0

Greedy Target Statistic: Compute the mean target value for each unique category

Now let’s write the equation for calculating the Target Statistic:
\[\hat{x}^i_k = \frac{
\sum_{j=1}^{n} 1_{{x^i_j = x^i_k}} \cdot y_j + a p
}{
\sum_{j=1}^{n} 1_{{x^i_j = x^i_k}} + a
}\]

Here \(x^i_j\) is the value of the i-th categorical feature for the j-th sample. So for the k-th sample, we iterate over all samples of \(x^i\), select the ones having the value \(x^i_k\), and take the average value of \(y\) over those samples. Instead of taking a direct average, we take a smoothened average which is what the \(a\) and \(p\) terms are for. The \(a\) parameter is the smoothening parameter and \(p\) is the global mean of \(y\).

If we calculate the Target Statistic using the formula above, we get:

Categorical Column	Target Column	Target Statistic
A	0	\(\frac{ap}{1+a}\)
B	1	\(\frac{1+ap}{1+a}\)
C	0	\(\frac{ap}{1+a}\)
D	1	\(\frac{1+ap}{1+a}\)
E	0	\(\frac{ap}{1+a}\)

Calculation of Greedy Target Statistic with Smoothening

Now if I use this Target Statistic column as my training data, I will get a perfect split at \( threshold = \frac{0.5+ap}{1+a}\). Anything above this value will be classified as 1 and anything below will be classified as 0. I have a perfect classification at this point, so I get 100% accuracy on my training data.

Let’s take a look at the test data. Here, since we are assuming that the feature has all unique values, the Target Statistic becomes—
\[TS = \frac{0+ap}{0+a} = p\]
If \(threshold\) is greater than \(p\), all test data predictions will be \(0\). Conversely, if \(threshold\) is less than \(p\), all test data predictions will be \(1\) leading to poor performance on the test set.

Although we rarely see datasets where values of a categorical variable are all unique, we do see cases of high cardinality. This extreme example shows the pitfalls of using Greedy Target Statistic as an encoding approach.

Leave One Out Target Statistic

So the Greedy TS didn’t work out quite well for us. Let’s try another method— the Leave One Out Target Statistic method. At first glance, this looks promising. But, as it turns out, this too has its problems. Let’s see how with another extreme example. This time let’s assume that our categorical variable \(x^i\) has only one unique value, i.e., all values are the same. Consider the below data:

Categorical Column	Target Column
A	0
A	1
A	0
A	1

Example data for an extreme case where a categorical feature has just one unique value

If calculate the leave one out target statistic, we get:

Categorical Column	Target Column	Target Statistic
A	0	\(\frac{n^+ -y_k + ap}{n+a}\)
A	1	\(\frac{n^+ -y_k + ap}{n+a}\)
A	0	\(\frac{n^+ -y_k + ap}{n+a}\)
A	1	\(\frac{n^+ -y_k + ap}{n+a}\)

Calculation of Leave One Out Target Statistic with Smoothening

Here:
\(n\) is the total samples in the data (in our case this 4)
\(n^+\) is the number of positive samples in the data (in our case this 2)
\(y_k\) is the value of the target column in that row
Substituting the above, we get:

Categorical Column	Target Column	Target Statistic
A	0	\(\frac{2 + ap}{4+a}\)
A	1	\(\frac{1 + ap}{4+a}\)
A	0	\(\frac{2 + ap}{4+a}\)
A	1	\(\frac{1 + ap}{4+a}\)

Substituing values of n and n⁺

Now, if I use this Target Statistic column as my training data, I will get a perfect split at \( threshold = \frac{1.5+ap}{4+a}\). Anything above this value will be classified as 0 and anything below will be classified as 1. I have a perfect classification at this point, so I again get 100% accuracy on my training data.

You see the problem, right? My categorical variable which doesn’t have more than a unique value is producing different values for Target Statistic which will perform great on the training data but will fail miserably on the test data.

Ordered Target Statistic

Illustration of ordered learning: CatBoost processes data in a randomly permuted order and predicts each sample using only the earlier samples. Image by author

CatBoost introduces a technique called Ordered Target Statistic to address the issues discussed above. This is the core principle of CatBoost’s handling of categorical variables.

This method, inspired by online learning, uses only past data to make predictions. CatBoost generates a random permutation (random ordering) of the training data(\(\sigma\)). To compute the Target Statistic for a sample at row \(k\), CatBoost uses samples from row \(1\) to \(k-1\). For the test data, it uses the entire train data to compute the statistic.

Additionally, CatBoost generates a new permutation for each tree, rather than reusing the same permutation each time. This reduces the variance that can arise in the early samples.

Ordered Boosting

This visualization shows how CatBoost computes residuals and updates the model: for sample xᵢ, the model predicts using only earlier data points. Source

Another important innovation introduced by the CatBoost paper is its use of Ordered Boosting. It builds on similar principles as ordered target statistics, where CatBoost randomly permutes the training data at the start of each tree and makes predictions sequentially.

In traditional boosting methods, when training tree \(t\), the model uses predictions from the previous tree \(t−1\) for all training samples, including the one it is currently predicting. This can lead to target leakage, as the model may indirectly use the label of the current sample during training.

To address this issue, CatBoost uses Ordered Boosting where, for a given sample, it only uses predictions from previous rows in the training data to calculate gradients and build trees. For each row \(i\) in the permutation, CatBoost calculates the output value of a leaf using only the samples before \(i\). The model uses this value to get the prediction for row \(i\). Thus, the model predicts each row without looking at its label.

CatBoost trains each tree using a new random permutation to average the variance in early samples in one permutation.
Let’s say we have 5 data points: A, B, C, D, E. CatBoost creates a random permutation of these points. Suppose the permutation is: σ = [C, A, E, B, D]

Step	Data Used to Train	Data Point Being Predicted	Notes
1	—	C	No previous data → use prior
2	C	A	Model trained on C only
3	C, A	E	Model trained on C, A
4	C, A, E	B	Model trained on C, A, E
5	C, A, E, B	D	Model trained on C, A, E, B

Table highlighting how CatBoost uses random permutation to perform training

This avoids using the actual label of the current row to get the prediction thus preventing leakage.

Building a Tree

Each time CatBoost builds a tree, it creates a random permutation of the training data. It calculates the ordered target statistic for all the categorical variables with more than two unique values. For a binary categorical variable, it maps the values to zeros and ones.

CatBoost processes data as if the data is arriving sequentially. It begins with an initial prediction of zero for all instances, meaning the residuals are initially equivalent to the target values.

As training proceeds, CatBoost updates the leaf output for each sample using the residuals of the previous samples that fall into the same leaf. By not using the current sample’s label for prediction, CatBoost effectively prevents data leakage.

Split Candidates

CatBoost bins continuous features to reduce the search space for optimal splits. Each bin edge and split point represents a potential decision threshold. Image by author

At the core of a decision tree lies the task of selecting the optimal feature and threshold for splitting a node. This involves evaluating multiple feature-threshold combinations and selecting the one that gives the best reduction in loss. CatBoost does something similar. It discretizes the continuous variable into bins to simplify the search for the optimal combination. It evaluates each of these feature-bin combinations to determine the best split

CatBoost uses Oblivious Trees, a key difference compared to other trees, where it uses the same split across all nodes at the same depth.

Oblivious Trees

Illustration of ordered learning: CatBoost processes data in a randomly permuted order and predicts each sample using only the earlier samples. Image by author

Unlike standard decision trees, where different nodes can split on different conditions (feature-threshold), Oblivious Trees split across the same conditions across all nodes at the same depth of a tree. At a given depth, all samples are evaluated at the same feature-threshold combination. This symmetry has several implications:

Speed and simplicity: since the same condition is applied across all nodes at the same depth, the trees produced are simpler and faster to train
Regularization: Since all trees are forced to apply the same condition across the tree at the same depth, there is a regularization effect on the predictions
Parallelization: the uniformity of the split condition, makes it easier to parallelize the tree creation and usage of GPU to accelerate training

Conclusion

CatBoost stands out by directly tackling a long-standing challenge: how to handle categorical variables effectively without causing target leakage. Through innovations like Ordered Target Statistics, Ordered Boosting, and the use of Oblivious Trees, it efficiently balances robustness and accuracy.

If you found this deep dive helpful, you might enjoy another deep dive on the differences between Stochastic Gradient Classifer and Logistic Regression

An intuitive introduction to Hypothesis Testing with (almost) no maths

Shubham Gandhi — Sat, 15 May 2021 12:14:10 +0000

An intuitive introduction to Hypothesis Testing with no maths (almost)

Photo by Testalize.me on Unsplash

In this post, I am going to do a basic demonstration of Hypothesis Testing. I will try to keep it as math-free as possible and keep the focus on giving you an overall idea of the concepts involved and develop some intuition on Hypothesis Testing.

Okay, so let’s get started.

An Example

I am going to present you with a hypothesis and the hypothesis is –

We’ll ignore the potential difference between women of different age groups and just keep it simple. Now, although we call this a hypothesis in a normal way, in Statistics, a hypothesis has a formal meaning as something that can be tested.

So let’s try and put it to the test. Assume we take a sample of 20 women and their mean height comes to 168.6 cms.

So what does this observation mean for this hypothesis? I want you to stop for a few seconds and think- how much doubt does it cast on our hypothesis?

We started by saying that the average is 169 cms but now we have this sample and it’s slightly lesser than that but does it really cast a lot of doubt on the hypothesis?

Probably not, right? We randomly selected a small sample but it is possible that the selected women were slightly shorter due to random variation. So it’s not totally outside the realm of possibility that our hypothesis is still correct.

Taking a different sampleNow let’s imagine another scenario where we yet again randomly sample 20 women but this time the average of their heights is 161 cms. This time the average is a bit farther away than the last time.

So let me ask you again, how much doubt does this cast on our hypothesis.

Do you think this time the observation that the average height is 161 cms in a sample casts more doubt than the last example even though we sampled the same number of people?

The average height is quite a bit lower than 169 so you got to start thinking well how likely is it that this sample is just randomly less than the true mean.

Although we haven’t dealt with any formulas or math yet, you have done your first hypothesis testing at this point.

Let’s formalize things a little bit now. To be more specific we call the original hypothesis as the null hypothesis and represent it as H₀, and in this case, the null hypothesis was that the true mean is 169 cms. And then we’ve got this other thing called the alternative hypothesis often represented as H₁. Under the alternative hypothesis, the true mean is not 169 cms.

Null and Alternative Hypothesis

So the question is- is our sample mean far enough away from 169 centimeters for us to be able to reject that original hypothesis? In the second example, you thought- mmm maybe it is far enough away from 169, maybe there is now quite a bit of doubt cast over that null hypothesis

Formalizing Hypothesis Testing

Let’s consider this number line representing the possible values of the sample mean. We typically represent sample mean with x̅ and population mean with 𝝻. If you took a sample and took its mean, you will be expecting that it comes to be 169 but you also know that due to random variation, it could easily be 168 or 170.

What hypothesis testing will do is set these critical boundaries beyond which we are going to start rejecting the null hypothesis. So when our sample gets too extreme, we start doubting our null hypothesis a lot more. In our example with the sample mean at 161 cms, it could well be too far away from 169 and be in that rejection zone.

A Second Example

Now let’s try another example- another hypothesis. The second hypothesis we have

Like before, we are gonna take a sample and put this hypothesis to the test. This time around though we take only five people to make up our sample and we find that the average weight of these 5 chosen people is only 68 kgs.

And again, we’re going to ask ourselves the same question- how much doubt does this cast on our hypothesis. So stop for a few seconds and think about it

Now even though the difference between the weights is large, you might be thinking that is alright since we are just considering 5 samples and maybe, just maybe we selected slightly lighter people, and if we selected yet another 5 samples, the total sample’s average weight could go up.

Now let me pose a slightly modified scenario. This time around we sample 500 people instead of 5. But the sample mean is still 68 kgs. How much doubt this casts on our hypothesis?

You’re going to start thinking that- yeah, this is casting a lot of doubt now. The average weight hasn’t changed a bit but what has changed is the number of observations in the sample. So we now have 500 people in the sample. What that means is that we’re more confident about the sample mean and I think intuitively you can kind of see this work. Take Amazon reviews for example – a product rated 4.8 stars on 5 ratings vs a product rated 4.8 stars with 1200 ratings. You would feel more confident about the second product.

If you had an even bigger sample and found the average weight was still 68 you’re gonna start thinking well we’ve pretty much got almost the entire town here now– the true average weight is probably not going to be 74 anymore.

Our null hypothesis here is that the true population mean is 74 and our alternative hypothesis is that it’s not 74.

Null and Alternative Hypothesis

So in our sample of five people, those critical values that we would draw here to determine whether we would reject that null hypothesis might be far away from 74 but when we had 500 people in our sample those critical values beyond which we’re going to reject that null hypothesis are actually going to be quite close to 74 itself. So if we get a sample mean at say 71 or maybe 77 maybe that’s enough evidence here to reject that null hypothesis.

Larger samples widen the rejection zone

Now so far we haven’t done any actual calculations here yet but there are ways for us to calculate those exact regions to determine whether we’re going to reject that null hypothesis or not.

And through those calculations that we’ll be seeing shortly, what we are really trying to answer is this – if the null hypothesis was true how extreme is our sample? This really is the core question that a hypothesis test tries to answer numerically.

Although at the start of the article I mentioned we are going to keep this math-free, I’ll just give you a formula and help solidify the intuitions we’ve developed here.

The extremeness of a sample

This Z is a measure of extremeness. So when Z is close to zero that means our sample lines up pretty much with what we would expect if the null hypothesis was true. And if you have a look at the numerator of that function you can see that if the sample mean equals the hypothesized mean mu then we’re going to get zero. And the larger that gap between x̅ and mu- the larger the Z value meaning our sample is more extreme and this means we’re more likely to reject the null hypothesis and we saw this happen in the first example.

Also, you might notice on the denominator there’s an n down there, which represents the sample size. And we now understand how that works too because as n increases that value of Z will also increase meaning again that we’re more likely to reject the null. So as our sample sizes increase, we are more likely to reject the null unless the difference between the population and sample mean is small.

Based on what we saw above, we know that we’re more likely to reject H₀ when- _**1. the sample difference is greater like in the first example and

when the number of observations is greater like in the second example.**_

Hypothesis testing, in essence, involves three steps

Hypothesis Testing Process

Type 1 and Type 2 Errors

And that brings us to our last concept for this article which is Type 1 and Type 2 errors.

Type 1 and Type 2 Errors

When we do hypothesis testing, we make our decisions on the basis of the evidence at hand and not a 100% guaranteed proof. We merely state that there is enough evidence to behave one way or the other. This is always true in statistics! Because of this, whatever the decision, there is always a chance that we made an error. There is always a chance to get an extreme sample which makes you reject the null or the alternative hypothesis.

Earlier, in the examples, we had our rejection zones where we were going to reject the null hypothesis. A Type 1 error occurs when we reject a null hypothesis that is in fact true. And we can never remove that possibility of a type error.

Rejection Zones are arbitrary and are selected by the person doing the testing

The probability of a type 1 error is called the level of significance or alpha (α). And you can actually choose your level of significance. You can choose how strict you want to be with your decision to reject that null hypothesis. Often we’re going to use a level of 5% but that’s just the convention. These regions that we’ve drawn up here where we’re going to reject that null hypothesis are completely arbitrary. We’ve just decided on them based on this level of strictness that we can call the significance level (α).

A Type 2 error occurs when you do not reject a null hypothesis that is in fact false so say the sample mean lay very close to 169 cms and we, therefore, chose not to reject that null hypothesis. It is of course still possible that the true population mean is different from 169. It could be anything and in that case, we’d be committing a type 2 error, and again this probability of committing a type 2 error is something we can’t fully mitigate. There’s always going to be a probability of committing a type 2 error when we conduct a hypothesis test in this case the probability of doing so for a type 2 error is called beta (β) and one minus beta is called the power of the hypothesis test (1-β).

And that concludes our introduction to Hypothesis Testing.

The post An intuitive introduction to Hypothesis Testing with (almost) no maths appeared first on Towards Data Science.

Tuning XGBoost Hyperparameters with Scikit Optimize

Shubham Gandhi — Fri, 16 Aug 2019 12:00:28 +0000

XGBoost is no longer an exotic model that a select few could understand and use. It has become a benchmark to compare against in many scenarios. The interest in XGBoost has also dramatically increased in the three and a half years since the paper first proposing the algorithm was published. Google trends suggest that the interest in XGBoost is currently at an all-time high.

Relative popularity of XGBoost in the since April 2016

What is the challenge?

Now, while XGBoost has gained popularity in the last few years, some challenges make it a little difficult to use. The most adamant one being what hyperparameter values to use- what should be the number of estimators, what is the maximum depth that should be allowed, and so on. What makes XGBoost more difficult to manage than say a linear/logistic regression model or a decision tree is that it has a lot more hyperparameters than many other models. Simply hand-tuning them is going to take a lot of time and patience. And you still can’t be sure if you are moving in the right direction or not.

Is there a solution?

You might think if only we could somehow automate this boring and tiring ritual of tuning the hyperparameters our lives would be a lot easier. And if you are wondering that, then be assured because, yes, there are ways we can improve the performance of XGBoost models without doing all of that manual labor and just let the computers figure it out for us.

This process of selecting the correct values of hyperparameters is called Hyperparameter Tuning and there are prebuilt libraries that can do that for us. So all we need to do is write some code, give some input values and let the computer figure out the best values.

How do we do it?

Before we get into code and muddy our hands, let us hold here for a minute and ask ourselves- if we were a computer and had been given the same problem, how would we do it?

Let’s think this through. I’ll assume we have only two hyperparameter values in this situation because it makes it easier to visualize. The first thing I am going to do is built a model with any random values for our two hyperparameters and see how my model performed.

The next thing I would do is increase one parameter and keep one stationary just to see how my model performance responds to an increase in one of these parameters. If my model performance increases, that means I am moving my hyperparameter in the right direction. If not, I now know I need to decrease the value of this hyperparameter. In the next iteration, I’ll change the value of my other hyperparameter and see how my model reacts to that change. I’ll do this a few times and once I have seen my model reacting to these changes enough times, I’ll start changing both simultaneously.

So what we are doing here is that we are leveraging the output of the previous model and changing the values of our hyperparameters accordingly. In the sections below, we are going to use a package developed specifically to help us do this.

The Idea

In this tutorial, we are going to use the scikit-optimize package to help us pick suitable values for our models. Now, to be able to use these algorithms, we need to define an area for them to search in, like a search space. Imagine you needed to search the highest point in a region and a friend willing to do that task for you. For your friend to able to do that efficiently, you would have to tell him in what area you want him to search. You would have to define an area/space within which you want him to spend time searching.

The same is with skopt and other such algorithms. The algorithm is your friend in this case and you need it to search for the best set of hyperparameter values. For the algorithm to be successful you need to define a search area.

These are the steps that we are going to follow

create a dummy dataset using sklearn’s make_classification functionality. We are going to be performing a binary classification using so our labels will contain zeros and ones
define a search space/area over which you want to model to search for the best hyperparameter values
define a function that fits the model with different hyperparameter values and measures the model performance
define how many times you want to train your model
using skopt’s gp_minimize algorithm search our space and give us the results

And that’s it. That’s all you need to do. So let’s take a look at the code now.

The Code

You’ll need the skopt package which you can install by entering pip install scikit-optimize at the command line

Note: links to all code snippets are provided below the code boxes. If any code looks incomplete, click on the GitHub link to find the full code

Imports

https://gist.github.com/vectosaurus/6ae1b455c7527bf25954e834b8b49a89

Creating the dataset

https://gist.github.com/vectosaurus/a53683c4d6fde396ce43b92c045a5a00

Defining the search space. Here we define a search space by providing the minimum and maximum value each hyperparameter can take.

https://gist.github.com/vectosaurus/dc922b0b8d9db080bd75b736c2e09b0b

Function to fit the model and return the performance of the model

https://gist.github.com/vectosaurus/73d257cb0be61d88ab788bb70f22b857

Here, in the return statement, I return 1 - test_score instead of the test_score. That is because skopt’s gp_minimize works to find the minimum. And we aim to find the maximum f1_score we can get. So by getting gp_minimize to minimize 1 - test_score we are maximizing the test_score

Running the algorithm

https://gist.github.com/vectosaurus/f686bfc43fdec8f378d0eeae074e5fe1

Results

https://gist.github.com/vectosaurus/2f6ff57ac4cb4e728d32c65a155ceb78

The Xgboost model starts with a test_scoreon the first iteration being0.762 but ends up at an F1 score of0.837, an increase of over seven percent.

Conclusion

While automated hyper tuning helps in improving the model performance in many circumstances it is still necessary to pay close attention to the data. Probing the data and engineering informative variables can in many cases be much more effective.

Also, it is important to note that the search space that we select for the model be meaningful. Simply declaring a very large space will impact the quality of search these algorithms can perform.

Hyperparameter tuning is quite effective but we need to make sure we are providing it a fair enough search space and a reasonable enough number of iterations to perform. Automated hyperparameter tuning reduces the human effort but it doesn’t reduce the complexity of the program.

The post Tuning XGBoost Hyperparameters with Scikit Optimize appeared first on Towards Data Science.

Shubham Gandhi, Author at Towards Data Science

Why CatBoost Works So Well: The Engineering Behind the Magic

Target Statistic

Greedy Target Statistic

Leave One Out Target Statistic

Ordered Target Statistic

Ordered Boosting

Building a Tree

Split Candidates

Oblivious Trees

Conclusion

Further Reading

An intuitive introduction to Hypothesis Testing with (almost) no maths

An intuitive introduction to Hypothesis Testing with no maths (almost)

An Example

Formalizing Hypothesis Testing

A Second Example

Type 1 and Type 2 Errors

Tuning XGBoost Hyperparameters with Scikit Optimize