The world’s leading publication for data science, AI, and ML professionals.

Boosting Model Accuracy: Techniques I Learned During My Machine Learning Thesis at Spotify (+Code…

A tech data scientist's stack to improve stubborn ML models

Image by Author
Image by Author

This article is one of a two-part piece documenting my learnings from my Machine Learning Thesis at Spotify. Be sure to also check out the second article on how I implemented Feature Importance in this research.

Feature Importance Analysis with SHAP I Learned at Spotify (with the Help of the Avengers)

In 2021, I spent 8 months building a predictive model to measure user satisfaction as part of my Thesis at Spotify.

My goal was to understand what made users satisfied with their music experience. To do so, I built a LightGBM classifier whose output was a binary response: y = 1 → the user is seemingly satisfied y = 0 → not so much

Predicting human satisfaction is a challenge because humans are by definition unsatisfied. Even a machine isn’t so fit to decipher the mysteries of the human psyche. So naturally my model was as confused as one can be.

From Human Predictor to Fortune Teller

My accuracy score was around 0.5, which is the worst possible outcome you can get on a classifier. It means the algorithm has a 50% chance of predicting yes or no, and that’s as random as a human guess.

So I spent 2 months trying and combining different techniques to improve the prediction of my model. In the end, I was finally able to improve my ROC score from 0.5 to 0.73, which was a big success!

In this post, I will share with you the techniques I used to significantly enhance the accuracy of my model. This article might come in handy whenever you’re dealing with models that just won’t cooperate.

Due to the confidentiality of this research, I cannot share sensitive information, but I’ll do my very best for it not to sound confusing.


But first, make sure to subscribe to my newsletter!

Click on the link below & I’ll send you more personalized content and insider tips to help you on your journey to becoming a Data Scientist!

Join +1k readers 💌 that follow my journey as a Data Scientist in Tech + Spotify, don’t miss out!_Join +1k readers 💌 that follow my journey as a Data Scientist in Tech + Spotify, don’t miss out! By signing up, yo_u…medium.com


0. Data Preparation

Before diving into the methods I used, I just want to make sure you get the basics right first. Some of these methods rely on encoding your variables and preparing your data accordingly in order for them to work. Some of the code snippets I’ve included also reference user-defined functions I created in the data preparation section, so be sure to check them.

Here's what my pipeline looked like in the order I implemented things
Here’s what my pipeline looked like in the order I implemented things

1. Encode Variables

Make sure your variables are encoded:

  • Ordinal features, so that the model preserves the ordinal information
  • Categorical features, so that the model can interpret nominal data

So first, let’s store our variables somewhere. Again, because the research is confidential, I cannot disclose the data I used, so let’s use these instead:

region = ['APAC', 'EU', 'NORTHAM', 'MENA', 'AFRICA']
user_type = ['premium', 'free']

ordinal_list = ['region', 'user_type']

Then, make sure to build the function that encodes the variables:

from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder

def var_encoding(X, cols, ordinal_list, encoding):

    #Function to encode ordinal variables
    if encoding == 'ordinal_ordered': 
        encoder = OrdinalEncoder(categories=ordinal_list) 
        encoder.fit(X.loc[:, cols])
        X.loc[:, cols] = encoder.transform(X.loc[:, cols])

    #Function to encode categorical variables    
    elif encoding == 'ordinal_unordered':
        encoder = OrdinalEncoder()
        encoder.fit(X.loc[:, cols])
        X.loc[:, cols] = encoder.transform(X.loc[:, cols])

    else:     
        encoder = OneHotEncoder(handle_unknown='ignore')
        encoder.fit(df.loc[:, cols])
        df.loc[:, cols] = encoder.transform(df.loc[:, cols])

    return X

Then apply that function to your list of variables. This means you need to create lists with strings of the name of your variables, i.e. a list for your ordinal variables, one for the categorical ones, and one for the numerical ones.

def encoding_vars(X, ordinal_cols, ordinal_list, preprocessing_categoricals=False):

    #Encode ordinal variables
    df = var_encoding(df, ordinal_cols, ordinal_list, 'ordinal_ordered')

    #Encode categorical variables
    if preprocessing_categoricals: 
       df = var_encoding(df, categorical_cols, 'ordinal_unordered')    

    #Else set your categorical variables as 'category' if needed
    else:
        for cat in categorical_cols: 
            X[cat] = X[cat].astype('category')

    #Rename your variables as such if needed to keep track of the order
    #An encoded feature such as region will no longer show female or male, but 0 or 1
    df.rename(columns={'user_type': 'free_0_premium_1'},   
    df.reset_index(drop=True, inplace=True)   

    return df

2. Split the Data

Split your dataframe to get your train, validation, and test sets:

  • Train Set – to train the model on the algorithm you pick eg. LightGBM
  • Validation Set – to hyper-tune your parameters and optimize your prediction results
  • Test Set – to make your final predictions

🔊 Keep in mind

In my research, I split the data twice for two different purposes. The first split happens in the very beginning to create the train, validation, and test sets based on a user-level split. The other split happens much below when doing cross-validation and hyperparameter tuning.

The initial split allows for a more flexible and randomized division of data, which ensures a good diversity of users in each set. The test set is set aside for final model evaluation, while the train and validation sets are used for model development and hyperparameter tuning.

In my research, I used GroupShuffleSplit as follows:

from sklearn.model_selection import GroupShuffleSplit

def split_df(df, ordinal_cols, ordinal_list, target):
    #splitting train and test
    splitter = GroupShuffleSplit(test_size=.13, n_splits=2, random_state=7)
    split = splitter.split(df, groups=df['user_id'])
    train_inds, test_inds = next(split)

    train = df.iloc[train_inds]
    test = df.iloc[test_inds]

    #splitting validation and test
    splitter2 = GroupShuffleSplit(test_size=.5, n_splits=2, random_state=7)
    split = splitter2.split(test, groups=test['user_id'])
    val_inds, test_inds = next(split)

    val = test.iloc[val_inds]
    test = test.iloc[test_inds]

    #defining X and y
    X_train = train.drop(['target_variable'], axis=1)
    y_train = train.target_variable

    X_val = val.drop(['target_variable'], axis=1)
    y_val = val.target_variable

    X_test = test.drop(['target_variable'], axis=1)
    y_test = test.target_variable

    #encoding the variables in the sets based on a pre-defined encoding function
    X_train = encoding_vars(X_train, ordinal_cols, ordinal_list)
    X_val = encoding_vars(X_val, ordinal_cols, ordinal_list)
    X_test = encoding_vars(X_test, ordinal_cols, ordinal_list)

    return X_train, y_train, X_val, y_val, X_test, y_test

1. Feature Engineering

Feature engineering made a huge difference in improving the accuracy of my model.

When it comes to user listening satisfaction, I wanted to know whether it was more dependent on the user, their streaming behavior, or other factors. While the preliminary user data I had was meaningful, it lacked sufficient information gain and predictive power.

The most significant step in my optimization process became then to create new features that could better capture user satisfaction.

As the name suggests, creating new features is a creative process, so it means you need to sit down and put your domain knowledge to work, and think through novel ways to capture important information.

The two main methods I used in this process were:

  1. Feature Interaction. The most important transformation I did was to combine already existing features together to create ratios. Example: Let’s say I have a feature measuring total minutes streamed, and another one tracking total minutes streamed when tracks are new releases. One thing I could do here would be to extract the minutes streamed from new releases and then divide it over the total minutes streamed to create a "new music streams ratio". This captures completely new information.

  2. Feature Aggregation. Another thing I did was aggregate data over time and groups to create summarized features, such as the mean or standard deviation. This means you can create the same features but over different aggregates per time group. Example: Averaging over the number of tracks streamed per day per playlist over the last 7 days, 14 days, and 30 days. And voilà, you just unlocked new information.

🔊 Keep in mind

Feature engineering is also an iterative process. You may need to experiment with different combinations of features, transformations, and techniques to find the best set of features for your specific problem.

Always validate the performance of your model with the new features on a separate validation set to ensure that the improvements are not due to overfitting.


2. Feature Selection

So I was feeding many features to my model without really knowing which ones were relevant. We may think that the more variables we have the better our model will learn, but if our model is learning from everything including garbage, this ends up being more harmful than anything.

Having too many features means that some of them could introduce noise to the model which is bad because it:

  1. Hides the underlying patterns or relationships within the data.
  2. Leads to overfitting as the model learns from the noise rather than the true relationships.
  3. Increases complexity and slows down training.

To avoid all these problems, we go chasing down the culprits using methods such as Pearson’s Correlation Coefficient, Recursive Feature Elimination, or Chi2 Test, amongst many others.

In my case, I used the first two.


Pearson’s Correlation Coefficient

This coef measures the linear relationship between two or more variables.

It is the ratio between the covariance of two features and the product of their standard deviations. The final output is between -1 and 1 where 1 suggests a positive linear relationship between variables and -1 a negative one.

Pearson’s correlation coefficient serves 2 purposes in feature selection:

  1. Filter out the least important features, which tend to show a low correlation with the target variable.
  2. Limit multicollinearity between variables to avoid overfitting that may arise with data redundancy.

Why use it? It’s a computationally cheap statistical method for picking up the intrinsic properties of dependent variables.

How to use it? Correlation heatmaps point out the linear relationships existing between the variables

def corr_matrix(df):
    # Select upper triangle of correlation matrix
    upper = df.corr().abs().where(np.triu(np.ones(df.corr().abs().shape), k=1).astype(np.bool))

    return upper

def cols_todrop(corr_matrix, threshold):
    # Find features with correlation greater than x (you pick your threshold)
    to_drop = [col for col in corr_matrix.columns if any(corr_matrix[col] > threshold)]

    return to_drop
# Get a ranking of top 10 features with the highest correlation based on your threshold
upper = corr_matrix(data)
upper.returned_1d.sort_values(ascending=False)[:10]
# Plot the correlation heatmap
plt.figure(figsize=(16, 6))

heatmap = sns.heatmap(data.corr(), vmin=-1, vmax=1, annot=True, cmap='BrBG')

heatmap.set_title('Correlation Heatmap', fontdict={'fontsize':15}, pad=12)
plt.savefig('heatmap_pre.png', dpi=300, bbox_inches='tight')

plt.show()

🚨 Be careful with non-linear relationships!

Sometimes non-linear relationships between variables might also exist, which means you might want to be careful when filtering out multicollinear features.

Detecting non-linear relationships can provide more nuanced and accurate insights into the data, which means you may want to keep them. To do so, you can use alternative methods such as Spearman’s Rank Correlation, Kendall’s Tau, Scatter Plots, etc…


Recursive Feature Elimination

It recursively narrows down features by weighting and ranking them using an importance algorithm. Starting with all features, it fits the chosen machine learning model, ranks the features, and iterates with smaller subsets until reaching the desired feature count (the one you initially set).

Why use it? The result is a ranking of features by importance, which allows us to kick out features with the least predictive power from the party.

🚨 Be careful with encoding!

RFE requires prior numerical encoding of categorical variables in order to work, so refer back to the initial section for encoding variables.

from sklearn.feature_selection import RFE
selector = RFE(model, n_features_to_select=30, step=1)
selector = selector.fit(X_train, y_train)

rfe_vars_keys = list(X_train.columns)
rfe_vars_values = list(selector.ranking_)
rfe_vars = dict(zip(rfe_vars_keys, rfe_vars_values))

sorted(rfe_vars.items(), key=lambda x: x[1])

I combined the results of these 2 methods when filtering out the least important features:

  • Using Pearson’s Correlation Coefficient, I found no strong linearity between the dependent features and the target variable. So I kept all of them (I was also scared of removing non-linear relationships).
  • Using Recursive Feature Elimination, I removed the lowest-ranked features (because why not).

3. Hyperparameter Tuning

Hyperparameter tuning is a mandatory stop when optimizing a machine learning model. It’s basically the part where you look for one of the best combinations of parameters that can give you great performance for your model.

In my research, I used a two-step strategy combining GroupKFold cross-validation with RandomizedSearchCVfor hyperparameter tuning, which was the best combination given that:

  1. The sample data was very large (300k users).
  2. The user data needed to be split appropriately (we don’t want to find K’s streaming data in all splits, no no).

Step 1: Preventing Data Leakage with GroupKFold

My data consisted of multiple records for individual users. Because data gets split for hyperparameter tuning, I needed to prevent data leakage by ensuring that information from the same user was not split between the training and validation sets.

The best method to do so is GroupKFold, which divides the data over a training and validating set randomly using different portions of the dataset at each iteration. This creates separate sets with distinct and non-overlapping users.

This is crucial for achieving a reliable performance assessment, as you want your model to be tested on entirely unseen users, not just new data from users it has seen during training.

Step 2: Efficient Hyperparameter Tuning with RandomizedSearchCV

My sample data was around 300k users, which was the largest one I could afford without triggering a system crash, given my computational capabilities. Using RandomizedSearchCV is much more efficient when your sample is this large. It works wonders.

Instead of searching through all possible hyperparameter combinations like a traditional grid search would do, it randomly samples a subset of the hyperparameter space. Then it evaluates the performance of the selected combinations using cross-validation.

✨Results

By combining these two, I performed hyperparameter tuning on multiple data subsets with non-overlapping users. This way I was able to:

  1. Address data leakage concerns
  2. Ensure computational efficiency
  3. Implement a robust basis for hyperparameter selection
def grid_search(X, y, groups):
    gkf = GroupKFold(n_splits=5).split(X, y, groups)
    model = lgb.LGBMClassifier(objective='binary', verbose=-1, max_depth=-1, random_state=314, metric='None', n_estimators=5000)

    grid = RandomizedSearchCV(
        model, param_grid, scoring='roc_auc', random_state=314,
        n_iter=100, cv=gkf, verbose=10, return_train_score=True, n_jobs=-1)

    return grid

grid = grid_search(X, y, groups)

%%time
grid.fit(X, y)

#printing the best hyperparameters
best_params = grid.best_params_

After we’re done identifying the best hyperparameters through RandomizedSearchCV and GroupKFold, we use the initial train and validation sets from GroupShuffleSplit to train the final model with the selected hyperparameters.

Remember that split_df() function we created at the very beginning of this article? We’re using it in this step to get our data split.

# We split the data using our initial function
X_train, y_train, X_val, y_val, X_test, y_test = split_df(df, ordinal_cols, ordinal_dfs, target='target_variable')

Then we plug in the best parameters found with hyperparameter tuning.

# We train the model using the best_params that we got from HP Tuning 
clf = lgb.LGBMClassifier(objective='binary', max_depth=-1, random_state=314, metric='roc_auc', n_estimators=5000, num_threads=16, verbose=-1,
                           **best_params)
%%time
clf.fit(X_train, y_train, eval_set=(X_val, y_val), eval_metric='roc_auc')

# Test set evaluates the final performance of the model on unseen users
roc_auc_score(y_test, clf.predict(X_test))

🔊 Keep in mind

I’m mentioning this because it confused me a lot while I was working on this research. The eval_set is used for monitoring the model’s performance on a specific validation set during training. This is different than cross-validation, which evaluates the model’s ability to generalize across multiple training-validation splits.


4. Data Generation

After implementing all the previous steps, my model still needed an extra boost. Because some groups in my data were more underrepresented than others, my model had a wee bit of a struggle to generalize through them.

So I made sure to generate a larger random sample of users for all of the underrepresented sets. This last step gave my model exactly what it needed to properly generalize all that beautiful wisdom from the data and make reliable predictions.


Last Word

Keep in mind that the process of optimizing a model is an iterative one, which means that you may have to combine and repeat some of these methods until you reach a satisfying performance.

Optimization Methods Recap

  1. Feature Engineering – Creating new features using different methods such as feature aggregation, transformation, temporal data encoding, standardization and more can introduce new information to the data.
  2. Feature Selection – After creating new features, evaluate their importance and remove irrelevant or redundant features that do not contribute to model performance. Some methods include Pearson’s Correlation Coefficient, Recursive Feature Elimination, or Chi2.
  3. Hyperparameter Tuning – Preventing Data Leakage with GroupKFold then searching for the best parameters with RandomisedSearchCV in a computationally efficient way.
  4. Data Generation – Make sure groups are equally represented in the sample and if needed and possible, increase the sample size to cover a larger sample of data points.

I have GIFTS for you 🎁 !

Sign up to my newsletter K’s DataLadder and you’ll automatically get my ultimate SQL cheat sheet with all the queries I use every day in my job in big tech + another secret gift!

I share each week what it’s like to be a Data Scientist in Tech, alongside practical tips, skills, and stories all meant to help you level up – because no one really knows until they’re in it!

If you haven’t done that already

  • Subscribe to my YouTube channel. New video coming up very soon!
  • Follow me ** on Instagram, LinkedIn, ** X, whatever works for you

See you soon!


Related Articles