A 6-Month Detailed Plan to Build Your Junior Data Science Portfolio

If you’ve just finished your degree or are looking for your first job, this article is for you. If you’re still working on your degree or haven’t started your Data Science journey yet, you might want to check out this article first.

As you know, the data science job market is more competitive than ever. Simply having a degree or academic projects isn’t enough to differentiate yourself from the crowd. You need practical, hands-on projects that show your skills in action.

For those who don’t know me, my journey started ten years ago with a degree in applied mathematics from an engineering school. Since then, I’ve worked across various industries, from water to energy, and spent time as a lecturer. I’ve also hired junior data scientists, and I’m here to show you how to build the perfect portfolio to help you land your first job.

On Today’s Menu 🍔

🍛 How to plan your 6-month journey to create your Data Science Portfolio.
🍔 The prep work to get started.
🥤The 8 projects that will skyrocket your portfolio.
🍰 Deploying your portfolio effectively.

Let’s talk about planning 📅

If you’re in a data science career, I’m sure you’re someone who enjoys scheduling and planning to stay on top of trends. Based on the assumption that you’re in a research phase, I’ve created this timeline with the idea that you’ll be dedicating 10 hours per week to building your portfolio. Of course, if you have more availability or are a bit busier, feel free to adjust this plan accordingly.

This plan starts in January 2025.
The hours allocated for each project assume that you’ve already taken a data science course and have basic knowledge of each topic. It’s okay if you haven’t worked with image/text data yet or haven’t used the cloud or set up a database. You should at least be familiar with Python, Pandas, NumPy, some visualization libraries, basic Machine Learning algorithms, and a bit of SQL.

With this schedule, you’ll still have two weeks available at the end of June. I’ve also assumed that over the next six months, you might take two weeks off ☀️.

To get the most out of your Portfolio-building journey, I recommend setting up all necessary tools and accounts beforehand. This way, you can stay focused on your projects and data without interruptions. The only account I suggest creating later is for the cloud, as most providers offer a free tier that lasts about one month, and you’ll want to save that for deployment.

Your 5-Hour Prepwork to Get Everything Ready ⏱️✨

1.Install Anaconda or Miniconda Anaconda or Miniconda is essential for managing packages and environments. Install one of them to get started.

2.Prepare Your Conda Environments Familiarize yourself with basic conda commands (this isn’t the focus of this tutorial). Then, create the following environments to avoid issues with library installation over the next few months:

Machine Learning Projects Environment: Install Pandas, NumPy, Scikit-Learn, StatsModels, Seaborn, Matplotlib, and Plotly.
SQL Project Environment: Install the necessary packages to connect Python to your database.
Deep Learning Projects Environment: For image and text data, install TensorFlow and libraries needed for data preparation and feature extraction.
Deployment and Monitoring Projects Environment: Install ML packages along with MLflow and FastAPI. Later, add packages for your chosen cloud provider (e.g., Azure, AWS) as needed.

conda create -n ml_env 
conda create -n sql_env 
conda create -n dl_env 
conda create -n deploy_env

Once your environments are created, activate each one individually and install the necessary packages using the requirements.txt files. Feel free to change the packages if you prefer other libraries, but the ones included should cover most of your needs.

3. Install VS Code

VSCode is a great code editor that integrates well with Python and Jupyter Notebooks. Download and install it from https://code.visualstudio.com/.
Install the Jupyter plugin for VS Code to work with Notebooks.
Open a notebook and ensure you know how to switch between environments in VS Code: Open the Command Palette in VS Code (Cmd/Ctrl + Shift + P), then select Python: Select Interpreter. You should see the environments you’ve created listed here.

4. Set Up GitHub

If you don’t have a GitHub account, create one now. If you already do, go to GitHub and create a repository for each project (You will have 8 project to work on for your portfolio : You can check out the projects names bellow).
Back in your VS Code terminal, or any terminal you prefer, navigate to a main directory where you’ll store all your projects. You might name it "Portfolio." Clone each repository one by one into this directory:

git clone <your-repo-url>

5. Install SQL with the Sakila Database

SQL skills are essential for data science, and the Sakila database is a great resource for practicing SQL queries. To install MySQL, download the installer from MySQL’s website.
Then, download the Sakila database using this guide: How to Use the Sakila Database in MySQL. (You will use it on one of your project).
Now, open any test notebook and confirm that you can connect Python to MySQL.

#Créer une connexion à la base de données MySQL
engine = create_engine("mysql://root:root@localhost:3306/sakila", echo=True)
conn = engine.connect()
print(engine)

6. Install Tableau and Create your Cloud Accounts

Tableau is ideal for data visualization and creating interactive dashboards, which you’ll need for Project 2. Download Tableau Public (free) from Tableau’s website.
Cloud Accounts: At this point, start considering which cloud service you’d like to use. I personally recommend Azure because it’s user-friendly and easier to debug. Heroku is also a great option for API deployment, and if you have a GitHub Student account, you can use it for free!

Bravo! You’ve successfully set up your working environment, and you’re now ready to begin your 6-month journey to build your portfolio. Let’s jump on it.

Project 1: Analyse global education and economic data

Goal: Analyze global education and economic data to understand trends and identify key indicators. This data is challenging and requires advanced data preparation skills.

Data Source: World Bank Education Statistics (EdStats)

Steps:

Import the datasets using pandas.
Display summary statistics and distribution plots.
Quickly select the relevant columns, countries, and years to work with (the dataset can be very complex, try to reduce it at the beginning).
Merge the necessary files using different merging techniques.
Analyze missing data and detect outliers. Replace NaNs using simple methods such as the median.
Create various charts to visualize data availability by year.
Reshape the data into a comprehensive format using advanced pandas functions like pivot and melt to organize data by country, indicator, and year.
Standardize the data and compute scores to capture trends across regions.
Explore advanced methods for data imputation (e.g., K-Nearest Neighbors) and feature reduction (e.g., Principal Component Analysis).
Use advanced statistical tests (e.g., ANOVA) to understand the relationships between the data.

Libraries: pandas, numpy, matplotlib, seaborn, scipy, missingno, sklearn, statsmodels, plotly.

Project 2: Python, MySQL, and Tableau with Sakila database

Goal: Analyze and visualize data from the Sakila database using SQL and Tableau.

Data Source: Sakila Database (MySQL)

Methodology: Design an ER model, execute advanced SQL queries, connect to Tableau, and build interactive dashboards.

Steps:

1.Connect Python to MySQL:

Use SQLAlchemy to connect Python to the MySQL Sakila database.
Run advanced SQL queries to extract key insights, such as top actors, revenue by category, and rental patterns.

2. Connect Tableau to MySQL:

In Tableau, connect to the local Sakila MySQL database.
Import relevant tables and set up relationships as needed for analysis.

3. Build Visualizations in Tableau:

Revenue by Category: Create a bar chart showing revenue distribution by category.
Top Movies by Rentals: Display a chart of the most rented movies.
Store Performance: Visualize rental counts and revenue for each store.
Time-Series Trends: Analyze rental activity over time with a line chart.

4. Dashboard Creation:

Combine visualizations into a Tableau dashboard, adding filters for interactive insights.
Save your dashboard as a Tableau file. Consider publishing it on Tableau Public to get a free URL, or take screenshots to include in a presentation.

Libraries and tools: pandas, SQLAlchemy, (for MySQL connection), Tableau.

Project 3: Predicting Energy Consumption

Goal: Predict energy consumption of buildings to aid in climate action goals.

Data Source: Seattle’s 2016 Building Energy Benchmarking

Methodology: Use machine learning to analyze and predict building energy consumption.

Steps:

Clean and preprocess data.
Perform feature engineering to create meaningful variables.
Normalize numerical features and encode categorical variables.
Split data into training and testing sets.
Benchmark different models (e.g., Ridge/Lasso Regressions, SVM, RandomForest, XGBoost).
Tune the hyperparameters to optimize the model’s performance metrics.
Use other evaluation metrics R2, RMSE, MAE.
Select the best model.
Evaluate the model on test data and analyze the coherence with the train data.
Interpret model results and feature importance.

Libraries: pandas, numpy, matplotlib, seaborn, sklearn, shap, plotly.

Project 4: Customer Segmentation

Goal: Segment customers using clustering techniques to identify distinct groups within Brazilian e-commerce data.

Data Source: Brazilian E-Commerce Public Dataset by Olist

Methodology: Apply unsupervised learning to discover customer segments based on purchasing behaviour.

Steps:

Merge and clean data.
Conduct feature engineering to extract useful features.
Experiment with clustering algorithms (KMeans, DBSCAN, AgglomerativeClustering).
Optimize hyperparameters (e.g., number of clusters) based on silhouette score, davies bouldin and distortion metrics.
Analyze and visualize clusters.

Libraries: pandas, numpy, matplotlib, seaborn, sklearn, yellowbrick.

Project 5: Images Classifier

Goal: Implement a deep learning model to classify images from the STL-10 dataset.

Data Source: STL-10 Image Recognition Dataset

Methodology: Explore and apply convolutional neural networks (CNNs) and transfer learning for image classification.

Steps:

Explore dataset.
Preprocess images (resize, normalize).
Build a CNN from scratch.
Use transfer learning modes.
Train the model and adjust hyperparameters.
Evaluate model accuracy and make predictions.

Libraries: pandas, numpy, matplotlib, tensorflow, keras, cv2, skimage.

Project 6: Stack Overflow Questions Tags

Goal: Predict tags for Stack Overflow questions using NLP techniques.

Data Source: Stack Overflow API or dataset.

Methodology: Utilize natural language processing to classify text data into multiple tags.

Steps:

Clean data using.
Extract features with TF-IDF.
Apply ML algorithms (e.g., Logistic Regression) for multi-label classification.
Experiment with feature extraction using advanced NLP models (BERT, Doc2Vec).
Evaluate model performance and adjust hyperparameters.
You can use the same metric to compare classical and more advanced methods like accuracy.

Libraries: pandas, numpy, matplotlib, tensorflow, gensim, spacy, transformers.

Project 7: API and Dashboard

Goal: Deploy a model as an API and build a dashboard for real-time interaction.

Data Source: Use the model from Project 6.

Methodology: Serialize a machine learning model and deploy it via an API, then build a dashboard with streamlit for user interaction.

Steps:

Serialize the model using joblib.
Create an API with FastAPI for model inference.
Develop a dashboard using Streamlit to interact with the API.
Deploy the API and dashboard on platforms like Heroku or a cloud service. (At this step, ensure you have your cloud account set up. If you want to go further, you can also install Docker and use it before deployment.)

Libraries and tools: fastapi, streamlit, pickle, joblib , docker, azure/heroku.

Project 8: Monitoring and MlOps

Goal: Implement MLOps practices including model monitoring, experiment tracking, and automated deployment.

Data Source: Use Project 7.

Methodology: Integrate MLflow for tracking experiments, add an automated deployment with GitHub Actions.

Steps:

Set up MLflow for experiment tracking and model versioning.
Deploy model artifacts to a cloud storage solution.
Configure GitHub Actions for CI/CD to automate the api deployment (if you want to go further, you can also use Docker for this step).

Libraries: MLflow, GitHub Actions, Azure, pickle, joblib, docker.

Woohoo, congrats! 🎉 You’ve reached the final stage of your portfolio. Now it’s time to refine everything. Don’t forget that you have 10 hours planned for that. Follow these steps:

1.Clean and Document

Add markdown explanations to your notebooks as you go.
Add comments in your Python files and clean or optimize the code where needed.

2.Push to GitHub

Upload all projects to GitHub.
Include a requirements.txt file for each project.

3.Create a README for Each Project that contains

Project Title, Description, and Data Sources
Installation instructions
Project Structure
Results and Analysis
Limitations and Future Work

4.Dashboard Projects

If the dashboard URL isn’t accessible, take screenshots and create a clean slide deck. Add these to GitHub.

You can stop here, as you should now have a complete and clear portfolio on GitHub. However, if you want to stand out even more, consider deploying your portfolio on a website 🌐 . I personally suggest the following options:

GitHub Pages: Start here for a quick, free portfolio site. It’s ideal for linking to your GitHub repositories, sharing project descriptions, and organizing your work in one place.
Streamlit: Use this for interactive projects to showcase your data science and machine learning skills. It’s easy to deploy apps directly from GitHub, adding a dynamic layer to your portfolio.
Wix or WordPress: Consider these later if you want a polished, customizable site with extra content like a bio or blog posts. They’re perfect for creating a visually engaging portfolio without coding.

Some advice📌 :

Before Jan 2025: Complete your prep work and set up your environment (install necessary tools, create GitHub repositories, and set up virtual environments).
Plan your time: Schedule 10 hours a week for your projects and try to keep a consistent weekly time slot.
Break down each project into smaller tasks. Map out each step in detail before starting each project.
Ask for code reviews at the end of each project from someone in your data network.
Create a learning log document: Don’t wait until the end to tidy up code or push to GitHub. Push regularly to avoid losing work.
Stay flexible: Adjust your plan based on other commitments (e.g., work, internship, job research).
Create a document for learning log: After each project, write what you’ve learned with a brief explanation. Note any concepts you used but didn’t fully understand so you can revisit them later. This log will be valuable for interview prep.
Use Notion, Word, or Google Drive: For tracking your progress, keep your log somewhere reliable so you won’t lose it.

I’ve mentored hundreds of junior data scientists and hired for various teams on behalf of my clients. If you follow this portfolio plan, it’ll make your journey much smoother

Keep learning, stay positive, and you’ll do great! Good luck!

Thank you for reading!

Note: Some parts of this article were initially written in French and translated into English with the assistance of ChatGPT.

If you found this article informative and helpful, please don’t hesitate to 👏 and follow me on Medium | LinkedIn.

A 6-Month Detailed Plan to Build Your Junior Data Science Portfolio

On Today’s Menu 🍔

Let’s talk about planning 📅

Your 5-Hour Prepwork to Get Everything Ready ⏱️✨

3. Install VS Code

4. Set Up GitHub

5. Install SQL with the Sakila Database

6. Install Tableau and Create your Cloud Accounts

Project 1: Analyse global education and economic data

Project 2: Python, MySQL, and Tableau with Sakila database

Project 3: Predicting Energy Consumption

Project 4: Customer Segmentation

Project 5: Images Classifier

Project 6: Stack Overflow Questions Tags

Project 7: API and Dashboard

Project 8: Monitoring and MlOps

Some advice📌 :

Related Articles

Implementing Convolutional Neural Networks in TensorFlow

How to Forecast Hierarchical Time Series

Hands-on Time Series Anomaly Detection using Autoencoders, with Python

3 AI Use Cases (That Are Not a Chatbot)

Solving a Constrained Project Scheduling Problem with Quantum Annealing

Back To Basics, Part Uno: Linear Regression and Cost Function

Must-Know in Statistics: The Bivariate Normal Projection Explained