Career Advice | Towards Data Science https://towardsdatascience.com/category/productivity/career-advice/ The world’s leading publication for data science, AI, and ML professionals. Wed, 05 Mar 2025 14:41:04 +0000 en-US hourly 1 https://wordpress.org/?v=6.7.1 https://towardsdatascience.com/wp-content/uploads/2025/02/cropped-Favicon-32x32.png Career Advice | Towards Data Science https://towardsdatascience.com/category/productivity/career-advice/ 32 32 Skills a Data Scientist Must Have (But a Software Engineer Doesn’t) https://towardsdatascience.com/software-engineering-to-data-science-a07f178a98c4/ Tue, 21 Jan 2025 15:02:07 +0000 https://towardsdatascience.com/software-engineering-to-data-science-a07f178a98c4/ I guided a friend through her career transition to being an ML engineer.

The post Skills a Data Scientist Must Have (But a Software Engineer Doesn’t) appeared first on Towards Data Science.

]]>
A software engineer putting on a data science glasses. - Image generated with FLUX.1 [schnell] by the author.
A software engineer putting on a data science glasses. – Image generated with FLUX.1 [schnell] by the author.

Mentoring teaches me a lot.

I recently had the opportunity to guide a friend who worked as a software engineer (SE) for two years and wants to transition to a Data Science (DS) role. What started as a casual chat eventually became several hours of outlining plans to become a data scientist.

Her first question was, "What should I learn new?"

Of course, I could list a dozen things in a minute, but it requires much more than a list of skills and links to popular courses. FYI, I never answered this question, neither in this post.

Some of her existing skills are invaluable to her new endeavor. She could fast-track her transition by carefully learning one. But what’s more important is thinking like a data scientist.

She doesn’t have to unlearn anything. However, some of her SE skills have little use in data science.

In this post, I summarize some of our discussions. These include which area of data science suits her interests, which new skills she needs to acquire, and how to start small and grow faster.

The Most Valuable LLM Dev Skill is Easy to Learn, But Costly to Practice.

How do data scientists think differently from software engineers?

This is not to say that one’s work is easier than the others. However, a DS has significantly different goals regarding responsibilities than an SE.

SEs care about designing, developing, and maintaining software. Mostly, SE’s work is more deterministic. In other words, they know the outcome they are building for. And there’s often a finite set of techniques to achieve them.

SEs will have to make many choices in their work and sometimes will have to do course corrections. But these uncertainties will often have predictable solutions.

On the other hand, a data scientist’s work is filled with more unknowns. There’s no guarantee that the data they have is of good quality. DS will have to work on them. No one can tell which ML model would work best for the problem at hand; it’s more or less trial and error. Which evaluation metric is more important than the others? And what’s an acceptable threshold for them? These are all questions a data scientist will have to figure out on the go.

Due to these reasons, a DS has a more experimental mindset. As a professional SE, you’d often work in a framework like SCRUM, which expects you to finish the work in a strict timeframe called a sprint.

Data scientists are one kind of data science professionals.

Although we interchange the name, not all data science professionals are data scientists. We’ve got several other roles that go along with data scientists.

We rarely see a role that says full stack data scientist, though the full stack software developer role is very common.

Most people employed as data scientists are either analysts, machine learning engineers, or data engineers.

The most common role in data science is analyst. Analysts play a key role in business decision-making. They extract key insights from the data they have and educate management. Analysts don’t have to be programmers, either. Most of the analysts I know work with only Excel.

This is How I Create Dazzling Dashboards Purely in Python.

Machine learning engineers’ core responsibility is to train models that solve business problems. They work on data preprocessing, model selection, and training. As Andrew Ng points out, data scientists in industrial setups (or that closely resonate with ML engineers) don’t try to invent new models. Instead, they try to find the best data and preprocessing techniques to solve problems.

3 Ways to Deploy Machine Learning Models in Production

On the other hand, data engineers create and maintain data and infrastructure. This involves creating data pipelines and warehouses and managing databases.

Data engineers and ML engineers often work hand in hand. On some tasks, like model deployment, they both work together.

In the 8 Key MLOps Roles, Where Do You Fit In?

Those wanting to be data scientists often need to pick one. Each has its challenges and tools to solve them. For instance, while ML engineers concentrate on tools like Scikit-learn, Tensorflow, and Pandas, data engineers focus on tools like Airflow, SQL, and cloud infrastructure management.

It is now clear that the technologies used by these different data scientists are fairly different and require an investment of your time to master them. For this reason, we rarely see a role that says full stack data scientist, though the full stack software developer role is very common.

How to pick the correct data science role for you

I already mentioned this. I don’t recommend that anyone try to master all the different data science roles, although my own career was kind of a kitchen sink.

The best thing is to choose a role that suits your personality and pick it as early as possible.

Here’s my guide (of course, opinionated.)

If you’re a person who likes less programming and less technical work, you focus more on aiding business decision-making. Still, you’d be happy as a data analyst if you want to be in a data-related role.

Knowing some data-wrangling techniques in Python is helpful but not a must. I’ve had colleagues who continuously challenged me that they could do the stuff in Excel, for which I’m using Python. Guess what, sometimes they win.

However, you should discuss with your HR how your performance is measured because the insights you take from data don’t have any tangible value. They need to be acted upon, and even then, there’d be a delay in seeing their success. Thus, the traditional evaluation technique won’t apply to you.

Are you a person who doesn’t care about numbers, insights, or models? Do you consider yourself more like a software engineer than a data scientist? But do you want to be a data scientist anyway? Consider the data engineer role.

You’d still do a lot of coding and work with databases.

Lastly, if you like to train models, evaluate them, perform data preprocessing, build pipelines, etc., an ML engineer role is better for you than the other. In an industrial setup, ML engineers don’t create new models. Instead, they use data and model selection to find the best solutions to the requirements.

Skills you’ll rarely use as a data scientist (If you were a software engineer before)

As a former software engineer and a data scientist, I know a few things that are of little use today. Here are a few.

Design patterns and principles

Software design patterns distinguish skilled SEs from amateurs. These best practices allow us to develop software that is easy to maintain, scale, and reuse components. Since other developers widely recognize patterns, it helps a new person understand and collaborate with you quickly.

Design principles guide better coding and help maintain the codebase. I strictly follow SOLID design principles whenever I develop software.

Likewise, design patterns are like templates you can borrow to solve problems without wasting time.

I couldn’t say these are completely useless. If you’ve used libraries like Pandas, scikit-learn (or literally anything), they use these patterns and principles.

But as a person who mainly cares about getting insights from a dataset you just received or developing pipelines that execute a set of instructions one -after another after another, you’d rarely need them.

Web development frameworks (like Django)

I was a Django developer. I loved doing it.

You might ask, "What if a data scientist wants to package their work in a web app? Don’t they need Django?"

They certainly did in the past.

However, today, the need for data scientists has been replaced by tools like Streemlit or BI tools like Tableau and PowerBI. You don’t have to program everything yourself.

6 Python GUI Frameworks to Create Desktop, Web, and Even Mobile Apps.

I recently used Django to create a workaround for Streamlit’s authentication issues. I also documented the methods in two previous posts (Authentication and authorization). Streamlit at that time didn’t have these features. But today, this is an outdated advice.

Agile project management (such as scrum)

During the transition, this was my biggest challenge. As software engineers who want to put features out every other week, a system like SCRUM is super helpful.

But this wasn’t easy when working with data – especially when the client owns the data.

Most of the time, the work we do is experimental. There’s no guarantee that you’d find the perfect ML model within the timespan of a sprint. Every time you see new data, you see new challenges that make time estimations tricky.

Even an analyst couldn’t say in advance that they will extract some number or insights in 2 weeks.

I remember what a manager said in my early days as a data analyst. "Torture the data until it starts screaming insight." And you don’t know when that’ll happen.

I then asked seasoned data scientists and SCRUM experts how to implement these principles in data science projects. They didn’t know.

New skills you may have to learn as a data scientist

Finally, we’re now ready to discuss the purpose of this post – what skills a software engineer wanting to be a data scientist must learn.

This is what my unofficial mentee and I agreed at last.

I’m a believer in the 80/20 rule. This means there’s a small subset of skills (20%) that makes the most impact in your career (80%)

These 5 SQL Techniques Cover ~80% of Real-Life Projects

Your job is to self-assess your relevant skills and discover the 20% of new skills you need to learn to take you to your future self.

My friend, an SE and Django developer, already knew Python well. She was also good at SQL. Python and SQL alone would make her a fine data analyst.

However, she thinks machine learning is cooler than extracting insights from datasets. This raises more questions: Should we use traditional ML or Deep learning models? Should we use model fitting, computer vision, or NLP?

Here’s what I suggested.

Regardless of anything, all data scientists use Pandas to some extent. You could get started on it in a couple of hours (just like any other library). But to be a data scientist, you should have some proficiency in Pandas. Therefore, that’s the first.

I didn’t suggest any courses or resources to learn about Pandas; she should be able to find some good ones by googling them for a few minutes.

Next scikit-learn. Again, a lot of data scientists use them. Even if you’re working on deep learning projects, you’d still be using some of the modules in scikit-learn. It’s worth learning.

Scikit Learn isn’t simply a library to produce ML models. You could do data preprocessing, model selection, etc.

Then, a little bit of NLP. I suggested the library TextBlob. It’s a wrapper around the NLTK library, the holy grail of the NLP library in Python.

Finally, I’ve asked her to master either Plotly or Streamlit. Streamlit is easier to learn, and many other data scientists use it. However, with her experience in SE, something like Plotly would still be within easy reach.

Python, SQL, Pandas, Scikit-learn, Steamlit, and TextBlog are good skills to consider as the top 20 percent that have an 80% impact on her career.

Anyone could argue that these aren’t sufficient for data scientists in today’s competitive market. I agree. But with this, she could figure out where to go next.

She’d choose tools like OpenCV and PyTourch if it’s computer vision. If she advanced in NLP, she’d choose libraries like SpaCy.

Final thoughts

A lot of us want a career change. Few get their dream job as their first job.

In this post, I’ve summarized what I’ve discussed with a friend who is an SE but wants to be a data scientist.

We figured out she wanted to be an ML engineer more than an analyst or a data engineer. Since she already knew Python and SQL, picking up what she needed to learn was easy.

We discovered that anyone wanting to be an ML engineer can focus on Python, SQL, Pandas, Scikit-learn, Steamlit, and TextBlog, as these are top skills that make most of your life as a data scientist.


Thanks for reading, friend! Besides Medium, I’m on LinkedIn and X, too!

The post Skills a Data Scientist Must Have (But a Software Engineer Doesn’t) appeared first on Towards Data Science.

]]>
Roadmap to Becoming a Data Scientist, Part 3: Machine Learning https://towardsdatascience.com/roadmap-to-becoming-a-data-scientist-part-3-machine-learning-628248c96cb5/ Tue, 14 Jan 2025 11:02:00 +0000 https://towardsdatascience.com/roadmap-to-becoming-a-data-scientist-part-3-machine-learning-628248c96cb5/ From beginner to pro: key machine learning skills for data science aspirants

The post Roadmap to Becoming a Data Scientist, Part 3: Machine Learning appeared first on Towards Data Science.

]]>

Introduction

Data Science is undoubtedly one of the most fascinating fields today. Following significant breakthroughs in machine learning about a decade ago, data science has surged in popularity within the tech community. Each year, we witness increasingly powerful tools that once seemed unimaginable. Innovations such as the Transformer architecture, ChatGPT, the Retrieval-Augmented Generation (RAG) framework, and state-of-the-art computer vision models – including GANs – have had a profound impact on our world.

However, with the abundance of tools and the ongoing hype surrounding AI, it can be overwhelming – especially for beginners – to determine which skills to prioritize when aiming for a career in data science. Moreover, this field is highly demanding, requiring substantial dedication and perseverance.

The first two parts of this series focused on essential math and software skills to obtain to become a data scientist. In this part, we will dive into probably the most exciting part that directly touches necessary Machine Learning skills!

This article will focus solely on the math skills necessary to start a career in Data Science. Whether pursuing this path is a worthwhile choice based on your background and other factors will be discussed in a separate article.

Roadmap to Becoming a Data Scientist, Part 1: Maths

Roadmap to Becoming a Data Scientist, Part 2: Software Engineering

Maths + Engineering → Machine learning

Machine learning is a very diverse domain, but to be successful in it, it is essential to have solid skills in both math and software engineering.

  • Math knowledge reinforces a deep understanding of the logic behind algorithms, which is useful for choosing better solutions, easier debugging, and grasping more complex ideas.
  • Software engineering allows for the efficient implementation of algorithms and pipelines in code using the best development practices.
Artificial Intelligence vs Machine Learning vs Deep Learning
Artificial Intelligence vs Machine Learning vs Deep Learning

01. Introduction

Before diving directly into algorithms, it is necessary to understand several important fundamental blocks. First of all, it is the definition of machine learning, its difference from artificial intelligence, and what makes a machine learning algorithm so distinct from a normal algorithm.

Due to the variety of machine learning methods, it is essential to distinguish the high-level differences between the most important methods:

  • Supervised learning
  • Unsupervised learning
  • Semi-supervised learning
  • Reinforcement learning
Roadmap for getting started in machine learning
Roadmap for getting started in machine learning

After that, learners should understand the main types of problems, which include classification, regression, ranking, clustering, dimensionality reduction, recommendation, etc. In most courses, the initial focus is typically on supervised learning and how it is used to solve classification and regression problems. Other learning methods and problem types are usually considered more advanced topics and are studied later.

Apart from that, before studying concrete algorithms, learners should understand how the input data for those algorithms can be represented. In particular, this applies to the tabular format, which is frequently used. Terms like dataset, target, features, and objects should be clear from the beginning.

Finally, the last important topic in this chapter involves the evaluation of algorithms. It is necessary to study the main evaluation metrics and techniques to be comfortable later when estimating how good or bad a given algorithm is or when comparing several of them.

02. Classical machine learning

After building a solid foundation in machine learning, it is time to learn the main algorithms that work on tabular data. Not only are these algorithms widely used on tabular data, but they also play an important role in introducing smart concepts and ideas that can be reused in more complex algorithms and domains.

Classical machine learning roadmap
Classical machine learning roadmap

The first essential algorithm to study is linear regression. Under the hood, linear regression uses stochastic gradient descent (SGD), whose goal is to find the algorithm parameters that minimize a given loss function. Without SGD, it would be impossible to imagine other optimization algorithms and the entire AI domain, as a significant number of algorithms rely on SGD to find optimal weights. While studying linear regression, it is also an excellent opportunity to familiarize yourself with the most commonly used loss functions in machine learning.

Next comes the support vector machine (SVM). Although SVM is rarely used in practice due to its slow performance on large datasets, it still introduces the interesting concept of the kernel trick. This allows the transformation of initially linearly inseparable data into a new space where the data points can be easily separated.

Kernel Trick. The initial inseparable data in the 1D space (on the left) is transformed into a 2D space, where it gains a new dimension y = x² and becomes separable.
Kernel Trick. The initial inseparable data in the 1D space (on the left) is transformed into a 2D space, where it gains a new dimension y = x² and becomes separable.

The next family of algorithms worth exploring are tree-based algorithms, starting with the decision tree. A decision tree can recursively split data into two subsets based on a chosen binary condition at each tree node. As a result, when a new object is given for prediction, it is run through the entire structure of the decision tree to ultimately reach the leaf node corresponding to the predicted class.

Traditionally, after decision trees, the next topic is random forests, which consist of a set of decision trees. Given that a single decision tree can make errors in its predictions, a random forest improves the overall system by constructing several trees that can "vote" for the best prediction, thus reducing the overall error probability. The concepts of voting and bagging introduced in random forests can also be applied to any base algorithm (not just decision trees) to make the entire system more resilient to errors.

Voting Example: The input vector is sent to several models, each of which individually makes predictions. The most frequent prediction is selected as the final output.
Voting Example: The input vector is sent to several models, each of which individually makes predictions. The most frequent prediction is selected as the final output.

One of the most, if not the most, powerful algorithms for tabular data is gradient boosting. Like random forests, it combines several base algorithms, but it does so differently. Instead of aggregating predictions from several strong algorithms run independently, gradient boosting creates a sequential structure of weak algorithms. Each subsequent algorithm learns to reduce the accumulated error produced by the previous algorithms. The most popular variation of gradient boosting uses decision trees as the base algorithm.

Speaking of algorithms, I would ultimately recommend looking at the k nearest neighbours (kNN) algorithm as one of the best and simplest examples demonstrating the fundamental difference between parametric and non-parametric algorithms. Unlike the parametric algorithms discussed previously, kNN does not learn any parameters. Instead, it relies on certain assumptions about the data and predicts the class of a new object based on the class of the most similar objects from the training dataset.

kNN algorithm
kNN algorithm

At the same time, it is also necessary to learn several techniques for performing data analysis and processing, such as exploratory data analysis (EDA), feature engineering, one-hot encoding, and addressing issues related to class imbalance.

Finally, another important concept is hyperparameter tuning. At this stage, it is sufficient to understand what it is and to be able to implement the grid search strategy in code. Grid search is one of the simplest methods to adjust algorithms and improve their performance.

Personal advice

While learning all of these algorithms, it is also important to understand how the same algorithm can be adjusted to be used for both classification and regression tasks.

These algorithms might seem challenging at first for beginners, as they are very different from the classic algorithms used in computer science. One of the best pieces of advice for gaining a deep understanding of their workflow is to implement them manually in code without relying on any libraries.

Libraries and frameworks

In the machine learning industry, most of the time developers use pre-implemented algorithms from standard Python libraries. Given this, it is important to know how to use them in practice. Luckily, the majority of libraries provide a very easy-to-understand interface, so anyone with even basic coding skills can train and use machine learning models.

Python libraries: Scikit-learn (machine learning), Pandas (data analysis), NumPy (linear algebra)
Python libraries: Scikit-learn (machine learning), Pandas (data analysis), NumPy (linear algebra)

All of the algorithms discussed in this section are implemented in the Scikit-learn package in Python. Additionally, to perform basic data manipulations and run data analysis, it is necessary to know Pandas. Finally, it might also be worth exploring NumPy, a well-known Python package used for linear algebra tasks.

03. Deep learning

Deep learning is a subset of machine learning that focuses on solving problems using neural networks. A neural network is typically represented as a combination of fully connected layers consisting of perceptrons, where the input layer receives the dataset features, which are then transformed by the intermediate layers, and the predicted result is produced in the output layer.

To be confident when working with neural networks, it is essential to fully understand their learning process. In fact, a neural network can be viewed as a very complex mathematical function with a large number of parameters. As was done with linear regression, we can apply the SGD algorithm to perform model updates and ultimately find the best neural network parameters. Easy, right?

However, the reality is quite different, and there is a whole theory dedicated to training neural networks because the simple SGD algorithm is usually not enough.

Deep learning roadmap
Deep learning roadmap

First, it is necessary to understand the details of the forward and backward propagation algorithms for training a neural network, along with their vectorization techniques, which can be applied to accelerate the training process. Computational graphs and the chain rule of differentiation (which should be learned earlier in calculus theory) play a crucial role in backpropagation.

Next, learners should study different types and properties of activation functions, which transform the original linear neural network into a more complex set of mathematical operations, enabling the solution of more sophisticated tasks.

Deep learning optimizers and learning rate schedulers also play a key role in modern neural networks, allowing them to converge to the optimum much faster. The most important optimizers to study are Momentum, RMSProp, AdaGrad, and Adam.

Due to the complex structures of neural networks, vanishing and exploding gradients can become significant problems that prevent the network from learning. Therefore, it is essential to know how to handle such situations. One such method involves using skip connections.

Finally, to reduce the chance of overfitting, it is necessary to apply standard regularization techniques, which generally include batch normalization, weight decay, and dropout.

Personal advice

In contrast to classic machine learning algorithms, I would generally advise beginners against implementing neural networks from scratch.

Nevertheless, it is always an excellent way to gain a deeper understanding of how neural networks function in practice. The problem is that deep learning algorithms are much harder to implement compared to the previous algorithms we discussed, and they may require a significant amount of time that could be better spent focusing on other theoretical aspects.

Regardless of the decision you ultimately make, I believe it is vital to understand the theoretical concepts of deep learning described above.

Libraries and frameworks

The top three well-known state-of-the-art Python frameworks for working with neural networks are PyTorch, TensorFlow, and Keras. Learners often ask which of these three they should choose for implementing their own networks.

The most popular deep learning frameworks: PyTorch, TensorFlow, Keras
The most popular deep learning frameworks: PyTorch, TensorFlow, Keras

In reality, when building a neural network, there is not much difference in terms of which of these three frameworks is used, as they all provide essentially the same functionality for basic tasks. Moreover, the code often looks almost identical in all of them.

It might play a role in the future if you work on a project that uses a particular framework or if you are an advanced researcher who needs very specific functionality that is not implemented in other frameworks. However, for beginners, any framework will be a good choice as long as the developer understands what happens behind the scenes when constructing and later training a neural network architecture.

Keras is built on top of TensorFlow and provides the simplest functionality for those taking their first steps in deep learning. However, in the long term, I would encourage learners to choose between TensorFlow and PyTorch.

Conclusion

In this article, we have covered the necessary machine learning theoretical blocks that every data scientist or machine learning engineer should know.

If you have gained solid knowledge of math, software engineering, and machine learning skills as described in the first three articles of this series, then you should be confident enough to consider yourself at least a Junior data scientist. Despite this significant achievement, it may still be challenging to find a job in today’s market, which is highly competitive in data science.

If you have reached this point, you should now be able to pick up and study more advanced machine learning topics that will expand your expertise and contribute to your professional growth.

These topics and new domains will be discussed in the fourth part of this series.

Resources

All images are by the author unless noted otherwise.

The post Roadmap to Becoming a Data Scientist, Part 3: Machine Learning appeared first on Towards Data Science.

]]>
How to Stand Out in The Data Science Job Market https://towardsdatascience.com/how-to-stand-out-in-the-data-science-job-market-81253228f010/ Thu, 02 Jan 2025 18:20:28 +0000 https://towardsdatascience.com/how-to-stand-out-in-the-data-science-job-market-81253228f010/ How to have the edge in your data science application

The post How to Stand Out in The Data Science Job Market appeared first on Towards Data Science.

]]>
Photo by Nick Fewings on Unsplash
Photo by Nick Fewings on Unsplash

Applying for jobs right now can feel a bit difficult, sending countless applications with little to no responses. The problem is that you are probably not differentiating yourself significantly from other candidates, so in this article, I want to go over several ways you can make yourself stand out.

What is standing out?

Standing out is actually a relatively simple concept to get your head around.

All you need to do is be different to other applicants. You must be an outlier.

Sounds simple, right?

Like everything in life, it’s simple to understand but much harder to do.

The primary way to be an outlier is to do things other people are not doing. By definition, you will become an outlier.

Notice I am not saying you need to be better, although this is sometimes a by-product of being an outlier.

You just need to be different, and it only really needs to be in one dimension.

Let’s now go over some ways you can stand out.

Online Presence

There is a saying from Brazilian poet Mario Quintana:

Don’t waste your time chasing butterflies. Mend your garden, and the butterflies will come.

It’s essentially saying it’s better to create something to attract what you want rather than expend energy chasing it. There is a trade-off, but if I think back to my career, my biggest leverage has always been my online presence.

Apart from my first job, I never actually went looking for a role; recruiters always reached out to me via LinkedIn.

A good, polished resume is still great, but how are your other profiles looking? Is your LinkedIn and personal website active and optimised? How often are you posting?

The term "personal brand" is often thrown around, but it is so important. I know people find it a bit cringy, but the leverage it creates is insane.

In every mentoring session, I always tell the person to start an online presence; honestly, not many of them do. So, if you are that person who does it, by definition, you will stand out.

  • At a minimum, I recommend having a nice-looking LinkedIn aligned with your resume / CV. They should complement each other nicely. See my LinkedIn for inspiration and my CV / Resume template here.
  • The next stage is to start posting stuff. It can literally be anything, but it should ideally be related to your work. I would also add a personal website/portfolio to really bolster your online presence. Here is my website if you want some inspiration.
  • The final stage is having something like a blog, YouTube channel, or newsletter that you post consistently. This is how you grow an "audience" and build an online presence of people who trust you and really show off your work.

As I said, applying for jobs directly is one way, but recruiters spend a lot of time looking for candidates as well, so having a strong online presence increases your chances of being approached.

I know many people reading this probably won’t do this, and that’s fine. But that’s precisely why building an online presence will make you stand out!

Unfair Advantages

Everyone has some unfair advantages they can use to increase their chances in the job market.

These are some of the most common ones you can probably use:

  • Networking It is well known, yet I still think it’s criminally underrated. Use that connection if you know anyone working in the field through family, friends, or even from past jobs. Obviously, don’t use the person, but the worst they can say is no, and you move on. Also, often, it is in their best interest due to the referral bonus.
  • Educational / Job Background – If you have a background in marketing, finance, medicine or anything that’s not directly related to tech, that’s your advantage. You already have the domain knowledge in a specific area, which will complement your data skills. So, apply for jobs in that business industry, and you will stand out as you have domain knowledge.
  • Where You Live – Hear me out; everyone wants to work in a big city for a big company, and it is great to have those aspirations. However, if you are starting out, there is nothing wrong with going for a smaller company nearby. There will be less competition, and you will still learn a lot. I did this for my first job, where I had to spend half my time in London and half in Sussex. My parents live in Sussex, which made it much easier for me!

There are so many more things I can think of, like speaking a different language, playing a particular sport, being really good at one bit of software, etc. The list is truly endless.

You have to find the things you are better at than most people and use that in the job market. It can be tricky, but if done well, it can really benefit you.

Build Something Cool

I regretted having too many "easy" projects when applying for my first data science job. It was good that I had projects, but they were mostly the same, and I just changed out an algorithm.

In reality, what works well is having one significant quality item that can "wow" the interviewers and be a great talking point during the interview – it could even take up the whole interview!

My primary approach would be to develop a project you don’t think you can do, but do it anyway. You will soon learn that everything is figure-outable.

To be more concrete, I am saying don’t just build a model in a Jupyter Notebook; make it way more interactive. You should

  • Deploy the model on a cloud provider like AWS.
  • Add some unit tests and make them align with software engineering best practices.
  • Figure out a way for it to make live predictions.
  • Build a monitoring dashboard that can accessed online.
  • Store historical predictions in a database.

I am essentially describing deploying and monitoring your model end-to-end, which is still only one approach. However, you can see how this type of project is of significant difficulty and quality compared to a static model in a notebook.

So, think of a project that would take a couple of months to complete and is something you are likely to struggle with.


If you are serious about becoming a data scientist, I have just released my definitive guide on breaking into data science.

Over the past 3.5 years, I’ve worked as a data scientist in insurance, e-commerce, and the supply chain industries.

But my journey here wasn’t straightforward.

While pursuing a master’s in physics, I explored Careers in finance, consultancy, and even research.

Then, I discovered data science, and everything changed. I was captivated by the ability to use machine learning and statistics to solve real-world problems.

I poured countless hours into coding, courses, and projects, applied to over 300 jobs, and eventually landed my first graduate Data Science role in 2021!

This e-book distils what I’ve learned from my experience into a practical guide for breaking into data science.

While I can’t guarantee you’ll land a job, following this advice will significantly improve your chances of success.

Get 50% off this January 2025 with code "EGOR50" – don’t miss out!

My Guide To Becoming A Data Scientist


Another Thing!

I have a free newsletter, Dishing the Data, where I share weekly tips and advice as a practising data scientist. Plus, when you subscribe, you will get my FREE data science resume and short PDF version of my AI roadmap!

Dishing The Data | Egor Howell | Substack

Connect With Me

The post How to Stand Out in The Data Science Job Market appeared first on Towards Data Science.

]]>
Top 12 Skills Data Scientists Need to Succeed in 2025 https://towardsdatascience.com/top-12-skills-data-scientists-need-to-succeed-in-2025-c80f54cf227a/ Tue, 31 Dec 2024 15:02:29 +0000 https://towardsdatascience.com/top-12-skills-data-scientists-need-to-succeed-in-2025-c80f54cf227a/ It's (not) all about LLMs and AI tools

The post Top 12 Skills Data Scientists Need to Succeed in 2025 appeared first on Towards Data Science.

]]>
Source: Image made by author, with help from Claude.
Source: Image made by author, with help from Claude.

The AI landscape is moving faster than a rocket ship in 2025, and hanging on is getting harder and harder!

Will you keep your current position, get hired, get promoted or get sacked? That depends on YOU and how fast you can adapt to change.

This isn’t to say that if you don’t adapt – you will perish.

Many things are changing, but other things are not. Understanding which changes require your attention is the key to success.

Yes, the new AI revolution is proliferating in huge sections of the economy, with a ton of new tools to boost productivity and automate many tasks. And if you were overwhelmed last year, better buckle in for another wild ride.

So, how should you handle this always-accelerating train of AI hype and tools?

By focusing on what matters.

Although AI tools are super shiny and powerful, many of the skills that will really help you succeed in your career haven’t changed too much for the past few decades and even for several millennia!

I’ll guide you through the top 12 essential skills you need to thrive as a data scientist, ML engineer, or applied scientist in 2025. From timeless capabilities to emerging technologies, I’ll give you practical examples and resources to help you get started and develop each skill. In other words:

This post will be your one-stop shop for leveling up in 2025.


Before We Start

What does a data scientist do in 2025?

Well, it depends.

First of all, I’m going to loosely use the term data scientist for roles that handle data, machine learning, deep learning, and generative AI models. However, other names for such roles include machine learning engineer, applied scientist, algorithm developer, and many more.

So, what should you expect in these roles? Well, I’ve reviewed the stated responsibilities of over 500 job descriptions, and here are my findings:

Top 5 Responsibilities Of Data Scientists:

  1. Data & Modelling: Process and engineer large-scale datasets. Utilize statistics and data analysis. Build, train, and evaluate ML models (including LLMs, computer vision, tabular data, time series, audio, and recommendation systems). Develop GenAI applications and workflows.
  2. Research & Development: Build internal tools. Conduct literature reviews. Stay current with the latest developments. Own projects from POC to deployment.
  3. Infrastructure Design & Development: Design cloud-based model serving infrastructure. Build data, training, and inference pipelines.
  4. Performance & Monitoring: Define and track success metrics. Build dashboards and monitoring tools. Ensure model reliability at scale.
  5. Collaboration & Communication: Present to stakeholders and clients. Share knowledge with team members. Work across different teams.

So, how much of the AI hype train made it into these job descriptions? Not as much as you would think. Take advantage of this and choose specific areas where you want to grow tall, but never forget your roots.


So What Skills Are Important In 2025?

Source: Image created by author with Dall E 3.
Source: Image created by author with Dall E 3.

1. Communication Skills

Nobody knows what you are doing and what you are thinking.

Is there anything that you want? Do you have any issues? Are you stuck? Do you have any questions? Don’t keep it in. Let it out.

Why should you get what you want if you didn’t ask for it?

Communicating your problems, issues, thoughts, results, and conclusions is vital to success in Data Science and to being productive in general.

I know it’s sometimes embarrassing to ask silly, "dumb," and naive questions. However, from my experience, those who frequently ask "dumb" questions get ahead faster. This is because they learn faster, generate value faster, and gain valuable connections and trust.

I personally love asking "dumb" and "trivial" questions, like "what is an error?". It almost always makes people stop in their tracks and question their most basic assumptions, usually leading to much better understanding of the project and problem, while improving communication between team members.

Take control of your career, interrupt your friend, and ask your questions. They will respect you more for it. The more you practice communicating your needs, thoughts, and ideas, the better you will become at it – not just in your career but in life in general.

What should you practice?

  1. Explaining technical stuff: Try to translate what you are doing or what you just learned to your non-technical friend or family member. Use the [ELI5 technique](https://blog.groovehq.com/the-eli5-technique#:~:text=The%20ELI5%20(explain%20it%20like,treat%20your%20customers%20like%20children.) (Explain Like I’m 5) and lead with the business impact before diving into details.
  2. Storytelling: Create a narrative for the story you want to tell. Create compelling visualizations or presentations to support your story. You can gradually build up complex concepts using the Progressive Disclosure Principle or the Pyramid Principle, which does it in reverse.
  3. Communicating with stakeholders: Stakeholders don’t have spare time. They want information fast and to the point. Practice the Bottom Line Up Front (BLUF) and Red/Amber/Green (RAG) techniques to deliver your message.
  4. Communicating in writing: Use BLUF and RAG, then summarize what’s been done, what’s next, and what’s blocking progress. When documenting your work, focus on the problem, methodology, reproducibility, and accessibility.
  5. Avoiding pitfalls: Think before your talk and don’t jabber on without any structure; avoid overwhelming with technical details or jargon; lastly, don’t procrastinate and wait too long to communicate; it’s always better to communicate something than to do nothing at all.

Where Can You Practice?

There are many opportunities to look for!

Try to explain your work to colleagues, friends, or family members. Send frequent status updates through email or your messaging app. Volunteer to present at team meetings or conferences. Join or start a journal club to practice discussing technical papers.

You could also seek feedback on your offline communications and presentations. Create a blog or contribute to technical documentation to improve your offline communication.

The opportunities are endless, you just need to look for them!


2. Programming Skills (Python)

That’s right, Python is number 2.

After communication, I don’t need to tell you that core programming skills are super important for data scientists.

As a data scientist, you need a scripting language like Python to do what you need to do! From data scraping, extraction, manipulation, and analysis to developing custom models and their training, evaluation, inference pipelines, deploying services, and building interactive applications.

You need more than just the machine-learning-related libraries. (That’s number 6 on this list).

Be familiar with the full breadth of what you can do in Python.

Try to familiarize yourself with as many built-in modules as you can in the Python Standard Library. It provides a ton of super useful tools that will expand the horizons of what you thought Python could do.

For example, [dataclasses](https://docs.python.org/3/library/dataclasses.html) are very useful tools for managing data passed through objects. This example shows the object detection outputs of one model (object detector) being passed to the second model (pose estimator).

Running this should produce the following output:

----------------------------------------
detection 1: person, conf: 91.94%
bbox (92, 46, 62, 171)
keypoints:
[[114.488144  96.92579 ]
 [144.00072   67.28505 ]
 [138.01134   59.30764 ]]
----------------------------------------
detection 2: cat, conf: 25.65%
bbox (175, 65, 31, 93)

3. Deep Understanding Of Data

Do you know what’s in your data?

3.1 Data Validation

What data issues do you have? which edge cases, corruptions, noise?

If you don’t know the answers to these questions, you might build a data pipeline or preprocessing component that produces bugs that are hard to understand and catch, or worse yet – silent errors! (Where everything runs fine, but you get behaviors that you don’t want).

How can you avoid these issues?

By explicitly testing ALL the assumptions you have about your data.

For example, is your data free of duplicates and missing values? Are all the values numeric, strings, or tensors? Are all the labels positive integers? Are all the files present? Are the paths structured as you believe they are?

Think of it like tests for your data.

Too many times have I skipped this crucial step and ended up paying the price of silent bugs that are super hard to catch by without explicit testing. That’s why I never forget Bob Colwell’s famous saying: "If you didn’t test it, it doesn’t work".


3.2 Understanding Your Data

Source: image from Deep Learning vs Data Science: Who Will Win?
Source: image from Deep Learning vs Data Science: Who Will Win?

Exploratory data analysis (EDA) is your friend!

Understanding the distributions involved, the difficulty of the learning task when training on the data, how to partition your data to provide insights, analyze performance, improve training, and encourage behaviors.

This will allow you to find the things you didn’t know that you didn’t know!

Hypothesis testing is good for finding things you know you don’t know, e.g., you know which questions to ask but you don’t know the answer to.

But what about questions you didn’t think about asking?

This is where visualization comes in. This is your chance to open your mind and let it race through possibilities. It is your chance to find patterns and learn to ask new questions you’ve never thought about asking.

Get very comfortable with tools such as [matplotlib](https://matplotlib.org/), [seaborn](https://seaborn.pydata.org/), [plotly](https://plotly.com/), and related libraries. Any LLM can likely implement some starter code for visualizing your data, but you should still know how to go in and tailor it to exactly what you need.

Remember, creating clear, informative visualizations that communicate insights effectively is a skill, not an art, and practice makes perfect.


3.3 Understanding The Effects

Last but not least, think about what behavior you expect your model to exhibit if it will be trained on the data.

Will your model really be able to learn what you want it to learn?

Can you use feature engineering to improve the learning signal? This is one of those places where you should try to leverage your mathematical intuition, knowledge of statistical methods, and probability theory.

This is also relevant to evaluation data. Try to think about what it means to get certain metrics on a certain partition of the data. Does it really support or disprove a hypothesis or business objective? If not, what kind of data will be better aligned with the goal?


4. Software Engineering Best Practices

Didn’t I cover this already?

No. We talked about "how" to do things with code, now let’s talk about "what" you should be doing.

Each programming language has its own upsides, downsides, perks, and quirks. That said, there are many best practices in software engineering that you should master, as these concepts will give you a competitive edge over your peers and make you a true professional in your craft.

From my experience (as someone who learned these concepts way too late in my career), these skills will likely provide much more value than any single framework or tool:

  1. Version control (Git and Github): careful commits, branches, pull requests, and documentation.
  2. Writing clean code and avoiding smelly code.
  3. Different types of testing.
  4. Object-Oriented Programming (OOP) concepts: Basic and SOLID principles, as well as common design patterns.
  5. Containerization.

Take this example of smelly code:

What wrong with it?

Many things actually.

The names of the method and variables are not meaningful(l,x,y), have no type hints, don’t have input validation, repeated logic (>= comparisons) and may produce undesirable behaviors, empty lists, and invalid grades.

Can we do better? Of course, we can! Check out this code instead:

This clean code is much, much better!

For example, the variable and function have meaningful names and type hints that clearly define inputs and output types (making it easier to read and test), constants are clearly defined, there is no repeated logic, and input validation ensures reliable behavior.

For further reading, I highly recommend checking out refactoring.guru, which has a ton of really good and precise information about these topics.


5. Interacting With Databases

As a data scientist, don’t you work with data?

And where does your data live?

Picture this: You’ve just joined a fast-growing startup as a data scientist. On your first day, you discover your data is scattered across half a dozen different systems. Customer information lives in PostgreSQL, user behavior data streams into MongoDB, session data sits in Redis, and your marketing team just dumped a treasure trove of campaign data into Snowflake.

Welcome to the modern data ecosystem.

Don’t get overwhelmed, though; it’s not as bad as you think. There are all kinds of databases out there, but the most basic families are relational and NoSQL databases.

  • Relational Databases (such as PostgreSQL and MySQL): These keep data in structured tables, which are good when your data can fit naturally into tables with rows and columns, has well-defined relationships between different types of data, has a relatively stable data schema, and requires you to run complex queries.
  • NoSQL Databases: This family actually has several different ways to store data. It is actually a family of "all the other storage approaches," including: a. Document-based: such as MongoDB. Good for semi-structured data. b. Key-value based: such as Redis. Good for simple and fast lookups. c. Wide-column based: such as Cassandra. Good for handling massive amounts of structured data across several servers. d. Graph-based: such as Neo4j. Good for situations where the relationships between entries are important. e. Vector-based: such as Pinecone. Good for similarity queries and storing high dimensional data, like feature embeddings.

    So, Which Query Languages Should You Learn?

It’s okay to be a bit lazy on this one.

I’d recommend starting with MySQL and MongoDB, as I think these are good places to start. If you want, you can take a look at Redis, DynamoDB, Pinecone, Neo4j, and others. However, don’t get caught up in this.

It would also be good to get some basic understanding of what data pipelines, data lakes, and data warehouses are.

As a data scientist, you won’t likely need to be a grandmaster of these frameworks. Just start with the very basics, build as you go, and always keep your eye on the real goal: turning data into insights.


6. Cloud Computing

Example of virual private cloud (VPC). Source: image made by author with Claude.
Example of virual private cloud (VPC). Source: image made by author with Claude.

How can you train your model if you can’t get your data and don’t even have a computer?

As a data scientist, you will likely be working with cloud platforms (such as AWS, GCP, Azure) for data storage and processing, as well as model training, evaluation, and deployment.

Even though you will likely always have a DevOps team supporting you, some basic understanding and hands-on experience with cloud services will give you a clear edge over your peers!

It will be like learning how to swim, but in a pool of abstract concepts.

This knowledge will help you communicate your needs better, give you more independence with your computing resources, help you get more and stronger GPUs, and accelerate your development, training, and inference!

For example, in one of my previous projects, I managed to 2X the number of strong GPUs my team had by optimizing the resources my team was using within its budget.

How To Get Started?

As with all technical skills, I think the best way to learn them is to start right away with hands-on tutorials. Let’s play around a bit with Amazon’s Simple Storage Service (S3) to save and download some data.

6.1 Let’s Start With Account Setup And Installations:

If you don’t have an account, create one at aws.amazon.com, then create an IAM user with a policy that gives you access to S3.

Didn’t understand any of that? no worries.

Hopefully, your DevOps team will be able to help you with setting this up, and if not, I’d recommend checking out this great tutorial on IAM, this one on S3, and this hands-on tutorial, too.

Ready?

Let’s go!

First, let’s install the rest of the pip packages:

pip install awscli boto3 torch torchvision

Now, configure your environment.

aws configure
# Enter your:
# - AWS Access Key ID
# - AWS Secret Access Key
# - Default region (e.g., us-east-1)
# - Default output format (json)

After you set this up, your environment will be linked to your IAM user with an access policy to do things with S3, either through the AWS command line interface (AWSCLI) or through their Python library (boto3).

Got that? Good.

6.2 Setting Up Our Helper Class

Let’s now make a little class to manage uploading, downloading, and verifying data to and from S3. Here specifically, we’ll handle images and a labels.pt tensor from a local directory.

6.3 Interacting With S3

Then, we can run these operations using the following script, in which we will upload the first 10 images and labels of the FashionMNIST test dataset (MIT license).

Running this should produce the following output, in which we confirm that the files we downloaded (after we uploaded them) are the same files we started with.

data saved locally to dataFashionMNISTprocessedtest
Created bucket: fashionmnist-s3-demo-20241230210543
uploading data
done
downloading data
done
downloading data
done
Dataset verification: passed

Lastly, we can also use the command line interface to confirm the data is still present in our S3 bucket. In this case, let’s use the AWSCLI:

aws s3 ls --recursive {BUCKET_NAME}

$ aws s3 ls --recursive fashionmnist-s3-demo-20241230210543
2024-12-30 21:05:45        394 cifar10/test/images/0.png
2024-12-30 21:05:45        582 cifar10/test/images/1.png
2024-12-30 21:05:45        391 cifar10/test/images/2.png
2024-12-30 21:05:45        408 cifar10/test/images/3.png
2024-12-30 21:05:45        635 cifar10/test/images/4.png
2024-12-30 21:05:45        444 cifar10/test/images/5.png
2024-12-30 21:05:45        542 cifar10/test/images/6.png
2024-12-30 21:05:46        700 cifar10/test/images/7.png
2024-12-30 21:05:46        224 cifar10/test/images/8.png
2024-12-30 21:05:46        381 cifar10/test/images/9.png
2024-12-30 21:05:46        808 cifar10/test/labels.pt

Important!

Always clean up the cloud resources you use to avoid excessive costs! Note that in this demo, the buckets will have different names (with timestamps) each time you run the script, so make sure to delete these buckets in S3 when you are done!

aws s3 rm s3://{BUCKET_NAME} - - recursive
aws s3 rb s3://{BUCKET_NAME}

7. Mastering Machine Learning Frameworks

You know this, so I won’t dive deep with this one.

As a data scientist, you need to know how and be willing to get your hands dirty with ML frameworks, so it is important to get good at using them! These include frameworks such as PyTorch, TensorFlow, and Scikit-Learn.

For example, here is a use case where you want to train a custom head for specific tokens of a pre-trained vision transformer to localize objects in a patch of an image. It also uses different learning rates to leverage pertained weights without "erasing" them.

You probably won’t find this kind of functionality in high-level trainers, such as when using the [Hugging Face](https://huggingface.co/docs/transformers/en/main_classes/trainer) or [Fastai](https://fastai1.fast.ai/training.html) trainers.


8. MLOps

Ever forget where you put your best model?

MLOps is a catch-all phrase that includes all the tech needed to push an ML project through its entire lifecycle, from initial experiment tracking to production deployment. This includes containerization, monitoring, and maintaining model performance over time.

Do you need to use all this stuff even if you are working alone? on something that’s experimental? On a one-week project?

Yes, yes and yes.

From my experience, moving from "we’re just trying stuff out" to "we need to work in an organized way" happens way too slowly and sometimes never!

My rule of thumb is:

For ANY project, always use an experiment and configuration manager to track your configurations, metrics, and model checkpoints.

This is why you should master the basic features of experiment managers, such as [wandb](https://wandb.ai/site/), [mlflow](https://mlflow.org/), [clearml](https://clear.ml/?utm_feeditemid&utm_device=c&utm_term=clearml&utm_source=adwords&utm_medium=ppc&utm_campaign=Search%20%7C%20Brand&hsa_cam=16563267736&hsa_grp=134088880946&hsa_mt=e&hsa_src=g&hsa_ad=636186357048&hsa_acc=4043203093&hsa_net=adwords&hsa_kw=clearml&hsa_tgt=kwd-784171082579&hsa_ver=3&gad_source=1&gclid=Cj0KCQiAvP-6BhDyARIsAJ3uv7ar2la2E9xUmv68XJBn9E67dMsdCCNSr0kcrtPip1t7f2aAWFOIlNkaAnucEALw_wcB), [neptune](https://docs.neptune.ai/usage/), and things like that. Don’t worry; they all have the same basic features, so just focus on one.

More elaborate MLOps components can wait until the project is more mature, but learning them will give you an edge. These include data, training, evaluation, CI/CD and inference pipelines, model and data versioning, ML monitoring, and more.

Here is an example of using [wandb](https://wandb.ai/site/), to train a model and track its metrics and configuration, then save it to a model registry using their artifacts API.

When opening up the [wandb](https://wandb.ai/site/) web app, you should see the experiments that ran and their metrics, configuration, and associated saved artifacts.

Screenshot of artifacts being created in wandb. Source: screenshot by author.
Screenshot of artifacts being created in wandb. Source: screenshot by author.
Screenshot of metrics being tracked in wandb. Source: screenshot by author.
Screenshot of metrics being tracked in wandb. Source: screenshot by author.

9. Understanding Metrics

One of the hardest things in data science is deciding what it means to succeed at your task or goal. You can probably think of a ton of metrics, such as [accuracy](https://scikit-learn.org/1.5/modules/generated/sklearn.metrics.accuracy_score.html), [F1](https://scikit-learn.org/1.5/modules/generated/sklearn.metrics.f1_score.html), [precision](https://scikit-learn.org/1.5/modules/generated/sklearn.metrics.precision_score.html), [recall](https://scikit-learn.org/1.5/modules/generated/sklearn.metrics.recall_score.html), [AUROC](https://scikit-learn.org/1.5/modules/generated/sklearn.metrics.roc_auc_score.html), [mAP](https://towardsdatascience.com/map-mean-average-precision-might-confuse-you-5956f1bfa9e2), [MSE](https://scikit-learn.org/1.5/modules/generated/sklearn.metrics.mean_squared_error.html), etc.

However, are these what you really care about? If your model is really good at one of them, does it mean your project is a success?

Typically not.

To develop systems that provide value to your clients and stakeholders, you need to understand their needs, define what success means, and find a way to quantify it.

Many metrics can be useful for improving the model, but choosing the right metric that indicates success will ensure you really provide value.

How To Get Good At This?

Try refreshing your knowledge of different loss functions and performance metrics. You can review the supported metrics in Machine Learning frameworks, such as [sklearn.metrics](https://scikit-learn.org/1.5/modules/model_evaluation.html), [torchmetrics](https://lightning.ai/docs/torchmetrics/stable/), [torch loss functions](https://neptune.ai/blog/pytorch-loss-functions), and more.

Many of these metrics have deep foundations in statistical measures and probability theory. However, I believe that trying to understand these metrics directly can be more beneficial than studying these fields.

Wanna gain more intuition? be proactive and try to interact with these metrics by running them with different inputs and plot what you get. Make sure you are able to explain to someone else why the plots look this way.

This is how you develop an intuitive feeling for these topics.


10. Problem-Solving & Critical Thinking

Source: image made by author with Dall E 3.
Source: image made by author with Dall E 3.

Have you ever been in this pose? Isn’t it the best? Being immersed in thought, considering different approaches, solutions and outcomes.

Solving problems should be your bread and butter. However, unless you are Einstien, you will need a systematic way to tackle new problems. Here’s the one I use:

  1. Clearly define your problem: your current situation, desired outcome, constraints, and success metrics.
  2. Before diving in, take a step back: is an algorithmic solution really necessary? Is there sufficient quality data? Does solving this problem provide value to your stakeholders, customers, or team?
  3. Make Data-Driven Decisions: Be naive, start small, and iterate. Try a super simple hypothesis and test it. Document the hypotheses you test, run small experiments, record outcomes, adjust hypotheses, add complexity, and repeat.

That said, definitely consider different problem-solving frameworks that might work better for you, such as PDCA, IDEAL, 5 Whys, and First Principles Thinking.

When looking for simple solutions and ways to test your hypothesis, statistical measures are your friend. These include measures such as Student’s T-test, Pearson correlation, and Wilcoxon signed-rank test.


11. AI-Based Tools And Workflows

Wait, didn’t you say it’s not all about the AI hype train?

Okay you got me.

I did say that, but I also think they can provide real value in automating certain tasks you already sort of understand. This is because you should never trust the outputs of these AI models without verifying them yourself.

This will ensure you provide real value and **** don’t look like an idiot who just copied the output of an AI model.

11.1 AI-Based Tools

There are many different AI tools out there, so I would like to focus more on how you should be using these tools rather than which tools to use.

  1. Code-related AI Tools: generate, edit, review, and test your code – whether through the API, UI (e.g., open AI canvas), or in your IDE. If you are not using these, wake up! These tools are a must in 2025 and will significantly boost your productivity and the quality of your code.
  2. Media generators (image, audio, 3D, and video): As a data scientist, you must communicate your ideas and results quite a bit. (Remember the number one skill?). These tools can help you do so in presentations.
  3. AI coworkers: brainstorm ideas, feedback, and advice regarding work and life decisions. Always take it with a grain of salt.
  4. Knowledge gateways: whether it is a reading assistant, a summarizer, a search engine, or just an LLM response. These tools can make knowledge more accessible, but be on the watch for hallucinations!
  5. Communication assistants: whether for translating to or from your native tongue, drafting an email, a letter, or a slide, these tools can likely save time and help improve your communication skills.
  6. "In-app" AI tools: AI tools such as Copilot are used for spreadsheets, slideshows, text editors, etc. These AI-centric user interfaces can save you time compared to the old graphical user interfaces.

11.2 AI-Based Workflows

Even though not every data scientist today is working to develop API-based AI applications, I think that it is vital to at least be familiar with the range of possible applications and workflows you can build with AI APIs.

I know what you might be saying:

Sounds complicated and either way, I don’t need this.

But I disagree. These applications can now work with any type of data and have the potential to automate huge chunks of your day-to-day work and let you focus on the things that really matter – doing good data science.

Best of all, it’s actually pretty simple to start building workflows! It is definitely simpler than many other machine learning systems.


Let’s take a look at a super simple example of building a workflow with an LLM-based agent that decides which (if any) tools to use.

If you are new to these things, think of "tools" as rule-based actions that the agent can decide to take based on its inputs. These can include running a method or a script, executing a terminal command, calling another API, using some resource, or even creating and/or running a different agent.

11.2.1 Installation And Environment Setup

pip install -qU langchain-anthropic langgraph langchain-core

Get your Anthropic API key here and set it as an environment variable:

export ANTHROPIC_API_KEY='your-api-key-here'

11.2.2 Imports And Helper Functions

Let’s define the tools we want the model to use and some helper functions for calling the model and routing based on its decisions.

Next, we can construct and run a workflow that uses the model to call the relevant tools based on its input. In this case, we will use a fixed input, though you can also parse it from your terminal or load it from a file.

11.2.3 Running The Graph

This should produce the following output:

================================ Human Message =================================
what's the weather in the coolest cities?
================================== Ai Message ==================================
[{'text': "I'll help you find out the weather in the coolest cities. I'll break this down into two steps:nn1. First, I'll get the list of coolest citiesn2. Then, I'll check the weather for each of those cities", 'type': 'text'}, {'id': 'toolu_012ZCXz6Db7bFa4M4FVmibQ3', 'input': {}, 'name': 'get_coolest_cities', 'type': 'tool_use'}]
Tool Calls:
get_coolest_cities (toolu_012ZCXz6Db7bFa4M4FVmibQ3)
Call ID: toolu_012ZCXz6Db7bFa4M4FVmibQ3
Args:
================================= Tool Message =================================
Name: get_coolest_cities
nyc, sf
================================== Ai Message ==================================
[{'text': "Now, I'll check the weather for New York City (NYC) and San Francisco (SF):", 'type': 'text'}, {'id': 'toolu_01WARnjshmMJ1qc6vjrAEqL3', 'input': {'location': 'nyc'}, 'name': 'get_weather', 'type': 'tool_use'}, {'id': 'toolu_01RgkFQ6ijpm39hZvqvYJLoQ', 'input': {'location': 'sf'}, 'name': 'get_weather', 'type': 'tool_use'}]
Tool Calls:
get_weather (toolu_01WARnjshmMJ1qc6vjrAEqL3)
Call ID: toolu_01WARnjshmMJ1qc6vjrAEqL3
Args:
location: nyc
get_weather (toolu_01RgkFQ6ijpm39hZvqvYJLoQ)
Call ID: toolu_01RgkFQ6ijpm39hZvqvYJLoQ
Args:
location: sf
================================= Tool Message =================================
Name: get_weather
It's 60 degrees and foggy.
================================== Ai Message ==================================
Here's the weather in the coolest cities:
New York City (NYC): It's 90 degrees and sunny
San Francisco (SF): It's 60 degrees and foggy

Quite a contrast between the two cities! NYC is experiencing a hot, sunny day, while SF is cool and foggy. Would you like to know anything else about these cities or their weather?

12. Adaptability & Continuous Learning

Souce: image created by author with with Dall E 3.
Souce: image created by author with with Dall E 3.

This is probably the most important one.

The key to staying relevant in 2025 isn’t just learning everything new, but learning the right things at the right pace.

Remember, you already have most of these skills in some form or another. Embrace new AI tools strategically – it’s about sharpening your spear, not chasing every shiny object. Here are a few tips:

  1. Create a Learning Framework: Set aside 2–3 dedicated hours per week for learning, ideally at the same time each week to build a habit. Maintain a "skills inventory" document tracking your current expertise levels and identifying gaps; these can be new skills or existing ones.
  2. 80/20 Rule for AI tools: spend 80% of the time mastering skills you already have and 20% experimenting with new tech. Always try to apply what you learn to real problems and projects you are working on.
  3. Use the "learn-apply-teach" method: Learn something new, apply it to a real project within 1 week, then explain it to a colleague. Document your learning in a personal wiki that no one needs to see.
  4. Measure progress and stay relevant: Set quarterly learning goals with specific, measurable outcomes. Track your "wins" where you successfully applied your new skills. Most importantly, review and update your goals.

Conclusion

Being a data scientist ain’t easy.

You need a ton of soft and hard skills, which can take years to develop. But don’t worry, no one is perfect, no one is good at ALL of these skills, and no one will ever be.

In this post, we’ve reviewed the top 12 skills that will be most important to succeed in the 2025 job market:

  1. Communication skills
  2. Programming skills (Python).
  3. Undersanding and handling of data.
  4. Software engineering best practices.
  5. Interacting with databases.
  6. Cloud computing.
  7. Machine learning frameworks.
  8. MLOps.
  9. Understanding metrics.
  10. Problem-solving skills.
  11. AI tools.
  12. Continuous learning.

So what should you do with all this information?

Take action and start boosting your skills!

That said, take things one step at a time. Don’t try to take it all in; you’ll get overwhelmed, procrastinate, and inevitably stay in place.

If you want to improve, start small.

Pick one skill from this list today, and dedicate the next month to mastering it. Pick a different one each month and you’ll cover them all!

Treat this post as a roadmap to learning what matters in the job market of 2025. I hope it will help you all land your first position soon or make you uniquely valuable and excel in your current position.

Good luck learning. You rock!


Sources and Further Reading:

[1] Qu, Changle, Sunhao Dai, Xiaochi Wei, Hengyi Cai, Shuaiqiang Wang, Dawei Yin, Jun Xu, and Ji-Rong Wen. "Tool Learning with Large Language Models: A Survey." (2024). arXiv preprint arXiv:2405.17935.

[2] Dosovitskiy, Alexey. "An image is worth 16×16 words: Transformers for image recognition at scale." (2020). arXiv preprint arXiv:2010.11929.

[3] Chase, H. (2022). LangChain [Computer software]. https://github.com/langchain-ai/langchain.

[4] Refactoring Guru. https://refactoring.guru/.

[5] LangGraph. https://github.com/langchain-ai/langgraph.

[6] Murphy, Forrest. "[Use This Simple Technique To Explain Complicated Concepts To Anyone](https://blog.groovehq.com/the-eli5-technique#:~:text=The%20ELI5%20(explain%20it%20like,treat%20your%20customers%20like%20children.)."

[7] Thompson, Pat. "Using the progressive disclosure principle in academic writing." (2022).

[8] Angelo, Linsday. "Strategic Storytelling: Helpful Tips to Boost your Business Communication and Influence."

[9] "Bottum Line Up Front (BLUF)". Wikipedia: The Free Encyclopedia.

[10] Henricksen, Tom. "What is the project status? Red Amber Green what does that mean?" (2023) Medium.com.

[11] Python Software Foundation. "The Python Standard Library". (2024). https://docs.python.org/3/library/index.html.

[12] Geeks For Geeks. www.geeksforgeeks.org.

[13] datacamp.com. https://www.datacamp.com/tutorial/aws-s3-efs-tutorial.

[14] AWS documentation. https://docs.aws.amazon.com/.

[15] Weights And Biases documentation. https://docs.wandb.ai/guides/.

[16] "Plan Do Check Act". Wikipedia: The Free Encyclopedia.

[17] "Five Whys". Wikipedia: The Free Encyclopedia.

[18] Tubis, Nick. "First Principles Thinking: The Blueprint For Solving Business Problems." (2023). Forbes.

[19] Tindle, Austin. "Learn, Apply, Teach, Repeat: Guidelines for Technical Learning". (2019). medium.com.

[20] Xiao, Han, Kashif Rasul, and Roland Vollgraf. "Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms." (2017). arXiv preprint arXiv:1708.07747.

The post Top 12 Skills Data Scientists Need to Succeed in 2025 appeared first on Towards Data Science.

]]>
Four Signs It’s Time to Leave Your Data Science Job https://towardsdatascience.com/four-signs-its-time-to-leave-your-data-science-job-7b56818a95d2/ Tue, 17 Dec 2024 01:00:51 +0000 https://towardsdatascience.com/four-signs-its-time-to-leave-your-data-science-job-7b56818a95d2/ Four tell-tale signs that you should look for another job

The post Four Signs It’s Time to Leave Your Data Science Job appeared first on Towards Data Science.

]]>

I see it too often: people stay in the same job far too long than necessary. Staying in the same place can stagnate one’s skills and compensation, which is definitely far from ideal.

In this article, I will discuss four tell-tale signs that you should probably look to move on soon.

The offer from the new company is just too good

Even if you are super happy at your current place, there is nothing wrong with speaking to recruiters and other companies that really interest you. Interviewing is a skill, and practising even if you are not planning on moving is not a bad idea.

Ryan Peterman, a staff software engineer at Meta, wrote an article on why you should interview at other places even if you are happy with your current role.

Staying Sharp

His main arguments are:

  • It can give you confidence that you are in the best place for you.
  • Looking for a job when you need one is the worst time to look because you are anxious and not in an abundance mindset.
  • You apply for fewer roles and ones that genuinely interest you.
  • You are not going to over-prepare, saving you time.
  • Keeps your skills sharp and lets you know what you are missing.

Another valuable point is that you may get a job with an offer you can’t refuse. Sure, you are happy, but statistically speaking, you could always be happier, right?

I am not saying to jump ship whenever. Make sure you stay somewhere long enough to deliver impact and actually be able to say you did excellent work. This varies by company, but it’s often at least a year, preferably two years.

However, if you are offered something that is just too good to pass up, then go for it! You will know in your gut if it’s the right choice.

You are miserable in your current place

If you despise your current role or company, then move. If you dread every working day, that’s not a good sign.

Unhappiness is caused by many things, the type of work, hours, your colleagues or boss. Whatever this is, it can be changed.

If everything else is good apart from one thing, you should endeavour to fix that issue at your current job. For example, if you are struggling with a colleague, try to reach an understanding and work things out.

Work-out-able stuff should always try to be fixed before you plan on moving, especially if it’s just one thing that’s making you miserable. However, if it’s a culmination of stuff, then moving is often your best bet in that scenario, especially if you are unlikely to solve it like culture or senior management.

People will say "moving is hard" or "easier said than done" when you tell them to look for new jobs. I am going to be a bit controversial and say some tough love, but yes, you are right. Looking for new jobs is hard, and many people stay where they are despite being unhappy.

I am still very young, and maybe I’m naive, but I’d rather spend months looking for a job I really like than risk being stuck for years, perhaps even decades, in a job I hate. Sure, it’s more work in the short term, but then that investment will give you years of a job you really like. Sounds worth it to me.

You are not learning or earning

There is a famous saying that you should either be "learning" or "earning" at your job, preferably both.

  • Learning – You should ideally be developing your skills and becoming a better professional. Looking back year on year, have you grown and become more proficient? If yes, you are learning; if not, you are not.
  • Earning – Earning the right amount or more for someone with your experience, skill level, and abilities compared to the market. It’s not abnormal for people to be underpaid, especially if they have been at the same company or job for a long time.

The ideal scenario is that you have both, and in that case, as I discussed earlier, there is no reason to leave unless the offer is just too good for you to reject.

If you are not getting both, then it’s a no-brainer. You should leave, even if you are happy, which chances are you are not because you are not getting those two things which are fundamental pillars of any job.

The trickier bit is in the middle, where you have one but not the other. At this point, it becomes a very personal discussion. It depends on the extent of how bad or good one is to the other.

If you are getting paid good money but feel like you are not learning, this is easier to fix. You start by asking your line manager to assign you certain projects or maybe even move teams within the company to increase your learning and skillset.

If you have the capacity, you can also learn in your spare time and make an effort to implement that in your day-to-day work. The main point is that companies have no problem with employees wanting to improve in their roles and are happy to accommodate this.

If you are learning but not earning, this is harder and more political. I find money an unnecessarily taboo subject, particularly in the UK. So, I recommend opening up a dialogue with your manager about this.

Be honest and do your research to show that, given your experience and skill level, you think you are getting paid below market rate. If recruiters are contacting you offering your £X amount, mention that and say you want to stay but feel you are underpaid.

You shouldn’t feel awkward about this; at the end of the day, this is your livelihood, and you should be firm but reasonable. In most cases you can reach sensible agreement and it’s always worth asking.

From this report, you can see that job changers on average more and than people who stay at their job. So moving jobs is often a viable strategy if you want more money.

Job changers and stayers, understanding earnings, UK – Office for National Statistics

There is no obvious growth

The final one is where you don’t see how you will progress, or there are no clear guidelines for moving up the ranks. You ideally want to advance in your career, and the company should have a clear framework for this.

It, of course, varies between companies; an established tech firm will have more structure than a startup, for example. So it’s essential to take all things into account.

This one is also reasonably solvable most of the time. You can ask your manager, head of department, CTO even about this issue, and it will likely be resolved because it’s also in their best interest.

What you are mainly after here is feedback on areas that you need to improve to reach the next level for someone at your position and rank within the company.

However, if this doesn’t happen, you are kind of directionless, which is dangerous for you career. Your abilities and skills may dwindle over the next few years as you could be working on the wrong things, and that’s not a fortunate position to be in.

Summary & Further Thoughts

Leaving your job can be scary and risky, but what’s riskier is staying in a job that underpays you and you don’t enjoy. Taking the leap is not as bad as you think; most of the things we are scared to do are worse in our minds than in reality.

Another Thing!

I have a free newsletter, Dishing the Data, where I share weekly tips and advice as a practising data scientist. Plus, when you subscribe, you will get my FREE Data Science resume and short PDF version of my AI roadmap!

Dishing The Data | Egor Howell | Substack

Connect With Me

The post Four Signs It’s Time to Leave Your Data Science Job appeared first on Towards Data Science.

]]>
How to Transition Into Data Science-and Within Data Science https://towardsdatascience.com/how-to-transition-into-data-science-and-within-data-science-9d8aac763d10/ Thu, 05 Dec 2024 14:31:01 +0000 https://towardsdatascience.com/how-to-transition-into-data-science-and-within-data-science-9d8aac763d10/ Our weekly selection of must-read Editors' Picks and original features

The post How to Transition Into Data Science-and Within Data Science appeared first on Towards Data Science.

]]>

Feeling inspired to write your first TDS post? We’re always open to contributions from new authors.

With January just around the corner, we’re about to enter prime career-moves season: that exciting time of the year when many data and machine learning professionals assess their career growth and explore new opportunities, and newcomers to the field plan the next steps towards landing their first job. (It’s also when companies tend to ramp up their hiring after the end-of-year lull.)

All this energy often comes with nontrivial amounts of uncertainty, stress, and the occasional moment of self-doubt. To help you calmly chart your own path and avoid unnecessary second-guessing (of yourself as well as of hiring teams, colleagues, and others), we put together a special edition of the Variable focused on career transitions for both new and current practitioners.

We never miss a chance to celebrate data scientists’ diverse professional and academic backgrounds, and the lineup of articles we’re presenting here reflects that range, too. Whether you’re thinking about a switch to management, are about to jump into your first startup job, or are in the midst of transitioning to Data Science from a totally different discipline, you’ll find some concrete, experience-based insights to learn from.


  • Rewiring My Career: How I Transitioned from Electrical Engineering to Data EngineeringWhen your goal is to jump across discipline lines, one of the toughest challenges is learning how to translate existing skills and knowledge and make their value apparent to prospective employers. Loizos Loizou‘s debut TDS article offers a detailed account of the author’s successful repositioning from a trained electrical engineer to a data engineer—a change that is far more substantive than the title alone suggests.
  • Why STEM Is Important for Any Data ScientistA background in the so-called hard sciences doesn’t always map directly onto data-focused job descriptions. As Radmila M. explains, however, the benefits of applying your hard-earned STEM expertise once you’ve moved on to data science are many – and can manifest themselves in unexpected moments when traditional problem-solving approaches fail to produce the desired outcome.
  • From Data Scientist to Data Manager: My First 3 Months Leading a TeamAfter nearly seven years as a data scientist, Yu Dong took on a new challenge recently and stepped into a management role for the first time. In a thoughtful new post, Yu reflects on "what has changed, what I’ve enjoyed, and what’s been challenging."
Photo by The Nix Company on Unsplash
Photo by The Nix Company on Unsplash
  • Are You Sure You Want to Become a Data Science Manager?Tackling the management-track conundrum from a different angle, Jose Parreño encourages anyone who’s considering a move away from an individual contributor role to think deeply about their motivations and goals, and to make an informed decision based on a realistic understanding of what becoming a manager actually entails.
  • Roadmap to Becoming a Data Scientist, Part 1: MathsFor aspiring data professionals who are still years away from debating their fit for a manager role, one of the perennial pain points remains the level and amount of math they need to master in order to start their journey on the right foot. Vyacheslav Efimov provides concrete pointers on what you should learn – and how to get started.
  • GenAI is Reshaping Data Science TeamsSetting yourself up for success doesn’t involve a fixed formula; in fields as dynamic as data science and machine learning, the very definition of your role can evolve from one month to the next. This has been especially true in the past couple of years, as generative-AI tools and LLMs have transformed core workflows across industries. Anna Via wrote a focused synthesis of the challenges and opportunities this rapid change presents, and what data teams—and individuals within them—can do to stay nimble and adapt quickly.
  • How to Hire at Early-Stage Startups ****It may sound counterintuitive that arriving at a new job with advanced educational credentials can sometimes make you less effective, but that’s precisely the point Claudia Ng drives home in her latest article. While she writes with hiring managers in mind, her insights are particularly valuable for data science PhDs who can adjust their mindset accordingly, and prevent potentially mismatched expectations.
  • So It’s Your First Year in AI; Here’s What to ExpectCongratulations: you’ve landed your dream role at a buzzy AI startup. Now what? Based on his own personal experiences, Michael Zakhary seeks to demystify what the job might entail and to "offer a glimpse into the daily life of an ML engineer – whether you’re working in a small, agile team or part of a larger, more structured organization."

Thank you for supporting the work of our authors! As we mentioned above, we love publishing articles from new authors, so if you’ve recently written an interesting project walkthrough, tutorial, or theoretical reflection on any of our core topics, don’t hesitate to share it with us.

Until the next Variable,

TDS Team

The post How to Transition Into Data Science-and Within Data Science appeared first on Towards Data Science.

]]>
Third-Year Work Anniversary as a Data Scientist: Growth, Reflections and Acceptance https://towardsdatascience.com/third-year-work-anniversary-as-a-data-scientist-growth-reflections-and-acceptance-a72618ab99ec/ Tue, 19 Nov 2024 17:50:56 +0000 https://towardsdatascience.com/third-year-work-anniversary-as-a-data-scientist-growth-reflections-and-acceptance-a72618ab99ec/ A letter to myself and fellow data scientists

The post Third-Year Work Anniversary as a Data Scientist: Growth, Reflections and Acceptance appeared first on Towards Data Science.

]]>
Dear Zijing,

Yesterday, your coworker messaged you that a celebration was held onsite for several work anniversaries, including yours. How time flies! At the end of the month, you will complete three years in this role.

Is it true that as we get older, time seems to pass faster? I heard one explanation that might make sense: when you are 10 years old, one year accounts for 10% of your life. However, when you reach 30, a year only represents about 3% of your life, which is why you feel time is speeding up. I thought about all those purposeless, long summer days you spent capturing tadpoles in the ponds and rewatching cartoons, believing the afternoon would never end. Now, each day slips away like sand falling through fingers.

Photo by Cesar Ramos on Unsplash
Photo by Cesar Ramos on Unsplash

Maybe time has always passed by at a steady speed, but it’s passing by a different you. To have youth is to afford to waste every minute and still believe there is a sunrise tomorrow. As you age, the inner clock starts ticking louder, reminding you that you have one less minute to become who you want to be and do what you want to do.

Three years of working experience indeed adds weight to one’s early career. Besides a more presentable resume, you are constantly asking, with 1095 days behind, have you become closer to the ideal person you envision for yourself? Have you had the opportunity to pursue your passions? I hope this letter will help you answer these questions.


Where should I begin? First, I want to talk about growth. Remember when you first saw the job description for this role? You felt more scared than excited. Although you were thrilled about the opportunity to sharpen your skills and make a bigger impact, you couldn’t help but worry that the journey ahead would involve challenges you were not fully prepared to tackle.

There is this book I am currently reading: "Never Split the Difference," which focuses on negotiation techniques. It highlights that uncertainty aversion and loss aversion are two major drivers of irrational decision-making. I believe I have a better understanding of the anxiety you were experiencing back then. Accepting this job entails confronting uncertainties in every aspect of work and losing the comforts and familiarity of your previous role.

When you eventually decided to take it, I guess, ultimately, you believed that we only feel challenged while walking uphill, and moving upwards, as higher as possible, is so crucial at early career stages. "If I get this job, I will learn a lot," you thought.

Looking back now, I realize the uphill path is even more challenging than you expected, but the scenery makes it all worthwhile. You survived, and you have indeed learned A LOT. Skill set development is a given. On top of that, I know you are more content with the confidence you now possess when facing challenges. You embrace challenges with open arms.

I am glad. There will always be something new, something hard, something uncertain ahead. An "I am ready to try" mindset, rather than "I am not going to make it" or "I am only trying if I can do it perfectly," contributes more to what you will get in the end, even more than your skillsets. When you decide to set off, the whole world clears a path for you.

Photo by Alex Woods on Unsplash
Photo by Alex Woods on Unsplash

Then, I want to talk about reflection, which has always been a must for you. I know you reflect on gains and losses to guide and calibrate future directions.

I reflect on all the 11 weeks you had crunched so far to deliver forecasts at the beginning of each quarter. Burnout is a real thing. During many frustrating and exhausting moments, you thought you would never deliver reasonable results on time but managed to push through. Throughout these weeks, you consistently challenged your limits: to be persistent, to be efficient, to be limitless. You learned no matter how hard a task looks, you will find ways to manage it. You also learned how to handle work stress and live a colorful life outside of the eight hours. I think that’s what makes your productivity sustainable.

Photo by Garrhet Sampson on Unsplash
Photo by Garrhet Sampson on Unsplash

I reflect on your struggles to make connections at work and build trust with stakeholders as an introvert who works remotely. You could build a kick-ass model but feel shy about communicating it and convincing others that it’s great, and they should use it. You were conflict-averse and not confident in expressing your opinions. "I must think thoroughly before I speak," you thought. Now I can tell you that this stage will pass. Building communication skills is no different from training a muscle. Certain factors, such as genetics, determine how much this "muscle" can develop, but your competitor is always just yourself – specifically, your past self. With an open mindset and enough practice, you will get there.

I reflected on your first experience as an interviewer, where you told the interviewee that, to be honest, it was your first time interviewing others. After being an interviewee in the tough job market numerous times, you were so used to being evaluated and underestimated that you were afraid you were not qualified enough to score others. Fighting against impostor syndrome, you told yourself these complicated deliverables were not generated by robots. You may not be the expert in every data science field (and you don’t have to be), but you definitely have a fair share to say who you want to work with.

I reflect on your stumbling steps toward more senior-level responsibilities. You thought you never had to care about leadership as you had no interest in being a manager. However, mentoring gives you happiness and satisfaction, isn’t it? Just like keep writing here. I guess leadership is not just about being the boss and assigning tasks. It’s also about collaboration and enabling. If you are a tree trying to grow taller, leadership helps you develop branches, with which you extend horizontally. A taller tree is easier to spot from far away, but only a wider tree provides larger shade.

Photo by Oliver Olah on Unsplash
Photo by Oliver Olah on Unsplash

Lastly, I want to talk about acceptance. Three years is a long time, long enough to give you the experience to know yourself – what excites you, what motivates you, and what drains you. Rather than bending yourself to be someone you are not, it is time to accept who you are and surround yourself with people who truly accept you.

It took some time, but you became more aware of the differences among tasks: those you can do, those you are good at, and those you are excited to do. Generally speaking, there are two types of projects: those that go from 0 to 1 and those that progress from 1 to infinity.

The "0 to 1" project is building things from scratch – a new set of concepts, a new methodology, a new product, etc. The "1 to infinity" project is iterating through an existing solution and making it better – improve it, deploy it, scale it, etc. You can do both and are probably good at both, but you definitely feel more excited about the "0 to 1" projects. You enjoy the satisfaction of building things from scratch.

Photo by Khoi Do on Unsplash
Photo by Khoi Do on Unsplash

Build, pass down, and move on would be your ideal workflow. Therefore, you need to work closely with people with different skill sets. They will help you with tasks you are not good at or do not prefer to do.

You also prefer to avoid repetitive tasks and inefficient temporary solutions. If given the luxury, you would not sacrifice quality for deadlines, though they motivate you to achieve results and put in extra effort. AI helps you ease the "dumb" tasks, and automation increases productivity. I am glad you are constantly trying to navigate towards what excites you and communicate your boundaries and preferences to those around you.


I have rambled on, but I believe you already know the answer to both questions is "Yes." You have always been seeking answers. Through seeking, you calibrate, and you will find yourself. Three years is not short. Time puts wrinkles on your face, gives you unsubstitutable experiences, and craves precious memories.

You are no longer young, but you are far from being old.

May you always be courageous to take on challenges and pursue what you love.

Best,

Zijing


Thanks for reading this far. I thought about the best way to convey the lessons learned throughout the three years. Since I have written several articles summarizing my data scientist Career Development along the way, I feel the need to have some variety in the format. I eventually wrote this as a letter, probably inspired by a book I recently read that was beautifully written, On Earth We’re Briefly Gorgeous. I hope this letter also inspires you to navigate through the early journey in your career. Check out other articles written by me:

I Got Promoted! How?

How I became a data scientist

Don’t be a data scientist if you…

Seven Principles I Follow To Be a Better Data Scientist

The post Third-Year Work Anniversary as a Data Scientist: Growth, Reflections and Acceptance appeared first on Towards Data Science.

]]>
How To Up-Skill In Data Science https://towardsdatascience.com/how-to-up-skill-in-data-science-3f71fafeaab7/ Thu, 14 Nov 2024 14:02:03 +0000 https://towardsdatascience.com/how-to-up-skill-in-data-science-3f71fafeaab7/ My framework for continually becoming a better data scientist

The post How To Up-Skill In Data Science appeared first on Towards Data Science.

]]>

Once you become a data scientist, that’s not the end; this is just the beginning.

A career in data science means you constantly need to be looking to improve due to the pace of the field. It doesn’t mean you need to work continually, but you should have some processes that allow you to keep improving regularly or at least at a rate desired by you.

In this article, I will explain my framework for up-skilling in data science, and hopefully, it will clarify or give you some ideas how you can approach it as well.

Where do you want to go?

The first step in anything is deciding where you want to go. Saying you want to "up-skill" is vague, so you should be clear on your direction.

What I mean by direction is kind of up to you, but in my experience, it generally means these things:

  • Is there a particular area you want to learn?
  • Is there a technical tool you want to learn?
  • Is there a specific industry you want to get into?
  • Is there a specific role of data you want to work as?

Again, these are not all the options, but they give you a sense of how you should approach this stage. You essentially want to easily explain what you are up-skilling towards.

Once you have an end goal in sight, it’s much easier to navigate your "up-skilling," and you can always tweak your direction later on if need be.

As the famous saying goes

You can’t steer a stationary ship

Oh, and one more thing: If you want to up-skill in a way that helps you in the job market and likely increases your compensation, then I recommend keeping up with trends and investing time in learning the things that are popular or will be popular in years to come.

The elephant in the room is that learning GenAI and LLMs will benefit you in the current market, as that’s where investor money is going. I don’t recommend chasing trends purely for financial gain, as some intrinsic motivation should be involved. However, to each their own!

How do you get there?

Now you have a target you want to up-skill towards; you need a way of getting there.

Networking with individuals who have already reached your desired position is the most effective approach. You can get their advice, which will be tailored specifically to you.

For example, I want to pivot to being a Machine Learning Engineer, so I contacted my friend Kartik Singhal, a Senior Machine Learning Engineer at Meta, for his advice and guidance. He provided me with many resources and taught me how to approach my learning if I wanted to achieve this transition.

He has a great newsletter, The ML Engineer Insights, that I recommend you check out if you are interested in MLE stuff!

The ML Engineer Insights | Kartik Singhal | Substack

Even though I have an online presence that helps build these connections, you certainly don’t need one.

People frequently ask me for data science advice, and I always reply, giving them the best guidance I think would work for them.

You can literally message so many people, and chances are at least one person will reply! LinkedIn is by far the best site for this, but you can use many others, so don’t limit yourself.

If you don’t want to do that, chances are there are some free online resources, roadmaps and videos explaining how to reach your target. The only downside is that they won’t be personally tailored to you, but it probably doesn’t matter so much if you are a complete beginner.

As an example, if you want to learn LLMs, then Andrej Karpathy has probably the best course on this and its free on YouTube!

After you have all this information, create a learning plan or roadmap to clearly define your actions. These online resources will often already have one created for you.

I find people often over-complicate this step. All you need is a plan that heads you in the right direction. It doesn’t need to be the "best", whatever that means, but as long as it covers everything you think you need, it’s fine. Don’t overthink it.

What do you do?

The question now comes to how you make sure you stick to your plan and actually do the work required to up-skill.

As the book Atomic Habits made famous, it’s all about the systems you put in place.

You do not rise to the level of your goals. You fall to the level of your systems.

The first strategy I employ is blocking out time in my calendar specifically designated for up-skilling. I recommend at least two hours a week to make decent progress, but I would debate an hour a day is preferable if you can.

I firmly believe that no matter who you are, there is some time in your week you could squeeze in learning. Don’t get me wrong, I understand it’s harder for some people than others, but if it’s something you want to prioritise, then you will figure out a way.

I have a separate article (linked below) explaining how to schedule time for learning like this and the steps you can follow.

How I Make Time for Everything (Even with a Full-Time Job)

If you are working at a company, ask to get involved in projects related to what you want to learn. For example, I am looking to pivot into machine learning engineering, so I asked my line manager if I could work on more projects focusing on the deployment and software engineering side.

You will be surprised how receptive people are often; all you have to do is ask! The worse they can say is no, which is normally quite unlikely.

If your company can’t put you on specific projects, suggest you want some learning and development time in your work week. From my experience, many tech-based companies have this as a perk, as they also want their employees to grow. Not only does this benefit the employees, but also the company as they have more up-skilled workers.

This gives you flexibility and means you don’t have to learn outside of work hours if you don’t have time. Again, from my experience, many companies and management are pretty receptive to this, and I am sure most people will be on board with the idea. Suggest it to your line manager if you have time.

Useful Habits and Practises

The following are some helpful practises and habits that really help me continuously learn:

  • You should always have something that you are learning or want to learn in mind. I have a massive list of areas I want to learn more about that I constantly update!
  • Take time when learning a topic. Careers are long, like four decades. So, you can afford to be patient and understand it deeply, which will pay off in the long run.
  • Learning by doing and physical implementation is the best way to learn. Build something, don’t just take courses.
  • Employ radical focus; this is a superpower nowadays. Remove as many distractions as possible and concentrate fully in that time block.
  • Building a study schedule is a game-changer and really a non-negotiable. It helps you stick to your learning.
  • Chain the things you learn to be relevant topics. For example, learn neural networks, RNNs and CNNs, and finally LLMs.

Summary & Further Thoughts

Data Science is a career filled with continual learning, which you must do to stay on top of your game. This is both a blessing and a curse because it keeps the work interesting, but you must invest time and strategies to stay current. Hopefully, this article will give you some ideas and methods for staying sharp in data science!

Another Thing!

I have a free newsletter, Dishing the Data, where I share weekly tips and advice as a practising data scientist. Plus, when you subscribe, you will get my FREE data science resume and short PDF version of my AI roadmap!

Dishing The Data | Egor Howell | Substack

Connect With Me!

The post How To Up-Skill In Data Science appeared first on Towards Data Science.

]]>
Top Data Science Career Questions, Answered https://towardsdatascience.com/top-data-science-career-questions-answered-abd3e3c085cc/ Sat, 09 Nov 2024 13:02:01 +0000 https://towardsdatascience.com/top-data-science-career-questions-answered-abd3e3c085cc/ I've been a data scientist for over 3 years. This is what most people want to know about the field.

The post Top Data Science Career Questions, Answered appeared first on Towards Data Science.

]]>
Top Data Science Career Questions, Answered

Photo by Clay Banks on Unsplash
Photo by Clay Banks on Unsplash

What does a data scientist do?

Most people are both impressed and confused when I tell them I’m a data scientist.

Impressed because it’s considered such a fancy and prestigious title nowadays (though some will still call us statisticians who can code).

Confused because … what does Data Science mean, really? And what do we do?

Well, it depends.

On the domain, the company, and the team itself.

But in general, data science encompasses the following categories of work:

  • Databases and data engineering — Many data scientists work closely with databases, whether that’s loading and querying large amounts of data, building data pipelines, or cleaning and preparing data for analysis. At my last company, I used SQL regularly to access our database in order to query data needed to build ML models. I also found myself creating and altering tables in order to store results from models and other analyses.
  • Data analytics and visualization — Data visualization involves not only analyzing the data but presenting it in a way that makes the information easy to interpret. Analytics can be a variety of things: Identifying and reporting key performance indicators (KPIs — for example, the percent increase in sales from one year to the next), trends (such as a correlation line chart) or other relevant conclusions about a dataset. Tableau is a popular tool used for data visualization and analytics as it easily allows you to create dashboards with lots of information. Great visualizations can also be built in Python or JavaScript.
  • Machine Learning and predictive modeling – This is what people typically imagine a data scientist to do – using statistical and mathematical models to predict some future outcome. And while it is personally my favorite part about the job, it’s not the only thing we do and there’s certainly a lot more that goes into ML behind the scenes. Lots of data retrieval, cleaning and preprocessing also needs to happen in order for this to be successful.

Data scientists may focus heavily on one, or all, of these areas. But regardless of what their specialty is, all data scientists are at least familiar with and capable of executing tasks in all 3 categories, and will likely do so many times throughout their career.

A Day in the Life of a Data Scientist

Photo by Unseen Studio on Unsplash
Photo by Unseen Studio on Unsplash

How can I get into data science?

The easiest and most efficient way is to get a college degree. Undergraduate or graduate.

I went this route and got an undergraduate degree. I think it’s probably one of the few remaining college degrees that’s actually worth it because since the salaries for data science positions are on the higher side, you’ll be able to pay off your debt faster.

Obviously, this is not the only way.

Many data scientists are self taught. You can take online courses, earn certifications, and build up a repository of personal projects. This route takes a lot of hard work and dedication and discipline.

I liked college because it gave me structure.

There’s also another route that’s kind of in between, and it’s one that I have seen many people do successfully.

That is, to get into data science through your current career.

At your current company, you most likely have a data science team somewhere.

Get in contact with them. Express your interest in the field. Ask them questions. Ask if someone can mentor you. Ask if you can collaborate on a project.

Then, ask your manager if the company will pay for you to take courses or certifications (Most will). You just may end up switching positions within your company and transitioning into the data science team.

And once you have this experience under your belt, you can transfer into a data science role at a new company, and continue to climb the ladder from there.

Photo by Fabian Blank on Unsplash
Photo by Fabian Blank on Unsplash

How much do data scientists make?

It depends on a few things.

  • Your geographical location. Country, state, and city.
  • Your education level. People with Masters degrees or PhDs tend to make more on average.
  • Your level of experience. Entry level data scientists will obviously make less than those who have been in the field for 10 years.

In the United States, the average data scientist salary is $122,861.

However, this is across all states, cities, and experience levels.

For more detailed information and advice on how to negotiate your first salary, as well as how to find out how much someone in your situation (geographical, educational, etc) should expect to make, check out the following article:

How to Negotiate Your Salary as a Data Scientist

Photo by Campaign Creators on Unsplash
Photo by Campaign Creators on Unsplash

How can I find a data science job in this difficult market?

I understand that it’s rough out here at the moment.

There are 3 main things you should focus on:

  1. Networking

This is one of the most effective things you can master in this day and age.

Referrals go a long way. When companies are looking to hire, staring at a resume just doesn’t have the same effect as being told by a live person at the company: "Hey, I know someone who would be a good fit for this role".

When an employer stares at your resume and has a familiar face to relate it back to, you are much more likely to get an interview.

When you’re searching for jobs, speak to your friends, family, and mutual friends. Search up their companies and see if they’re hiring.

And of course, make the most of your LinkedIn. Reach out to old friends, classmates, coworkers.

How to Network as a Data Scientist

2. Strengthening your personal brand

This includes your LinkedIn profile and other relevant social media profiles. Have a clean, readable profile with a nice professional picture.

Take some time to work on the bio for your About section. Keep your work experience, certifications, and projects up to date, and add relevant details.

Make some posts (or even just repost things). This shows that you have a passion and interest in the field.

LinkedIn is how recruiters find you and if you can capture their attention you will get more interviews.

3. Building a strong portfolio

Whether it’s making your own website, tidying up and populating your Github, or even starting a Medium blog, a portfolio is hard evidence that you can do the things you say you can on your resume.

This is especially important for people who have no data science work experience under their belt.

Doing personal data science projects, Kaggle competitions, or freelance work that you can publish to Github or some other website will really help employers and hiring managers to see tangible results from you and get a good idea of your skills.

What advice do you have for beginners?

Master the fundamentals.

Statistics, linear regressions, classification, data cleaning, preprocessing, and feature engineering.

All the fancy stuff will come. LLMs, neural networks, deep learning, sentiment analysis… it will all come much more naturally to you when you understand the building blocks of ML and data science.

Be patient, take it 1 day at a time, and embrace repetition. Embrace mistakes and bugs because they will happen, but eventually you’ll be able to spot them and resolve them much quicker, and you can use your energy for grasping more complicated concepts.

The more you do something, the more you understand the process inside and out, and the more confidence you build in yourself and your abilities.

So keep going.

Your First Year as a Data Scientist: A Survival Guide

Thanks for reading

Haden Pelletier – Medium

The post Top Data Science Career Questions, Answered appeared first on Towards Data Science.

]]>
A 6-Month Detailed Plan to Build Your Junior Data Science Portfolio https://towardsdatascience.com/a-6-month-detailed-plan-to-build-your-junior-data-science-portfolio-a470ab79ee58/ Fri, 08 Nov 2024 12:01:59 +0000 https://towardsdatascience.com/a-6-month-detailed-plan-to-build-your-junior-data-science-portfolio-a470ab79ee58/ Step-by-step guide to creating, polishing, and deploying a portfolio that helps you land your first job

The post A 6-Month Detailed Plan to Build Your Junior Data Science Portfolio appeared first on Towards Data Science.

]]>
If you’ve just finished your degree or are looking for your first job, this article is for you. If you’re still working on your degree or haven’t started your Data Science journey yet, you might want to check out this article first.

As you know, the data science job market is more competitive than ever. Simply having a degree or academic projects isn’t enough to differentiate yourself from the crowd. You need practical, hands-on projects that show your skills in action.

For those who don’t know me, my journey started ten years ago with a degree in applied mathematics from an engineering school. Since then, I’ve worked across various industries, from water to energy, and spent time as a lecturer. I’ve also hired junior data scientists, and I’m here to show you how to build the perfect portfolio to help you land your first job.


On Today’s Menu 🍔

  • 🍛 How to plan your 6-month journey to create your Data Science Portfolio.
  • 🍔 The prep work to get started.
  • 🥤The 8 projects that will skyrocket your portfolio.
  • 🍰 Deploying your portfolio effectively.

Let’s talk about planning 📅

pixabay.com
pixabay.com

If you’re in a data science career, I’m sure you’re someone who enjoys scheduling and planning to stay on top of trends. Based on the assumption that you’re in a research phase, I’ve created this timeline with the idea that you’ll be dedicating 10 hours per week to building your portfolio. Of course, if you have more availability or are a bit busier, feel free to adjust this plan accordingly.

  • This plan starts in January 2025.
  • The hours allocated for each project assume that you’ve already taken a data science course and have basic knowledge of each topic. It’s okay if you haven’t worked with image/text data yet or haven’t used the cloud or set up a database. You should at least be familiar with Python, Pandas, NumPy, some visualization libraries, basic Machine Learning algorithms, and a bit of SQL.
Created by the author
Created by the author

With this schedule, you’ll still have two weeks available at the end of June. I’ve also assumed that over the next six months, you might take two weeks off ☀.

To get the most out of your Portfolio-building journey, I recommend setting up all necessary tools and accounts beforehand. This way, you can stay focused on your projects and data without interruptions. The only account I suggest creating later is for the cloud, as most providers offer a free tier that lasts about one month, and you’ll want to save that for deployment.

Your 5-Hour Prepwork to Get Everything Ready ⏱✨

1.Install Anaconda or Miniconda Anaconda or Miniconda is essential for managing packages and environments. Install one of them to get started.

2.Prepare Your Conda Environments Familiarize yourself with basic conda commands (this isn’t the focus of this tutorial). Then, create the following environments to avoid issues with library installation over the next few months:

  • Machine Learning Projects Environment: Install Pandas, NumPy, Scikit-Learn, StatsModels, Seaborn, Matplotlib, and Plotly.
  • SQL Project Environment: Install the necessary packages to connect Python to your database.
  • Deep Learning Projects Environment: For image and text data, install TensorFlow and libraries needed for data preparation and feature extraction.
  • Deployment and Monitoring Projects Environment: Install ML packages along with MLflow and FastAPI. Later, add packages for your chosen cloud provider (e.g., Azure, AWS) as needed.
conda create -n ml_env 
conda create -n sql_env 
conda create -n dl_env 
conda create -n deploy_env

Once your environments are created, activate each one individually and install the necessary packages using the requirements.txt files. Feel free to change the packages if you prefer other libraries, but the ones included should cover most of your needs.

3. Install VS Code

  • VSCode is a great code editor that integrates well with Python and Jupyter Notebooks. Download and install it from https://code.visualstudio.com/.
  • Install the Jupyter plugin for VS Code to work with Notebooks.
  • Open a notebook and ensure you know how to switch between environments in VS Code: Open the Command Palette in VS Code (Cmd/Ctrl + Shift + P), then select Python: Select Interpreter. You should see the environments you’ve created listed here.

4. Set Up GitHub

  • If you don’t have a GitHub account, create one now. If you already do, go to GitHub and create a repository for each project (You will have 8 project to work on for your portfolio : You can check out the projects names bellow).
  • Back in your VS Code terminal, or any terminal you prefer, navigate to a main directory where you’ll store all your projects. You might name it "Portfolio." Clone each repository one by one into this directory:
git clone <your-repo-url>

5. Install SQL with the Sakila Database

  • SQL skills are essential for data science, and the Sakila database is a great resource for practicing SQL queries. To install MySQL, download the installer from MySQL’s website.
  • Then, download the Sakila database using this guide: How to Use the Sakila Database in MySQL. (You will use it on one of your project).
  • Now, open any test notebook and confirm that you can connect Python to MySQL.
#Créer une connexion à la base de données MySQL
engine = create_engine("mysql://root:root@localhost:3306/sakila", echo=True)
conn = engine.connect()
print(engine)

6. Install Tableau and Create your Cloud Accounts

  • Tableau is ideal for data visualization and creating interactive dashboards, which you’ll need for Project 2. Download Tableau Public (free) from Tableau’s website.
  • Cloud Accounts: At this point, start considering which cloud service you’d like to use. I personally recommend Azure because it’s user-friendly and easier to debug. Heroku is also a great option for API deployment, and if you have a GitHub Student account, you can use it for free!

Bravo! You’ve successfully set up your working environment, and you’re now ready to begin your 6-month journey to build your portfolio. Let’s jump on it.


Project 1: Analyse global education and economic data

Goal: Analyze global education and economic data to understand trends and identify key indicators. This data is challenging and requires advanced data preparation skills.

Data Source: World Bank Education Statistics (EdStats)

Steps:

  • Import the datasets using pandas.
  • Display summary statistics and distribution plots.
  • Quickly select the relevant columns, countries, and years to work with (the dataset can be very complex, try to reduce it at the beginning).
  • Merge the necessary files using different merging techniques.
  • Analyze missing data and detect outliers. Replace NaNs using simple methods such as the median.
  • Create various charts to visualize data availability by year.
  • Reshape the data into a comprehensive format using advanced pandas functions like pivot and melt to organize data by country, indicator, and year.
  • Standardize the data and compute scores to capture trends across regions.
  • Explore advanced methods for data imputation (e.g., K-Nearest Neighbors) and feature reduction (e.g., Principal Component Analysis).
  • Use advanced statistical tests (e.g., ANOVA) to understand the relationships between the data.

Libraries: pandas, numpy, matplotlib, seaborn, scipy, missingno, sklearn, statsmodels, plotly.

Project 2: Python, MySQL, and Tableau with Sakila database

pixabay.com
pixabay.com

Goal: Analyze and visualize data from the Sakila database using SQL and Tableau.

Data Source: Sakila Database (MySQL)

Methodology: Design an ER model, execute advanced SQL queries, connect to Tableau, and build interactive dashboards.

Steps:

1.Connect Python to MySQL:

  • Use SQLAlchemy to connect Python to the MySQL Sakila database.
  • Run advanced SQL queries to extract key insights, such as top actors, revenue by category, and rental patterns.

2. Connect Tableau to MySQL:

  • In Tableau, connect to the local Sakila MySQL database.
  • Import relevant tables and set up relationships as needed for analysis.

3. Build Visualizations in Tableau:

  • Revenue by Category: Create a bar chart showing revenue distribution by category.
  • Top Movies by Rentals: Display a chart of the most rented movies.
  • Store Performance: Visualize rental counts and revenue for each store.
  • Time-Series Trends: Analyze rental activity over time with a line chart.

4. Dashboard Creation:

  • Combine visualizations into a Tableau dashboard, adding filters for interactive insights.
  • Save your dashboard as a Tableau file. Consider publishing it on Tableau Public to get a free URL, or take screenshots to include in a presentation.

Libraries and tools: pandas, SQLAlchemy, (for MySQL connection), Tableau.

Project 3: Predicting Energy Consumption

pixabay.com
pixabay.com

Goal: Predict energy consumption of buildings to aid in climate action goals.

Data Source: Seattle’s 2016 Building Energy Benchmarking

Methodology: Use machine learning to analyze and predict building energy consumption.

Steps:

  • Clean and preprocess data.
  • Perform feature engineering to create meaningful variables.
  • Normalize numerical features and encode categorical variables.
  • Split data into training and testing sets.
  • Benchmark different models (e.g., Ridge/Lasso Regressions, SVM, RandomForest, XGBoost).
  • Tune the hyperparameters to optimize the model’s performance metrics.
  • Use other evaluation metrics R2, RMSE, MAE.
  • Select the best model.
  • Evaluate the model on test data and analyze the coherence with the train data.
  • Interpret model results and feature importance.

Libraries: pandas, numpy, matplotlib, seaborn, sklearn, shap, plotly.

Project 4: Customer Segmentation

pixabay.com
pixabay.com

Goal: Segment customers using clustering techniques to identify distinct groups within Brazilian e-commerce data.

Data Source: Brazilian E-Commerce Public Dataset by Olist

Methodology: Apply unsupervised learning to discover customer segments based on purchasing behaviour.

Steps:

  • Merge and clean data.
  • Conduct feature engineering to extract useful features.
  • Experiment with clustering algorithms (KMeans, DBSCAN, AgglomerativeClustering).
  • Optimize hyperparameters (e.g., number of clusters) based on silhouette score, davies bouldin and distortion metrics.
  • Analyze and visualize clusters.

Libraries: pandas, numpy, matplotlib, seaborn, sklearn, yellowbrick.

Project 5: Images Classifier

pixabay.com
pixabay.com

Goal: Implement a deep learning model to classify images from the STL-10 dataset.

Data Source: STL-10 Image Recognition Dataset

Methodology: Explore and apply convolutional neural networks (CNNs) and transfer learning for image classification.

Steps:

  • Explore dataset.
  • Preprocess images (resize, normalize).
  • Build a CNN from scratch.
  • Use transfer learning modes.
  • Train the model and adjust hyperparameters.
  • Evaluate model accuracy and make predictions.

Libraries: pandas, numpy, matplotlib, tensorflow, keras, cv2, skimage.

Project 6: Stack Overflow Questions Tags

pixabay.com
pixabay.com

Goal: Predict tags for Stack Overflow questions using NLP techniques.

Data Source: Stack Overflow API or dataset.

Methodology: Utilize natural language processing to classify text data into multiple tags.

Steps:

  • Clean data using.
  • Extract features with TF-IDF.
  • Apply ML algorithms (e.g., Logistic Regression) for multi-label classification.
  • Experiment with feature extraction using advanced NLP models (BERT, Doc2Vec).
  • Evaluate model performance and adjust hyperparameters.
  • You can use the same metric to compare classical and more advanced methods like accuracy.

Libraries: pandas, numpy, matplotlib, tensorflow, gensim, spacy, transformers.

Project 7: API and Dashboard

pixabay.com
pixabay.com

Goal: Deploy a model as an API and build a dashboard for real-time interaction.

Data Source: Use the model from Project 6.

Methodology: Serialize a machine learning model and deploy it via an API, then build a dashboard with streamlit for user interaction.

Steps:

  • Serialize the model using joblib.
  • Create an API with FastAPI for model inference.
  • Develop a dashboard using Streamlit to interact with the API.
  • Deploy the API and dashboard on platforms like Heroku or a cloud service. (At this step, ensure you have your cloud account set up. If you want to go further, you can also install Docker and use it before deployment.)

Libraries and tools: fastapi, streamlit, pickle, joblib , docker, azure/heroku.

Project 8: Monitoring and MlOps

pixabay.com
pixabay.com

Goal: Implement MLOps practices including model monitoring, experiment tracking, and automated deployment.

Data Source: Use Project 7.

Methodology: Integrate MLflow for tracking experiments, add an automated deployment with GitHub Actions.

Steps:

  • Set up MLflow for experiment tracking and model versioning.
  • Deploy model artifacts to a cloud storage solution.
  • Configure GitHub Actions for CI/CD to automate the api deployment (if you want to go further, you can also use Docker for this step).

Libraries: MLflow, GitHub Actions, Azure, pickle, joblib, docker.


Woohoo, congrats! 🎉 You’ve reached the final stage of your portfolio. Now it’s time to refine everything. Don’t forget that you have 10 hours planned for that. Follow these steps:

1.Clean and Document

  • Add markdown explanations to your notebooks as you go.
  • Add comments in your Python files and clean or optimize the code where needed.

2.Push to GitHub

  • Upload all projects to GitHub.
  • Include a requirements.txt file for each project.

3.Create a README for Each Project that contains

  • Project Title, Description, and Data Sources
  • Installation instructions
  • Project Structure
  • Results and Analysis
  • Limitations and Future Work

4.Dashboard Projects

  • If the dashboard URL isn’t accessible, take screenshots and create a clean slide deck. Add these to GitHub.

You can stop here, as you should now have a complete and clear portfolio on GitHub. However, if you want to stand out even more, consider deploying your portfolio on a website 🌐 . I personally suggest the following options:

  • GitHub Pages: Start here for a quick, free portfolio site. It’s ideal for linking to your GitHub repositories, sharing project descriptions, and organizing your work in one place.
  • Streamlit: Use this for interactive projects to showcase your data science and machine learning skills. It’s easy to deploy apps directly from GitHub, adding a dynamic layer to your portfolio.
  • Wix or WordPress: Consider these later if you want a polished, customizable site with extra content like a bio or blog posts. They’re perfect for creating a visually engaging portfolio without coding.

Some advice📌 :

  • Before Jan 2025: Complete your prep work and set up your environment (install necessary tools, create GitHub repositories, and set up virtual environments).
  • Plan your time: Schedule 10 hours a week for your projects and try to keep a consistent weekly time slot.
  • Break down each project into smaller tasks. Map out each step in detail before starting each project.
  • Ask for code reviews at the end of each project from someone in your data network.
  • Create a learning log document: Don’t wait until the end to tidy up code or push to GitHub. Push regularly to avoid losing work.
  • Stay flexible: Adjust your plan based on other commitments (e.g., work, internship, job research).
  • Create a document for learning log: After each project, write what you’ve learned with a brief explanation. Note any concepts you used but didn’t fully understand so you can revisit them later. This log will be valuable for interview prep.
  • Use Notion, Word, or Google Drive: For tracking your progress, keep your log somewhere reliable so you won’t lose it.

I’ve mentored hundreds of junior data scientists and hired for various teams on behalf of my clients. If you follow this portfolio plan, it’ll make your journey much smoother

Keep learning, stay positive, and you’ll do great! Good luck!

Thank you for reading!

Note: Some parts of this article were initially written in French and translated into English with the assistance of ChatGPT.

If you found this article informative and helpful, please don’t hesitate to 👏 and follow me on Medium | LinkedIn.

The post A 6-Month Detailed Plan to Build Your Junior Data Science Portfolio appeared first on Towards Data Science.

]]>