Gustavo Santos, Author at Towards Data Science

Creating an AI Agent to Write Blog Posts with CrewAI

Gustavo Santos — Fri, 04 Apr 2025 18:11:49 +0000

Introduction

I love writing. You may notice that if you follow me or my blog. For that reason, I am constantly producing new content and talking about Data Science and Artificial Intelligence.

I discovered this passion a couple of years ago when I was just starting my path in Data Science, learning and evolving my skills. At that time, I heard some more experienced professionals in the area saying that a good study technique was practicing new skills and writing about it somewhere, teaching whatever you learned.

In addition, I had just moved to the US, and nobody knew me here. So I had to start somewhere, creating my professional image in this competitive market. I remember I talked to my cousin, who’s also in the Tech industry, and he told me: write blog posts about your experiences. Tell people what you are doing. And so I did.

And I never stopped.

Fast forward to 2025, now I have almost two hundred published articles, many of them with Towards Data Science, a published Book, and a good audience.

Writing helped me so much in the Data Science area.

Most recently, one of my interests has been the amazing Natural Language Processing and Large Language Models subjects. Learning about how these modern models work is fascinating.

That interest led me to experiment with Agentic Ai as well. So, I learned about CrewAI, an easy and open-source package that helps us build AI agents in a fun and easy way, with little code. I decided to test it by creating a crew of agents to write a blog post, and then see how that goes.

In this post, we will learn how to create those agents and make them work together to produce a simple blog post.

Let’s do that.

What is a Crew?

A crew of AI agents is a combination of two or more agents, each of them performing a task towards a final goal.

In this case study, we will create a crew that will work together to produce a small blog post about a given topic that we will provide.

Crew of Agents workflow. Image by the author

The flow works like this:

We choose a given topic for the agents to write about.
Once the crew is started, it will go to the knowledge base, read some of my previously written articles, and try to mimic my writing style. Then, it generates a set of guidelines and passes it to the next agent.
Next, the Planner agent takes over and searches the Internet looking for good content about the topic. It creates a plan of content and sends it to the next agent.
The Writer agent receives the writing plan and executes it according to the context and information received.
Finally, the content is passed to the last agent, the Editor, who reviews the content and returns the final document as the output.

In the following section, we will see how this can be created.

Code

CrewAI is a great Python package because it simplifies the code for us. So, let’s begin by installing the two needed packages.

pip install crewai crewai-tools

Next, if you want, you can follow the instructions on their Quickstart page and have a full project structure created for you with just a couple of commands on a terminal. Basically, it will install some dependencies, generate the folder structure suggested for CrewAI projects, as well as generate some .yaml and .py files.

I personally prefer to create those myself, but it is up to you. The page is listed in the References section.

Folder Structure

So, here we go.

We will create these folders:

knowledge
config

And these files:

In the config folder: create the files agents.yaml and tasks.yaml
In the knowledge folder, that’s where I will add the files with my writing style.
In the project root: create crew.py and main.py.

Folders structure. Image by the author.

Make sure to create the folders with the names mentioned, as CrewAI looks for agents and tasks inside the config folder and for the knowledge base within a knowledge folder.

Next, let us set our agents.

Agents

The agents are composed of:

Name of the agent: writer_style
Role: LLMs are good role players, so here you can tell them which role to play.
Goal: tell the model what the goal of that agent is.
Backstory: Describe the story behind this agent, who it is, what it does.

writer_style:
  role: >
    Writing Style Analyst
  goal: >
    Thoroughly read the knowledge base and learn the characteristics of the crew, 
    such as tone, style, vocabulary, mood, and grammar.
  backstory: >
    You are an experienced ghost writer who can mimic any writing style.
    You know how to identify the tone and style of the original writer and mimic 
    their writing style.
    Your work is the basis for the Content Writer to write an article on this topic.

I won’t bore you with all the agents created for this crew. I believe you got the idea. It is a set of prompts explaining to each agent what they are going to do. All the agents instructions are stored in the agents.yaml file.

Think of it as if you were a manager hiring people to create a team. Think about what kinds of professionals you would need, and what skills are needed.

We need 4 professionals who will work towards the final goal of producing written content: (1) a Writer Stylist, (2) a Planner, (3) a Writer, and (4) an Editor.

If you want to see the setup for them, just check the full code in the GitHub repository.

Tasks

Now, back to the analogy of the manager hiring people, once we “hired” our entire crew, it is time to separate the tasks. We know that we want to produce a blog post, we have 4 agents, but what each of them will do.

Well, that will be configured in the file tasks.yaml.

To illustrate, let me show you the code for the Writer agent. Once again, these are the parts needed for the prompt:

Name of the task: write
Description: The description is like telling the professional how you want that task to be performed, just like we would tell a new hire how to perform their new job. Give precise instructions to get the best result possible.
Expected output: This is how we want to see the output. Notice that I give instructions like the size of the blog post, the quantity of paragraphs, and other information that helps my agent to give me the expected output.
Agent to perform it: Here, we are indicating the agent who will perform this task, using the same name set in the agents.yaml file.
Output file: Now always applicable, but if so, this is the argument to use. We asked for a markdown file as output.

write:
  description: >
    1. Use the content plan to craft a compelling blog post on {topic}.
    2. Incorporate SEO keywords naturally.
    3. Sections/Subtitles are properly named in an engaging manner. Make sure 
    to add Introduction, Problem Statement, Code, Before You Go, References.
    4. Add a summarizing conclusion - This is the "Before You Go" section.
    5. Proofread for grammatical errors and alignment with the writer's style.
    6. Use analogies to make the article more engaging and complex concepts easier
    to understand.
  expected_output: >
    A well-written blog post in markdown format, ready for publication.
    The article must be within a 7 to 12 minutes read.
    Each section must have at least 3 paragraphs.
    When writing code, you will write a snippet of code and explain what it does. 
    Be careful to not add a huge snippet at a time. Break it in reasonable chunks.
    In the examples, create a sample dataset for the code.
    In the Before You Go section, you will write a conclusion that is engaging
    and factually accurate.
  agent: content_writer
  output_file: blog_post.md

After the agents and tasks are defined, it is time to create our crew flow.

Coding the Crew

Now we will create the file crew.py, where we will translate the previously presented flow to Python code.

We begin by importing the needed modules.

#Imports
import os
from crewai import Agent, Task, Process, Crew, LLM
from crewai.project import CrewBase, agent, crew, task
from crewai.knowledge.source.pdf_knowledge_source import PDFKnowledgeSource
from crewai_tools import SerperDevTool

We will use the basic Agent, Task, Crew, Process and LLM to create our flow. PDFKnowledgeSource will help the first agent learning my writing style, and SerperDevTool is the tool to search the internet. For that one, make sure to get your API key at https://serper.dev/signup.

A best practice in software development is to keep your API keys and configuration settings separate from your code. We’ll use a .env file for this, providing a secure place to store these values. Here’s the command to load them into our environment.

from dotenv import load_dotenv
load_dotenv()

Then, we will use the PDFKnowledgeSource to show the Crew where to search for the writer’s style. By default, that tool looks at the knowledge folder of your project, thus the importance of the name being the same.

# Knowledge sources

pdfs = PDFKnowledgeSource(
    file_paths=['article1.pdf',
                'article2.pdf',
                'article3.pdf'
                ]
)

Now we can set up the LLM we want to use for the Crew. It can be any of them. I tested a bunch of them, and those I liked the most were qwen-qwq-32b and gpt-4o. If you choose OpenAI’s, you will need an API Key as well. For Qwen-QWQ, just uncomment the code and comment out the OpenAI’s lines.. You need an API key from Groq.

# LLMs

llm = LLM(
    # model="groq/qwen-qwq-32b",
    # api_key= os.environ.get("GROQ_API_KEY"),
    model= "gpt-4o",
    api_key= os.environ.get("OPENAI_API_KEY"),
    temperature=0.4
)

Now we have to create a Crew Base, showing where CrewAI can find the agents and tasks configuration files.

# Creating the crew: base shows where the agents and tasks are defined

@CrewBase
class BlogWriter():
    """Crew to write a blog post"""
    agents_config = "config/agents.yaml"
    tasks_config = "config/tasks.yaml"

Agents Functions

And we are ready to create the code for each agent. They are composed of a decorator @agent to show that the following function is an agent. We then use the class Agent and indicate the name of the agent in the config file, the level of verbosity, being 1 low, 2 high. You can also use a Boolean value, such as true or false.

Lastly, we specify if the agent uses any tool, and what model it will use.

# Configuring the agents
    @agent
    def writer_style(self) -> Agent:
        return Agent(
                config=self.agents_config['writer_style'],
                verbose=1,
                knowledge_sources=[pdfs]
                )

    @agent
    def planner(self) -> Agent:
        return Agent(
        config=self.agents_config['planner'],
        verbose=True,
        tools=[SerperDevTool()],
        llm=llm
        )

    @agent
    def content_writer(self) -> Agent:
        return Agent(
        config=self.agents_config['content_writer'],
        verbose=1
        )

    @agent
    def editor(self) -> Agent:
        return Agent(
        config=self.agents_config['editor'],
        verbose=1
        )

Tasks Functions

The next step is creating the tasks. Similarly to the agents, we will create a function and decorate it with @task. We use the class Task to inherit CrewAI’s functionalities and then point to the task to be used from our tasks.yaml file to be used for each task created. If any output file is expected, use the output_file argument.

# Configuring the tasks    

    @task
    def style(self) -> Task:
        return Task(
        config=self.tasks_config['mystyle'],
        )

    @task
    def plan(self) -> Task:
        return Task(
        config=self.tasks_config['plan'],
        )

    @task
    def write(self) -> Task:
        return Task(
        config=self.tasks_config['write'],
        output_file='output/blog_post.md' # This is the file that will be contain the final blog post.
        )

    @task
    def edit(self) -> Task:
        return Task(
        config=self.tasks_config['edit']
        )

Crew

To glue everything together, we now create a function and decorate it with the @crew decorator. That function will line up the agents and the tasks in the order to be performed, since the process chosen here is the simplest: sequential. In other words, everything runs in sequence, from start to finish.

@crew

    def crew(self) -> Crew:
        """Creates the Blog Post crew"""

        return Crew(
            agents= [self.writer_style(), self.planner(), self.content_writer(), self.editor(), self.illustrator()],
            tasks= [self.style(), self.plan(), self.write(), self.edit(), self.illustrate()],
            process=Process.sequential,
            verbose=True
        )

Running the Crew

Running the crew is very simple. We create the main.py file and import the Crew Base BlogWriter created. Then we just use the functions crew().kickoff(inputs) to run it, passing a dictionary with the inputs to be used to generate the blog post.

# Script to run the blog writer project

# Warning control
import warnings
warnings.filterwarnings('ignore')
from crew import BlogWriter


def write_blog_post(topic: str):
    # Instantiate the crew
    my_writer = BlogWriter()
    # Run
    result = (my_writer
              .crew()
              .kickoff(inputs = {
                  'topic': topic
                  })
    )

    return result

if __name__ == "__main__":

    write_blog_post("Price Optimization with Python")

There it is. The result is a nice blog post created by the LLM. See below.

Resulting blog post. GIF by the author.

That is so nice!

Before You Go

Before you go, know that this blog post was 100% created by me. This crew I created was an experiment I wanted to do to learn more about how to create AI agents and make them work together. And, like I said, I love writing, so this is something I would be able to read and assess the quality.

My opinion is that this crew still didn’t do a very good job. They were able to complete the tasks successfully, but they gave me a very shallow post and code. I would not publish this, but at least it could be a start, maybe.

From here, I encourage you to learn more about CrewAI. I took their free course where João de Moura, the creator of the package, shows us how to create different kinds of crews. It is really interesting.

GitHub Repository

https://github.com/gurezende/Crew_Writer

About Me

If you want to learn more about my work, or follow my blog (it is really me!), here are my contacts and portfolio.

https://gustavorsantos.me

References

[Quickstart CrewAI](https://docs.crewai.com/quickstart)

[CrewAI Documentation](https://docs.crewai.com/introduction)

[GROQ](https://groq.com/)

[OpenAI](https://openai.com)

[CrewAI Free Course](https://learn.crewai.com/)

The post Creating an AI Agent to Write Blog Posts with CrewAI appeared first on Towards Data Science.

LLM + RAG: Creating an AI-Powered File Reader Assistant

Gustavo Santos — Mon, 03 Mar 2025 21:02:28 +0000

Introduction

AI is everywhere.

It is hard not to interact at least once a day with a Large Language Model (LLM). The chatbots are here to stay. They’re in your apps, they help you write better, they compose emails, they read emails…well, they do a lot.

And I don’t think that that is bad. In fact, my opinion is the other way – at least so far. I defend and advocate for the use of AI in our daily lives because, let’s agree, it makes everything much easier.

I don’t have to spend time double-reading a document to find punctuation problems or type. AI does that for me. I don’t waste time writing that follow-up email every single Monday. AI does that for me. I don’t need to read a huge and boring contract when I have an AI to summarize the main takeaways and action points to me!

These are only some of AI’s great uses. If you’d like to know more use cases of LLMs to make our lives easier, I wrote a whole book about them.

Now, thinking as a data scientist and looking at the technical side, not everything is that bright and shiny.

LLMs are great for several general use cases that apply to anyone or any company. For example, coding, summarizing, or answering questions about general content created until the training cutoff date. However, when it comes to specific business applications, for a single purpose, or something new that didn’t make the cutoff date, that is when the models won’t be that useful if used out-of-the-box – meaning, they will not know the answer. Thus, it will need adjustments.

Training an LLM model can take months and millions of dollars. What is even worse is that if we don’t adjust and tune the model to our purpose, there will be unsatisfactory results or hallucinations (when the model’s response doesn’t make sense given our query).

So what is the solution, then? Spending a lot of money retraining the model to include our data?

Not really. That’s when the Retrieval-Augmented Generation (RAG) becomes useful.

RAG is a framework that combines getting information from an external knowledge base with large language models (LLMs). It helps AI models produce more accurate and relevant responses.

Let’s learn more about RAG next.

What is RAG?

Let me tell you a story to illustrate the concept.

I love movies. For some time in the past, I knew which movies were competing for the best movie category at the Oscars or the best actors and actresses. And I would certainly know which ones got the statue for that year. But now I am all rusty on that subject. If you asked me who was competing, I would not know. And even if I tried to answer you, I would give you a weak response.

So, to provide you with a quality response, I will do what everybody else does: search for the information online, obtain it, and then give it to you. What I just did is the same idea as the RAG: I obtained data from an external database to give you an answer.

When we enhance the LLM with a content store where it can go and retrieve data to augment (increase) its knowledge base, that is the RAG framework in action.

RAG is like creating a content store where the model can enhance its knowledge and respond more accurately.

User prompt about Content C. LLM retrieves external content to aggregate to the answer. Image by the author.

Summarizing:

Uses search algorithms to query external data sources, such as databases, knowledge bases, and web pages.
Pre-processes the retrieved information.
Incorporates the pre-processed information into the LLM.

Why use RAG?

Now that we know what the RAG framework is let’s understand why we should be using it.

Here are some of the benefits:

Enhances factual accuracy by referencing real data.
RAG can help LLMs process and consolidate knowledge to create more relevant answers
RAG can help LLMs access additional knowledge bases, such as internal organizational data
RAG can help LLMs create more accurate domain-specific content
RAG can help reduce knowledge gaps and AI hallucination

As previously explained, I like to say that with the RAG framework, we are giving an internal search engine for the content we want it to add to the knowledge base.

Well. All of that is very interesting. But let’s see an application of RAG. We will learn how to create an AI-powered PDF Reader Assistant.

Project

This is an application that allows users to upload a PDF document and ask questions about its content using AI-powered natural language processing (NLP) tools.

The app uses Streamlit as the front end.
Langchain, OpenAI’s GPT-4 model, and FAISS (Facebook AI Similarity Search) for document retrieval and question answering in the backend.

Let’s break down the steps for better understanding:

Loading a PDF file and splitting it into chunks of text.
1. This makes the data optimized for retrieval
Present the chunks to an embedding tool.
1. Embeddings are numerical vector representations of data used to capture relationships, similarities, and meanings in a way that machines can understand. They are widely used in Natural Language Processing (NLP), recommender systems, and search engines.
Next, we put those chunks of text and embeddings in the same DB for retrieval.
Finally, we make it available to the LLM.

Data preparation

Preparing a content store for the LLM will take some steps, as we just saw. So, let’s start by creating a function that can load a file and split it into text chunks for efficient retrieval.

# Imports
from  langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

def load_document(pdf):
    # Load a PDF
    """
    Load a PDF and split it into chunks for efficient retrieval.

    :param pdf: PDF file to load
    :return: List of chunks of text
    """

    loader = PyPDFLoader(pdf)
    docs = loader.load()

    # Instantiate Text Splitter with Chunk Size of 500 words and Overlap of 100 words so that context is not lost
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=100)
    # Split into chunks for efficient retrieval
    chunks = text_splitter.split_documents(docs)

    # Return
    return chunks

Next, we will start building our Streamlit app, and we’ll use that function in the next script.

Web application

We will begin importing the necessary modules in Python. Most of those will come from the langchain packages.

FAISS is used for document retrieval; OpenAIEmbeddings transforms the text chunks into numerical scores for better similarity calculation by the LLM; ChatOpenAI is what enables us to interact with the OpenAI API; create_retrieval_chain is what actually the RAG does, retrieving and augmenting the LLM with that data; create_stuff_documents_chain glues the model and the ChatPromptTemplate.

Note: You will need to generate an OpenAI Key to be able to run this script. If it’s the first time you’re creating your account, you get some free credits. But if you have it for some time, it is possible that you will have to add 5 dollars in credits to be able to access OpenAI’s API. An option is using Hugging Face’s Embedding.

# Imports
from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings
from langchain.chains import create_retrieval_chain
from langchain_openai import ChatOpenAI
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_core.prompts import ChatPromptTemplate
from scripts.secret import OPENAI_KEY
from scripts.document_loader import load_document
import streamlit as st

This first code snippet will create the App title, create a box for file upload, and prepare the file to be added to the load_document() function.

# Create a Streamlit app
st.title("AI-Powered Document Q&A")

# Load document to streamlit
uploaded_file = st.file_uploader("Upload a PDF file", type="pdf")

# If a file is uploaded, create the TextSplitter and vector database
if uploaded_file :

    # Code to work around document loader from Streamlit and make it readable by langchain
    temp_file = "./temp.pdf"
    with open(temp_file, "wb") as file:
        file.write(uploaded_file.getvalue())
        file_name = uploaded_file.name

    # Load document and split it into chunks for efficient retrieval.
    chunks = load_document(temp_file)

    # Message user that document is being processed with time emoji
    st.write("Processing document... :watch:")

Machines understand numbers better than text, so in the end, we will have to provide the model with a database of numbers that it can compare and check for similarity when performing a query. That’s where the embeddings will be useful to create the vector_db, in this next piece of code.

# Generate embeddings
    # Embeddings are numerical vector representations of data, typically used to capture relationships, similarities,
    # and meanings in a way that machines can understand. They are widely used in Natural Language Processing (NLP),
    # recommender systems, and search engines.
    embeddings = OpenAIEmbeddings(openai_api_key=OPENAI_KEY,
                                  model="text-embedding-ada-002")

    # Can also use HuggingFaceEmbeddings
    # from langchain_huggingface.embeddings import HuggingFaceEmbeddings
    # embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

    # Create vector database containing chunks and embeddings
    vector_db = FAISS.from_documents(chunks, embeddings)

Next, we create a retriever object to navigate in the vector_db.

# Create a document retriever
    retriever = vector_db.as_retriever()
    llm = ChatOpenAI(model_name="gpt-4o-mini", openai_api_key=OPENAI_KEY)

Then, we will create the system_prompt, which is a set of instructions to the LLM on how to answer, and we will create a prompt template, preparing it to be added to the model once we get the input from the user.

# Create a system prompt
    # It sets the overall context for the model.
    # It influences tone, style, and focus before user interaction starts.
    # Unlike user inputs, a system prompt is not visible to the end user.

    system_prompt = (
        "You are a helpful assistant. Use the given context to answer the question."
        "If you don't know the answer, say you don't know. "
        "{context}"
    )

    # Create a prompt Template
    prompt = ChatPromptTemplate.from_messages(
        [
            ("system", system_prompt),
            ("human", "{input}"),
        ]
    )

    # Create a chain
    # It creates a StuffDocumentsChain, which takes multiple documents (text data) and "stuffs" them together before passing them to the LLM for processing.

    question_answer_chain = create_stuff_documents_chain(llm, prompt)

Moving on, we create the core of the RAG framework, pasting together the retriever object and the prompt. This object adds relevant documents from a data source (e.g., a vector database) and makes it ready to be processed using an LLM to generate a response.

# Creates the RAG
     chain = create_retrieval_chain(retriever, question_answer_chain)

Finally, we create the variable question for the user input. If this question box is filled with a query, we pass it to the chain, which calls the LLM to process and return the response, which will be printed on the app’s screen.

# Streamlit input for question
    question = st.text_input("Ask a question about the document:")
    if question:
        # Answer
        response = chain.invoke({"input": question})['answer']
        st.write(response)

Here is a screenshot of the result.

Screenshot of the final app. Image by the author.

And this is a GIF for you to see the File Reader Ai Assistant in action!

File Reader AI Assistant in action. Image by the author.

Before you go

In this project, we learned what the RAG framework is and how it helps the Llm to perform better and also perform well with specific knowledge.

AI can be powered with knowledge from an instruction manual, databases from a company, some finance files, or contracts, and then become fine-tuned to respond accurately to domain-specific content queries. The knowledge base is augmented with a content store.

To recap, this is how the framework works:

1️⃣ User Query → Input text is received.

2️⃣ Retrieve Relevant Documents → Searches a knowledge base (e.g., a database, vector store).

3️⃣ Augment Context → Retrieved documents are added to the input.

4️⃣ Generate Response → An LLM processes the combined input and produces an answer.

GitHub repository

https://github.com/gurezende/Basic-Rag

About me

If you liked this content and want to learn more about my work, here is my website, where you can also find all my contacts.

https://gustavorsantos.me

References

https://cloud.google.com/use-cases/retrieval-augmented-generation

https://www.ibm.com/think/topics/retrieval-augmented-generation

https://youtu.be/T-D1OfcDW1M?si=G0UWfH5-wZnMu0nw

https://python.langchain.com/docs/introduction

https://www.geeksforgeeks.org/how-to-get-your-own-openai-api-key

The post LLM + RAG: Creating an AI-Powered File Reader Assistant appeared first on Towards Data Science.

LightGBM: The Fastest Option of Gradient Boosting

Gustavo Santos — Sun, 12 Jan 2025 12:02:14 +0000

LightGBM is a faster option | Image generated by AI. Meta Llama, 2025. https://meta.ai

Introduction

When we talk about Gradient Boosting Models [GBM], we often also hear about Kaggle. This algorithm is very powerful, offering many tuning arguments, thus leading to very high accuracy metrics, and helping people to win competitions on that mentioned platform.

However, we are here to talk about real life. Or at least an implementation that we can apply to problems faced by companies.

Gradient Boosting is an algorithm that creates many models in sequence, always modeling on top of the error of the previous iteration and following a learning rate determined by the data scientist, until it reaches a plateau, becoming unable to improve the evaluation metric anymore.

Gradient Boosting algorithm creates sequential models trying to decrease the previous iteration’s error.

The downside of GBMs is also what makes them so effective. The sequential construction.

If each new iteration is in sequence, the algorithm must wait for the completion of one iteration before it can start another, increasing the training time of the model. Furthermore, as the data increases in size, so does the time cost, becoming a problem when dealing with larger datasets.

Lightgbm comes to solve this problem. The package offers a lighter implementation of the algorithm that focuses on:

Faster training speed and higher efficiency.
Lower memory usage.
Better accuracy.
Support of parallel, distributed, and GPU learning.
Capable of handling large-scale data.

Let’s see how to train a model using LightGBM in Python.

Implementation

LightGBM was first released in 2016. Currently, it has packages for R and Python. In Python, specifically, it also has an implementation by Scikit-Learn.

Here we will use the lightgbm Python package.

We are also using these libraries.

import lightgbm as lgb
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from ucimlrepo import fetch_ucirepo
import pandas as pd

Dataset

The dataset to be used in this exercise is from the UCI Machine Learning Repository: PhiUSIIL Phishing URL (Website). The dataset is open under the Creative Commons license.

This data is very recent, donated to the UCI repository in 2024, and it shows many variables with information about websites. Some features are extracted from the source code of the webpage and URL. The label classifies the website as legit (1) or a phishing website that is not legit (0).

Credits:

Prasad, A. & Chandra, S. (2024). PhiUSIIL Phishing URL (Website). UCI Machine Learning Repository. https://doi.org/10.1016/j.cose.2023.103545.

Python"># fetch dataset
phishing_url = fetch_ucirepo(id=967)

# data (as pandas dataframes)
X = phishing_url.data.features
y = phishing_url.data.targets

# Pandas Dataframe
df = pd.concat([X, y], axis=1)

The data is mostly numerical, as integer variables. The exceptions are URL , Domain, TLD and Title.

Code

In this tutorial, for the sake of time and scope of the article, we will focus on the implementation of a simple LightGBM model.

So, we are not interested in exploring the data and getting insights from it, although I would encourage you to do so if this subject interests you. The dataset is very rich in information.

First, let’s check for shape, missing data, and data types.

# info check for missing and data types
df.info()


RangeIndex: 235795 entries, 0 to 235794
Data columns (total 55 columns):
 #   Column                      Non-Null Count   Dtype   
---  ------                      --------------   -----   
 0   URL                         235795 non-null  object  
 1   URLLength                   235795 non-null  int64   
 2   Domain                      235795 non-null  object  
 3   DomainLength                235795 non-null  int64   
 4   IsDomainIP                  235795 non-null  int64   
 5   TLD                         235795 non-null  object
 6   URLSimilarityIndex          235795 non-null  float64 
 7   CharContinuationRate        235795 non-null  float64 
 8   TLDLegitimateProb           235795 non-null  float64 
 9   URLCharProb                 235795 non-null  float64 
 10  TLDLength                   235795 non-null  int64   
 11  NoOfSubDomain               235795 non-null  int64   
 12  HasObfuscation              235795 non-null  int64   
 13  NoOfObfuscatedChar          235795 non-null  int64   
 14  ObfuscationRatio            235795 non-null  float64 
 15  NoOfLettersInURL            235795 non-null  int64   
 16  LetterRatioInURL            235795 non-null  float64 
 17  NoOfDegitsInURL             235795 non-null  int64   
 18  DegitRatioInURL             235795 non-null  float64 
 19  NoOfEqualsInURL             235795 non-null  int64   
 20  NoOfQMarkInURL              235795 non-null  int64   
 21  NoOfAmpersandInURL          235795 non-null  int64   
 22  NoOfOtherSpecialCharsInURL  235795 non-null  int64   
 23  SpacialCharRatioInURL       235795 non-null  float64 
 24  IsHTTPS                     235795 non-null  int64   
 25  LineOfCode                  235795 non-null  int64   
 26  LargestLineLength           235795 non-null  int64   
 27  HasTitle                    235795 non-null  int64   
 28  Title                       235795 non-null  object  
 29  DomainTitleMatchScore       235795 non-null  float64 
 30  URLTitleMatchScore          235795 non-null  float64 
 31  HasFavicon                  235795 non-null  int64   
 32  Robots                      235795 non-null  int64   
 33  IsResponsive                235795 non-null  int64   
 34  NoOfURLRedirect             235795 non-null  int64   
 35  NoOfSelfRedirect            235795 non-null  int64   
 36  HasDescription              235795 non-null  int64   
 37  NoOfPopup                   235795 non-null  int64   
 38  NoOfiFrame                  235795 non-null  int64   
 39  HasExternalFormSubmit       235795 non-null  int64   
 40  HasSocialNet                235795 non-null  int64   
 41  HasSubmitButton             235795 non-null  int64   
 42  HasHiddenFields             235795 non-null  int64   
 43  HasPasswordField            235795 non-null  int64   
 44  Bank                        235795 non-null  int64   
 45  Pay                         235795 non-null  int64   
 46  Crypto                      235795 non-null  int64   
 47  HasCopyrightInfo            235795 non-null  int64   
 48  NoOfImage                   235795 non-null  int64   
 49  NoOfCSS                     235795 non-null  int64   
 50  NoOfJS                      235795 non-null  int64   
 51  NoOfSelfRef                 235795 non-null  int64   
 52  NoOfEmptyRef                235795 non-null  int64   
 53  NoOfExternalRef             235795 non-null  int64   
 54  label                       235795 non-null  int64   
dtypes: category(1), float64(10), int64(41), object(3)
memory usage: 97.6+ MB

Ok, great. No missing values. And like we said, only those 4 categorical variables. As we know, LightGBM can deal with categories natively, but you should transform the data type from object to category. This can be easily done with the next code snippet.

# Variable TLD to category
df['TLD'] = df.TLD.astype('category')

Let us also check how balanced is this data.

df['label'].value_counts(normalize=True)

label  proportion
  1     0.571895
  0     0.428105

It is fairly well-balanced.

Now, this algorithm is very powerful, so for this dataset, if we choose all the variables (or even just a few best ones), the model will easily overfit. So I just got a couple of random variables including TLD which is categorical to train the model.

Next, we choose some variables and separate train and test sets.

# Selected columns
cols = ['TLD','LineOfCode','Pay', 'Robots', 'Bank', 'IsDomainIP']

# X & Y
X = df[cols]
y = df['label']

# Split Train and Validation
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=0.2, 
                                                    random_state=42)

We will use the following parameters.

'force_col_wise': True: use this when the number of columns is large, or the total number of bins is large, or to reduce memory cost.
'categorical_feature': 'TLD': indicate which column is categorical for built-in encoding.
'objective': 'binary': as our labels has only two classes. Use multiclass for more classes.
'metric': 'auc': the metric for model evaluation.
'learning_rate': 1: rate for learning at each iteration.
'is_unbalance': False: if the class is unbalanced, use True for auto-balancing.

And finally train the model.

# Train LightGBM with imbalance handling
train_data = lgb.Dataset(X_train, label=y_train)
params = {
    'force_col_wise': True,
    'categorical_feature': 'TLD',
    'objective': 'binary',
    'metric': 'auc',
    'learning_rate': 1,
    'is_unbalance': False
}

# Fit model
model = lgb.train(params, train_data, num_boost_round=100)

# Predictions and evaluation
y_pred = (model.predict(X_test) > 0.5).astype(int)
print(classification_report(y_test, y_pred))

The resulting classification report is as follows.

               precision    recall  f1-score   support

           0       0.97      0.96      0.97     20124
           1       0.97      0.98      0.97     27035

    accuracy                           0.97     47159
   macro avg       0.97      0.97      0.97     47159
weighted avg       0.97      0.97      0.97     47159

Wow. With just a couple of variables, the model performed tremendously well on the validation set.

Now let’s compare the processing times of LightGBM implementation against a regular GBM from Scikit-Learn.

Comparison

Comparing models | Image generated by AI. Meta Llama, 2025. https://meta.ai

LightGBM

First, LightGBM trained on a generated classification dataset with 1,000,000 observations.

# Generate a dataset
X, y = make_classification(n_samples=1_000_000, n_features=10, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train LightGBM with imbalance handling
train_data = lgb.Dataset(X_train, label=y_train)
params = {
    'force_col_wise': True,
    'objective': 'binary',
    'metric': 'auc',
    'boosting_type': 'gbdt',
    'learning_rate': 0.05,
    'is_unbalance': True  # Handle class imbalance
}
model = lgb.train(params, train_data, num_boost_round=100)

# Predictions and evaluation
y_pred = (model.predict(X_test) > 0.5).astype(int)
print(classification_report(y_test, y_pred))

----------------------------OUT-------------------------------
              precision    recall  f1-score   support

           0       0.98      0.98      0.98     99942
           1       0.98      0.98      0.98    100058

    accuracy                           0.98    200000
   macro avg       0.98      0.98      0.98    200000
weighted avg       0.98      0.98      0.98    200000

The result is 9.73 seconds for the train and prediction to run.

Gradient Boosting Classifier from Scikit-Learn

Now, let’s train the same model with the GBM implementation from sklearn.

#import gradient boosting from sklearn
from sklearn.ensemble import GradientBoostingClassifier

# Generate a dataset
X, y = make_classification(n_samples=1_000_000, n_features=10, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model2 = GradientBoostingClassifier(n_estimators=100, learning_rate=0.05, random_state=42)
model2.fit(X_train, y_train)

# Predictions and evaluation
y_pred = model2.predict(X_test)
print(classification_report(y_test, y_pred))

----------------------------OUT-------------------------------
              precision    recall  f1-score   support

           0       0.97      0.99      0.98     99942
           1       0.99      0.97      0.98    100058

    accuracy                           0.98    200000
   macro avg       0.98      0.98      0.98    200000
weighted avg       0.98      0.98      0.98    200000

It took 15 minutes for fitting and predicting with this algorithm. And the data is not even that big.

Before You Go

We were able to learn a quick and simple tutorial of how to train a model using LightGBM package in Python.

We also learned that this implementation of the algorithm is much faster than others, becoming a great option to create powerful classification or regression models trained on large datasets with simple code.

The API documentation is well organized and complete, helping data scientists to quickly find arguments to fine-tune their models.

Follow Me

If you liked this content, follow me for more.

Gustavo R Santos – Medium

Gustavo R Santos

Git Hub

Here is the repository with the entire code of this exercise.

Studying/Python/LightGBM at master · gurezende/Studying

References

Welcome to LightGBM’s documentation! – LightGBM 4.5.0.99 documentation

UCI Machine Learning Repository

GradientBoostingClassifier

The post LightGBM: The Fastest Option of Gradient Boosting appeared first on Towards Data Science.

How Bias and Variance Affect Your Model

Gustavo Santos — Tue, 24 Dec 2024 11:01:38 +0000

Introduction

Ever since I started migrating to Data Science I heard about the famous Bias versus Variance tradeoff.

But I learned it enough to move on with my studies and never looked back too much. I always knew that a highly biased model underfits the data, while a high-variance model is overfitted, and that any of those are not good when training an ML model.

I also know that we should look for a balance between both states, so we’ll have a good fit or a model that generalizes the pattern well to new data.

But I might say I never went farther than that. I never searched or created highly biased or highly variant models just to see what they actually do to the data and how the predictions of those models are.

That is until today, of course, because this is exactly what we’re doing in this post. Let’s proceed with some definitions.

High Bias

Oversimplifying = Hit anything with the hammer | Google Gemini, 2024. https://gemini.google.com

Bias is the difference between the prediction of our model and the true value that we are trying to predict. A highly biased model underfits the data, failing to capture underlying patterns.

Models with High Bias perform poorly on both training and test data.

Model Behavior: The model oversimplifies and misses the correct relationship.
Analogy: Think about this problem like a student who oversimplifies topics, skips important details, and gets consistently low scores everywhere.
Example: Misses key information, like predicting house prices using only the number of bedrooms and ignoring location.
Example of Model Prediction: If the true pattern is to add 1 for every unit increase, a high bias model might always add 0.5, or worse, not add anything at all, completely ignoring the input.

Let’s see an example with code.

# Imports
import numpy as np
import matplotlib.pyplot as plt

# Generating high bias data
x = np.linspace(0, 10, 50)
y = 2 * np.sin(x)

# Plot
plt.scatter(x, y, label="True Nonlinear Data")
plt.plot(x, 0.5 * x, color='g', label="High Bias Linear Model")
plt.legend()
plt.show()

This is what a highly biased model looks like. We are trying to fit a linear regression to non-linear data. We’re oversimplifying the pattern, trying to make a line fit to a curve.

High Bias model: trying to fit a curve with a line

If we go one step ahead, create a Linear Regression model, and then make some predictions, we will see what happens.

from sklearn.linear_model import LinearRegression
import pandas as pd

# Fit
lm = LinearRegression().fit(x.reshape(-1, 1), y)
# Predict
preds = lm.predict(x.reshape(-1, 1))

# Performance
(pd
 .DataFrame({'Actual': y, 
             'Predicted': preds, 
             'Difference %': (y-preds)/y})
 .head(10)
 .T
 )

Look how the predictions are way off. The model can’t recognize the pattern in this data.

Predictions of a highly biased model.

Now it is time to understand models with high variance.

High Variance

Unpredictable outcomes based on minor fluctuations in the data | Google Gemini, 2024.

Variance is the variability of the model’s predictions for different training datasets. It measures how much the predictions change if the training data changes. A high-variance model overfits the training data, capturing noise instead of general patterns.

A model with high variance performs well on training data but poorly on unseen data.

Model Behavior: The model is too sensitive to the training data and overfits. High variance indicates the model is too complex and overfits the data.
Analogy: Like a student who memorizes answers without understanding concepts, performing great in practice but poorly in real-world tests.
Example: Captures too much noise, like trying to predict the stock market but mistaking random fluctuations for trends.
Example of Model Prediction: If the true pattern is to add 1 for every unit increase, a high variance model might learn to add 100, or even unpredictable values based on minor fluctuations in the data.

Let’s code an example now.

# Imports
import numpy as np
import matplotlib.pyplot as plt

# Generating high variance data
x = np.linspace(0, 10, 50)
y = 2 * np.sin(x) + np.random.normal(0, 2, len(x))  # Adding noise

# Plot
plt.scatter(x, y, label="High Variance Data")
plt.plot(x, 2 * np.sin(x), color='r', label="True Pattern")
plt.legend()
plt.show()

The model with high variance is very noisy. Notice how the noisy data fluctuates significantly around the true curve, making it challenging for a model to learn the correct relationship without overfitting.

High Variance model: noisy data fluctuates significantly around the true curve.

Let’s create a Decision Tree model, which is very well-known for overfitting, and assess the behavior of a high-variance model.

from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split

# Generate high variance data
np.random.seed(42)
x = np.linspace(0, 10, 100).reshape(-1, 1)
y_true = 2 * np.sin(x).ravel()
y_high_variance = y_true + np.random.normal(0, 3, len(x))  # Adding noise

# Split data into training and testing sets
x_train, x_test, y_train, y_test = train_test_split(x, y_high_variance, 
                                                    test_size=0.2,
                                                    random_state=42)

# Fit a decision tree regressor (high variance model due to overfitting)
tree_high_variance = DecisionTreeRegressor(max_depth=None, random_state=42)  # No depth limit increases variance
tree_high_variance.fit(x_train, y_train)

# Predict using the model
x_range = np.linspace(0, 10, 300).reshape(-1, 1)  # Fine range for smooth predictions
y_pred = tree_high_variance.predict(x_range)

# Plot the data and model predictions
plt.figure(figsize=(10, 6))
plt.scatter(x_train, y_train, color='blue', label='Training Data', alpha=0.7)
plt.scatter(x_test, y_test, color='green', label='Test Data', alpha=0.7, marker='x')
plt.plot(x_range, y_pred, color='orange', label='High Variance Model Predictions', linewidth=2)
plt.plot(x, y_true, color='red', label='True Pattern', linewidth=2, linestyle='--')
plt.title("High Variance Model with Decision Tree")
plt.xlabel("X")
plt.ylabel("Y")
plt.legend()
plt.show()

The outcome is displayed next.

High-variance model: overfitted to the training data; poor job on test data.

We see in the previous picture that the data is overfitted to the training data, but it does not do well for test data. Let’s see the numbers.

# Predictions Train set
train_preds = tree_high_variance.predict(x_train)

(pd
 .DataFrame({'Actual': y_train,
             'Predicted': train_preds, 
             'Difference %': 100*(y_train-train_preds)/y_train})
 .head(10)
 .T
)

# Predictions test set
preds = tree_high_variance.predict(x_test)

(pd
 .DataFrame({'Actual': y_test,
             'Predicted': preds, 
             'Difference %': 100*(y_test-preds)/y_test})
 .head(10)
 .T
)

As expected, the results on the train set are completely overfitted. The model was 100% accurate.

But it couldn’t generalize the pattern, performing poorly on the test set.

Tradeoff

To solve those problems, the idea is to find the middle ground between a model that is too simple and another that is too complex.

Increasing model complexity reduces bias, getting closer fit to training data, but increases variance, making the model more sensitive to noise. On the other hand, simplifying the model reduces variance, becoming less sensitive to noise, and increases bias, possibly missing true patterns.

Analogy

Think of bias as a rigid system (like a straight ruler) and variance as a flexible one (a bendable ruler).

A rigid ruler won’t capture curves (high bias).
A bendable ruler may bend too much for minor imperfections, overcomplicating the shape (high variance).

The ideal tool is one that balances rigidity and flexibility to represent the true shape (patterns in data).

In Practice

In practical terms, what can be done to balance this tradeoff?

High Bias Models

Using more complex algorithms
Cross Validation
Feature engineering
Add non-linear terms
Fine-tuning hyperparameters
Use Boosting ensemble methods (XGB, Gradient Boosting), because it reduces bias by correcting errors iteratively.

High Variance Models

Simplify the model
Cross Validation
Adding regularization
Use Bagging **** ensemble methods (Random Forest), because it reduces variance by averaging predictions.

Before You Go

This conceptual article is necessary for us to build a deeper understanding of this important topic for data science. Knowing what the Bias vs. Variance Tradeoff is and how each type affects modeling is a valuable knowledge that enables us to take corrective actions accordingly.

Every modeling decision – like choosing algorithms, regularization, and hyperparameter tuning – implicitly aims to balance bias and variance to minimize total error. This balance is crucial for building robust and generalizable Machine Learning models.

Note: All images by the author, unless otherwise noted.

Follow Me & Contacts

If you liked this content, follow my blog or connect with me via social media. You can learn more about my work on my website.

Gustavo R Santos

Gustavo R Santos – Medium

References

Understanding the Bias-Variance Tradeoff

Bias-Variance Trade Off – Machine Learning – GeeksforGeeks

The post How Bias and Variance Affect Your Model appeared first on Towards Data Science.

Synthetic Control Sample for Before and After A/B Test

Gustavo Santos — Thu, 19 Dec 2024 19:32:30 +0000

Introduction

A/B Testing is very powerful. I like this kind of experiment because it gives us the power to compare outcomes and determine if something is performing better than another.

A/B Testing has a specific type that adds the time component, which is the Before and After A/B Test. On that test, the comparison is between the situation of a given subject before and after an intervention.

Let us translate that previous sentence to a real-world example.

A company wants to know if an advertising would drive sales increase, so they can show that ad to a treatment group and compare the results to a control group that did not see the ad. The difference before and after the ad would indicate whether the intervention was effective or not.

Now, sometimes it is not possible to plan ahead and make that split of control and treatment groups before the intervention.

That is when the Synthetic Control sample will be useful. Using some statistics and machine learning, it is possible to simulate what would have happened with a sample if the intervention didn’t happen.

This is what we are going to learn in this post.

You can learn more about A/B testing in this post linked next.

My Easy Guide to Pre vs. Post Treatment Tests

Problem Description

We are working with a dataset from a retail chain of stores. To generate it, I used the Rossmann Stores dataset from Kaggle as my baseline to get a sense of sales distribution by store, so it would be closer to reality. Additionally, I selected control stores that are similar to the tested store. Therefore, the data was completely recreated and modified and has different store numbers and sales numbers.

The variables are:

Store : numbered from 1 to 10. The test store is the #10.
Date : daily dates from 2013 to 2015.
Sales : Amount of sales on that date in dollars.

Intervention:

A competitor opened a new store next to our Store #10, in March of 2014.
The control stores (1–9) don’t have a competitor around.

We intend to understand the impact caused by this competitor on Store #10’s sales to take any necessary actions and fight the competition, if applicable.

Plotting the sales trends from control and treatment stores, we get the next figure.

Control stores sales vs. Treatment Store sales. Image by the author.

Notice that the control stores keep following a steady pattern, with a slight growth even. On the other hand, the treatment store has declined considerably after the opening of the competitor.

Could this be related to the competition? Let’s keep digging.

Synthetic Control Sample

We just saw that there was a considerable decrease in sales for test Store #10 after the treatment event (the opening of a competitor in March 2014).

If we get the data and calculate the mean sales by week before the intervention date it was 68,199.28. After the intervention, 51,283.35. The average reduction is ~25%.

Now we could get the control stores and compare their performance against the test store like a regular Before and After A/B Test. After all, I pre-selected similar control and test stores when I was creating this data frame.

However, our intention is to compare the current performance of Store #10 with a hypothetical performance of Store #10 if no competition had been opened. To accomplish that, we need to simulate a performance without competition, thus we use the data from the control stores.

Let’s think for a minute:

The control stores don’t have competition around, so the assumption is that Store #10 would have a similar behavior after March 2014 without competition.
The Before versus After A/B Test is a comparison of two points in time of the same subject. Group A Before versus Group A after. Group B before versus Group B after.
We have Group B before and after, which is the test Store #10 as is, with competition.
We need a simulation of the behavior of Store #10 without competition opening.

So, here is what we are going to do:

Use Linear Regression to create a synthetic sales series for Store #10 without competition around.
That synthetic sample is fitted on the control stores – and that is the reason why the control stores must be similar to whatever we are willing to simulate.

Let’s do that.

Code

The libraries and modules used in this exercise.

# Data manipulation
import pandas as pd
import numpy as np

# DataViz
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

# Stats
import scipy.stats as scs
import pingouin as pg

# Modeling
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_percentage_error

We start separating control and test stores.

# Select the control group
control = df2.query('Store != "10"')

# Treatment Group
treat = df2.query('Store == "10"')

Next, we must resample our control data from daily sales to weekly sales (sum every 7 days). Here, we can already get the series of sales by date for the test store – y_sales, given that it is a single store.

# Create dataset of control stores sales resampled from daily to weekly (7 days) sales
unit_control = control.pivot(index='Date', columns='Store', values='Sales').resample('7D').sum()

# Agregating sales for Store #10 by each 7 days
y_sales = treat.set_index('Date')['Sales'].resample('7D').sum()

Then we will filter the control stores’ dates on the same dates we have available for the test store, aligning both datasets. If any value is NA, we fill it with the median, and then we take the mean value of all the control stores, transforming the data on a series of sales indexed by date.

# Filter the control sample with only the dates contained in the treatment dataset
aligned_dates = unit_control.index.intersection(y_sales.index)

# Get the control data filtered with dates from the treatment series
X_sales = unit_control.loc[aligned_dates].fillna(unit_control.median())

# Transform the control data into a single series aggregated as the mean of sales by week
X_sales_mean = X_sales.mean(axis=1)

Here is a view of the X_sales_mean data.

Series of control stores sales indexed by date. Image by the author.

Let’s plot both series y_sales and X_sales_mean.

# Set Date of the intervention
intervention_date = pd.to_datetime("2014-03-01")

# Plot
plt.figure(figsize=(12, 6))
plt.plot(y_sales.index, y_sales, label="Store 10", color='green', lw=2)
plt.plot(X_sales_mean.index, X_sales_mean, label="No Comp Stores", color='gray', linestyle='--')
plt.axvline(x=intervention_date, color="red", linestyle=":", lw=2, label="Opened Competitor for Store 10 ")
plt.xlabel("D A T E")
plt.ylabel("S A L E S")
plt.title("Sales GAP Store 10 vs Stores W/o Competition")
plt.legend()
plt.show()

STORE #10 versus the mean of the control stores (no competition). Image by the author.

We can observe that the results of Store #10 are consistently higher than the average of the control stores before competition. But once the competitor opened, sales became consistently under the average of the control stores.

Now we should fit the regression model to create the synthetic Store #10 without competition.

X: The sales of the control stores will be the predictor variables
y : The sales of the test store will be the target.
X_before and y_before are the training sets.
X_after and y_after are the test sets.

# X & y sets
X_before = X_sales[:intervention_date]
X_after = X_sales[intervention_date:]
y_before = y_sales[:intervention_date]
y_after = y_sales[intervention_date:]

# Linear Regression fit
lm = LinearRegression(fit_intercept=False)
lm.fit(X_before, y_before)

# Predictions before and after intervention
preds_before = lm.predict(X_before)
preds_after = lm.predict(X_after)

# Synthetic Control Series
synthetic_series = pd.Series(
    np.concatenate([preds_before, preds_after]),
    index= X_sales.index
)

If we plot again both series, we are comparing Store #10 with competition versus Store #10 without competition (synthetic sample).

# Plot Synthetic (No competition) vs 10
plt.figure(figsize=(12, 6))
plt.plot(y_sales.index, y_sales, label="Store 10", color='green', lw=2)
plt.plot(synthetic_series.index, synthetic_series, label="10 W/O Competition", color='gray', linestyle='--')
plt.axvline(x=intervention_date, color="red", linestyle=":", lw=2, label="Opened Competitor for Store 10 ")
plt.xlabel("D A T E")
plt.ylabel("S  A  L  E  S")
plt.title("Comparison: Store 10 vs Synthetic 10 No Comp")
plt.legend()
plt.show()

Here’s the graphic.

STORE #10 actuals versus synthetic STORE #10 w/o competition. Image by the author.

We can notice that the fit is pretty good for the initial part of the series, before the intervention date, then the lines open a considerable gap, where the actual Store #10 underperforms the synthetic pair, leading us to conclude that the cause of the drop in performance is, in fact, due to the opening of the competition.

To quantify this gap, let’s calculate the Mean Average Percentage Error (MAPE) of the predictions versus the real data.

# MAPE Store 10 versus Synthetic Store 10
mape_after = mean_absolute_percentage_error(y_after, preds_after)
print(f'MAPE after intervention: {mape_after:.2f}')

--------- OUT -----------
MAPE after intervention: 0.32

The calculation points to a drop of 32% in Store #10 sales after the opening of the competitor.

Let’s go one step further and perform a Before vs. After A/B Test using the actual values for the control and test stores, so we can check if the results are consistent.

Before and After A/B Test

A/B Test | Image generated by AI. Meta Llama, 2024. https://meta.ai

The Before and After Test compares the performance of the control stores against the performance of the test store pre and post-intervention, which is the opening of a competitor around store 10.

Let’s set the intervention date for the test.

# Intervention date
intervention_date = pd.to_datetime("2014-03-01")

Next, we must resample our data from daily sales to sales aggregated by 7 days.

df_sales7d = (
    df2
    .pivot(index='Date', columns='Store', values='Sales') #pivot for resampling
    .resample('7D') #aggregate sales by 7 days
    .sum() #sum sales
    .reset_index() #reset to make Date as column again
    .melt(id_vars='Date', var_name='Store', value_name='Sales') # unpivot
    .assign(after= lambda x: np.select([x.Date <= intervention_date, x.Date > intervention_date],
                                        [0, 1])) # add after Yes or No
    .assign(group= lambda x: np.select([x.Store == "10", x.Store != "10"],
                                        ["treatment", "control"])) # add group treatment or control
    )

# View
df_sales7d.head(2)

Here is the view of the data.

Sales aggregated 7 days. Image by the author.

Next, we calculate the averages and standard errors of the samples.

# Calculating averages and standard errors
ab_means = (df_sales7d
 .groupby(['group', 'after'])
 .agg({'Sales':['mean', 'std']})
 .round(2)
 )

Avg and Std. error of the samples. Image by the author.

Moving forward, we have to define a couple of functions:

std_error_two_samples: This function calculates a single standard error for the control and test samples. This will be important for the confidence interval calculation of the difference between the Before and After groups.
ab_test: a function to perform the A/B test and calculate the confidence interval for the difference between the groups.
Note: Both functions can be found in the GitHub repository, here.

print('Control Before vs After:')
ab_test(data=  df_sales7d.query('group == "control"').rename(columns={'group':'grp','after':'group'}),
        group_col= 'group',
        target_col = 'Sales')

print('---------------------------------------------------------------------')

print('Treatment Before vs After:')
ab_test(data=  df_sales7d.query('group == "treatment"').rename(columns={'group':'grp','after':'group'}),
        group_col= 'group',
        target_col = 'Sales')

Resulting in the following output:

Control Before vs After:
The calculated standard error is 1635.206108090877
The difference in means Group B - A : 1118.971107665042
P-Value (two tails): 0.49378591381056647
confidence interval with 95% confidence is [-2085.97  4323.92]
---------------------------------------------------------------------
Treatment Before vs After:
The calculated standard error is 1695.1522766982898
The difference in means Group B - A : -16915.930695613657
P-Value (two tails): 0.0
confidence interval with 95% confidence is [-20238.37 -13593.49]

For the control samples, at a significance level of 5%, we cannot reject the null hypothesis of equal means. So the stores are statistically behaving similarly in both periods. The difference between both samples means will fluctuate between -$2,085 to $4,323 a week.

For the test Store #10, the p-value suggests that the sample means are statistically different. So that store was impacted by the competition and dropped about $17k (-25%) in sales a week, with this difference varying from -$20k (-29%) to -$13.5k (-19%), both numbers well below zero.

But, hey… I am a visual guy. I like to see the difference in a graphic.

plt.figure(figsize=(18, 5))

# Control Samples Before vs. After
# Creating Normal Distribution of both Control samples
plt.subplot(1, 2, 1)
plot_a = np.random.normal(loc= ab_means.iloc[0,0], scale= ab_means.iloc[0,1], size=10000)
plot_b = np.random.normal(loc= ab_means.iloc[1,0], scale= ab_means.iloc[1,1], size=10000)
plot = pd.DataFrame({'group': ['Control_Before']*10000 + ['Control After']*10000,
                     'mu': np.concatenate([plot_a, plot_b])})
# Intervention date line
plt.axvline(x=ab_means.iloc[0,0], color="royalblue", linestyle="--", lw=1)
plt.axvline(x=ab_means.iloc[1,0], color="darkorange", lw=1)
sns.kdeplot(plot, x='mu', hue='group')
plt.title('Control Before vs After');

plt.subplot(1, 2, 2)
# Treatment Samples Before vs. After
# Creating Normal Distribution of both Treatment samples
plot_a = np.random.normal(loc= ab_means.iloc[2,0], scale= ab_means.iloc[2,1], size=10000)
plot_b = np.random.normal(loc= ab_means.iloc[3,0], scale= ab_means.iloc[3,1], size=10000)
plot = pd.DataFrame({'group': ['Treatment Before']*10000 + ['Treatment After']*10000,
                     'mu': np.concatenate([plot_a, plot_b])})
# Intervention date line
plt.axvline(x=ab_means.iloc[2,0], color="royalblue", linestyle="--", lw=1)
plt.axvline(x=ab_means.iloc[3,0], color="darkorange", lw=1)
sns.kdeplot(plot, x='mu', hue='group')
plt.title('Treatment Before vs After');

And here it is.

Before vs After A/B Test | Control vs Treatment samples. Image by the author.

I believe we have eliminated any doubts about the impact on sales caused by the new competitor.

Before You Go

Causal Inference has been growing lately. It is more and more common to see new content about the topic. After all, as it is said, "correlation does not imply causation", so when we are asked by a client or the leadership at work what is the impact (or cause) of some effect, we must have tools to assess that.

Some nice Python packages like DoWhy, Causal Impact are out there to help us infer causality out of the results.

The exercise we performed in this post is very similar to what the package causalimpact does. If we run some code on it, the result should look familiar to you now.

# !pip install causalimpact --quiet
from causalimpact import CausalImpact

# Create dataset with mean of the control stores (x) and the test store (y)
df4 = (pd.concat([X_sales_mean.round(2), y_sales], axis=1)
       .rename(columns={0: 'x', 'Sales': 'y'})
       .reindex(columns=['y', 'x'])
       )

# Run Causal Impact
impact = CausalImpact(data=df4,
                      pre_period=[pd.to_datetime("2013-01-01"), pd.to_datetime("2014-03-04")],
                      post_period=[pd.to_datetime("2014-03-11"), pd.to_datetime("2015-07-28")])

impact.run()

# Plot result
impact.plot()

# Print Summary
impact.summary()

# Prin Report
impact.summary(output="report")

This is the result. Very similar to our previous comparison.

Causal Impact library output. Image by the author.

I recommend you to take a look. But be aware of some conflicts with newer versions of Pandas.

That’s it. If you liked this content, follow me for more.

Follow Me | Blog & Website

Gustavo R Santos – Medium

Gustavo R Santos

Complete code GitHub

Before-and-After-Testing/Python/Causal Inference at main · gurezende/Before-and-After-Testing

References

Controle Sintético: Como Aplicar Inferência Causal na Prática

4 Python Packages to Learn Causal Analysis

Notebook on nbviewer

The post Synthetic Control Sample for Before and After A/B Test appeared first on Towards Data Science.

Documenting Python Projects with MkDocs

Gustavo Santos — Fri, 22 Nov 2024 18:58:11 +0000

Introduction

Project Documentation is necessary. Very necessary, I would emphasize.

At the beginning of my career, I learned the hard way that a project must be documented.

Let’s go back in time – to the 2000s – when I was working as a Customer Representative for large US companies. I was part of a team and my colleagues and I had joined the company around the same month. So, for a while, there was no need to worry because nobody was going on vacation just a few weeks or months after starting a new job.

However, after some time, it inevitably would happen. And we were all assigned to back up each other. That is when documentation started to play a major part in my career.

The day the first person took a few days off, I panicked! I got to work and I didn’t know what to do or even where to start. The tasks kept coming and piling up while I was trying to figure out how to process them.

In the end, everything turned out well. I was able to figure it out and move on. But from that day on, I knew that documentation needed to be in place for any time off or team movement, like promotions or offboardings.

In this post, we will learn how to create a simple (and effective) project documentation using Mkdocs in Python. The final result will look similar to MkDocs documentation.

Building The Documentation

mkdocs is a module in Python that allows us to create simple web pages using Markdown language. The benefit is that it is highly customizable, and gives your documentation a professional look, besides easily integrating with GitHub.

Additionally, mkdocs leverages Markdown notation language, which is very simple to use, being just plain text with the addition of a couple of signs to point titles, subtitles, bullet points, italic, bold etc. To illustrate, Medium uses Markdown language for blogging.

Markdown is a lightweight markup language for creating web formatted text using a plain-text editor.

Preparation

I believe that the best time to create the documentation is once we finish the project. At that point, we already know which modules were used, how it was deployed, and how the project can be started and maintained. So, it is time to document those steps for the users.

When documenting something, my experience tells me to:

Describe it as if you were describing how to run the project to a complete layperson.
Try to avoid highly technical terms, and acronyms used in your company.
Describe each step using clear and simple language.
If the concept is too dense, or the task is too complex, try breaking it down into bullets.

Before starting with the documentation, let’s create a sample project real quick, using the module uv for virtual environment management. Here, I am using uv and VSCode.

Open a terminal and install uv with pip install uv
Create a new project name "p2": uv init p2
Change the directory to access the new folder: cd p2
Set the Python version for the project: pyenv local 3.12.1
Create a new virtual environment: uv venv --python 3.12.1
Activate the environment: venv/Scripts/activate
Add some packages: uv add pandas, numpy, scikit-learn, streamlit

Sample project created. Image by the author.

Getting Started

Having the project created, let’s add mkdocs.

# Install mkdocs
uv add mkdocs

Next, we will create a new documentation folder.

# create a new documentation folder in your project
mkdocs new .

That command will generate a docs folder and the files needed for the documentation.

File mkdocs.yml: It is used to configure your documentation webpage, like title, theme, and site structure, like adding new tabs.
Folder docs with the file index.md: This file is where you will write the documentation itself.

Documentation folder. Image by the author.

If we want to look at our documentation, we already can. Just use the serve command.

# Open a local server to display the docs
mkdocs serve

Local version running on the port 8000. Image by the author.

Now, we can just copy + paste that HTTP into a browser (or Ctrl + click) to see how the documentation currently is.

MkDocs documentation "out of the box". Image by the author.

Customization

It is time to customize our documentation.

Let’s start by changing the Title of the documentation page. Open the mkdocs.yml file. You will see only that site_name title in the default file.

mkdocs.yml default file. Image by the author.

Let’s change it.

site_name: P2 Project Documentation

We can add a new tab About with the information about the project. For that to actually work, we also need to add a markdown file about.md to the folder docs.

site_name: P2 Project Documentation
nav:
  - Home: index.md
  - About: about.md

And we can change the theme if we want to. Check for built-in available themes here. Or for installable themes gallery here.

site_name: P2 Project Documentation
nav:
  - Home: index.md
  - About: about.md
theme: mkdocs

Here is the result, so far.

Mkdocs is easily customizable. Image by the author.

Next, let us start writing the documentation. This should be done in a markdown file within the folder docs.

I will write the whole example documentation to the file index.md and the project meta information will go to the file about.md.

File index.md

We will erase the sample text that is in there and write our documentation instead.

# P2 Project

This project is an example of how we can write a professional documentation using `mkdocs` module in Python.

To learn MarkDown notation, use this [Cheatsheet](https://github.com/adam-p/markdown-here/wiki/Markdown-Here-Cheatsheet).

---

## Python Version

This project was built using **Python 3.12.1**

---

## Modules

* mkdocs >= 1.6.1
* numpy >= 2.1.3
* pandas >= 2.2.3
* scikit-learn >= 1.5.2
* seaborn >= 0.13.2
* streamlit >= 1.40.1

---

## Quick Start

To create a documentation with MkDocs, these are the main bash commands:

* Install mkdocs: `pip install mkdocs`
* Create a new documentation folder: `mkdocs new .`
* `mkdocs.yml` is the file to customize the web page, such as creating tabs, changing titles and themes.
* The files in the folder **docs** are the ones to hold the documentation text, using MarkDown notation.

File about.md

# About This Project
 


* **Author:** Your Name
* **Purpose:** Exemplify how to create a professional looking documentation for your projects using MarkDown notation in Python.

---

### Contact

Find me on [Linkedin](https://www.linkedin.com/in/profile)

The final result is this beautiful documentation site.

Final documentation website. Image by the author.

Adding Functions and Docstrings

MkDocs also has ability to pull functions and respective docstrings from the code. To do that, first add the module MkDocstrings-Python using pip install mkdocstrings-python. In our case, I am using uv.

uv add mkdocstrings-python

Next, adjust the mkdocs.yml file to add the plugin. Add the following lines to the end of the file and save it.

plugins:
- mkdocstrings:
    default_handler: python

Now, let’s look at our code. In this example project, we have only the file hello.py with two functions.

Functions in the python file. Image by the author.

Adding them to the documentation is pretty simple. Use three ::: followed by the path to your file. Since this file is in the main folder of the project, we simply add the file_name.function. If it is within a folder, you can use something like folder.file.function.

### Function

These are the functions used in the code `hello.py`

::: hello.main_func
::: hello.calculations

After saving the file, we can look at the result mkdocs serve.

Functions pulled directly from the .py file. Image by the author.

Now let’s deploy the docs.

Deploying

Deploying our documentation page is simple.

First, we must create a GitHub page for the project, if we already don’t have one.

Next, go back to the IDE terminal and we will build our page with the next command. This command will create the folders and files necessary to deploy the documentation website.

mkdocs build

[OUT]:
INFO  -  Cleaning site directory
INFO  -  Building documentation to directory: C:MyDocumentstestesp2site
INFO  -  Documentation built in 0.06 seconds

Now, need to add the GitHub repository in the mkdocs.yml file, so the module knows where to deploy the documentation to.

Adding the repository name to the mkdocs.yml. Image by the author.

Then we open a Git Bash Terminal to initialize Git and commit.

# Initialize Git
git init

# Add the reference repository
git remote add origin https://github.com/gurezende/MkDocs-Example.git

# Add files from the project
git add .

# Commit the files
git commit -m "Project Code and documentation"

# Create Branch Main
git branch -M main

# Push files to GitHub
git push -u origin main

And then we can deploy the documentation with the following bash code in a Powershell terminal.

mkdocs gh-deploy

## Output ##
INFO    -  Cleaning site directory
INFO    -  Building documentation to directory: C:MyDocumentstestesp2site
INFO    -  Documentation built in 0.08 seconds
WARNING -  Version check skipped: No version specified in previous deployment.
INFO    -  Copying 'C:MyDocumentstestesp2site' to 'gh-pages' branch and pushing to GitHub.
Enumerating objects: 39, done.
Counting objects: 100% (39/39), done.
Delta compression using up to 12 threads
Compressing objects: 100% (37/37), done.
Writing objects: 100% (39/39), 829.12 KiB | 11.68 MiB/s, done.
Total 39 (delta 2), reused 0 (delta 0), pack-reused 0
remote: Resolving deltas: 100% (2/2), done.
remote: 
remote: Create a pull request for 'gh-pages' on GitHub by visiting:
remote:      https://github.com/gurezende/MkDocs-Example/pull/new/gh-pages
remote: 
To https://github.com/gurezende/MkDocs-Example.git
 * [new branch]      gh-pages -> gh-pages
INFO - Your documentation should shortly be available at: https://gurezende.github.io/MkDocs-Example/

Notice that on the last line, we have the URL where the documentation was deployed. This address can be added to your GitHub readme file.

P2 Project

After deployment, we just need to push the updates again to update GitHub, using the following commands on a Git Bash terminal.

git add .
git commit -m "Online documentation added"
git push origin main

That’s it! The documentation is live!

From now on, every time we update the markdown files from our project and command mkdocs gh-deploy, the web page is updated and our documentation stays up to date. Easy like that!

GitHub page of the project. Image by the author.

Before You Go

Documenting your projects is important.

After all, nobody knows what was in your head when you developed something. Therefore, documenting is like showing your line of thought, the steps used to reach an end.

Open a window in your mind to show other people how you created that product and how to use it.

MkDocs make it so easy and looks super professional. I am sure it will help a lot in documenting your projects at work, helping fellow colleagues to navigate your code, as well as positively impacting anyone who looks at your portfolio from now on.

If you liked this content, follow me for more.

Gustavo R Santos

GitHub Repository

Here is the GitHub Repository for this article.

GitHub – gurezende/MkDocs-Example: Creating documentation for Python Projects with mkdocs

Learn More

If you want to see this content in video, here’s a product from my Gumroad page.

Create Stunning Documentation with Python and MkDocs

References

MkDocs

Usage – mkdocstrings-python

PYTHON – Project Documentation with MkDocs and Python

Markdown Here Cheatsheet

Choosing Your Theme – MkDocs

Gallery

The post Documenting Python Projects with MkDocs appeared first on Towards Data Science.

How I Created a Data Science Project Following CRISP-DM Lifecycle

Gustavo Santos — Wed, 13 Nov 2024 16:02:02 +0000

Introduction

CRISP-DM stands for Cross-Industry Standard Process for Data Mining, a data mining framework open to anyone who wants to use it.

Its first version was created by SPSS, Daimler-Benz and NCR. Then, a group of companies developed and evolved it to CRISP-DM, which nowadays is one of the most known and adopted frameworks in Data Science.

The process consists of 6 phases, and it is flexible. It is more like a living organism where you can (and probably should) go back and forth between the phases, iterating and enhancing the results.

The phases are:

Business Understanding

Data Understanding

Data Preparation

Modeling

Evaluation

Deployment

The small arrows show a natural path from Business Understanding to Deployment—where the interactions occur directly—while the circle denotes a cyclic relationship between the phases. This means that the project does not end with Deployment but can be restarted due to new business questions triggered by the project or adjustments potentially needed.

CRISP-DM. Credits: Wikipedia

In this post, we will follow a project throughout its lifecycle using CRISP-DM steps. Our main objective is to show how using this framework is beneficial to the data scientist and to the company.

Let’s dive in.

Project

Let’s go over a project following the CRISP-DM framework.

In summary, our project is to create a classification model to estimate the probability of a customer submit a term direct deposit in our client’s institution, a bank.

Here is the GitHub Repository with the code, if you want to code along or follow it while reading the article.

GitHub – gurezende/CRISP-DM-Classification: End to End Classification project using the CRISP-DM…

Business Understanding

Understanding the business is crucial for any project, not just data science projects. We must know things like:

What is the business?
What is its product
What are we selling/ offering?
What is expected for that project?
What is the definition of success?
Metrics

In this project, we are working with a Bank, therefore we are talking about the Finance Industry. Our client sells financial solutions for people to easily receive, save, and invest their money in a secure environment.

The client reached us to discuss some direct marketing campaigns based on phone calls aiming to convert a financial product (term deposit). However, they feel like wasting time and effort from their managers to get the expected results, so the client wants to increase/ optimize the conversions by focusing effort on customers with a higher probability of conversion.

Certainly, business is a complex subject. Several factors can impact the result of the campaigns, but for the sake of simplicity, we will go straight to this solution:

Create a predictive model that would give the managers a probability that the customer will convert or not.

Having that in hand, managers would be equipped with a tool to make a better selection of calls with a higher probability of success versus those customers that would need more work along the way.

Ergo, the definition of success for this project is estimating the probability of conversion, and the metric for the model will be F1-score. For the business, the metric could be the conversion rate, which would be compared in a Before and After comparative study.

Next, we need to start touching the data.

_Da_ta Understanding

The data we will use is the dataset Bank Marketing, available in the UCI Data Science Repository. It is open source under the Creative Commons 4.0 license.

The modules installed and imported in this project can be found on the project’s GitHub page.

Python">!pip install ucimlrepo --quiet
from ucimlrepo import fetch_ucirepo

# fetch dataset
bank_marketing = fetch_ucirepo(id=222)

# data (as pandas dataframes)
df = pd.concat([bank_marketing.data.features, bank_marketing.data.targets], 
               axis=1)
df = df.rename(columns={'day_of_week':'day'})

# View
df.sample(3)

The first look at the imported dataset. Image by the author.

Before starting working on the data, we will go ahead and split it into train and test, so we keep it safe of [data leakage](https://en.wikipedia.org/wiki/Leakage(machinelearning)).

# Split in train and test sets
X_train, X_test, y_train, y_test = train_test_split(df.drop('y', axis=1),
                                                    df['y'],
                                                    test_size=0.2,
                                                    stratify=df['y'],
                                                    random_state=42)

# train
df_train = pd.concat([X_train, y_train], axis=1)

# test
df_test = pd.concat([X_test, y_test], axis=1)

Great. Now we are ready to move on and understand the data. This is also known as Exploratory Data Analysis (EDA).

Exploratory Data Analysis

The first step in an EDA is to describe the data statistically. This will already bring insights up to start understanding the data, such as spotting variables with potential errors or outliers, having a sense of the distributions and averages, as well learning which categories are the most frequent for categorical variables.

# Statistical description
df_train.describe(include='all').T

Statistical description of the data. Image by the author.

This simple one line command allows us to get the following insights:

Age of the customers is 40 years old on average. Distribution skewed to the right.
More than 20% of the customers are blue-collar workers.
Most of the customers are married, with secondary level education, own a house loan.
Only ~2% had payment default.
Conversion Rate ~ 11.7%
The data is highly unbalanced towards the negative class.

Target variable. Negative class dominates it. Image by the author.

Once we know the distribution of the target variable, it is time to understand how the predictor variables interact with the target, trying to figure out which ones could be better for modeling the target variable’s behavior.

Age versus Conversions | Customers who converted to the campaigns are slightly younger than those who did not. However, both distributions are visually similar, even though the KS Test shows they are statistically different.

#Sample 1 - Age of the converted customers
converted = df_train.query('y == "yes"')['age']

#Sample 2 - Age of the not converted customers
not_converted = df_train.query('y == "no"')['age']

# Kolmogorov-Smirnov Test
# The null hypothesis is that the two distributions are identical
from scipy.stats import ks_2samp
statistic, p = ks_2samp(converted, not_converted)

if p > 0.05:
    print("The distributions are identical.")
else:
    print("The distributions are not identical: p-value ==", round(p,10))

----------
[OUT]:
The distributions are not identical: p-value == 0.0

# Age versus Conversion
plt.figure( figsize=(10,5))
ax = sns.boxenplot(data=df_train, x='age', y='y', hue='y', alpha=0.8)
plt.suptitle('Age versus Conversion')
plt.ylabel('Converted')
plt.title('Conversions are concentrated between 30 and 50 years old, which is not that different from the not converted', size=9)

# Annotation
# Medians and Averages
median_converted = df_train.query('y == "yes"')['age'].median()
median_not_converted = df_train.query('y == "no"')['age'].median()
avg_converted = df_train.query('y == "yes"')['age'].mean()
avg_not_converted = df_train.query('y == "no"')['age'].mean()
# Annotation - Insert text with Average and Median for each category
plt.text(95, 0, f"Avg: {round(avg_not_converted,1)} nMedian: {median_not_converted}",
         ha="center", va="center", rotation=0,
         size=9, bbox=dict(boxstyle="roundtooth, pad=0.5", fc="lightblue",
         ec="r", lw=0))
plt.text(95, 1, f"Avg: {round(avg_converted,1)} nMedian: {median_converted}",
         ha="center", va="center", rotation=0,
         size=9, bbox=dict(boxstyle="roundtooth, pad=0.5", fc="orange", 
         ec="r", lw=0));

The previous code yields this visualization.

Age versus Conversions. Image by the author.

Job vs. Converions | Customers who hold management roles in their jobs are converting more, followed by technicians, blue-collars, admin and retired.

# job versus Conversions == "YES"
converted = df_train.query('y == "yes"')
plt.figure( figsize=(10,5))
# order of the bars from highest to lowest
order = df_train.query('y == "yes"')['job'].value_counts().index
# Plot and title
ax = sns.countplot(data=converted,
                   x='job',
                   order=order,
                   palette= 5*["#4978d0"] + 6*["#7886a0"])
plt.suptitle('Job versus Converted Customers')
plt.title('Most of the customers who converted are in management jobs. n75% of the conversions are concentrated in 5 job-categories', size=9);
# X label rotation
plt.xticks(rotation=80);
#add % on top of each bar
for pct in ax.patches:
    ax.annotate(f'{round(pct.get_height()/converted.shape[0]*100,1)}%',
                (pct.get_x() + pct.get_width() / 2, pct.get_height()),
                ha='center', va='bottom')

Job versus Conversions. Image by the author.

Well, it does not make much sense to keep repeating code here for the visualizations, so I will go ahead and present only the graphics and the analysis from now on. Again, it is all available in this GitHub repository.

Marital status vs. Conversions | Married customers convert more to the term deposit.

Marital status vs Conversions. Image by the author.

Education vs. Conversion | More educated people convert more to a financial product. However, the converted distribution follows the dataset distribution, so this variable will probably not differentiate conversions from not conversions.

Education vs. Conversion. Image by the author.

Balance vs. Conversion | Customers with a higher balance on their account are converting more. We tested the statistical significance of the samples and there is a difference.

Balance vs. Conversion. Image by the author.

In the previous plot, we arbitrarily removed the data points over the 98th percentile, so the visualization was better. We can see that the converted customers have higher balances, in general, but we can’t tell if there is a statistical difference between both groups. Let’s test that. Given that the distributions are heavily skewed to the right, we will use a non-parametric test, the Kolmogorov-Smirnov Test.

#Sample 1 - Balance of the converted customers
converted = df_train.query('y == "yes"')['balance']

#Sample 2 - Balance of the not converted customers
not_converted = df_train.query('y == "no"')['balance']

# Kolmogorov-Smirnov Test
# The null hypothesis is that the two distributions are identical
from scipy.stats import ks_2samp
statistic, p = ks_2samp(converted, not_converted)

if p > 0.05:
    print("The distributions are identical.")
else:
    print("The distributions are not identical: p-value ==", round(p,4))

---------
[OUT]: 
The distributions are not identical: p-value == 0.0

Are there people with negative balance converting to a term deposit? Common sense says that, in order to be able to deposit something, you must have money available. Therefore, if the customer is negative, they should not be able to convert to a deposit. However, we will see that it happens.

neg_converted = df_train.query('y == "yes" & balance < 0').y.count()
pct = round(neg_converted/df_train.query('y == "yes"').y.count()*100,1)
print(f'There are {neg_converted} conversions from people with negative acct balance. nThis represents {pct}% of the total count of customers converted.')

---------
[OUT]:
There are 161 conversions from people with negative acct balance. 
This represents 3.8% of the total count of customers converted.

Duration vs. Conversions | In this plot, we can visually notice the impact of the duration of the phone calls on the conversions. Customers who converted stayed twice or more time in the call than the other customers.

Duration vs. Conversion. Image by the author.

Campaign contacts vs. Conversions | People who converted received between 2 to 4 contacts, in general. After the 5th contact, the points for converted start to become sparse. For Not converted, the points are more consistent through 13 contacts or so.

Campaign contacts vs. Conversion. Image by the author.

Previous Contacts vs. Converted | It appears that more previous contacts can influence the customer to convert. We notice in the graphic that the converted customers received a couple more calls than the not converted.

Previous contacts vs. Conversion. Image by the author.

Previous campaign outcome vs. Conversions | Customers who converted in the past are more inclined to convert again. Likewise, customers with past failures tend to repeat the failure.

Previous Outcome vs. Conversion. Image by the author.

Contact Method vs. Conversions | Despite there being more conversions from customers contacted via cell phone, it just shows that there are less landlines. The proportions of conversion are similar from both types of contact.

Contact Method vs. Conversion. Image by the author.

Month vs. Conversions | There are more conversions on the mid-year months, however, ~76% of the calls were made on those months. Possibly the campaign ran more heavily during those months.

Month vs. Conversion. Image by the author.

Day vs. Conversions | The conversions happen more around the most probable payment days 5, 15 and 30. We can notice higher peaks around these dates.

Day vs. Conversion. Image by the author.

Days since last contact vs. Conversions | Most of the conversions happened for customers contacted within the past 100 days from a previous campaign.

pDays vs. Conversion. Image by the author.

Most conversions (64%) are made in the first contact.

# The impact of the recency of the contact over conversions
total = df_train.query('y == "yes"').y.count()
print('First contact:', round( df_train.query('y == "yes" & pdays == -1').y.count()/total*100, 0 ), '%')
print('Up to 180 days:', round( df_train.query('y == "yes" & pdays > 0 & pdays <= 180').y.count()/total*100, 0 ), '%')
print('More than 180 days:', round( df_train.query('y == "yes" & pdays > 180').y.count()/total*100, 0 ), '%')

-------
[OUT]:
First contact: 64.0 %
Up to 180 days: 18.0 %
More than 180 days: 18.0 %

However, this is not different from the majority of the data. The non-converting customers with just the first contact are even higher in proportion (84%).

# The impact of the recency of the contact over Not converted
total = df_train.query('y == "no"').y.count()
print('First contact:', round( df_train.query('y == "no" & pdays == -1').y.count()/total*100, 0 ), '%')
print('Up to 180 days:', round( df_train.query('y == "no" & pdays > 0 & pdays <= 180').y.count()/total*100, 0 ), '%')
print('More than 180 days:', round( df_train.query('y == "no" & pdays > 180').y.count()/total*100, 0 ), '%')

-------
[OUT]:
First contact: 84.0 %
Up to 180 days: 6.0 %
More than 180 days: 10.0 %

Housing vs. Conversions | There are more conversions from people wihtout house loan – 1.7 times more conversions.

House Loan vs. Conversion. Image by the author.

Personal Loan vs. Conversions | There are more conversions from people without peprsonal loans. Although it follows the overall distribution, people without loan are proportionally higher conversions.

Personal Loan vs. Conversion. Image by the author.

Default vs. Conversions | Conversions are almost entirely from people without payment defaults, what makes sense, as those with default are probably without money.

Default vs. Conversion. Image by the author.

People without default is converting twice more (12%) as those with default (6%).

Next, we are ready to write the summary of findings.

Exploration Summary

After thorough exploration of the data, we can summarize it as follows:

The converter profile is a 38 to 41 year old person, working on a management role, married, with a refined education of at least secondary level, holding a positive balance on their account without housing or personal loan, thus with less debt.
Most conversions happened on the first contact (64%).
Customers not converted on the first contact received something between 2 to 4 contacts before conversion.
The more contacts from this current campaign, the lower the probability of a customer have converted.
Customers never contacted before need more contacts on average than existing customers
There are more chances of conversion for people contacted up to 10 times for previous campaign.
Contacts from previous campaigns can impact on the conversion of the current campaign, possibly indicating that relationship over time matters.
Customers that converted in previous campaigns are more likely to repeat conversion, while failure to convert also show a tendency to not convert again.
The longer the contacts, the higher the chance of converting. People who converted stayed connected to the call up to 4 times more seconds. However, we can’t use the duration of the call as a predictor.

Looking at the graphics after exploration, the variables duration, job, marital, balance, previous, campaign, default, housing and loan are interesting for modeling, as they impact more directly on the target variable. However, duration cannot be used, as it is not possible to know the duration of a phone call until it ends. The variable poutcome also looks promising, but it has too many NAs, so it needs further treatment to be considered.

Data Preparation

Understanding the data is very important for a better modeling. After the initial insights, we have an idea of what could drive more separation of the classes.

The next step is to prepare this dataset for modeling, transforming variables into categories or numbers, since many data science algorithms require only numbers as input.

Let’s get to work.

Dealing with Missing Data

Missing data can ruin our model, so we must treat them by removing or inputting data for those observations.

Here is what we have of missing data points.

# Checking for missing data
df_train.isna().sum()[df_train.isna().sum() > 0]

-------
[OUT]:

job 234
education 1482
contact 10386
poutcome 29589

Starting with job, out of those 234 NAs, we see that there are 28 converted customers that would be lost (0.6%) if we drop those NAs.

# NAs in job
 (df_train #data
 .query('job != job') # only NAs
 .groupby('y') #group by target var
 ['y']
 .count() #count values
 )

-------
[OUT]:

y 
no 206
yes 28

There would be three options in this case:

Drop the NAs: only 0.6% may not make a difference
Use Random Forest to predict what is the job.
Add the most frequent job, which is blue collar.

We will move on with drop, at this time, as we consider the number too small to be worth it to predict a job.

# Check the impact of NAs for the job variable in the conversions
df_train.query('job != job').groupby('y')['y'].value_counts()

# Drop NAs.
df_train_clean = df_train.dropna(subset='job')

Next, looking at education missing values. There are 1482 missing entries and 196 of those are Yes, which represents 4.6% of the converted customers. In this case, it is a considerable amount of converted observations to be dropped.

In this case, we are going to use the CategoricalImputer from feature_engineinput the most frequent category for the education of these NAs.

# Check the impact of NAs for the education variable in the conversions
df_train.query('education != education').groupby('y')['y'].value_counts()

# Simple Imputer
imputer = CategoricalImputer(
    variables=['education'],
    imputation_method="frequent"
)

# Fit and Transform
imputer.fit(df_train_clean)
df_train_clean = imputer.transform(df_train_clean)

For outcome, we must come up with a new category. So this variable shows the result of a previous marketing campaign. According to our insight in the exploration phase, customers who converted in the past are more likely to convert again. So this variable becomes interesting to the model. However, there are a lot of missing values that will need to go to a separate category, so we won’t bias our model with imputation of the vast majority of the data. We will input "unknown" for the NAs, as it is in the data documentation.

# Input "unknown" for NAs.
df_train_clean['poutcome'] = df_train_clean['poutcome'].fillna('unknown')

For contact we will add "unknown" to the NAs, just like the data documentation says.

# Fill NAs with "unknown"
df_train_clean['contact'] = df_train_clean['contact'].fillna('unknown')

Data clean of missing values. Image by the author.

Next, we need other transformations in this dataset.

Categorical Transformation

Many models don’t deal well with categorical data. Therefore, we need to transform the data into numbers using an encoding type. Here is the strategy to be used for this project:

education, contact, balance, marital, job, and poutcome: For these variables, One Hot Encoding can be ideal.
default, housing, loan, and y are binary variables that will be mapped to no: 0 and yes: 1.

# Binarizing default, housing, loan, and y
df_train_clean = df_train_clean.replace({'no': 0, 'yes': 1})

There is a previous binning to be done on balance prior to One Hot Encoding.

# Balance in 3 categories: <0 = 'negative, 0-median = 'avg', >median = 'over avg'
df_train_clean = (
    df_train_clean
    .assign(balance = lambda x: np.where(x.balance < 0,
                                          'negative',
                                          np.where(x.balance < x.balance.median(),
                                                   'avg',
                                                   'over avg')
                                          )
    )
)

# One Hot Encoding for 'marital', 'poutcome', 'education', 'contact', 'job', 'balance'
from feature_engine.encoding import OneHotEncoder

# Instance
ohe = OneHotEncoder(variables=['marital', 'poutcome', 'education', 'contact', 'job', 'balance'], drop_last=True)

# Fit
ohe.fit(df_train_clean)

# Transform
df_train_clean = ohe.transform(df_train_clean)

# Move y to the first column
df_train_clean.insert(0, 'y', df_train_clean.pop('y'))

Next, month to numerical variable.

# Month to numbers
df_train_clean['month'] = df_train_clean['month'].map({ 'jan':1, 'feb':2, 'mar':3, 'apr':4, 'may':5, 'jun':6, 'jul':7, 'aug':8, 'sep':9, 'oct':10, 'nov':11, 'dec':12})

And other numerical variables will be categorized (bins) to reduce the number of single values, which can help classification models to find patterns.

# Function to replace the variable data with the new categorized bins
def variable_to_category(data, variable, k):
  return pd.cut(data[variable], bins=k).astype(str)

# Transforming variable Age into bins
# Using Sturges rule, where number of bins k = 1 + 3.3*log10(n)
k = int( 1 + 3.3*np.log10(len(df_train_clean)) )

# Categorize age, balance, duration, previous, pdays
for var in str.split('age,pdays,previous', sep=','):
  df_train_clean[var] = variable_to_category(df_train_clean, var, k=k)

# CatBoost Encoding the dataset
df_train_clean = ce.CatBoostEncoder().fit_transform(df_train_clean, df_train_clean['y'])

# View of the final dataset for modeling
df_train_clean.sample(5)

Next, you can see a partial view of the final data to be used for modeling.

Data cleaned and transformed for modeling. Image by the author.

Modeling comes in sequence.

Modeling

Once the data is prepared and transformed, we can start modeling. For this modeling, we are going to start testing many algorithms to see which one performs best. Knowing that the data has a huge unbalance with 88% of the observations classified as no, we will use weights for the classes.

For this initial test, let’s get a sample of 10k observations randomly selected, so it runs faster.

# X and y sample for testing models
df_sample = df_train_clean.sample(10_000)
X = df_sample.drop(['y'], axis=1)
y = df_sample['y']

The code for the test is fairly extensive, but it can be seen in the GitHub repo.

# Example of using the function with your dataset
results = test_classifiers(X, y)
print(results)

-------
[OUT]:
               Classifier  F1 Score  Cross-Validated F1 Score
0                Catboost  0.863289                  0.863447
1             Extra Trees  0.870542                  0.862850
2       Gradient Boosting  0.868414                  0.861208
3                 XGBoost  0.858113                  0.858268
4           Random Forest  0.857215                  0.855420
5                AdaBoost  0.858410                  0.851967
6     K-Nearest Neighbors  0.852051                  0.849515
7           Decision Tree  0.831266                  0.833809
8  Support Vector Machine  0.753743                  0.768772
9     Logistic Regression  0.747108                  0.762013

The best-performing models for this problem were the Boosting ones. CatBoost was the top estimator, so we will work with it from now on.

Let’s move on with a new split and test, now for the whole cleaned training set.

# Split X and y
X = df_train_clean.drop(['y', 'duration'], axis=1)
y = df_train_clean['y']

# Split Train and Validation
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

Let us begin with a base model with all the columns to try to tune it from that starting point.

model = CatBoostClassifier(verbose=False)
# train the model
model.fit(X_train, y_train)

prediction = model.predict(X_val)

# confusion matrix
cm = pd.DataFrame(confusion_matrix(y_val, prediction)  )
print ("Confusion Matrix : n")
display(cm)

# Evaluate the weighted model
print('Base Model:')
print(classification_report(y_val, prediction))

Result of the base model. Image by the author.

As expected, the model does really well for the negative class, since there is a huge unbalance towards it. And even if the model just classified everything as "no", it would still be correct 88% of the time. That is why accuracy is not the best metric for classification. The precision of the positive class is not bad, but the recall is terrible. Let’s tune this model.

For that, I ran a GridSearchCV and tested a few values of learning_rate, depth, class_weights, border_count, and l2_leaf_reg. The hyperparameters:

border_count: Controls the number of binning thresholds for numeric features. Lower values (e.g., 32 or 64) can reduce overfitting, which may help the model generalize better on imbalanced data.
l2_leaf_reg: Adds L2 regularization to the model. Higher values (e.g., 5 or 7) can penalize the model, reducing its complexity and potentially preventing it from being overly biased toward the majority class.
depth: Controls how deep the decision tree should go for classification.
learning_rate: how large is the step of the learning for each iteration when adjusting the weights of the algorithm.
class_weights: good for unbalanced data, we can give a higher weight for the minority class.

The Gird Search returned this to me:

Best Parameters: {‘border_count’: 64, ‘class_weights’: [1, 3], ‘depth’: 4, ‘l2_leaf_reg’: 5, ‘learning_rate’: 0.1}

Here, I am considering that a False Positive (1 when the truth is 0) is worse than a False Negative (true 1 is classified as 0). That is because, thinking as a manager, if I see a customer with a higher probability of converting, I wouldn’t like to spend energy on that call if that is a false positive. On the other hand, if I call a person with a lower probability, but that person converts, I have made my sale.

So, a few other tweaks were manually made by me with that in mind, and I came up with this code snippet.

# Tuning the estimator
model2 = CatBoostClassifier(iterations=300,
                            depth=5,
                            learning_rate=0.1,
                            loss_function='Logloss',
                            eval_metric='F1',
                            class_weights={0: 1, 1: 3},
                            border_count= 64,
                            l2_leaf_reg= 13,
                            early_stopping_rounds=50,
                            verbose=1000)

# train the model
model2.fit(X_train, y_train)

prediction2 = model2.predict(X_val)

# confusion matrix
cm = pd.DataFrame(confusion_matrix(y_val, prediction2)  )
print ("Confusion Matrix : n")
display(cm)

# # Evaluate the weighted model
print('Tuned Catboost:')
print(classification_report(y_val, prediction2))
print('F1:', accuracy_score(y_val, prediction2))
print('Accuracy:', f1_score(y_val, prediction2))

Result of the tunned model. Image by the author.

Now, we can still run a Recursive Feature Elimination to select fewer variables and try to make this model simpler.

df_train_selected = df_train_clean[['age',  'job_admin.', 'job_services', 'job_management', 'job_blue-collar', 'job_unemployed', 'job_student', 'job_technician',
                                    'contact_cellular', 'contact_telephone', 'job_retired', 'poutcome_failure', 'poutcome_other', 'marital_single', 'marital_divorced',
                                    'previous', 'pdays', 'campaign', 'month', 'day', 'loan', 'housing', 'default', 'poutcome_unknown', 'y']]

The results are as follows.

Result of the selected variables model. Image by the author.

Despite being a good separator for the classes, the variable duration cannot be used, as it is not possible to know the duration of a phone call until it ends. But if we could, these are the results.

Result with the variable duration. Image by the author.

Look how we improve considerably the F1 score!

I have also tried some ensemble models, such as a VotingClassifier and a StackingClassifier. The results are presented next.

Voting and Stacking Classifiers. Image by the author.

Having trained enough models, it is time to evaluate the results and potentially iterate to adjust the best model.

Evaluation

I like to create a table to display the results of the models. It makes it easier to compare them all together.

pd.DataFrame({
    'Model':['Catboost Base', 'Catboost Tuned', 'Catboost Selected Variables', 'Voting Classifier', 'Voting Classifier + SMOTE', 'Catboost + duration', 'Stacking Classifier'],
    'F1 Score': [f1_score(y_val, prediction), f1_score(y_val, prediction2), f1_score(ys_val, prediction3), f1_score(y_val, y_pred), f1_score(y_val, y_pred2), f1_score(y_vald, prediction4), f1_score(y_val, y_pred3)],
    'Accuracy': [accuracy_score(y_val, prediction), accuracy_score(y_val, prediction2), accuracy_score(ys_val, prediction3), accuracy_score(y_val, y_pred), accuracy_score(y_val, y_pred2), accuracy_score(y_vald, prediction4), accuracy_score(y_val, y_pred3)]
}).sort_values('F1 Score', ascending=False)

Comparison of the models. Image by the author.

The Catboost model with the variable duration was by far the best one, however we cannot use that extra variable, since this data will not be available for the managers until the call ends, making no sense to have that for prediction.

So, the next best models were the Catboost Tuned and the model with the selected variables. Let’s take the tuned model and analyze the errors it is presenting. One way I like to do that is by creating some histograms or density plots, so we can see where the errors are concentrating for each variable.

Distribution of the errors by variable. Image by the author.

Concluding this study, it is clear that the variables presented cannot provide a solid separation of classes.

The imbalance is heavy, but the techniques to correct it – such as class weights and SMOTE – were not sufficient to improve class separation. This causes a problem for the model to find a pattern to properly classify the minority class 1 (converting customers) and perform better.

Given that there are too many observations where the customers did not convert, the variability of combinations that are labeled 0 is too large, overlaying and hiding the class 1 within it. Thus, the observations falling in this "commonplace" have similar probabilities for both sides, and that is where the model will fail. The observations are wrongly classified due to the imbalance since the negative class has more strength and creates more bias.

Predictions

To predict the results, the input data must be the same as the input provided during the training. So, I have created a function to take care of that. Once again, it’s available on GitHub.

# Preparing data for predictions
X_test, y_test = prepare_data(df_test)

# Predict
test_prediction = model3.predict(X_test)

# confusion matrix
cm = pd.DataFrame(confusion_matrix(y_test, test_prediction)  )
print ("Confusion Matrix : n")
display(cm)

# Evaluate the model
print('----------- Test Set Restuls -----------:')

print(classification_report(y_test, test_prediction))
print('-------------------------------')
print('F1:', f1_score(y_test, test_prediction))
print('-------------------------------')
print('Accuracy:', accuracy_score(y_test, test_prediction))

The results were within the expected, i.e. aligned with the results we have been seeing in training. The False Positives are slightly smaller than the False Negatives, which is better for our case. This prevents managers from erroneously going after customers who will not convert.

Result for the test set. Image by the author.

Finally, I also created a function to predict a single observation at a time, already thinking about the deployment application. The code that follows predicts one observation.

obs = {'age': 37,
       'job': 'management',
       'marital': 'single',
       'education': 'tertiary',
       'default': 'no', #
       'balance': 100,
       'housing': 'yes', #
       'loan': 'no', #
       'contact': 'cellular', #
       'day': 2, #
       'month': 'aug', #
       'duration': np.nan,
       'campaign': 2, #
       'pdays': 272, #
       'previous': 10,
       'poutcome': 'success',
       'y':99}

# Prediction
predict_single_entry(obs)

----------
[OUT]:
array([[0.59514531, 0.40485469]])

As a result, 59% probability that this customer will not convert. This exercise was interesting because as I manually changed each of the variables at a time, it was possible to see which ones had a larger influence on the model. It turns out that the variables default, housing, loan, day, contact_cellular, contact_telephone, month, campaign, pdays were changing more drastically the probabilities when modified.

So, I decided to create an even simpler model with those variables. And here is the true value of the CRISP-DM framework. I was almost done with the modeling when I noticed something new and went back to the beginning for another iteration.

This is the result.

Final model results for the test set. Image by the author.

This model is not only simpler, but it presents a better performance. The gain is very small, but when the results are similar, the simpler model is better, because it requires less data, computation power, and training time. It is a cheaper model overall.

Well, this is a wrap. Let’s go to the final considerations now.

Deployment

CRISP-DM has a Deployment step, but we won’t cover that in this post. It is way too long already.

The deployment will be presented in a future post, with a Streamlit application. Stay tuned to my blog.

Before You Go

In this post, the intention was to go over a whole data science project following the CRISP-DM lifecycle framework.

CRISP-DM is one of the most used lifecycle frameworks for data science, as it is intuitive and complete. The framework preaches that we should not only follow a sequence of steps. In fact, we can go back and forth whenever needed, as new concerns or discoveries are learned.

I loved creating this project and writing this article. I learned a lot, truly. There were many times when I was on the modeling and I learned something that could change the results. So I went back to exploration, understanding to incorporate the new knowledge into the model until I got to the final result, which is the best model I could create with the information and variables from this dataset.

This is a framework that I recommend. It can make you a better Data Scientist and your projects more complete.

Learn More

I have created a mini-course out of this content. So, if you liked this article, here is the link with a coupon code for you to redeem and enroll in the course. Everything you read here is taught in this quick course! Enjoy!

Follow me for more and mark this post for future reference.

Gustavo Santos – Medium

Find me on Linkedin.

Code Repository

GitHub – gurezende/CRISP-DM-Classification: End to End Classification project using the CRISP-DM…

References

Moro, S., Rita, P., & Cortez, P. (2014). Bank Marketing [Dataset]. UCI Machine Learning Repository. https://doi.org/10.24432/C5K306.

UCI Machine Learning Repository

CatBoostClassifier |

Cross-industry standard process for data mining – Wikipedia

CRISP-DM Help Overview

Leakage (machine learning) – Wikipedia

The post How I Created a Data Science Project Following CRISP-DM Lifecycle appeared first on Towards Data Science.

Make Your Way from Pandas to PySpark

Gustavo Santos — Thu, 26 Sep 2024 03:31:51 +0000

Introduction

I am part of a few data science communities on LinkedIn and from other places and one thing that I see from time to time is people questioning about PySpark.

Let’s face it: Data Science is too vast of a field for anyone to be able to know about everything. So, when I join a course/community about statistics, for example, sometimes people ask what is PySpark, how to calculate some stats in PySpark, and many other kinds of questions.

Usually, those who already work with Pandas are especially interested in Spark. And I believe that happens for a couple of reasons:

Pandas is for sure very famous and used by data scientists, but also for sure not the fastest package. As the data increases in size, the speed decreases proportionally.
It is a natural path for those who already dominate Pandas to want to learn a new option to wrangle data. As data is more available and with higher volume, knowing Spark is a great option to deal with Big Data.
Databricks is very famous, and PySpark is possibly the most used language in the Platform, along with SQL.

In this post, we will learn and compare the code snippets for Pandas and Pyspark for the main wrangling functions:

Summarizing
Slicing
Filtering
Grouping
Replacing
Arranging

Let’s dive in.

Preparing the Environment

For the sake of simplicity, we will use Google Colab in this exercise.

Once you open a new notebook in the tool, install pyspark and py4j and you should be good to go.

Python">!pip install pyspark py4j

These are the modules we need to import. We will import SparkSession, the functions module, so we can use methods like mean, sum, max, when etc. For the Python side, just pandas and numpy.

# import spark and functions
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.functions import col, mean, count, when

# Imports from Python
import pandas as pd
import numpy as np

Since we are working in a Jupyter Notebook, a Spark session needs to be initialized before you can use it. Notice that, if you’re using Databricks to code along (Community edition – free), the spark session is already initiated with the cluster and you won’t need to do that. But in Colab, we need this line of code.

# Create a spark session
spark = SparkSession.builder.appName("tests").getOrCreate()

This is the dataset to be used: California Housing, a sample dataset under the Creative Commons license.

Pandas vs PySpark

The drill now will be presenting each code snippet and commenting on the differences between both syntaxes.

Load and View Data

To load data, both syntaxes are pretty similar. While in Pandas we use the pd alias, in Spark we use the name of the Spark Session, in our case, we named it spark. I recommend you use this name as a default, since many tutorials will give you code snippets with spark as the name of the Spark session.

Pandas use the snake_case with underscore, and Spark will use the dot ..

# Load data with Pandas
dfp = pd.read_csv(pth)

# View data
dfp.head()

Dataset displayed by Pandas. Image by the author.

Additionally, in Spark the (schema) variable type inference is not always automatic, so use the argument inferSchema=True.Now, let us load and view data with Pyspark. The .limit(n) method will limit the result to the number n of rows we want to display.

# Load data to session
df = spark.read.csv(pth, header=True, inferSchema=True)

# Visualizing the Data
df.limit(5).show()

Dataset displayed by Spark. Image by the author.

Summarizing

The next comparison is the summarization. Usually, an easy way to summarize the statistics of a dataset is using the .describe() method. In Spark, the method is the same.

This is Pandas.

# Summarizing Data in Pandas
dfp.describe()

Describe from Pandas. Image by the author.

However, notice that Spark brings less information. The percentiles are not displayed. This is possibly because it is meant to work with a lot of data, so simplifying the output mean less calculations, what should make it faster.

# Summarizing Data with Spark
df.describe().show()

Describe from Spark. Image by the author.

The next snippet will display the percentiles of the data, in case you want to see it.

# Percentiles Spark
(df
 .agg(*[F.percentile(col, [.25, .5, .75]) for col in df.columns])
 .show()
 )

Slicing

Slicing is cutting the dataset to see a specific part of it. It is different than filtering because it does not carry conditions, though.

In Pandas, the code is as follows. Using .loc[row,col] we can determine the row numbers and columns to display.

# Slicing (Selecting) Data in Pandas
dfp.loc[10:20, ['households', 'housing_median_age', 'median_house_value']]

Sliced data from Pandas. Image by the author.

Be aware that in Spark you can’t directly slice a data frame by rows, knowing that it is not indexed like Pandas.

Spark Dataframes are not row-indexed like Pandas.

As you start using Spark, you will notice that its syntax is very influenced by SQL. So, to slice the data, we will select() the columns we want to see from the data.

# Slicing in Spark
(df # dataset
 .select('households', 'housing_median_age', 'median_house_value') #select columns
 .limit(10) # limit how many rows to display
 .show() #show data
)

Sliced data from Spark. Image by the author.

The next code adds the row number column and the ability to slice by row. However, this is an intermediate-level code for PySpark, using the Window functions. I will leave it here, and you can learn more with a link from the References section.

from pyspark.sql.window import Window
from pyspark.sql.functions import row_number

# Slicing by columns and rows in Spark
(df # dataset
 .select('households', 'housing_median_age', 'median_house_value') #select columns
 .withColumn('row_n', row_number().over(Window.orderBy('households'))) #create row number column
 .filter( col('row_n').between(10,20) ) #slice by row number
 .show() #show data
)

Filtering

The filters allow us to show only rows and columns that follow certain conditions.

In Pandas, I prefer to use the .[query](https://medium.com/gustavorsantos/pandas-query-the-easiest-way-to-filter-data-39e0163ef35a)() function. It is much more intuitive and easy to use. But, of course, you can use the good-old slicing notation as well.

# Filtering data Pandas
dfp.query('housing_median_age < 20').head(10)

The code snippet is filtering on houses with less than 20 years since construction.

Filtered data from Pandas. Image by the author.

In Spark, the method to filter can be .filter() or it can be .where() (like in SQL). Within the filter function, you can add the col('col_name') and condition, like < 20 in the example. To add other conditions, just use & for AND and | for OR.

Note: in Spark, every time we will apply a transformation or a function to a column, we must use the col() function to make it an object. For example, here we are applying a condition to the column, thus the need to use col.

# Filtering in Spark
(df #dataset
 .filter(col('housing_median_age') < 20) # Filter
 .show() #display data
 )

Filtered data from Spark. Image by the author.

Grouping

Now, getting to one of the most used wrangling functions, the group by, that allows us to aggregate data.

In Pandas, we can write using the slicing notation, as well as we can use the dictionary style, which is similar to the Spark syntax.

# Grouping in Pandas
(dfp
 .groupby('housing_median_age')
 ['median_house_value'].mean()
 .reset_index()
 .sort_values('housing_median_age')
 .head(10)
)

# Get different aggregation values for different variables
(dfp
 .groupby('housing_median_age')
 .agg({'median_house_value': 'mean',
       'population':'max',
       'median_income':'median'})
 .reset_index()
 .sort_values('housing_median_age')
 .head(10)
 )

Grouped data from Pandas. Image by the author.

And next is the Spark Code. Notice how the second snippet is very similar to the Pandas’ syntax with a dictionary.

Another observation is that for the mean() function we are not using F.mean, but we use it for F.max and F.median. The reason is because all the wrangling functions are in the module pyspark.sql.functions.

At the beginning of our code, we imported mean separately ( from pyspark.sql.function import mean)and the others we just imported as from pyspark.sql import functions as F. So, we must call F. for those not imported separately. This rule is just like regular module importing rules in Python code.

# Grouping in Spark
 (df #dataset
 .groupBy('housing_median_age') #grouping
 .agg(mean('median_house_value').alias('median_house_value')) #aggregation func
 .sort('housing_median_age') #sort
 .show()#display
)

# Grouping different variables in Spark
(df #dataset
 .groupBy('housing_median_age') #grouping
 .agg(mean('median_house_value').alias('median_house_value'), #aggregation funcs
      F.max('population').alias('population'),
      F.median('median_income').alias('median_income'))
 .sort('housing_median_age') #sort
 .show() #display
)

Grouped data from Spark. Image by the author.

Another comment to make is that aggregation functions in PySpark will change the variable name in the output. So, for example, the name of the output would be mean('median_house_value') if we had not used the .alias() function to rename the output column.

Replacing

To replace values, there are different ways to do that using Pandas. In the next code, we are using a combination of .assign and .where.

# Replacing values in Pandas
(dfp #dataset
 .assign(housing_median_age= 
         dfp['housing_median_age'].where(dfp.housing_median_age > 15, 
         other="potential buy") ) #assign replaced values to variable
 )

Replaced values with Pandas. Image by the author.

Using Spark, an easy way to do that is by rewriting the column (or adding a new one) using the conditions within the .when function from Spark. The code below rewrites the _housing_medianage column using a .when function that replaces houses up to 15 years of age with the word "potential buy", otherwise it just repeats the current value.

# Replace values in Spark
(df #dataset
 .withColumn('housing_median_age',
                when(col('housing_median_age') <= 15, 'potential buy')
                .otherwise(col('housing_median_age'))
                ) #new column
 .show() #display
)

Replaced values with Spark. Image by the author.

Arranging

Finally, arranging data is just ordering it. In Pandas, we can do that using sort_values().

# Arrange values in Pandas
(dfp
 .sort_values('median_house_value')
 .head(10)
)

Ordered by price in Pandas. Image by the author.

Using Spark, the only difference is the function orderBy() replacing the sort_values. If we want to order descending, we can add the column indicator function and the descending function, like this col('col_name').desc().

# Arrange values in Spark
(df #dataset
 .orderBy('median_house_value') #order data
 .show() #display
)

# Arrange values in descending order
(df
 .orderBy(col('median_house_value').desc())
 .show()
)

Ordered by price in Pandas. Image by the author.

Before You Go

That’s a wrap.

In this post, we have compared the main wrangling functions using Pandas and Pyspark and commented on their syntax differences.

This is useful for those starting their journey learning PySpark to wrangle big data. Remember, Spark is not difficult. If you already work with Python, the syntax is very easy to catch.

People who know SQL code and Python will see a lot of influence of the SQL language in the PySpark methods. Those who already like Polars will benefit the most from knowledge transfer, given that both Polars and PySpark share many similarities in syntax.

Learn More

Interested in learning more about PySpark?

Well, lucky you, because I have a whole online course in Udemy and I am applying a coupon code to offer you the best price possible through this link.

Contacts

If you liked this content, follow me for more.

Gustavo Santos – Medium

Also, let’s connect on LinkedIn.

Code

You can find the code from this exercise in this GitHub repo:

https://github.com/gurezende/Studying/blob/master/PySpark/PySpark_in_Colab.ipynb

References

How to Replace Values in Pandas

Functions – PySpark 3.5.3 documentation

DataFrame – pandas 2.2.3 documentation

PySpark Window Functions

Pandas Query: the easiest way to filter data

The post Make Your Way from Pandas to PySpark appeared first on Towards Data Science.

A Closer Look at Scipy’s Stats Module – Part 2

Gustavo Santos — Thu, 19 Sep 2024 05:53:24 +0000

Introduction

In the last post, the Closer Look at Scipy Stats—Part 1, we learned about distributions, statistics and hypothesis tests with one sample.

Now, we will move on learning about this powerful module and also check a couple of more complex functions available in this package.

In this post, we will learn about Statistical tests comparing two samples, Bootstraping, Monte Carlo simulations and a couple of transformations using Scipy.

Let’s go.

A Closer Look at Scipy’s Stats module – Part 1

Comparing Two Samples

Comparing two samples is a common task for data scientists. In Scipy, we can use the two independent samples test when we want to check if two different samples were drawn from the same distribution, thus have statistically similar averages.

# Two samples test: Comparison of means

# Sample 1
samp1 = scs.norm.rvs(loc=2, scale=5, size=100, random_state=12)
# Sample 2
samp2 = scs.norm.rvs(loc=3, scale=3, size=100, random_state=12)

# Hypothesis test
scs.ttest_ind(samp1, samp2, equal_var=False)

TtestResult(statistic=-2.1022782237188657, pvalue=0.03679301172995361, df=198.0)

The result on a significance level of 0.05 means that the null hypothesis is rejected in favor of the alternative, meaning that the samples don’t have statistically equal means.

It is important to register that there is an argument in the T-Test for comparison of means that is the equal_var. When the samples don’t have similar variances, the argument should be set to false.

One of the ways to detect that is using the levene test.

# Levene for equal variances test
scs.levene(samp1, samp2)

The resultant p-value rejects Ho in favor of Ha, confirming that the samples have different variances.

LeveneResult(statistic=18.17151743553794, pvalue=3.1222803039659347e-05)

And we can confirm the different variation of the samples looking at that thevariation function. The sample 1 varies 4 times more than sample 2.

scs.variation(samp1), scs.variation(samp2)

[OUT]
(4.090657648137783, 1.2223438764152452)

We also have the two related samples test, which is pretty similar code.

# Hypothesis test
scs.ttest_rel(samp1, samp2)

Similarly, there is the Kolmogorov-Smirnov comparison of two samples. The KS test will compare both samples’ empirical cumulative distributions to determine if they match closely or not, helping to test if they can be from the same population.

# KS test for two samples
ks_2samp(sample1, sample2, alternative='two-sided')

Resampling

Scipy has functions for resampling, such as Bootstrap and Monte Carlo simulations.

Bootstrap

First, bootstrap. It takes the sample and calculates the chosen statistic N times, returning the confidence interval of it.

# Bootstrap
dist = scs.norm.rvs(loc=10, scale=1, size=300, random_state=12)

dist = (dist,)  # samples must be in a sequence, as per documentation
scs.bootstrap(dist, statistic=np.mean, confidence_level=0.95)

[OUT]
BootstrapResult(confidence_interval=ConfidenceInterval(low=9.717562154498236, 
high=9.955179688263437), 
bootstrap_distribution=array([9.84762955, 9.87137856, 9.82459403, ..., 
9.89246925, 9.81027895, 9.83134846]), standard_error=0.06040179036433725)

Monte Carlo Test

Now, monte_carlo_test is a little more complex to understand. It took me some time to study and understand what’s behind it. I left in the Reference section a couple of good articles for you to get more familiar with it, in case you want to deepen your knowledge.

In a nutshell, what this function does is comparing the statistic calculated from the provided sample with the same statistic from a bunch of synthetic normal samples, under the null hypothesis that the sample is normally distributed. If you get a more extreme value, then the null hypothesis that both samples are from the same distribution is discarded.

Let’s see that in action.

Suppose we have a sample of 10 people with weight 70kg on average and standard deviation of 10kg. Now let’s test it using Monte Carlo simulations (10k simulations) if that sample is statistically heavier on average than the population (65kg avg +/- 10kg std) on a 95% confidence level.

# Generate a sample from a normally distribution
x = scs.norm.rvs(loc=70, scale=10, size=10).astype(np.int32)
rvs = lambda size: scs.norm.rvs(loc=65, scale=10, size=size).astype(np.int32)

# Monte Carlo Test
scs.monte_carlo_test(x, rvs, np.mean, n_resamples=10000, alternative='greater')

[OUT]
MonteCarloTestResult(statistic=68.3, pvalue=0.11678832116788321, null_distribution=array([60.9, 63.1, 65.1, ..., 63.2, 61.9, 67.3]))

According to our simulations, the p-Value is higher than the significance level of 5%, thus we fail to reject the null hypothesis and conclude that the sample group is not in fact heavier than the true population.

We certainly could use a simple T-Test for that and get a similar result, but as we are simulating N times with the monte_carlo_test, it can be more accurate.

x = scs.norm.rvs(loc=70, scale=10, size=10).astype(np.int32)
y = scs.norm.rvs(loc=65, scale=10, size=10).astype(np.int32)

# T-Test of means
scs.ttest_ind(x, y, equal_var=True)

[OUT]
TtestResult(statistic=1.688107424484386, pvalue=0.1086404147869201, df=18.0)

Transformations

Finally, let’s talk about some transformations that can be done using Scipy.

Normalizing the target variable is also a common procedure to get more accurate predictions. One way to do that is using the boxcox transformation, available in Scipy.

# sample Exponential Distribution
exp_ = scs.exponpow.rvs(b=10, loc=3, scale=1, size=300, random_state=12)

# Box Cox
boxcox_transformed = scs.boxcox(exp_)

The Box-Cox transformation is a statistical technique that transforms data into a normal distribution using the optimal power value to transform the data.

Ok. We have created an exponential distribution and then normalized it with Box Cox. Now we can plot the results.

# Create figure
fig = plt.figure(figsize=(9, 7))

# Subplot 1
ax1 = fig.add_subplot(211)
scs.probplot(exp_, plot=ax1)
ax1.set_xlabel('')
ax1.set_title('Probplot against normal distribution')

# Subplot 2
ax2 = fig.add_subplot(212)
scs.probplot(boxcox_transformed[0], plot=ax2);
ax2.set_title('Probplot after Box-Cox transformation');

The result of the above code snippet is this next plot. See how the Box-Cox transformed variable is much closer to the normality line.

Distribution normalized. Image by the author.

Since we didn’t use the lmbda argument in the Box-Cox transformation, the function already calculates the optimal value and returns it as the second item of the array.

But we can also use boxcox_normmax to calculate only the optimal lambda value when needed.

# Calculating optimal lambda with Scipy
scs.boxcox_normmax(exp_)

Z Scores

Another transformation possible is calculating the Z-scores of the values in an array. The Z score is the equivalent value of the observation when it is transformed to a standard normal distribution (mean=0, std=1). The transformation is easy.

The Z score is the equivalent value of the observation when it is transformed to a standard normal distribution (mean=0, std=1).

# Distribution
dist = [12,23,13,34,55,16,1]

# Calculating Z scores = "point_estimate - mean/std"
Z = scs.zscore(dist)
Z

[OUT]
array([-0.60825887,  0.06082589, -0.54743299,  0.72991065,  2.00725429,  -0.36495532, -1.27734364])

From that, we can use the CDF to calculate probabilities of a more extreme result than each point in the distribution.

# Use the cumulative distribution function (CDF) to find the probability
probability = scs.norm.cdf(Z)

for i,n in zip(probability,dist):
  if i < 0:
    print(f"Probability of a number more extreme than {n}:", i)
  else:
    print(f"Probability of a number more extreme than {n}:", 1-i)

[OUT]
Probability of a number more extreme than 12: 0.7284921038135481
Probability of a number more extreme than 23: 0.4757489366442327
Probability of a number more extreme than 13: 0.7079593510974136
Probability of a number more extreme than 34: 0.23272240120113985
Probability of a number more extreme than 55: 0.02236129701530054
Probability of a number more extreme than 16: 0.6424276224116628
Probability of a number more extreme than 1: 0.8992595228449655

As expected, as we increase the number in our distribution, the lower is the probability to find a more extreme value.

Before You Go

Well, I believe we have covered a lot in these two posts.

This material would is a good resource for anyone willing to learn more about Scipy and adding these tools to their Data Science box.

Make sure to bookmark these posts and also always look at the documentation to understand the functions from this great package.

Again, you can access the whole code in this GitHub repo:

Studying/Python/scipy at master · gurezende/Studying

If you liked the content, don’t forget to follow me.

Gustavo Santos – Medium

Also find me on LinkedIn.

References

Statistical functions (scipy.stats) – SciPy v1.14.1 Manual

Monte Carlo Hypothesis Tests versus Traditional Parametric Tests

Notebook on nbviewer

boxcox – SciPy v1.14.1 Manual

The post A Closer Look at Scipy’s Stats Module – Part 2 appeared first on Towards Data Science.

A Closer Look at Scipy’s Stats module – Part 1

Gustavo Santos — Thu, 19 Sep 2024 05:50:10 +0000

A Closer Look at Scipy’s Stats Module – Part 1

Photo by Алекс Арцибашев on Unsplash

Introduction

Data Scientists and programmers who have been in the field long enough and have chosen Python as their programming language certainly already heard about Scipy, or Scientific Python.

Scipy is a Python package for (as it says) scientific operations and computations. Here, in this post, we will talk more specifically about the scipy.stats module, which is the one used for statistical tests in Python.

Scipy Stats

Scipy.stats is the powerful module for statistics within Scientific Python. It is a very resourceful module that brings many methods for creating random distributions, creating statistical tests, resampling, transformations, Monte Carlo simulations, and much more.

In this first part of the post, we will explore the distributions, statistics and hypothesis tests.

Next, in a second post, we will see tools for dealing with more than one sample, resampling, and transformations.

Let’s dive into some of the main methods of the statistical module of Scipy.

Distributions

Scipy Stats has a large number of random distributions that one can replicate. After all, in statistics, when modeling we are nothing more than finding the most probable number within a distribution.

This module has the most famous and many other distributions available for the researcher to choose from. Some examples of distributions are alpha, beta, normal, poison, chi, cosine, exponential, uniform, gamma.

Scipy allows us to easily replicate famous distributions, such as normal, exponential, gamma, uniform…

To simulate a distribution, all you need to do is to find the one you need and use the .rvs() method. Let’s see some examples.

# normal distribution
normal = scs.norm.rvs(loc=0, scale=1, size=100)
sns.kdeplot(normal);

Normal Distribution. Image by the author.

# poison distribution
pois = scs.poisson.rvs(mu=2, size=500)

#plot
sns.histplot(pois);

Poisson distribution. Image by the author.

# exponential distribution
expo = scs.expon.rvs(loc=3, scale=1, size=100, random_state=12)

# plot
sns.histplot(expo);

Exponential distribution. Image by the author.

So we can see how easy it is to create a random distribution using scipy. The complete list of distributions available can be found in this link, in the documentation of the package.

We can also create the Cumulative Distribution Function (CDF) – to calculate the probability of a number being more extreme than the one you are testing – and/or Probability Density Function (PDF) – to calculate what is the probability density around a number in a distribution.

scs.norm.cdf(normal)
scs.norm.pdf(normal)

The Probability Density Function (PDF) will give us the density of the probability around the number, or how concentrated the probability is around that number. So, it is expected that the density will be high near to the average.

# Normal distribution
normal = scs.norm.rvs(loc=10, scale=5, size=50, random_state=12)

# Mean and Standard Deviation
mu = normal.mean()
sd = normal.std()

# Probability Density at the mean
scs.norm.pdf(10, loc=mu, scale=sd)

[OUT]:
0.07054625186263343

The actual probability of a number falling on the mean point (10) is much lower, knowing that a continuous distribution has almost infinite values (10.01, 10.02, 10.001, etc).

# Use Case: : What is the probability that a number will be 10
scs.norm.cdf(10.001, loc=mu, scale=sd) - scs.norm.cdf(9.999, loc=mu, scale=sd)

[OUT]:
0.00014109250299032539

Comparing the results: the probability density is about 7%, while the actual probability is close to 0.014%.

Statistics

Since we can create many distributions, we can also calculate their statistics easily in scipy. So, let us create this standard normal distribution to calculate some stats on it.

# Creating a normal distribution
d = scs.norm.rvs(loc=0, scale=1, size=100, random_state=12)

Now, just like in Pandas, for example, we have the .describe() method to calculate many stats at once.

# Describe
scs.describe(d)

[OUT]
DescribeResult(nobs=100, minmax=(-3.14741652154398, 2.8718193949889166), 
mean=-0.14430749763868853, variance=1.1050630588641528, 
skewness=0.05470178363174691, kurtosis=0.27834782892208176)

You can access the numbers individually as well if wanted:

stats_ = scs.describe(d)
stats_.mean

-0.14430749763868853

stats_.variance
stats_.skewness
stats_.kurtosis

The .describe() function is useful because it gives you many stats in a single method. Now, whenever needed, Scipy also brings a number of other summary statistics, like mode, skew, kurtosis.

tmean is a nice method to calculate the trimmed mean where you can add the mean within given lower and upper limits, or trimmed min, max, var, standard dev.

# Mean only of the values between 1 and 3 in the distribution d
scs.tmean(d, (1,3))

1.5427740357627466

We can rank data using rankdata. Observe as it even shows the ties.

d = [1,1,2,7,3,4,5,6]
scs.rankdata(d)

[OUT]:
array([1.5, 1.5, 3. , 8. , 4. , 5. , 6. , 7. ])

We can find the repetitions of an array.

scs.find_repeats([1,2,3,3,4,5,6])

RepeatedResults(values=array([3.]), counts=array([2]))

Hypothesis Tests

In Data Science, hypothesis tests are a must for different occasions. We should compare numbers and remove the uncertainty that the result we’re seeing is not by chance. Thus the utility of a hypothesis test.

Say you heard about these jobs from a company that claims it is paying over the average of the market for the job roles announced. It pays $120k, $115k, $125k, $138k for different roles in Data Science.

Curious to know if that statement is true or not, you research on websites and find out that the average salary for a Data Scientist in your state is around $130k. So, is the company telling a true statement or not, according to statistics?

We can easily solve that problem with a Scipy hypothesis test for one sample against the population mean.

Python"># Company salary
company = [120, 115, 125, 138]

# Market average
mkt = 130

# Hypothesis test
# Null: Company pays less than the market average
# Alternative: Company pays more than the market 
scs.ttest_1samp(company, popmean=mkt, alternative='greater')

[OUT]
TtestResult(statistic=-1.1130623746355601, pvalue=0.8265797321607529, df=3)

The result, with a 0.05 significance level, shows a p-value of 0.82, therefore, we cannot reject the null hypothesis that the market is paying more on average than the company.

We conclude that the company is not publishing a true statement, according to statistics.

There is the binomtest that uses a Bernoulli distribution to test the probability of success of a binomial test: is event or is not event. Let’s say you have a coffee shop and you assess that 8% of 150 customers are buying a new drink, so 12 customers. Can we test if, in the same environment (given the same probability of success = 8%), if we could see 20 customers buying the new drink? The significance level is 0.05.

# Binomial Test
successes = 20
trials = 150
p = 0.08

scs.binomtest(successes, trials, p, alternative='greater')

[OUT]
BinomTestResult(k=20, n=150, alternative='greater', 
statistic=0.13333333333333333, pvalue=0.01684360036905186)

As our p-value = 0.016, it’s smaller than the significance level, therefore we reject the null hypothesis and conclude that, under those conditions, it is not probable that 20 customers will buy the new drink in a day.

To test for normality, there are many ways. In Scipy, the normaltest method compares a given sample to a normal distribution. The null hypothesis is that the sample is normal. If you get a p-value under your significance level, then it is not normally distributed. Next, we can see the test with a normal distribution and an exponential one. Works as expected.

# Normal Test
d_norm = scs.norm.rvs(loc=0, scale=1, size=100, random_state=12)
d_expon = scs.expon.rvs(loc=3, scale=1, size=100, random_state=12)

print("Normal:",scs.normaltest(d_norm))
print("Exponential:", scs.normaltest(d_expon))

Normal: NormaltestResult(statistic=0.8190182789113444, pvalue=0.6639760898237252)
Exponential: NormaltestResult(statistic=33.69547265654829, pvalue=4.820821820360837e-08)

Besides that, we have other famous normality tests like shapiro, anderson, as well as Kolmogorov-Smirnov ks_1samp for goodness of fit against another distribution.

# Shapiro (for samples up to 5000)
scs.shapiro(distr)

# Anderson
scs.anderson(distr)

# KS test vs normal distribution
scs.ks_1samp(distr, stats.norm.cdf)

Before You Go

Well, I believe we have covered a lot so far.

We talked about how to create distributions with the .rvs() functions, how to check the cdf() and .pdf, calculating Statistics and performing hypothesis tests.

In the second part of this post, we will check other functionalities of the powerful scipy.stats module.

Stay tuned.

To check the code from this post, go to this GitHub repo.

Studying/Python/scipy at master · gurezende/Studying

If you liked this content, follow me for more.

Gustavo Santos – Medium

Find me on Linkedin as well.

A Closer Look at Scipy’s Stats Module – Part 2

References

Statistical functions (scipy.stats) – SciPy v1.14.1 Manual

Monte Carlo Hypothesis Tests versus Traditional Parametric Tests

Notebook on nbviewer

boxcox – SciPy v1.14.1 Manual

The post A Closer Look at Scipy’s Stats module – Part 1 appeared first on Towards Data Science.