A Closer Look at Scipy's Stats module - Part 1

Introduction

Data Scientists and programmers who have been in the field long enough and have chosen Python as their programming language certainly already heard about Scipy, or Scientific Python.

Scipy is a Python package for (as it says) scientific operations and computations. Here, in this post, we will talk more specifically about the scipy.stats module, which is the one used for statistical tests in Python.

Scipy Stats

Scipy.stats is the powerful module for statistics within Scientific Python. It is a very resourceful module that brings many methods for creating random distributions, creating statistical tests, resampling, transformations, Monte Carlo simulations, and much more.

In this first part of the post, we will explore the distributions, statistics and hypothesis tests.

Next, in a second post, we will see tools for dealing with more than one sample, resampling, and transformations.

Let’s dive into some of the main methods of the statistical module of Scipy.

Distributions

Scipy Stats has a large number of random distributions that one can replicate. After all, in statistics, when modeling we are nothing more than finding the most probable number within a distribution.

This module has the most famous and many other distributions available for the researcher to choose from. Some examples of distributions are alpha, beta, normal, poison, chi, cosine, exponential, uniform, gamma.

Scipy allows us to easily replicate famous distributions, such as normal, exponential, gamma, uniform…

To simulate a distribution, all you need to do is to find the one you need and use the .rvs() method. Let’s see some examples.

# normal distribution
normal = scs.norm.rvs(loc=0, scale=1, size=100)
sns.kdeplot(normal);

Normal Distribution. Image by the author.

# poison distribution
pois = scs.poisson.rvs(mu=2, size=500)

#plot
sns.histplot(pois);

Poisson distribution. Image by the author.

# exponential distribution
expo = scs.expon.rvs(loc=3, scale=1, size=100, random_state=12)

# plot
sns.histplot(expo);

Exponential distribution. Image by the author.

So we can see how easy it is to create a random distribution using scipy. The complete list of distributions available can be found in this link, in the documentation of the package.

We can also create the Cumulative Distribution Function (CDF) – to calculate the probability of a number being more extreme than the one you are testing – and/or Probability Density Function (PDF) – to calculate what is the probability density around a number in a distribution.

scs.norm.cdf(normal)
scs.norm.pdf(normal)

The Probability Density Function (PDF) will give us the density of the probability around the number, or how concentrated the probability is around that number. So, it is expected that the density will be high near to the average.

# Normal distribution
normal = scs.norm.rvs(loc=10, scale=5, size=50, random_state=12)

# Mean and Standard Deviation
mu = normal.mean()
sd = normal.std()

# Probability Density at the mean
scs.norm.pdf(10, loc=mu, scale=sd)

[OUT]:
0.07054625186263343

The actual probability of a number falling on the mean point (10) is much lower, knowing that a continuous distribution has almost infinite values (10.01, 10.02, 10.001, etc).

# Use Case: : What is the probability that a number will be 10
scs.norm.cdf(10.001, loc=mu, scale=sd) - scs.norm.cdf(9.999, loc=mu, scale=sd)

[OUT]:
0.00014109250299032539

Comparing the results: the probability density is about 7%, while the actual probability is close to 0.014%.

Statistics

Since we can create many distributions, we can also calculate their statistics easily in scipy. So, let us create this standard normal distribution to calculate some stats on it.

# Creating a normal distribution
d = scs.norm.rvs(loc=0, scale=1, size=100, random_state=12)

Now, just like in Pandas, for example, we have the .describe() method to calculate many stats at once.

# Describe
scs.describe(d)

[OUT]
DescribeResult(nobs=100, minmax=(-3.14741652154398, 2.8718193949889166), 
mean=-0.14430749763868853, variance=1.1050630588641528, 
skewness=0.05470178363174691, kurtosis=0.27834782892208176)

You can access the numbers individually as well if wanted:

stats_ = scs.describe(d)
stats_.mean

-0.14430749763868853

stats_.variance
stats_.skewness
stats_.kurtosis

The .describe() function is useful because it gives you many stats in a single method. Now, whenever needed, Scipy also brings a number of other summary statistics, like mode, skew, kurtosis.

tmean is a nice method to calculate the trimmed mean where you can add the mean within given lower and upper limits, or trimmed min, max, var, standard dev.

# Mean only of the values between 1 and 3 in the distribution d
scs.tmean(d, (1,3))

1.5427740357627466

We can rank data using rankdata. Observe as it even shows the ties.

d = [1,1,2,7,3,4,5,6]
scs.rankdata(d)

[OUT]:
array([1.5, 1.5, 3. , 8. , 4. , 5. , 6. , 7. ])

We can find the repetitions of an array.

scs.find_repeats([1,2,3,3,4,5,6])

RepeatedResults(values=array([3.]), counts=array([2]))

Hypothesis Tests

In Data Science, hypothesis tests are a must for different occasions. We should compare numbers and remove the uncertainty that the result we’re seeing is not by chance. Thus the utility of a hypothesis test.

Say you heard about these jobs from a company that claims it is paying over the average of the market for the job roles announced. It pays $120k, $115k, $125k, $138k for different roles in Data Science.

Curious to know if that statement is true or not, you research on websites and find out that the average salary for a Data Scientist in your state is around $130k. So, is the company telling a true statement or not, according to statistics?

We can easily solve that problem with a Scipy hypothesis test for one sample against the population mean.

Python"># Company salary
company = [120, 115, 125, 138]

# Market average
mkt = 130

# Hypothesis test
# Null: Company pays less than the market average
# Alternative: Company pays more than the market 
scs.ttest_1samp(company, popmean=mkt, alternative='greater')

[OUT]
TtestResult(statistic=-1.1130623746355601, pvalue=0.8265797321607529, df=3)

The result, with a 0.05 significance level, shows a p-value of 0.82, therefore, we cannot reject the null hypothesis that the market is paying more on average than the company.

We conclude that the company is not publishing a true statement, according to statistics.

There is the binomtest that uses a Bernoulli distribution to test the probability of success of a binomial test: is event or is not event. Let’s say you have a coffee shop and you assess that 8% of 150 customers are buying a new drink, so 12 customers. Can we test if, in the same environment (given the same probability of success = 8%), if we could see 20 customers buying the new drink? The significance level is 0.05.

# Binomial Test
successes = 20
trials = 150
p = 0.08

scs.binomtest(successes, trials, p, alternative='greater')

[OUT]
BinomTestResult(k=20, n=150, alternative='greater', 
statistic=0.13333333333333333, pvalue=0.01684360036905186)

As our p-value = 0.016, it’s smaller than the significance level, therefore we reject the null hypothesis and conclude that, under those conditions, it is not probable that 20 customers will buy the new drink in a day.

To test for normality, there are many ways. In Scipy, the normaltest method compares a given sample to a normal distribution. The null hypothesis is that the sample is normal. If you get a p-value under your significance level, then it is not normally distributed. Next, we can see the test with a normal distribution and an exponential one. Works as expected.

# Normal Test
d_norm = scs.norm.rvs(loc=0, scale=1, size=100, random_state=12)
d_expon = scs.expon.rvs(loc=3, scale=1, size=100, random_state=12)

print("Normal:",scs.normaltest(d_norm))
print("Exponential:", scs.normaltest(d_expon))

Normal: NormaltestResult(statistic=0.8190182789113444, pvalue=0.6639760898237252)
Exponential: NormaltestResult(statistic=33.69547265654829, pvalue=4.820821820360837e-08)

Besides that, we have other famous normality tests like shapiro, anderson, as well as Kolmogorov-Smirnov ks_1samp for goodness of fit against another distribution.

# Shapiro (for samples up to 5000)
scs.shapiro(distr)

# Anderson
scs.anderson(distr)

# KS test vs normal distribution
scs.ks_1samp(distr, stats.norm.cdf)

Before You Go

Well, I believe we have covered a lot so far.

We talked about how to create distributions with the .rvs() functions, how to check the cdf() and .pdf, calculating Statistics and performing hypothesis tests.

In the second part of this post, we will check other functionalities of the powerful scipy.stats module.

Stay tuned.

To check the code from this post, go to this GitHub repo.

Studying/Python/scipy at master · gurezende/Studying

If you liked this content, follow me for more.

Gustavo Santos – Medium

Find me on Linkedin as well.

A Closer Look at Scipy’s Stats Module – Part 2

References

Statistical functions (scipy.stats) – SciPy v1.14.1 Manual

Monte Carlo Hypothesis Tests versus Traditional Parametric Tests

Notebook on nbviewer

boxcox – SciPy v1.14.1 Manual

A Closer Look at Scipy’s Stats module – Part 1

Introduction

Scipy Stats

Distributions

Statistics

Hypothesis Tests

Before You Go

References

Related Articles

Implementing Convolutional Neural Networks in TensorFlow

How to Forecast Hierarchical Time Series

Hands-on Time Series Anomaly Detection using Autoencoders, with Python

Solving a Constrained Project Scheduling Problem with Quantum Annealing

Back To Basics, Part Uno: Linear Regression and Cost Function

Must-Know in Statistics: The Bivariate Normal Projection Explained

Our Columns