
Introduction
Data Scientists and programmers who have been in the field long enough and have chosen Python as their programming language certainly already heard about Scipy
, or Scientific Python.
Scipy
is a Python package for (as it says) scientific operations and computations. Here, in this post, we will talk more specifically about the scipy.stats
module, which is the one used for statistical tests in Python.
Scipy Stats
Scipy.stats
is the powerful module for statistics within Scientific Python. It is a very resourceful module that brings many methods for creating random distributions, creating statistical tests, resampling, transformations, Monte Carlo simulations, and much more.
In this first part of the post, we will explore the distributions, statistics and hypothesis tests.
Next, in a second post, we will see tools for dealing with more than one sample, resampling, and transformations.
Let’s dive into some of the main methods of the statistical module of Scipy.
Distributions
Scipy Stats has a large number of random distributions that one can replicate. After all, in statistics, when modeling we are nothing more than finding the most probable number within a distribution.
This module has the most famous and many other distributions available for the researcher to choose from. Some examples of distributions are alpha, beta, normal, poison, chi, cosine, exponential, uniform, gamma.
Scipy allows us to easily replicate famous distributions, such as normal, exponential, gamma, uniform…
To simulate a distribution, all you need to do is to find the one you need and use the .rvs()
method. Let’s see some examples.
# normal distribution
normal = scs.norm.rvs(loc=0, scale=1, size=100)
sns.kdeplot(normal);

# poison distribution
pois = scs.poisson.rvs(mu=2, size=500)
#plot
sns.histplot(pois);

# exponential distribution
expo = scs.expon.rvs(loc=3, scale=1, size=100, random_state=12)
# plot
sns.histplot(expo);

So we can see how easy it is to create a random distribution using scipy
. The complete list of distributions available can be found in this link, in the documentation of the package.
We can also create the Cumulative Distribution Function (CDF) – to calculate the probability of a number being more extreme than the one you are testing – and/or Probability Density Function (PDF) – to calculate what is the probability density around a number in a distribution.
scs.norm.cdf(normal)
scs.norm.pdf(normal)
The Probability Density Function (PDF) will give us the density of the probability around the number, or how concentrated the probability is around that number. So, it is expected that the density will be high near to the average.
# Normal distribution
normal = scs.norm.rvs(loc=10, scale=5, size=50, random_state=12)
# Mean and Standard Deviation
mu = normal.mean()
sd = normal.std()
# Probability Density at the mean
scs.norm.pdf(10, loc=mu, scale=sd)
[OUT]:
0.07054625186263343
The actual probability of a number falling on the mean point (10) is much lower, knowing that a continuous distribution has almost infinite values (10.01, 10.02, 10.001, etc).
# Use Case: : What is the probability that a number will be 10
scs.norm.cdf(10.001, loc=mu, scale=sd) - scs.norm.cdf(9.999, loc=mu, scale=sd)
[OUT]:
0.00014109250299032539
Comparing the results: the probability density is about 7%, while the actual probability is close to 0.014%.
Statistics
Since we can create many distributions, we can also calculate their statistics easily in scipy
. So, let us create this standard normal distribution to calculate some stats on it.
# Creating a normal distribution
d = scs.norm.rvs(loc=0, scale=1, size=100, random_state=12)
Now, just like in Pandas, for example, we have the .describe()
method to calculate many stats at once.
# Describe
scs.describe(d)
[OUT]
DescribeResult(nobs=100, minmax=(-3.14741652154398, 2.8718193949889166),
mean=-0.14430749763868853, variance=1.1050630588641528,
skewness=0.05470178363174691, kurtosis=0.27834782892208176)
You can access the numbers individually as well if wanted:
stats_ = scs.describe(d)
stats_.mean
-0.14430749763868853
stats_.variance
stats_.skewness
stats_.kurtosis
The .describe()
function is useful because it gives you many stats in a single method. Now, whenever needed, Scipy also brings a number of other summary statistics, like mode
, skew
, kurtosis
.
tmean
is a nice method to calculate the trimmed mean where you can add the mean within given lower and upper limits, or trimmed min, max, var, standard dev.
# Mean only of the values between 1 and 3 in the distribution d
scs.tmean(d, (1,3))
1.5427740357627466
We can rank
data using rankdata
. Observe as it even shows the ties.
d = [1,1,2,7,3,4,5,6]
scs.rankdata(d)
[OUT]:
array([1.5, 1.5, 3. , 8. , 4. , 5. , 6. , 7. ])
We can find the repetitions of an array.
scs.find_repeats([1,2,3,3,4,5,6])
RepeatedResults(values=array([3.]), counts=array([2]))
Hypothesis Tests
In Data Science, hypothesis tests are a must for different occasions. We should compare numbers and remove the uncertainty that the result we’re seeing is not by chance. Thus the utility of a hypothesis test.
Say you heard about these jobs from a company that claims it is paying over the average of the market for the job roles announced. It pays $120k, $115k, $125k, $138k for different roles in Data Science.
Curious to know if that statement is true or not, you research on websites and find out that the average salary for a Data Scientist in your state is around $130k. So, is the company telling a true statement or not, according to statistics?
We can easily solve that problem with a Scipy hypothesis test for one sample against the population mean.
Python"># Company salary
company = [120, 115, 125, 138]
# Market average
mkt = 130
# Hypothesis test
# Null: Company pays less than the market average
# Alternative: Company pays more than the market
scs.ttest_1samp(company, popmean=mkt, alternative='greater')
[OUT]
TtestResult(statistic=-1.1130623746355601, pvalue=0.8265797321607529, df=3)
The result, with a 0.05 significance level, shows a p-value of 0.82, therefore, we cannot reject the null hypothesis that the market is paying more on average than the company.
We conclude that the company is not publishing a true statement, according to statistics.
There is the binomtest
that uses a Bernoulli distribution to test the probability of success of a binomial test: is event or is not event. Let’s say you have a coffee shop and you assess that 8% of 150 customers are buying a new drink, so 12 customers. Can we test if, in the same environment (given the same probability of success = 8%), if we could see 20 customers buying the new drink? The significance level is 0.05.
# Binomial Test
successes = 20
trials = 150
p = 0.08
scs.binomtest(successes, trials, p, alternative='greater')
[OUT]
BinomTestResult(k=20, n=150, alternative='greater',
statistic=0.13333333333333333, pvalue=0.01684360036905186)
As our p-value = 0.016, it’s smaller than the significance level, therefore we reject the null hypothesis and conclude that, under those conditions, it is not probable that 20 customers will buy the new drink in a day.
To test for normality, there are many ways. In Scipy, the normaltest
method compares a given sample to a normal distribution. The null hypothesis is that the sample is normal. If you get a p-value under your significance level, then it is not normally distributed. Next, we can see the test with a normal distribution and an exponential one. Works as expected.
# Normal Test
d_norm = scs.norm.rvs(loc=0, scale=1, size=100, random_state=12)
d_expon = scs.expon.rvs(loc=3, scale=1, size=100, random_state=12)
print("Normal:",scs.normaltest(d_norm))
print("Exponential:", scs.normaltest(d_expon))
Normal: NormaltestResult(statistic=0.8190182789113444, pvalue=0.6639760898237252)
Exponential: NormaltestResult(statistic=33.69547265654829, pvalue=4.820821820360837e-08)
Besides that, we have other famous normality tests like shapiro
, anderson
, as well as Kolmogorov-Smirnov ks_1samp
for goodness of fit against another distribution.
# Shapiro (for samples up to 5000)
scs.shapiro(distr)
# Anderson
scs.anderson(distr)
# KS test vs normal distribution
scs.ks_1samp(distr, stats.norm.cdf)
Before You Go
Well, I believe we have covered a lot so far.
We talked about how to create distributions with the .rvs()
functions, how to check the cdf()
and .pdf
, calculating Statistics and performing hypothesis tests.
In the second part of this post, we will check other functionalities of the powerful scipy.stats
module.
Stay tuned.
To check the code from this post, go to this GitHub repo.
If you liked this content, follow me for more.
Find me on Linkedin as well.
References
Statistical functions (scipy.stats) – SciPy v1.14.1 Manual
Monte Carlo Hypothesis Tests versus Traditional Parametric Tests