The world’s leading publication for data science, AI, and ML professionals.

A Closer Look at Scipy’s Stats Module – Part 2

Let's learn the main methods from scipy.stats module in Python.

Photo by Алекс Арцибашев on Unsplash
Photo by Алекс Арцибашев on Unsplash

Introduction

In the last post, the Closer Look at Scipy Stats—Part 1, we learned about distributions, statistics and hypothesis tests with one sample.

Now, we will move on learning about this powerful module and also check a couple of more complex functions available in this package.

In this post, we will learn about Statistical tests comparing two samples, Bootstraping, Monte Carlo simulations and a couple of transformations using Scipy.

Let’s go.

A Closer Look at Scipy’s Stats module – Part 1

Comparing Two Samples

Comparing two samples is a common task for data scientists. In Scipy, we can use the two independent samples test when we want to check if two different samples were drawn from the same distribution, thus have statistically similar averages.

# Two samples test: Comparison of means

# Sample 1
samp1 = scs.norm.rvs(loc=2, scale=5, size=100, random_state=12)
# Sample 2
samp2 = scs.norm.rvs(loc=3, scale=3, size=100, random_state=12)

# Hypothesis test
scs.ttest_ind(samp1, samp2, equal_var=False)
TtestResult(statistic=-2.1022782237188657, pvalue=0.03679301172995361, df=198.0)

The result on a significance level of 0.05 means that the null hypothesis is rejected in favor of the alternative, meaning that the samples don’t have statistically equal means.

It is important to register that there is an argument in the T-Test for comparison of means that is the equal_var. When the samples don’t have similar variances, the argument should be set to false.

One of the ways to detect that is using the levene test.

# Levene for equal variances test
scs.levene(samp1, samp2)

The resultant p-value rejects Ho in favor of Ha, confirming that the samples have different variances.

LeveneResult(statistic=18.17151743553794, pvalue=3.1222803039659347e-05)

And we can confirm the different variation of the samples looking at that thevariation function. The sample 1 varies 4 times more than sample 2.

scs.variation(samp1), scs.variation(samp2)

[OUT]
(4.090657648137783, 1.2223438764152452)

We also have the two related samples test, which is pretty similar code.

# Hypothesis test
scs.ttest_rel(samp1, samp2)

Similarly, there is the Kolmogorov-Smirnov comparison of two samples. The KS test will compare both samples’ empirical cumulative distributions to determine if they match closely or not, helping to test if they can be from the same population.

# KS test for two samples
ks_2samp(sample1, sample2, alternative='two-sided')

Resampling

Scipy has functions for resampling, such as Bootstrap and Monte Carlo simulations.

Bootstrap

First, bootstrap. It takes the sample and calculates the chosen statistic N times, returning the confidence interval of it.

# Bootstrap
dist = scs.norm.rvs(loc=10, scale=1, size=300, random_state=12)

dist = (dist,)  # samples must be in a sequence, as per documentation
scs.bootstrap(dist, statistic=np.mean, confidence_level=0.95)
[OUT]
BootstrapResult(confidence_interval=ConfidenceInterval(low=9.717562154498236, 
high=9.955179688263437), 
bootstrap_distribution=array([9.84762955, 9.87137856, 9.82459403, ..., 
9.89246925, 9.81027895, 9.83134846]), standard_error=0.06040179036433725)

Monte Carlo Test

Now, monte_carlo_test is a little more complex to understand. It took me some time to study and understand what’s behind it. I left in the Reference section a couple of good articles for you to get more familiar with it, in case you want to deepen your knowledge.

In a nutshell, what this function does is comparing the statistic calculated from the provided sample with the same statistic from a bunch of synthetic normal samples, under the null hypothesis that the sample is normally distributed. If you get a more extreme value, then the null hypothesis that both samples are from the same distribution is discarded.

Let’s see that in action.

Suppose we have a sample of 10 people with weight 70kg on average and standard deviation of 10kg. Now let’s test it using Monte Carlo simulations (10k simulations) if that sample is statistically heavier on average than the population (65kg avg +/- 10kg std) on a 95% confidence level.

# Generate a sample from a normally distribution
x = scs.norm.rvs(loc=70, scale=10, size=10).astype(np.int32)
rvs = lambda size: scs.norm.rvs(loc=65, scale=10, size=size).astype(np.int32)

# Monte Carlo Test
scs.monte_carlo_test(x, rvs, np.mean, n_resamples=10000, alternative='greater')
[OUT]
MonteCarloTestResult(statistic=68.3, pvalue=0.11678832116788321, null_distribution=array([60.9, 63.1, 65.1, ..., 63.2, 61.9, 67.3]))

According to our simulations, the p-Value is higher than the significance level of 5%, thus we fail to reject the null hypothesis and conclude that the sample group is not in fact heavier than the true population.

We certainly could use a simple T-Test for that and get a similar result, but as we are simulating N times with the monte_carlo_test, it can be more accurate.

x = scs.norm.rvs(loc=70, scale=10, size=10).astype(np.int32)
y = scs.norm.rvs(loc=65, scale=10, size=10).astype(np.int32)

# T-Test of means
scs.ttest_ind(x, y, equal_var=True)
[OUT]
TtestResult(statistic=1.688107424484386, pvalue=0.1086404147869201, df=18.0)

Transformations

Finally, let’s talk about some transformations that can be done using Scipy.

Normalizing the target variable is also a common procedure to get more accurate predictions. One way to do that is using the boxcox transformation, available in Scipy.

# sample Exponential Distribution
exp_ = scs.exponpow.rvs(b=10, loc=3, scale=1, size=300, random_state=12)

# Box Cox
boxcox_transformed = scs.boxcox(exp_)

The Box-Cox transformation is a statistical technique that transforms data into a normal distribution using the optimal power value to transform the data.

Ok. We have created an exponential distribution and then normalized it with Box Cox. Now we can plot the results.

# Create figure
fig = plt.figure(figsize=(9, 7))

# Subplot 1
ax1 = fig.add_subplot(211)
scs.probplot(exp_, plot=ax1)
ax1.set_xlabel('')
ax1.set_title('Probplot against normal distribution')

# Subplot 2
ax2 = fig.add_subplot(212)
scs.probplot(boxcox_transformed[0], plot=ax2);
ax2.set_title('Probplot after Box-Cox transformation');

The result of the above code snippet is this next plot. See how the Box-Cox transformed variable is much closer to the normality line.

Distribution normalized. Image by the author.
Distribution normalized. Image by the author.

Since we didn’t use the lmbda argument in the Box-Cox transformation, the function already calculates the optimal value and returns it as the second item of the array.

But we can also use boxcox_normmax to calculate only the optimal lambda value when needed.

# Calculating optimal lambda with Scipy
scs.boxcox_normmax(exp_)

Z Scores

Another transformation possible is calculating the Z-scores of the values in an array. The Z score is the equivalent value of the observation when it is transformed to a standard normal distribution (mean=0, std=1). The transformation is easy.

The Z score is the equivalent value of the observation when it is transformed to a standard normal distribution (mean=0, std=1).

# Distribution
dist = [12,23,13,34,55,16,1]

# Calculating Z scores = "point_estimate - mean/std"
Z = scs.zscore(dist)
Z
[OUT]
array([-0.60825887,  0.06082589, -0.54743299,  0.72991065,  2.00725429,  -0.36495532, -1.27734364])

From that, we can use the CDF to calculate probabilities of a more extreme result than each point in the distribution.

# Use the cumulative distribution function (CDF) to find the probability
probability = scs.norm.cdf(Z)

for i,n in zip(probability,dist):
  if i < 0:
    print(f"Probability of a number more extreme than {n}:", i)
  else:
    print(f"Probability of a number more extreme than {n}:", 1-i)
[OUT]
Probability of a number more extreme than 12: 0.7284921038135481
Probability of a number more extreme than 23: 0.4757489366442327
Probability of a number more extreme than 13: 0.7079593510974136
Probability of a number more extreme than 34: 0.23272240120113985
Probability of a number more extreme than 55: 0.02236129701530054
Probability of a number more extreme than 16: 0.6424276224116628
Probability of a number more extreme than 1: 0.8992595228449655

As expected, as we increase the number in our distribution, the lower is the probability to find a more extreme value.

Before You Go

Well, I believe we have covered a lot in these two posts.

This material would is a good resource for anyone willing to learn more about Scipy and adding these tools to their Data Science box.

Make sure to bookmark these posts and also always look at the documentation to understand the functions from this great package.

Again, you can access the whole code in this GitHub repo:

Studying/Python/scipy at master · gurezende/Studying

If you liked the content, don’t forget to follow me.

Gustavo Santos – Medium

Also find me on LinkedIn.

References

Statistical functions (scipy.stats) – SciPy v1.14.1 Manual

Monte Carlo Hypothesis Tests versus Traditional Parametric Tests

Notebook on nbviewer

boxcox – SciPy v1.14.1 Manual


Related Articles