An intuitive introduction to Hypothesis Testing with (almost) no maths

In this post, I am going to do a basic demonstration of Hypothesis Testing. I will try to keep it as math-free as possible and keep the focus on giving you an overall idea of the concepts involved and develop some intuition on Hypothesis Testing.

Okay, so let’s get started.

An Example

I am going to present you with a hypothesis and the hypothesis is –

We’ll ignore the potential difference between women of different age groups and just keep it simple. Now, although we call this a hypothesis in a normal way, in Statistics, a hypothesis has a formal meaning as something that can be tested.

So let’s try and put it to the test. Assume we take a sample of 20 women and their mean height comes to 168.6 cms.

So what does this observation mean for this hypothesis? I want you to stop for a few seconds and think- how much doubt does it cast on our hypothesis?

We started by saying that the average is 169 cms but now we have this sample and it’s slightly lesser than that but does it really cast a lot of doubt on the hypothesis?

Probably not, right? We randomly selected a small sample but it is possible that the selected women were slightly shorter due to random variation. So it’s not totally outside the realm of possibility that our hypothesis is still correct.

Taking a different sampleNow let’s imagine another scenario where we yet again randomly sample 20 women but this time the average of their heights is 161 cms. This time the average is a bit farther away than the last time.

So let me ask you again, how much doubt does this cast on our hypothesis.

Do you think this time the observation that the average height is 161 cms in a sample casts more doubt than the last example even though we sampled the same number of people?

The average height is quite a bit lower than 169 so you got to start thinking well how likely is it that this sample is just randomly less than the true mean.

Although we haven’t dealt with any formulas or math yet, you have done your first hypothesis testing at this point.

Let’s formalize things a little bit now. To be more specific we call the original hypothesis as the null hypothesis and represent it as H₀, and in this case, the null hypothesis was that the true mean is 169 cms. And then we’ve got this other thing called the alternative hypothesis often represented as H₁. Under the alternative hypothesis, the true mean is not 169 cms.

So the question is- is our sample mean far enough away from 169 centimeters for us to be able to reject that original hypothesis? In the second example, you thought- mmm maybe it is far enough away from 169, maybe there is now quite a bit of doubt cast over that null hypothesis

Formalizing Hypothesis Testing

Let’s consider this number line representing the possible values of the sample mean. We typically represent sample mean with x̅ and population mean with 𝝻. If you took a sample and took its mean, you will be expecting that it comes to be 169 but you also know that due to random variation, it could easily be 168 or 170.

What hypothesis testing will do is set these critical boundaries beyond which we are going to start rejecting the null hypothesis. So when our sample gets too extreme, we start doubting our null hypothesis a lot more. In our example with the sample mean at 161 cms, it could well be too far away from 169 and be in that rejection zone.

A Second Example

Now let’s try another example- another hypothesis. The second hypothesis we have

Like before, we are gonna take a sample and put this hypothesis to the test. This time around though we take only five people to make up our sample and we find that the average weight of these 5 chosen people is only 68 kgs.

And again, we’re going to ask ourselves the same question- how much doubt does this cast on our hypothesis. So stop for a few seconds and think about it

Now even though the difference between the weights is large, you might be thinking that is alright since we are just considering 5 samples and maybe, just maybe we selected slightly lighter people, and if we selected yet another 5 samples, the total sample’s average weight could go up.

Now let me pose a slightly modified scenario. This time around we sample 500 people instead of 5. But the sample mean is still 68 kgs. How much doubt this casts on our hypothesis?

You’re going to start thinking that- yeah, this is casting a lot of doubt now. The average weight hasn’t changed a bit but what has changed is the number of observations in the sample. So we now have 500 people in the sample. What that means is that we’re more confident about the sample mean and I think intuitively you can kind of see this work. Take Amazon reviews for example – a product rated 4.8 stars on 5 ratings vs a product rated 4.8 stars with 1200 ratings. You would feel more confident about the second product.

If you had an even bigger sample and found the average weight was still 68 you’re gonna start thinking well we’ve pretty much got almost the entire town here now– the true average weight is probably not going to be 74 anymore.

Our null hypothesis here is that the true population mean is 74 and our alternative hypothesis is that it’s not 74.

So in our sample of five people, those critical values that we would draw here to determine whether we would reject that null hypothesis might be far away from 74 but when we had 500 people in our sample those critical values beyond which we’re going to reject that null hypothesis are actually going to be quite close to 74 itself. So if we get a sample mean at say 71 or maybe 77 maybe that’s enough evidence here to reject that null hypothesis.

Now so far we haven’t done any actual calculations here yet but there are ways for us to calculate those exact regions to determine whether we’re going to reject that null hypothesis or not.

And through those calculations that we’ll be seeing shortly, what we are really trying to answer is this – if the null hypothesis was true how extreme is our sample? This really is the core question that a hypothesis test tries to answer numerically.

Although at the start of the article I mentioned we are going to keep this math-free, I’ll just give you a formula and help solidify the intuitions we’ve developed here.

This Z is a measure of extremeness. So when Z is close to zero that means our sample lines up pretty much with what we would expect if the null hypothesis was true. And if you have a look at the numerator of that function you can see that if the sample mean equals the hypothesized mean mu then we’re going to get zero. And the larger that gap between x̅ and mu- the larger the Z value meaning our sample is more extreme and this means we’re more likely to reject the null hypothesis and we saw this happen in the first example.

Also, you might notice on the denominator there’s an n down there, which represents the sample size. And we now understand how that works too because as n increases that value of Z will also increase meaning again that we’re more likely to reject the null. So as our sample sizes increase, we are more likely to reject the null unless the difference between the population and sample mean is small.

Based on what we saw above, we know that we’re more likely to reject H₀ when- _**1. the sample difference is greater like in the first example and

when the number of observations is greater like in the second example.**_

Hypothesis testing, in essence, involves three steps

Type 1 and Type 2 Errors

And that brings us to our last concept for this article which is Type 1 and Type 2 errors.

When we do hypothesis testing, we make our decisions on the basis of the evidence at hand and not a 100% guaranteed proof. We merely state that there is enough evidence to behave one way or the other. This is always true in statistics! Because of this, whatever the decision, there is always a chance that we made an error. There is always a chance to get an extreme sample which makes you reject the null or the alternative hypothesis.

Earlier, in the examples, we had our rejection zones where we were going to reject the null hypothesis. A Type 1 error occurs when we reject a null hypothesis that is in fact true. And we can never remove that possibility of a type error.

Rejection Zones are arbitrary and are selected by the person doing the testing

The probability of a type 1 error is called the level of significance or alpha (α). And you can actually choose your level of significance. You can choose how strict you want to be with your decision to reject that null hypothesis. Often we’re going to use a level of 5% but that’s just the convention. These regions that we’ve drawn up here where we’re going to reject that null hypothesis are completely arbitrary. We’ve just decided on them based on this level of strictness that we can call the significance level (α).

A Type 2 error occurs when you do not reject a null hypothesis that is in fact false so say the sample mean lay very close to 169 cms and we, therefore, chose not to reject that null hypothesis. It is of course still possible that the true population mean is different from 169. It could be anything and in that case, we’d be committing a type 2 error, and again this probability of committing a type 2 error is something we can’t fully mitigate. There’s always going to be a probability of committing a type 2 error when we conduct a hypothesis test in this case the probability of doing so for a type 2 error is called beta (β) and one minus beta is called the power of the hypothesis test (1-β).

And that concludes our introduction to Hypothesis Testing.