Is This the Solution to P-Hacking?

In scientific research, the manipulation of data and peeking at results have been problems for as long as the field has existed. Researchers often aim for a significant p-value to get published, which can lead to the temptation of stopping data collection early or manipulating the data. This practice, known as p-hacking, was the focus of my previous post. If researchers decide to deliberately change data values or fake complete datasets, there is not much we can do about it. However, for some instances of p-hacking, there might be a solution available!

In this post, we dive into the topic of safe testing. Safe tests have some strong advantages over the old (current) way of hypothesis testing. For example, this method of testing allows for the combination of results from multiple studies. Another advantage is that you can stop the experiment optionally, at any time you like. To illustrate safe testing, we will use the R package safestats, developed by the researchers who proposed the theory. First, we will introduce e-values and explain the problem they can solve. E-values are already used by companies like Netflix and Amazon because of their benefits.

I will not delve into the proofs of the theory; instead, this post takes a more practical approach, showing how you can use e-values in your own tests. For proofs and a thorough explanation of safe testing, the original paper is a good resource.

An Introduction to E-values

In hypothesis testing, which you can brush up on here, you assess whether to retain the null hypothesis or to accept the alternative. Usually, the p-value is used for this. If the p-value is smaller than the predetermined significance level, alpha, you accept the alternative hypothesis.

E-values function differently from p-values but are related. The easiest interpretation of e-values is like this: Suppose you are gambling against the null hypothesis. You invest 1$, and the return value is equal to E$. If the e-value E is between 0 and 1, you lose, and the null hypothesis holds true. On the other hand, if the e-value is higher than 1, you win! The null hypothesis loses the game. A modest E of 1.1 implies limited evidence against the null, whereas a substantial E, say 1000, denotes overwhelming evidence.

Some main points of e-values to be aware of:

An e-value can take any positive value, and you can use e-values as an alternative to p-values in hypothesis testing.
An e-value E, is interpretable as a traditional p-value p, by the relation 1/E = p. Beware: It will not give you the same result as a standard p-value, but you can interpret it like a p-value.
In traditional tests, you have alpha, also known as the significance level. Often this value is equal to 0.05. E-values work a bit different, and you can look at them as evidence against the null. The higher the e-value, the more evidence against the null.
At any point in time (!) you can stop data collection and draw a conclusion during the test if you are using e-values. This is known as e-processes, and the use of e-processes ensures validity under optional stopping and allows sequential updates of statistical evidence.

Fun fact: E-values are not as ‘new’ as you might think. The first paper on it was written in 1976. The values were not called e-values at that time.

A researcher gambling against... a hypothesis?! Image created with Dall·E 3 by the author. — A researcher gambling against… a hypothesis?! Image created with Dall·E 3 by the author.

Why should I care about E-values?

That is a valid question. What is wrong with traditional p-values? Is there a need to replace them with e-values? Why learn something new if there is nothing wrong with the current way of testing?

Actually, there is something wrong with p-values. There is a ton of criticism on traditional p-values. Some statisticians (over 800) want to abandon p-values completely.

Let’s illustrate why with a classic example.

Imagine you are a junior researcher for a pharmaceutical company. You need to test the efficacy of a medicine the company developed. You search for test candidates, and half of them receive the medicine, while the other half takes a placebo. You determine how many test candidates you need to be able to draw conclusions.

The experiment starts, and you struggle a bit finding new participants. You are under time pressure, and your boss asks on a regular basis, "Do you have the results for me? We want to ship this product to the market!" Because of the pressure, you decide to peek at the results and calculate the p-value, although you haven’t reached the minimum number of test candidates! Looking at the p-value, now there are two options:

The p-value is not significant. This means you cannot prove that the medicine works. Obviously, you don’t share these results! You wait a bit longer, hoping the p-value will become significant…
Yes! You find a significant p-value! But what is your next step? Do you stop the experiment? Do you continue until you reach the correct number of test candidates? Do you share the results with your boss?

After you looked at the data once, it’s tempting to do it more often. You calculate the p-value, and sometimes it’s significant, sometimes it isn’t… It might seem innocent to do this, while in fact, you are sabotaging the process.

Significant or not? Image created with Dall·E 3 by the author.

Why is it wrong to only look at the data and the corresponding p-value a few times before the experiment has officially ended? One simple and intuitive reason is because if you would have done something with other results (e.g. if you find a significant p-value you stop the experiment), you are messing with the process.

From a theoretical perspective: You violate the Type I error guarantee. The Type I error guarantee refers to how certain you can be that you will not mistakenly reject a true null hypothesis (= find a significant result). It’s like a promise about how often you’ll cry wolf when there’s no wolf around. The risk of this happening is ≤ alpha. But only for one experiment! If you are looking at the data more often, you cannot trust this value anymore: the risk of a Type I error becomes higher.

This relates to the multiple comparisons problem. If you do multiple independent tests to proof the same hypothesis, you should correct the value of alpha to keep the risk of a Type I error low. There are different ways of fixing this, like Bonferroni, Tukey’s range test or Scheffé’s method.

The family-wise error rate for multiple independent tests. For one tests it is equal to alpha. Note that for 10 tests, the error rate has increased to 40%, and for 60 tests, it's 95%. Image by author. — The family-wise error rate for multiple independent tests. For one tests it is equal to alpha. Note that for 10 tests, the error rate has increased to 40%, and for 60 tests, it’s 95%. Image by author.

To summarize: P-values can be used, but it can be tempting for researchers to look at the data before the sample size is reached. This is wrong and increases the risk of a Type I error. To guarantee the quality and robustness of an experiment, e-values are the better alternative. Because of characteristics of e-values, you don’t need to doubt these experiments (or at least less, a researcher can always decide to fabricate data 😢 ).

Benefits of using E-values

As mentioned earlier, we can use e-values in the same way as p-values. One major difference in that case is that large e-values are comparable with low p-values. Recall that 1/E = p. If you want to use e-values in the same way as p-values and you use a significance level of 0.05, you can reject the null hypothesis if an e-value is higher than 20 (1/0.05).

But of course, there are more use cases and benefits of e-values! If there are several experiments that test the same hypothesis, we can multiply the e-values for those tests to get a new e-value that can be used for testing. This can never be done for p-values. But for e-values, it works.

You can also look at the data and results during the experiment. If you want to stop with the test, because the results don’t look promising, that’s okay. Another possibility is to continue with a test if it does look promising.

We can also create anytime valid confidence intervals with e-values. What does this mean? It means that the confidence intervals will work for any sample size (so during the whole experiment). They will be a bit broader than a regular confidence interval, but the good thing is that you can trust them at anytime.

Usage of the safestats package

In the last part of the post, we get more practical. Let’s calculate our own e-values. For this, we use the R-package safestats. To install and load it, run:

install.packages("safestats")
library(safestats)

The case we will solve is a classic one: We will compare different versions of a website. If a person buys, we log success (1), and if a person does not buy anything, we log failure (0). We show the old version of the website to 50% of the visitors (group A), and the new version of the website to the other 50% (group B). In this use case, wel will look at different things that can happen. It can happen that the null hypothesis is true (no difference between the website or the old version is better), and sometimes the alternative hypothesis is true (the new website is better).

The first step in creating a safe test is creating the design objective. In this variable, you specify values for alpha, beta and delta:

designObj <- designSafeTwoProportions(
  na = 1,
  nb = 1,        # na and nb are of equal size so 1:1
  alpha = 0.05,  # significance level
  beta = 0.2,    # risk of type II error
  delta = 0.05,  # minimal effect we like to detect
)

designObj

In many cases, delta is set to a higher number. But for comparing different versions of a website with a lot of traffic, it makes sense to set it small, because it’s easy to get many observations.

The output looks like this:

 Safe Test of Two Proportions Design

 na±2se, nb±2se, nBlocksPlan±2se = 1±0, 1±0, 4355±180.1204
              minimal difference = 0.05
                     alternative = twoSided
         alternative restriction = none
                 power: 1 - beta = 0.8
 parameter: Beta hyperparameters = standard, REGRET optimal
                           alpha = 0.05
decision rule: e-value > 1/alpha = 20

Timestamp: 2023-11-15 10:58:37 CET

Note: Optimality of hyperparameters only verified for equal group sizes (na = nb = 1)

You can recognize the values we chose, but the package also calculated the nBlocksPlan parameter. This is the number of data points (blocks) we need to observe, it’s based on the delta and beta parameter. Also check the decision rule, based on the value of alpha. If the e-value is greater than 20 (1 divided by 0.05), we reject the null hypothesis.

Test case: Alternative Hypothesis is True

Now, let’s generate some fake data:

set.seed(10)
successProbA = 0.05  # success probability for A 5%
successProbB = 0.08  # success probability for B 8%
nTotal = designObj[["nPlan"]]["nBlocksPlan"]  # use the nBlocksPlan value as sample size
ya <- rbinom(n = nTotal, size = 1, prob = successProbA)
yb <- rbinom(n = nTotal, size = 1, prob = successProbB)

Distribution A and B for success probabilities 0.05 and 0.08, respectively. Image by author.

It’s time to perform our first safe test!

safe.prop.test(ya=ya, yb=yb, designObj=designObj)

With output:

 Safe Test of Two Proportions

data:  ya and yb. nObsA = 4355, nObsB = 4355

test: Beta hyperparameters = standard, REGRET optimal
e-value = 77658 > 1/alpha = 20 : TRUE
alternative hypothesis: true difference between proportions in group a and b is not equal to 0 

design: the test was designed with alpha = 0.05
for experiments with na = 1, nb = 1, nBlocksPlan = 4355
to guarantee a power = 0.8 (beta = 0.2)
for minimal relevant difference = 0.05 (twoSided)

The e-value is equal to 77658, which means we can reject the null hypothesis. Enough evidence to reject it!

A question that might arise: "Could we have stopped earlier?" That is a nice benefit of e-values. Peeking at the data is allowed before the planned sample size is reached, so you can decide to quit or continue an experiment at any time. We can plot the e-values, e.g. cumulative for every 50 new samples. The first 40 e-values plot:

In the beginning there is no evidence against the null, corresponding to low e-values. But with gathering more samples the evidence starts to show: the e-values exceed the decision boundary of 20. Image by author.

The full plot:

We can be sure: the null hypothesis should be rejected. All e-values except the last one. Image by author.

Test case: Null Hypothesis is True

If we change the fake data and make the probabilities equal to each other (success probability for version A and B equal to 0.05), we should detect no significant e- or p-value. The distributions of version A and B look similar and the null hypothesis is true. This is reflected in the e-values plot:

But what if we compare this with p-values? How often will we reject the null hypothesis, although in reality we shouldn’t? Let’s test it. We will repeat the experiment 1000 times, and see in how many cases we rejected the null hypothesis for p-values and e-values.

The R code:

pValuesRejected <- c()
eValuesRejected <- c()
alpha <- 0.05
ealpha <- 1/alpha

# repeat the experiment 1000 times, calculate the p-value and the e-value
for (i in seq(1, 1000, by = 1)) {
  # create data, use the same value of nTotal as before (4355)
  set.seed(i)
  ya <- rbinom(n = nTotal, size = 1, prob = 0.05)
  yb <- rbinom(n = nTotal, size = 1, prob = 0.05)

  # calculate the p-value, H0 rejected if it's smaller than alpha
  testresultp <- prop.test(c(sum(ya), sum(yb)), n=c(nTotal, nTotal))
  if (testresultp$p.value < alpha){
    pValuesRejected <- c(pValuesRejected, 1)
  } else{
    pValuesRejected <- c(pValuesRejected, 0)
  }

  # calculate the e-value, H0 rejected if it's bigger than 1/alpha
  testresulte <- safe.prop.test(ya=ya, yb=yb, designObj=designObj)
  if (testresulte[["eValue"]] > ealpha){
    eValuesRejected <- c(eValuesRejected, 1)
  } else{
    eValuesRejected <- c(eValuesRejected, 0)
  }
}

And the output if we sum the pValuesRejected and the eValuesRejected:

> sum(pValuesRejected)
[1] 48
> sum(eValuesRejected)
[1] 0

The p-value was significant in 48 of the cases (around 5%, this is what we would expect with an alpha of 0.05). On the other hand, the e-value does a great job: It never rejects the null hypothesis. In case you weren’t convinced of using e-values yet, I hope you are now!

If you are curious for other examples, I can recommend the vignettes from the safestats package.

Conclusion

E-values present a compelling alternative to traditional p-values, offering several advantages. They provide the flexibility to either continue or halt an experiment at any stage. Additionally, their combinability is a benefit, and the freedom to review experimental results at any point is a big plus. The comparison of p-values and e-values revealed that e-values are more reliable; p-values carry a greater risk of falsely identifying significant differences when none exist. The safestats R package is a useful tool for implementing these robust tests.

I am convinced of the merits of e-values and look forward to the development of a Python package that supports their implementation! 😄

Sneaky Science: Data Dredging Exposed

How to Compare ML Solutions Effectively?

Simplify Your Machine Learning Projects

An Introduction to E-values

Why should I care about E-values?

Benefits of using E-values

Usage of the safestats package

Test case: Alternative Hypothesis is True

Test case: Null Hypothesis is True

Conclusion

Related

Related Articles

How to Forecast Hierarchical Time Series

Must-Know in Statistics: The Bivariate Normal Projection Explained

Squashing the Average: A Dive into Penalized Quantile Regression for Python

HELP! We’ve Been HECS’d

A Visual Learner’s Guide to Explain, Implement and Interpret Principal Component Analysis

Sampling Distribution - sample mean

An interesting walk from Bayesian statistics: Differences between MAP and MLE.