Adjusting standard errors for multiple hypothesis testing

Andreas Baltin

Join Date: Apr 2019

Posts: 36
#1

Adjusting standard errors for multiple hypothesis testing

22 May 2019, 15:46

Hi Statalisters,

I need your help once again, probably for the last time.

I am running a fixed effects estimation on a panel dataset from China. I run a full regression, and then seperately some regressions with interaction terms, and some where I limit the sample to certain individuals. It looks somewhat like this

Code:

xtreg dep control1 control2 control3 control4 control5, fe vce(cluster state) xtreg dep control1 control2 control3 control4 control5, fe vce(cluster state) (on a subsample) xtreg dep control1 control2 control3, fe vce(cluster state) xtreg dep control1 control2 control3 control4 control5 control3*control4, fe vce(cluster state)

Now my supervisor has flagged to me that I should adjust the standard errors for multiple hypothesis testing. I have found this thread on it (https://www.statalist.org/forums/for...thesis-testing), but apart from this there is very little information available on this. Could anyone point me in the direction of how to implement this in Stata?

Many thanks in advance,
Andreas
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30097
#2

22 May 2019, 16:06

I have never heard of adjusting standard errors for multiple hypothesis tests, nor does the idea make any sense to me. While standard errors are often important ingredients in test statistics, they are not themselves test statistics: they are estimates of the sampling variation in whatever statistic they are the standard error of. They are not tests, neither singly nor multiply.

When you do multiple t-tests, say, and you adjust the pvalue for multiple testing what you are doing is assuring that the probability of a Type I error for the entire family of tests is limited to the unadjusted pvalue. If you believe in null hypothesis significance testing, this makes sense in that framework. But there is no analogous statement one could make about standard errors. The sampling error associated with a coefficient or other statistic has nothing to do with how many other statistics you are estimating in the model. There is nothing to correct for.

There are various ways of adjusting p-values for multiple hypothesis testing--and the link you provide show the simplest approach to that in Stata.
Comment
Andreas Baltin

Join Date: Apr 2019

Posts: 36
#3

23 May 2019, 04:16

Originally posted by Clyde Schechter View Post

I have never heard of adjusting standard errors for multiple hypothesis tests, nor does the idea make any sense to me. While standard errors are often important ingredients in test statistics, they are not themselves test statistics: they are estimates of the sampling variation in whatever statistic they are the standard error of. They are not tests, neither singly nor multiply.

When you do multiple t-tests, say, and you adjust the pvalue for multiple testing what you are doing is assuring that the probability of a Type I error for the entire family of tests is limited to the unadjusted pvalue. If you believe in null hypothesis significance testing, this makes sense in that framework. But there is no analogous statement one could make about standard errors. The sampling error associated with a coefficient or other statistic has nothing to do with how many other statistics you are estimating in the model. There is nothing to correct for.

There are various ways of adjusting p-values for multiple hypothesis testing--and the link you provide show the simplest approach to that in Stata.

Hi Clyde, many thanks for your answer! I reckon my supervisor meant adjusting p-values for multiple hypothesis testing, and just phrased it wrong.
What I do now is the following. Run the first regression:

Code:

xtreg dep controls1-15, fe vce(cluster state) and then run the follwing: test controls1-15, mtest(bonferroni) and then get the following output: | F(df,86) df p -------+------------------------------- (1) | 8.76 1 0.0597 # (2) | 0.15 1 1.0000 # (3) | 0.00 1 1.0000 # (4) | 0.00 1 1.0000 # (5) | 0.16 1 1.0000 # (6) | 0.00 1 1.0000 # (7) | 3.01 1 1.0000 # (8) | 2.44 1 1.0000 # (9) | 1.77 1 1.0000 # (10) | 1.30 1 1.0000 # (11) | 0.05 1 1.0000 # (12) | 0.05 1 1.0000 # (13) | 0.19 1 1.0000 # (14) | 2.68 1 1.0000 # (15) | 6.42 1 0.1965 # -------+------------------------------- all | 4.91 15 0.0000 --------------------------------------- # Bonferroni-adjusted p-values

Am I right to interrpet the findings above that variable (1) (the one I am interested in), is significant at the 10% level using Bonferroni-adjusted p-values. (Before it was significant at the 5% level)

Consequently, I would do this for all individual regressions right?

Many thanks again for your help Clyde, you are really helping me out!

Best,
Andreas

Last edited by Andreas Baltin; 23 May 2019, 04:48.
Comment
Andreas Baltin

Join Date: Apr 2019

Posts: 36
#4

23 May 2019, 05:09

Edit: Or should I simply do a bonferroni correction, where I do the following. 'To get the Bonferroni corrected/adjusted p value, divide the original α-value by the number of analyses on the dependent variable.'
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30097
#5

23 May 2019, 09:57

There are two ways of doing a Bonferroni correction. Properly done, they are equivalent.

The way Stata has done it is to multiply the p-value shown in the regression output by the number of tests done (and then replace that by 1.0 if the result is larger than 1) and report that as a p-value. So, yes, the output shown in #3 would allow you to say that the association of variable (1) to your outcome is significant at the 0.10 level, but not at the 0.05 level.

The other way to do a Bonferroni correction is to first decide on your experiment-wide alpha. Let's say we are using the conventional .05 level. In order to declare a result significant at the 0.05 experiment-wide level, we need to find, not p < 0.05 in the regression output, but p < 0.05/#_of_tests. So, without doing the -mtest()- thing, you could say that you have a statistically significant result whenever the p-value in the regression output is less than 0.003333...

These amount to the same thing.
Comment
Andreas Baltin

Join Date: Apr 2019

Posts: 36
#6

23 May 2019, 10:07

Originally posted by Clyde Schechter View Post

There are two ways of doing a Bonferroni correction. Properly done, they are equivalent.

The way Stata has done it is to multiply the p-value shown in the regression output by the number of tests done (and then replace that by 1.0 if the result is larger than 1) and report that as a p-value. So, yes, the output shown in #3 would allow you to say that the association of variable (1) to your outcome is significant at the 0.10 level, but not at the 0.05 level.

The other way to do a Bonferroni correction is to first decide on your experiment-wide alpha. Let's say we are using the conventional .05 level. In order to declare a result significant at the 0.05 experiment-wide level, we need to find, not p < 0.05 in the regression output, but p < 0.05/#_of_tests. So, without doing the -mtest()- thing, you could say that you have a statistically significant result whenever the p-value in the regression output is less than 0.003333...

These amount to the same thing.

Hi Clyde,
many thanks for clearing things up. I have followed both approaches and I am able to confirm what you have outlined.

What I do not fully understand: Why exactly do I have to do a bonferroni correction? I am simply running a regression with 15 controls, that is a standard multivariate (panel) regression. Why do I have to use this? I have read countless of papers and almost nowhere read about them having done a bonferroni correction.

Are you able to shed light on this?

Many thanks in advance,
Andreas

Last edited by Andreas Baltin; 23 May 2019, 10:20.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30097
#7

23 May 2019, 11:19

You probably don't really want to hear what I have to say about this, but I'll say it anyway.

[RANT]

Let's start from the perspective I have long believed, and has recently been endorsed by the American Statistical Association (see https://www.tandfonline.com/doi/full...2019.1583913): statistical significance is a deeply flawed concept and should not be used. So from my perspective, with or without Bonferroni correction, I am not interested in testing null hypotheses. The issues raised by multiple hypothesis tests are only a few of the many reasons why.

Putting that aside, if you still want to live in that world of null hypothesis significance testing, the problem is that if you have a data set consisting of nothing but independent random numbers and you run a large number of hypothesis tests on it, in the long run 5% of those will turn out to be "statistically significant" at the 0.05 level. That is, in fact, the definition of the significance level: it is the probability that the test statistic will exceed the critical value when the null hypothesis is true. So you can basically generate "significant results" by just doing more and more tests until something turns up. This is one of the reasons that there is a crisis of irreproducibility in scientific research: many authors do precisely this. They beat their data with test after test until they find a "significant" result and then publish that one. The Bonferroni correction is a patch on the flawed concept of statistical significance that makes it harder to cheat in this way. The more tests you do, the stricter the criterion for significance becomes. That is the rationale for it.

Now, there is another issue in the contest of your post #1. In that post, you refer to these variables as "control1-15." Again, staying within the discredited framework of null hypothesis statistical testing, if these variables are not the actual variables of interest in your model but are just included to adjust for ("control" is an often-used abuse of language for this) their potential confounding effects, then you shouldn't be doing significance tests on these anyway, Bonferroni corrected or not. That's because the statistical significance, or even the continuous p-value, for such a variable tells you nothing whatsoever about whether that variable's effects need to be taken into account in order to get an unbiased estimate of the effects of the actual variables of interest of the study. They are useless for that purpose, and Bonferroni correcting them doesn't make them any less useless.

As for why you rarely see Bonferroni corrections in the literature, there are several reasons. Some authors are deliberately avoiding acknowledging that they have just mined all the noise in the data until they came up with a, probably phony, statistically significant result. They are a minority. A somewhat larger group are failing to acknowledge it not out of deceit but because they aren't even aware that there is a problem. Another group may mention specifically that they have not Bonferroni corrected their pvalues and give some reason why. I am sometimes in this last group. My reason for disliking the Bonferroni correction, even when I'm working in the null hypothesis significance testing framework, is that it is often quite unclear just how many tests have been done. For example, in your post you refer to 15 variables. But did you also do some other regressions and test some variables there? Maybe you did some others, but don't plan to publish the other results and just these. So is the correction for 15 tests or for the larger number? What if somebody else is also working on this data and has done a bunch of tests? Should you count their tests as well? If so, what do you do if you don't actually know how many were done?

In fairness to Bonferroni, the original context in which he developed his correction is one in which it actually is clearly defined and makes sense. It was originally used in analysis of variance where you had a multi-level categorical predictor variable. Then people would do the omnibus F test for the categorical variable as a whole. But they would also be interested in doing comparisons among the different levels of that variable. So, for example, you find an overall significant effect for "color" but you then want to specifically contrast red with blue and purple with orange and green with yellow. That's three tests. And in this setting the Bonferroni correction is quite sensible (although it is not the only sensible approach).

[/RANT]
1 like
Comment
Andreas Baltin

Join Date: Apr 2019

Posts: 36
#8

23 May 2019, 12:07

Hi Clyde,

I actually enjoyed your rant, it was (even for me) easy to understand and sounded logically. Being a student, I am sadly unable to form an (informed) reply. Nonetheless, what I can say is that being on this forum has kind of 'opened' my eyes. We get all these things taught in class as 'given', i.e. they explain these approaches and outline in which settings they should be employed. Not with one word is there mentioned any ongoing debate about these techniques. Before I came to this forum, I thought that basically all of these techniques were 'accepted and proven', as this is more or less what we get taught (not explicitly, but the style of teaching suggests it). Also, do you mind reposting your link to othe American Statistical Association? It is not working for me.

Back to my specific paper. I sadly have to stick to this world, as otherwise my professors would probably slaughter me (whether that is justified or not). What I do now is the follwing:

1. I run all 4 regressions how they are and interpret the results.
2. I run all 4 regressions and carry out bonferroni adjustments for each individual regression. The variable I am interested in is still significant at the 10% level (before at 1%).
3. I then carry out Benjamini's FDR, which gives me the same statistical levels as the bonferroni
I then say how the results from 2 and 3 should be treated with caution, as they are very conservative and because of the debate that is going on. I conclude that I feel reassured in my reesults, as they are significant under all 1. 2. and 3.
Do you think this is a sensible solution to the issue?

Last edited by Andreas Baltin; 23 May 2019, 12:10.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30097
#9

23 May 2019, 12:46

Not sure why the link didn't work. Here it is again: https://www.tandfonline.com/doi/full...5.2019.1583913.

Yes your plan in 1, 2, and 3 seems quite reasonable to me.
Comment
John Adler

Join Date: Apr 2017

Posts: 173
#10

19 Nov 2019, 04:32

Sorry to hijack an old thread, but this is very interesting!

Is there a way to do this with indicator variables though?

Taking the above example:

Code:

xtreg dep controls1-15, fe vce(cluster state) and then run the following: test controls1-15, mtest(bonferroni) would provide: i: operator invalid if any controls had an i. prefix in the initial analysis

Is there a way to solve this problem?

All the best,

John
Comment
Latoya Sundack

Join Date: Jul 2019

Posts: 67
#11

20 Sep 2022, 04:47

Hi Clyde,

I trust all is well.

Can you please tell me how to carry out the Benjamini's FDR correction in Stata (based on the comment in #8)? Thank you.

Best,
Toya
Comment

Announcement

Adjusting standard errors for multiple hypothesis testing

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment