Trying to run a ttest or a non-parametric rank-sum test pending the results of the Shapiro-Wilk's test for normality

Jay Gold

Join Date: Jul 2023

Posts: 72
#1

Trying to run a ttest or a non-parametric rank-sum test pending the results of the Shapiro-Wilk's test for normality

18 Aug 2023, 15:11

Hi,

I am trying to run a ttest or a non-parametric rank-sum test pending the normality results of the Shapiro-Wilk's test.

I am going to simplify my data set somewhat. Let's say I have the population set into two groups by gender (0=male and 1=female). Cortisol is being measured as a continuous variable.

So far I have:

Code:

bys gender: swilk Cortisol ttest Cortisol,by(gender) unequal ranksum Cortisol, by(gender)

If the p-value for either gender category is <0.05, then the ranksum command should be run. I can use

Code:

return list

to pull out the p-value, but it won't pull it out for one or the other, it kind of puts out gibberish.

Essentially trying to do what post #3 here is describing but for Shapiro-Wilk's test instead of Lavene's test. https://www.statalist.org/forums/for...ttest-to-excel

Thank you for any insights!
Tags: None
Leonardo Guizzetti

Join Date: Jul 2016

Posts: 2386
#2

18 Aug 2023, 15:41

This conditional testing strategy is not statistically valid and will inflate overall type 1 error. Furthermore, they don’t test the same hypothesis. As cortisol is a quantitative value, you will be fine to run a two-sample t-test without assuming equal variances. Equivalently, use regress with group as a single factor variable and robust variance estimates.
1 like
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35399
#3

19 Aug 2023, 04:28

Formal tests for normality divide the statistical world, and many experienced people don't use them much or at all. One reason is that when a distribution of data isn't normal and that matters, you usually know that and use some other procedure any way that doesn't assume it.

A more subtle reason is that normality tests don't and can't answer the question that you want answered, which is loose, namely is this distribution far enough from normal in a way that will undermine what I am doing. A test like Shapiro-Wilk can give misleading answers, most commonly by (1) failing to detect possibly important departures in a small sample or (2) detecting even slight departures from normality in moderate or large samples that are too trivial to bite. The intermediate zone seems very narrow!

The Shapiro-Wilk test doesn't try to answer a key question, namely what else should or might you do if you don't have a normal distribution if that is supposedly ideal?

If I am seriously interested in any t test comparing the means of two groups, I typically

* Plot the data any way to size up what is going on (checking not just for unwelcome departures from normality, but also for outliers, etc.)

* Measure skewness and kurtosis (and feel free to go beyond the moment-based measures, which aren't the only possibilities, or even the most useful)

* Carry out some sensitivity analyses: would I get similar answers doing something a bit different and as or more appropriate?

Here is the comparison of mean mpg for domestic and foreign cars in the auto dataset examined in excruciating detail. Official command qnorm stops at single batches so I used qplot from the Stata Journal. I also used stripplot from SSC. The Tufte flavour of box plot has a marker symbol at the median and whiskers from quartiles to extremes. As there is a quantile plot alongside it is easy to see any other details you might care about. The horizontal reference lines by default show means.

moments from SSC is a convenience wrapper for summarize. Recall that a normal (Gaussian) has skewness 0 and kurtosis 3.

lmoments from SSC calculates L-moments and related measures. Know that t_3 is a scaled measure of skewness that is 0 for any symmetric distribution and t_4 is a scaled measure of tail weight that is near enough 0.123 for a normal.

So each distribution is right-skewed and fatter-tailed than normal. Does that matter? The t-test yields a satisfying P-value for those who live and die by P-values.

Two comparisons that spring to mind here are

1. log scale -- because it's common experience that comparisons of positive and positively skewed outcomes are easier on logarithmic scale

2. reciprocal scale -- because specifically gallons / mile or some multiple makes as much or more engineering (substantive) sense as miles / gallon (indeed using units such as litres / km is standard in many places).

Naturally, you could transform accordingly and carry out the tests or -- and I tend to prefer this -- recast the problem as one of experimenting with different links in a generalized linear model. Here the results are even more clearcut, with my substantive preference for working on reciprocal scale.

Code:

Code:

sysuse auto, clear moments mpg, by(foreign) lmoments mpg, by(foreign) short qplot mpg, over(foreign) legend(order(2 1)) trscale(invnormal(@)) xla(-2/2) xtitle(standard normal deviate) aspect(1) name(G1, replace) stripplot mpg, over(foreign) height(0.6) vertical refline tufte cumul cumprob xla(, ang(-0.001) tlength(*2) tlc(bg)) name(G2, replace) ttest mpg, by(foreign) unequal quietly glm mpg i.foreign, link(log) mat li r(table) quietly glm mpg i.foreign, link(power -1) mat li r(table)

qplot

stripplot

Numeric results:

Code:

. sysuse auto, clear (1978 automobile data) . moments mpg, by(foreign) ---------------------------------------------------------------------- Group | n mean SD skewness kurtosis ----------+----------------------------------------------------------- Domestic | 52 19.827 4.743 0.771 3.441 Foreign | 22 24.773 6.611 0.657 3.107 ---------------------------------------------------------------------- . lmoments mpg, by(foreign) short ---------------------------------------------------------------------- Group | n l_1 l_2 t_3 t_4 ----------+----------------------------------------------------------- Domestic | 52 19.827 2.631 0.153 0.164 Foreign | 22 24.773 3.721 0.140 0.189 ---------------------------------------------------------------------- . ttest mpg, by(foreign) unequal Two-sample t test with unequal variances ------------------------------------------------------------------------------ Group | Obs Mean Std. err. Std. dev. [95% conf. interval] ---------+-------------------------------------------------------------------- Domestic | 52 19.82692 .657777 4.743297 18.50638 21.14747 Foreign | 22 24.77273 1.40951 6.611187 21.84149 27.70396 ---------+-------------------------------------------------------------------- Combined | 74 21.2973 .6725511 5.785503 19.9569 22.63769 ---------+-------------------------------------------------------------------- diff | -4.945804 1.555438 -8.120053 -1.771556 ------------------------------------------------------------------------------ diff = mean(Domestic) - mean(Foreign) t = -3.1797 H0: diff = 0 Satterthwaite's degrees of freedom = 30.5463 Ha: diff < 0 Ha: diff != 0 Ha: diff > 0 Pr(T < t) = 0.0017 Pr(|T| > |t|) = 0.0034 Pr(T > t) = 0.9983 . quietly glm mpg i.foreign, link(log) . mat li r(table) r(table)[9,3] mpg: mpg: mpg: 0b. 1. foreign foreign _cons b 0 .22270258 2.9870408 se . .05939597 .0374601 z . 3.7494563 79.739259 pvalue . .00017722 0 ll . .10628862 2.9136203 ul . .33911653 3.0604612 df . . . crit 1.959964 1.959964 1.959964 eform 0 0 0 . . quietly glm mpg i.foreign, link(power -1) . mat li r(table) r(table)[9,3] mpg: mpg: mpg: 0b. 1. foreign foreign _cons b 0 -.0100695 .05043647 se . .00265174 .00188936 z . -3.7973187 26.695069 pvalue . .00014627 5.37e-157 ll . -.01526681 .0467334 ul . -.00487218 .05413954 df . . . crit 1.959964 1.959964 1.959964 eform 0 0 0

Many readers will know that I am omitting yet further possibilities and raising further questions. I want to comment on a purist view that a researcher should specify a single procedure in advance and then carry it out, not try different things and select according to what makes most sense. Cue indignation or outrage at cherry-picking, P-hacking and the like (which is not to deny that these can be genuine issues).

1. Many papers pretend that this was done. Whether it's plausible is sometimes in doubt.

2. Congratulations if you're smart enough to know in advance of looking at the data exactly how they should be analysed. Sometimes the reality is that people are using some procedure that is standard or traditional in some field -- or they are just copying some papers they've read. Ideally a choice grows out of experience with similar data and knowing what works well.

3. My small philosophy is that you can learn from data how they are best analysed. What's yours? In any case, subject to reviewers, examiners, etc. there should often be scope to explain at least briefly what was tried and found less than ideal.
2 likes
Comment
Jay Gold

Join Date: Jul 2023

Posts: 72
#4

19 Aug 2023, 12:33

Originally posted by Leonardo Guizzetti View Post

This conditional testing strategy is not statistically valid and will inflate overall type 1 error. Furthermore, they don’t test the same hypothesis. As cortisol is a quantitative value, you will be fine to run a two-sample t-test without assuming equal variances. Equivalently, use regress with group as a single factor variable and robust variance estimates.

I am not sure how this would inflate or lead to more type 1 error. If anything, quite the opposite. I am using the test of normality to see if this continuous variable, cortisol, within each group is normally distributed or not. If it is not normally distributed, I will use a non-normal comparison test, which if anything decreases my type 1 error.

Regardless, do you have any insight into how I can automate this check with the suggested code?
Comment
Jay Gold

Join Date: Jul 2023

Posts: 72
#5

19 Aug 2023, 12:40

Originally posted by Nick Cox View Post

Formal tests for normality divide the statistical world, and many experienced people don't use them much or at all. One reason is that when a distribution of data isn't normal and that matters, you usually know that and use some other procedure any way that doesn't assume it.

A more subtle reason is that normality tests don't and can't answer the question that you want answered, which is loose, namely is this distribution far enough from normal in a way that will undermine what I am doing. A test like Shapiro-Wilk can give misleading answers, most commonly by (1) failing to detect possibly important departures in a small sample or (2) detecting even slight departures from normality in moderate or large samples that are too trivial to bite. The intermediate zone seems very narrow!

The Shapiro-Wilk test doesn't try to answer a key question, namely what else should or might you do if you don't have a normal distribution if that is supposedly ideal?

If I am seriously interested in any t test comparing the means of two groups, I typically

* Plot the data any way to size up what is going on (checking not just for unwelcome departures from normality, but also for outliers, etc.)

* Measure skewness and kurtosis (and feel free to go beyond the moment-based measures, which aren't the only possibilities, or even the most useful)

* Carry out some sensitivity analyses: would I get similar answers doing something a bit different and as or more appropriate?

Here is the comparison of mean mpg for domestic and foreign cars in the auto dataset examined in excruciating detail. Official command qnorm stops at single batches so I used qplot from the Stata Journal. I also used stripplot from SSC. The Tufte flavour of box plot has a marker symbol at the median and whiskers from quartiles to extremes. As there is a quantile plot alongside it is easy to see any other details you might care about. The horizontal reference lines by default show means.

moments from SSC is a convenience wrapper for summarize. Recall that a normal (Gaussian) has skewness 0 and kurtosis 3.

lmoments from SSC calculates L-moments and related measures. Know that t_3 is a scaled measure of skewness that is 0 for any symmetric distribution and t_4 is a scaled measure of tail weight that is near enough 0.123 for a normal.

So each distribution is right-skewed and fatter-tailed than normal. Does that matter? The t-test yields a satisfying P-value for those who live and die by P-values.

Two comparisons that spring to mind here are

1. log scale -- because it's common experience that comparisons of positive and positively skewed outcomes are easier on logarithmic scale

2. reciprocal scale -- because specifically gallons / mile or some multiple makes as much or more engineering (substantive) sense as miles / gallon (indeed using units such as litres / km is standard in many places).

Naturally, you could transform accordingly and carry out the tests or -- and I tend to prefer this -- recast the problem as one of experimenting with different links in a generalized linear model. Here the results are even more clearcut, with my substantive preference for working on reciprocal scale.

Code:

Code:

sysuse auto, clear moments mpg, by(foreign) lmoments mpg, by(foreign) short qplot mpg, over(foreign) legend(order(2 1)) trscale(invnormal(@)) xla(-2/2) xtitle(standard normal deviate) aspect(1) name(G1, replace) stripplot mpg, over(foreign) height(0.6) vertical refline tufte cumul cumprob xla(, ang(-0.001) tlength(*2) tlc(bg)) name(G2, replace) ttest mpg, by(foreign) unequal quietly glm mpg i.foreign, link(log) mat li r(table) quietly glm mpg i.foreign, link(power -1) mat li r(table)

qplot

[ATTACH=CONFIG]n1724332[/ATTACH] stripplot

[ATTACH=CONFIG]n1724333[/ATTACH]
Numeric results:

Code:

. sysuse auto, clear (1978 automobile data) . moments mpg, by(foreign) ---------------------------------------------------------------------- Group | n mean SD skewness kurtosis ----------+----------------------------------------------------------- Domestic | 52 19.827 4.743 0.771 3.441 Foreign | 22 24.773 6.611 0.657 3.107 ---------------------------------------------------------------------- . lmoments mpg, by(foreign) short ---------------------------------------------------------------------- Group | n l_1 l_2 t_3 t_4 ----------+----------------------------------------------------------- Domestic | 52 19.827 2.631 0.153 0.164 Foreign | 22 24.773 3.721 0.140 0.189 ---------------------------------------------------------------------- . ttest mpg, by(foreign) unequal Two-sample t test with unequal variances ------------------------------------------------------------------------------ Group | Obs Mean Std. err. Std. dev. [95% conf. interval] ---------+-------------------------------------------------------------------- Domestic | 52 19.82692 .657777 4.743297 18.50638 21.14747 Foreign | 22 24.77273 1.40951 6.611187 21.84149 27.70396 ---------+-------------------------------------------------------------------- Combined | 74 21.2973 .6725511 5.785503 19.9569 22.63769 ---------+-------------------------------------------------------------------- diff | -4.945804 1.555438 -8.120053 -1.771556 ------------------------------------------------------------------------------ diff = mean(Domestic) - mean(Foreign) t = -3.1797 H0: diff = 0 Satterthwaite's degrees of freedom = 30.5463 Ha: diff < 0 Ha: diff != 0 Ha: diff > 0 Pr(T < t) = 0.0017 Pr(|T| > |t|) = 0.0034 Pr(T > t) = 0.9983 . quietly glm mpg i.foreign, link(log) . mat li r(table) r(table)[9,3] mpg: mpg: mpg: 0b. 1. foreign foreign _cons b 0 .22270258 2.9870408 se . .05939597 .0374601 z . 3.7494563 79.739259 pvalue . .00017722 0 ll . .10628862 2.9136203 ul . .33911653 3.0604612 df . . . crit 1.959964 1.959964 1.959964 eform 0 0 0 . . quietly glm mpg i.foreign, link(power -1) . mat li r(table) r(table)[9,3] mpg: mpg: mpg: 0b. 1. foreign foreign _cons b 0 -.0100695 .05043647 se . .00265174 .00188936 z . -3.7973187 26.695069 pvalue . .00014627 5.37e-157 ll . -.01526681 .0467334 ul . -.00487218 .05413954 df . . . crit 1.959964 1.959964 1.959964 eform 0 0 0

Many readers will know that I am omitting yet further possibilities and raising further questions. I want to comment on a purist view that a researcher should specify a single procedure in advance and then carry it out, not try different things and select according to what makes most sense. Cue indignation or outrage at cherry-picking, P-hacking and the like (which is not to deny that these can be genuine issues).

1. Many papers pretend that this was done. Whether it's plausible is sometimes in doubt.

2. Congratulations if you're smart enough to know in advance of looking at the data exactly how they should be analysed. Sometimes the reality is that people are using some procedure that is standard or traditional in some field -- or they are just copying some papers they've read. Ideally a choice grows out of experience with similar data and knowing what works well.

3. My small philosophy is that you can learn from data how they are best analysed. What's yours? In any case, subject to reviewers, examiners, etc. there should often be scope to explain at least briefly what was tried and found less than ideal.

Thanks, this is a great in depth response! I have a few questions:
In your example, does output from the

Code:

moments

command demonstrate a normal distribution? How close does the skewness need to be to zero?

Could you explain what the code here is doing?

Code:

. quietly glm mpg i.foreign, link(log) . mat li r(table) r(table)[9,3] mpg: mpg: mpg: 0b. 1. foreign foreign _cons b 0 .22270258 2.9870408 se . .05939597 .0374601 z . 3.7494563 79.739259 pvalue . .00017722 0 ll . .10628862 2.9136203 ul . .33911653 3.0604612 df . . . crit 1.959964 1.959964 1.959964 eform 0 0 0 . . quietly glm mpg i.foreign, link(power -1) . mat li r(table)

I am not familiar with

Code:

quietly

. Or

Code:

mat li

. I do however often use

Code:

r(table)

.
Comment

Tiago Pereira

Join Date: Jan 2016
Posts: 374

19 Aug 2023, 13:19

Although I agree with Leonardo and Nick, sometimes it is a dilemma to deal with reviewers, supervisors and the stubborn herd effect in several areas.

Code:

clear
set obs 100
gene gender = round(runiform())
gene Cortisol = rbeta(1,1)

swilk Cortisol if gender==1
local p1 = cond(r(p)<0.05,1,0)
swilk Cortisol if gender==0
local p2 =cond(r(p)<0.05,1,0)


    if (`p1'+`p2') >0 {
    ranksum Cortisol, by(gender)
    }
    
    else {
    ttest Cortisol,by(gender) unequal
    }

Comment

Jay Gold

Join Date: Jul 2023

Posts: 72
#7

19 Aug 2023, 15:09

Originally posted by Tiago Pereira View Post

Although I agree with Leonardo and Nick, sometimes it is a dilemma to deal with reviewers, supervisors and the stubborn herd effect in several areas.

Code:

clear set obs 100 gene gender = round(runiform()) gene Cortisol = rbeta(1,1) swilk Cortisol if gender==1 local p1 = cond(r(p)<0.05,1,0) swilk Cortisol if gender==0 local p2 =cond(r(p)<0.05,1,0) if (`p1'+`p2') >0 { ranksum Cortisol, by(gender) } else { ttest Cortisol,by(gender) unequal }

Thank you Tiago! This worked.

Indeed, you are right with your assessment as to why I need to do this. This is the standard in my field - for better or worse.
Comment
Leonardo Guizzetti

Join Date: Jul 2016

Posts: 2386
#8

19 Aug 2023, 15:25

Originally posted by Jay Gold View Post

Indeed, you are right with your assessment as to why I need to do this. This is the standard in my field - for better or worse.

I can sympathize with the peer pressure to pursue this type of tactic. However, it's not well founded. If you are interested is tests of means, do that (-/+ transformations that are reasonable or customary for skewness). T-tests or robust regression are very robust to departures from normality for testing equality of means. In biomarker concentration data, natural logs or square-roots are commonplace enough. However, a ranksum test is not a non-normal test of means, nor even medians, a very common misconception present even in textbooks. It simply tells you whether one group is higher on average than the other, with no reference to any underlying distribution. Either testing approach is fine as long as you are transparent about what it is you are most interested in testing.
2 likes
Comment
Leonardo Guizzetti

Join Date: Jul 2016

Posts: 2386
#9

19 Aug 2023, 17:13

Originally posted by Jay Gold View Post

I am not sure how this would inflate or lead to more type 1 error. If anything, quite the opposite. I am using the test of normality to see if this continuous variable, cortisol, within each group is normally distributed or not. If it is not normally distributed, I will use a non-normal comparison test, which if anything decreases my type 1 error.

From a theoretical perspective, this is a conditional testing strategy, and as such, the p-value at the end of the chosen procedure is also conditional on the the first test being significant (or not) by some threshold alpha. Said another way, it is multiple comparison procedure. The absolute best case scenario you could hope for is that it does not increase type 1 error. It will not lead to a smaller type 1 error though.

I was reminded of this paper which discusses some theory and also shows the basic strategy simulation. Their conclusions show that in this case, the inflation is benign. However, the strategy is wrong from a formal perspective. In any case, the key question here is not a distribution one (normal versus not), but whether the test addresses the scientific question of interest (are means of interest, or something else?).
2 likes
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35399
#10

19 Aug 2023, 19:33

In #5 there were two questions:

In your example, does output from the moments command demonstrate a normal distribution? How close does the skewness need to be to zero?

That's a testing question that moments deliberately avoids. In preferring measurement to testing I am like someone who says that someone being 1.85 m tall is more information than someone being reported as tall. It takes a bit experience to know that skewness being 0.8 and kurtosis being 3.4 is not a big deal. That experience is boosted by looking at graphs and checking that you get similar answers on the means being different even with different ways of asking.

Could you explain what the code here is doing?

. quietly glm mpg i.foreign, link(log)
. mat li r(table)

Take off the quietly to see that this is generalized linear approach to comparing the means. A comparison between two group means is equivalent to fitting a regression model with a (0. 1) predictor. The log link indicates that fitting is done on the log scale. The z statistic associated with the predictor is analogous to the t statistic in the conventional test with that name.

I would want to add that bootstrapping the difference between means is a good way to push the comparison forward.
1 like
Comment

Announcement

Trying to run a ttest or a non-parametric rank-sum test pending the results of the Shapiro-Wilk's test for normality

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment