Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Trying to run a ttest or a non-parametric rank-sum test pending the results of the Shapiro-Wilk's test for normality

    Hi,

    I am trying to run a ttest or a non-parametric rank-sum test pending the normality results of the Shapiro-Wilk's test.

    I am going to simplify my data set somewhat. Let's say I have the population set into two groups by gender (0=male and 1=female). Cortisol is being measured as a continuous variable.

    So far I have:
    Code:
    bys gender: swilk Cortisol
    ttest Cortisol,by(gender) unequal
    ranksum Cortisol, by(gender)
    If the p-value for either gender category is <0.05, then the ranksum command should be run. I can use
    Code:
    return list
    to pull out the p-value, but it won't pull it out for one or the other, it kind of puts out gibberish.

    Essentially trying to do what post #3 here is describing but for Shapiro-Wilk's test instead of Lavene's test. https://www.statalist.org/forums/for...ttest-to-excel

    Thank you for any insights!

  • #2
    This conditional testing strategy is not statistically valid and will inflate overall type 1 error. Furthermore, they don’t test the same hypothesis. As cortisol is a quantitative value, you will be fine to run a two-sample t-test without assuming equal variances. Equivalently, use regress with group as a single factor variable and robust variance estimates.

    Comment


    • #3
      Formal tests for normality divide the statistical world, and many experienced people don't use them much or at all. One reason is that when a distribution of data isn't normal and that matters, you usually know that and use some other procedure any way that doesn't assume it.

      A more subtle reason is that normality tests don't and can't answer the question that you want answered, which is loose, namely is this distribution far enough from normal in a way that will undermine what I am doing. A test like Shapiro-Wilk can give misleading answers, most commonly by (1) failing to detect possibly important departures in a small sample or (2) detecting even slight departures from normality in moderate or large samples that are too trivial to bite. The intermediate zone seems very narrow!

      The Shapiro-Wilk test doesn't try to answer a key question, namely what else should or might you do if you don't have a normal distribution if that is supposedly ideal?

      If I am seriously interested in any t test comparing the means of two groups, I typically

      * Plot the data any way to size up what is going on (checking not just for unwelcome departures from normality, but also for outliers, etc.)

      * Measure skewness and kurtosis (and feel free to go beyond the moment-based measures, which aren't the only possibilities, or even the most useful)

      * Carry out some sensitivity analyses: would I get similar answers doing something a bit different and as or more appropriate?

      Here is the comparison of mean mpg for domestic and foreign cars in the auto dataset examined in excruciating detail. Official command qnorm stops at single batches so I used qplot from the Stata Journal. I also used stripplot from SSC. The Tufte flavour of box plot has a marker symbol at the median and whiskers from quartiles to extremes. As there is a quantile plot alongside it is easy to see any other details you might care about. The horizontal reference lines by default show means.

      moments from SSC is a convenience wrapper for summarize. Recall that a normal (Gaussian) has skewness 0 and kurtosis 3.

      lmoments from SSC calculates L-moments and related measures. Know that t_3 is a scaled measure of skewness that is 0 for any symmetric distribution and t_4 is a scaled measure of tail weight that is near enough 0.123 for a normal.

      So each distribution is right-skewed and fatter-tailed than normal. Does that matter? The t-test yields a satisfying P-value for those who live and die by P-values.

      Two comparisons that spring to mind here are

      1. log scale -- because it's common experience that comparisons of positive and positively skewed outcomes are easier on logarithmic scale

      2. reciprocal scale -- because specifically gallons / mile or some multiple makes as much or more engineering (substantive) sense as miles / gallon (indeed using units such as litres / km is standard in many places).

      Naturally, you could transform accordingly and carry out the tests or -- and I tend to prefer this -- recast the problem as one of experimenting with different links in a generalized linear model. Here the results are even more clearcut, with my substantive preference for working on reciprocal scale.

      Code:

      Code:
      sysuse auto, clear 
      
      moments mpg, by(foreign)
      
      lmoments mpg, by(foreign) short 
      
      qplot mpg, over(foreign) legend(order(2 1)) trscale(invnormal(@)) xla(-2/2) xtitle(standard normal deviate) aspect(1) name(G1, replace)
      
      stripplot mpg, over(foreign) height(0.6) vertical refline tufte cumul cumprob xla(, ang(-0.001) tlength(*2) tlc(bg)) name(G2, replace)
      
      ttest mpg, by(foreign) unequal 
      
      quietly glm mpg i.foreign, link(log)
      
      mat li r(table)
      
      quietly glm mpg i.foreign, link(power -1)
      
      mat li r(table)
      qplot

      Click image for larger version

Name:	mpg_qnorm.png
Views:	1
Size:	31.2 KB
ID:	1724332
      stripplot

      Click image for larger version

Name:	mpg_stripplot.png
Views:	1
Size:	33.1 KB
ID:	1724333

      Numeric results:

      Code:
      . sysuse auto, clear 
      (1978 automobile data)
      
      . moments mpg, by(foreign)
      
      ----------------------------------------------------------------------
          Group |          n        mean          SD    skewness    kurtosis
      ----------+-----------------------------------------------------------
       Domestic |         52      19.827       4.743       0.771       3.441
        Foreign |         22      24.773       6.611       0.657       3.107
      ----------------------------------------------------------------------
      
      . lmoments mpg, by(foreign) short 
      
      ----------------------------------------------------------------------
          Group |          n         l_1         l_2         t_3         t_4
      ----------+-----------------------------------------------------------
       Domestic |         52      19.827       2.631       0.153       0.164
        Foreign |         22      24.773       3.721       0.140       0.189
      ----------------------------------------------------------------------
      
      . ttest mpg, by(foreign) unequal 
      
      Two-sample t test with unequal variances
      ------------------------------------------------------------------------------
         Group |     Obs        Mean    Std. err.   Std. dev.   [95% conf. interval]
      ---------+--------------------------------------------------------------------
      Domestic |      52    19.82692     .657777    4.743297    18.50638    21.14747
       Foreign |      22    24.77273     1.40951    6.611187    21.84149    27.70396
      ---------+--------------------------------------------------------------------
      Combined |      74     21.2973    .6725511    5.785503     19.9569    22.63769
      ---------+--------------------------------------------------------------------
          diff |           -4.945804    1.555438               -8.120053   -1.771556
      ------------------------------------------------------------------------------
          diff = mean(Domestic) - mean(Foreign)                         t =  -3.1797
      H0: diff = 0                     Satterthwaite's degrees of freedom =  30.5463
      
          Ha: diff < 0                 Ha: diff != 0                 Ha: diff > 0
       Pr(T < t) = 0.0017         Pr(|T| > |t|) = 0.0034          Pr(T > t) = 0.9983
      
      . quietly glm mpg i.foreign, link(log)
      
      . mat li r(table)
      
      r(table)[9,3]
                    mpg:       mpg:       mpg:
                     0b.         1.           
                foreign    foreign      _cons
           b          0  .22270258  2.9870408
          se          .  .05939597   .0374601
           z          .  3.7494563  79.739259
      pvalue          .  .00017722          0
          ll          .  .10628862  2.9136203
          ul          .  .33911653  3.0604612
          df          .          .          .
        crit   1.959964   1.959964   1.959964
       eform          0          0          0
      
      . 
      . quietly glm mpg i.foreign, link(power -1)
      
      . mat li r(table)
      
      r(table)[9,3]
                     mpg:        mpg:        mpg:
                      0b.          1.            
                 foreign     foreign       _cons
           b           0   -.0100695   .05043647
          se           .   .00265174   .00188936
           z           .  -3.7973187   26.695069
      pvalue           .   .00014627   5.37e-157
          ll           .  -.01526681    .0467334
          ul           .  -.00487218   .05413954
          df           .           .           .
        crit    1.959964    1.959964    1.959964
       eform           0           0           0
      Many readers will know that I am omitting yet further possibilities and raising further questions. I want to comment on a purist view that a researcher should specify a single procedure in advance and then carry it out, not try different things and select according to what makes most sense. Cue indignation or outrage at cherry-picking, P-hacking and the like (which is not to deny that these can be genuine issues).

      1. Many papers pretend that this was done. Whether it's plausible is sometimes in doubt.

      2. Congratulations if you're smart enough to know in advance of looking at the data exactly how they should be analysed. Sometimes the reality is that people are using some procedure that is standard or traditional in some field -- or they are just copying some papers they've read. Ideally a choice grows out of experience with similar data and knowing what works well.

      3. My small philosophy is that you can learn from data how they are best analysed. What's yours? In any case, subject to reviewers, examiners, etc. there should often be scope to explain at least briefly what was tried and found less than ideal.

      Comment


      • #4
        Originally posted by Leonardo Guizzetti View Post
        This conditional testing strategy is not statistically valid and will inflate overall type 1 error. Furthermore, they don’t test the same hypothesis. As cortisol is a quantitative value, you will be fine to run a two-sample t-test without assuming equal variances. Equivalently, use regress with group as a single factor variable and robust variance estimates.
        I am not sure how this would inflate or lead to more type 1 error. If anything, quite the opposite. I am using the test of normality to see if this continuous variable, cortisol, within each group is normally distributed or not. If it is not normally distributed, I will use a non-normal comparison test, which if anything decreases my type 1 error.

        Regardless, do you have any insight into how I can automate this check with the suggested code?

        Comment


        • #5
          Originally posted by Nick Cox View Post
          Formal tests for normality divide the statistical world, and many experienced people don't use them much or at all. One reason is that when a distribution of data isn't normal and that matters, you usually know that and use some other procedure any way that doesn't assume it.

          A more subtle reason is that normality tests don't and can't answer the question that you want answered, which is loose, namely is this distribution far enough from normal in a way that will undermine what I am doing. A test like Shapiro-Wilk can give misleading answers, most commonly by (1) failing to detect possibly important departures in a small sample or (2) detecting even slight departures from normality in moderate or large samples that are too trivial to bite. The intermediate zone seems very narrow!

          The Shapiro-Wilk test doesn't try to answer a key question, namely what else should or might you do if you don't have a normal distribution if that is supposedly ideal?

          If I am seriously interested in any t test comparing the means of two groups, I typically

          * Plot the data any way to size up what is going on (checking not just for unwelcome departures from normality, but also for outliers, etc.)

          * Measure skewness and kurtosis (and feel free to go beyond the moment-based measures, which aren't the only possibilities, or even the most useful)

          * Carry out some sensitivity analyses: would I get similar answers doing something a bit different and as or more appropriate?

          Here is the comparison of mean mpg for domestic and foreign cars in the auto dataset examined in excruciating detail. Official command qnorm stops at single batches so I used qplot from the Stata Journal. I also used stripplot from SSC. The Tufte flavour of box plot has a marker symbol at the median and whiskers from quartiles to extremes. As there is a quantile plot alongside it is easy to see any other details you might care about. The horizontal reference lines by default show means.

          moments from SSC is a convenience wrapper for summarize. Recall that a normal (Gaussian) has skewness 0 and kurtosis 3.

          lmoments from SSC calculates L-moments and related measures. Know that t_3 is a scaled measure of skewness that is 0 for any symmetric distribution and t_4 is a scaled measure of tail weight that is near enough 0.123 for a normal.

          So each distribution is right-skewed and fatter-tailed than normal. Does that matter? The t-test yields a satisfying P-value for those who live and die by P-values.

          Two comparisons that spring to mind here are

          1. log scale -- because it's common experience that comparisons of positive and positively skewed outcomes are easier on logarithmic scale

          2. reciprocal scale -- because specifically gallons / mile or some multiple makes as much or more engineering (substantive) sense as miles / gallon (indeed using units such as litres / km is standard in many places).

          Naturally, you could transform accordingly and carry out the tests or -- and I tend to prefer this -- recast the problem as one of experimenting with different links in a generalized linear model. Here the results are even more clearcut, with my substantive preference for working on reciprocal scale.

          Code:

          Code:
          sysuse auto, clear
          
          moments mpg, by(foreign)
          
          lmoments mpg, by(foreign) short
          
          qplot mpg, over(foreign) legend(order(2 1)) trscale(invnormal(@)) xla(-2/2) xtitle(standard normal deviate) aspect(1) name(G1, replace)
          
          stripplot mpg, over(foreign) height(0.6) vertical refline tufte cumul cumprob xla(, ang(-0.001) tlength(*2) tlc(bg)) name(G2, replace)
          
          ttest mpg, by(foreign) unequal
          
          quietly glm mpg i.foreign, link(log)
          
          mat li r(table)
          
          quietly glm mpg i.foreign, link(power -1)
          
          mat li r(table)
          qplot

          [ATTACH=CONFIG]n1724332[/ATTACH] stripplot

          [ATTACH=CONFIG]n1724333[/ATTACH]
          Numeric results:

          Code:
          . sysuse auto, clear
          (1978 automobile data)
          
          . moments mpg, by(foreign)
          
          ----------------------------------------------------------------------
          Group | n mean SD skewness kurtosis
          ----------+-----------------------------------------------------------
          Domestic | 52 19.827 4.743 0.771 3.441
          Foreign | 22 24.773 6.611 0.657 3.107
          ----------------------------------------------------------------------
          
          . lmoments mpg, by(foreign) short
          
          ----------------------------------------------------------------------
          Group | n l_1 l_2 t_3 t_4
          ----------+-----------------------------------------------------------
          Domestic | 52 19.827 2.631 0.153 0.164
          Foreign | 22 24.773 3.721 0.140 0.189
          ----------------------------------------------------------------------
          
          . ttest mpg, by(foreign) unequal
          
          Two-sample t test with unequal variances
          ------------------------------------------------------------------------------
          Group | Obs Mean Std. err. Std. dev. [95% conf. interval]
          ---------+--------------------------------------------------------------------
          Domestic | 52 19.82692 .657777 4.743297 18.50638 21.14747
          Foreign | 22 24.77273 1.40951 6.611187 21.84149 27.70396
          ---------+--------------------------------------------------------------------
          Combined | 74 21.2973 .6725511 5.785503 19.9569 22.63769
          ---------+--------------------------------------------------------------------
          diff | -4.945804 1.555438 -8.120053 -1.771556
          ------------------------------------------------------------------------------
          diff = mean(Domestic) - mean(Foreign) t = -3.1797
          H0: diff = 0 Satterthwaite's degrees of freedom = 30.5463
          
          Ha: diff < 0 Ha: diff != 0 Ha: diff > 0
          Pr(T < t) = 0.0017 Pr(|T| > |t|) = 0.0034 Pr(T > t) = 0.9983
          
          . quietly glm mpg i.foreign, link(log)
          
          . mat li r(table)
          
          r(table)[9,3]
          mpg: mpg: mpg:
          0b. 1.
          foreign foreign _cons
          b 0 .22270258 2.9870408
          se . .05939597 .0374601
          z . 3.7494563 79.739259
          pvalue . .00017722 0
          ll . .10628862 2.9136203
          ul . .33911653 3.0604612
          df . . .
          crit 1.959964 1.959964 1.959964
          eform 0 0 0
          
          .
          . quietly glm mpg i.foreign, link(power -1)
          
          . mat li r(table)
          
          r(table)[9,3]
          mpg: mpg: mpg:
          0b. 1.
          foreign foreign _cons
          b 0 -.0100695 .05043647
          se . .00265174 .00188936
          z . -3.7973187 26.695069
          pvalue . .00014627 5.37e-157
          ll . -.01526681 .0467334
          ul . -.00487218 .05413954
          df . . .
          crit 1.959964 1.959964 1.959964
          eform 0 0 0
          Many readers will know that I am omitting yet further possibilities and raising further questions. I want to comment on a purist view that a researcher should specify a single procedure in advance and then carry it out, not try different things and select according to what makes most sense. Cue indignation or outrage at cherry-picking, P-hacking and the like (which is not to deny that these can be genuine issues).

          1. Many papers pretend that this was done. Whether it's plausible is sometimes in doubt.

          2. Congratulations if you're smart enough to know in advance of looking at the data exactly how they should be analysed. Sometimes the reality is that people are using some procedure that is standard or traditional in some field -- or they are just copying some papers they've read. Ideally a choice grows out of experience with similar data and knowing what works well.

          3. My small philosophy is that you can learn from data how they are best analysed. What's yours? In any case, subject to reviewers, examiners, etc. there should often be scope to explain at least briefly what was tried and found less than ideal.
          Thanks, this is a great in depth response! I have a few questions:
          • In your example, does output from the
            Code:
            moments
            command demonstrate a normal distribution? How close does the skewness need to be to zero?
          • Could you explain what the code here is doing?
            Code:
             	
             . quietly glm mpg i.foreign, link(log)  . mat li r(table)  r(table)[9,3]               mpg:       mpg:       mpg:                0b.         1.                      foreign    foreign      _cons      b          0  .22270258  2.9870408     se          .  .05939597   .0374601      z          .  3.7494563  79.739259 pvalue          .  .00017722          0     ll          .  .10628862  2.9136203     ul          .  .33911653  3.0604612     df          .          .          .   crit   1.959964   1.959964   1.959964  eform          0          0          0  .  . quietly glm mpg i.foreign, link(power -1)  . mat li r(table)
          I am not familiar with
          Code:
          quietly
          . Or
          Code:
          mat li
          . I do however often use
          Code:
          r(table)
          .

          Comment


          • #6
            Although I agree with Leonardo and Nick, sometimes it is a dilemma to deal with reviewers, supervisors and the stubborn herd effect in several areas.

            Code:
            clear
            set obs 100
            gene gender = round(runiform())
            gene Cortisol = rbeta(1,1)
            
            swilk Cortisol if gender==1
            local p1 = cond(r(p)<0.05,1,0)
            swilk Cortisol if gender==0
            local p2 =cond(r(p)<0.05,1,0)
            
            
                if (`p1'+`p2') >0 {
                ranksum Cortisol, by(gender)
                }
                
                else {
                ttest Cortisol,by(gender) unequal
                }

            Comment


            • #7
              Originally posted by Tiago Pereira View Post
              Although I agree with Leonardo and Nick, sometimes it is a dilemma to deal with reviewers, supervisors and the stubborn herd effect in several areas.

              Code:
              clear
              set obs 100
              gene gender = round(runiform())
              gene Cortisol = rbeta(1,1)
              
              swilk Cortisol if gender==1
              local p1 = cond(r(p)<0.05,1,0)
              swilk Cortisol if gender==0
              local p2 =cond(r(p)<0.05,1,0)
              
              
              if (`p1'+`p2') >0 {
              ranksum Cortisol, by(gender)
              }
              
              else {
              ttest Cortisol,by(gender) unequal
              }
              Thank you Tiago! This worked.

              Indeed, you are right with your assessment as to why I need to do this. This is the standard in my field - for better or worse.

              Comment


              • #8
                Originally posted by Jay Gold View Post
                Indeed, you are right with your assessment as to why I need to do this. This is the standard in my field - for better or worse.
                I can sympathize with the peer pressure to pursue this type of tactic. However, it's not well founded. If you are interested is tests of means, do that (-/+ transformations that are reasonable or customary for skewness). T-tests or robust regression are very robust to departures from normality for testing equality of means. In biomarker concentration data, natural logs or square-roots are commonplace enough. However, a ranksum test is not a non-normal test of means, nor even medians, a very common misconception present even in textbooks. It simply tells you whether one group is higher on average than the other, with no reference to any underlying distribution. Either testing approach is fine as long as you are transparent about what it is you are most interested in testing.

                Comment


                • #9
                  Originally posted by Jay Gold View Post
                  I am not sure how this would inflate or lead to more type 1 error. If anything, quite the opposite. I am using the test of normality to see if this continuous variable, cortisol, within each group is normally distributed or not. If it is not normally distributed, I will use a non-normal comparison test, which if anything decreases my type 1 error.
                  From a theoretical perspective, this is a conditional testing strategy, and as such, the p-value at the end of the chosen procedure is also conditional on the the first test being significant (or not) by some threshold alpha. Said another way, it is multiple comparison procedure. The absolute best case scenario you could hope for is that it does not increase type 1 error. It will not lead to a smaller type 1 error though.

                  I was reminded of this paper which discusses some theory and also shows the basic strategy simulation. Their conclusions show that in this case, the inflation is benign. However, the strategy is wrong from a formal perspective. In any case, the key question here is not a distribution one (normal versus not), but whether the test addresses the scientific question of interest (are means of interest, or something else?).

                  Comment


                  • #10
                    In #5 there were two questions:

                    In your example, does output from the moments command demonstrate a normal distribution? How close does the skewness need to be to zero?
                    That's a testing question that moments deliberately avoids. In preferring measurement to testing I am like someone who says that someone being 1.85 m tall is more information than someone being reported as tall. It takes a bit experience to know that skewness being 0.8 and kurtosis being 3.4 is not a big deal. That experience is boosted by looking at graphs and checking that you get similar answers on the means being different even with different ways of asking.

                    Could you explain what the code here is doing?

                    . quietly glm mpg i.foreign, link(log)
                    . mat li r(table)
                    Take off the quietly to see that this is generalized linear approach to comparing the means. A comparison between two group means is equivalent to fitting a regression model with a (0. 1) predictor. The log link indicates that fitting is done on the log scale. The z statistic associated with the predictor is analogous to the t statistic in the conventional test with that name.

                    I would want to add that bootstrapping the difference between means is a good way to push the comparison forward.

                    Comment

                    Working...
                    X