Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Best way to assess normality in stata and how to know that my sample size is enough

    I am doing a research with a number of sample apparently exceeding 4000.

    Trying to assess normality:
    Code:
    swilk currentsmoker cigsperday prevalentstroke totchol bmi glucose
    
                       Shapiro-Wilk W test for normal data
    
        Variable |        Obs       W           V         z       Prob>z
    -------------+------------------------------------------------------
    currentsmo~r |      4,238    0.99997      0.063    -7.221    1.00000
      cigsperday |      4,209    0.95455    105.598    12.160    0.00000
    prevalents~e |      4,238    0.92485    175.659    13.491    0.00000
         totchol |      4,188    0.96867     72.464    11.176    0.00000
             bmi |      4,219    0.95759     98.739    11.986    0.00000
         glucose |      3,850    0.56337    936.450    17.804    0.00000
    
    Note: The normal approximation to the sampling distribution of W'
          is valid for 4<=n<=2000.
    as stated in the note, it said that the normality assessed here may not be representing true normality as my sample size is 4323 and currently shapiro-wilk is only for 2000. Similarly with
    Code:
    sktest

    Secondly, I would like to ask a technical question. I am really sorry I know that STATA is a software and statistical interpretation is derived from the researcher itself. If I have:
    Variable Coronary Disease
    Yes No pvalue OR
    Have diabetes 40 69 0.000 3.3
    Dont have diabetes 604 3525
    How do I know that 0.000 truly represents the result of our variable, given that the proportion between population with diabetes and without diabetes is totally imbalanced?

    Thank you very much

  • #2
    Zhianni:
    1) with a 4,000 sample size I would not be concerned about normality. That said, what's the aim of your research? Why are struggling with normality?
    2) numbers in your dataset are what they are (and Stata -like any other statistical package- cannot do anything about that). The table you show can be easily reproduced with Stata -cci- command:
    Code:
    . cci 40 69 604 3525
                                                             Proportion
                     |   Exposed   Unexposed  |      Total      exposed
    -----------------+------------------------+------------------------
               Cases |        40          69  |        109       0.3670
            Controls |       604        3525  |       4129       0.1463
    -----------------+------------------------+------------------------
               Total |       644        3594  |       4238       0.1520
                     |                        |
                     |      Point estimate    |    [95% conf. interval]
                     |------------------------+------------------------
          Odds ratio |         3.383242       |    2.209614    5.118499 (exact)
     Attr. frac. ex. |         .7044255       |    .5474323    .8046302 (exact)
     Attr. frac. pop |         .2585048       |
                     +-------------------------------------------------
                                   chi2(1) =    40.14  Pr>chi2 = 0.0000
    
    .
    About your last statement, painflul as it may force me to realize from time to time, Stata cannot replace researcher's knowledge about statistics, but can enormously help with its computation.
    Last edited by Carlo Lazzaro; 16 Mar 2024, 05:18.
    Kind regards,
    Carlo
    (StataNow 18.5)

    Comment


    • #3
      Tests for normality are widely over-rated, because (without giving a complete list of objections)

      1. Marginal normality is rarely needed for almost anything statistical, contrary to myth.

      2. They often answer the wrong question, depending on sample size. Sometimes your sample is too small to detect non-normality: even in these times of "big data" I routinely see people trying to get good results from very small samples. Often your sample is so large that any departure from non-normality will qualify as rejecting the null, even if it is not of scientific or statistical importance or concern. There is a zone in between but it's hard to state in advance what it might be.

      Nevertheless it is a good idea to look at the data! For example, some variables may benefit from transformations on other grounds. And knowing something about how data are distributed often indicates what should be done and even if it indicates that there is not much to worry about it the exploration isn't wasted. For one, watching out for really wild outliers can be a check on data quality and warn you of a possible problem.

      One device that helps -- even if normality is not really a major concern -- is, paradoxically or not, a normal quantile plot (aka normal probability plot, normal scores plot, normal plot, fractile diagram, probit plot).

      qnorm is the official command but remains limited in what you can do.

      multqplot from the Stata Journal, which in turn uses qplot from the Stata Journal, gives you a portfolio with relative ease.

      Code:
      sysuse auto, clear 
      
      multqplot price-foreign, trscale(invnormal(@)) yla(#5) xla(-2/2)
      The trscale() option calls for standard normal deviates on the x axis (and we specify sensible labels too). These variables are in different units and have different magnitudes, but yla(#5) works not too badly for all.

      Even though testing for normality is absurd or pointless for indicator or ordered categorical variables -- let alone unordered (nominal) categorical variables -- we get to see their distribution too. (And naturally if you think those graphs are useless, just leave them out.)

      Otherwise the plots show us all kind of departures from normality, not just in terms of general skewness and tail weight, but also in terms of multimodality or granularity, gaps and outliers.


      Click image for larger version

Name:	multqplot.png
Views:	1
Size:	81.7 KB
ID:	1746898




      Comment

      Working...
      X