Best way to assess normality in stata and how to know that my sample size is enough

Zhianni Yang

Join Date: Jun 2023

Posts: 19
#1

Best way to assess normality in stata and how to know that my sample size is enough

16 Mar 2024, 02:48

I am doing a research with a number of sample apparently exceeding 4000.

Trying to assess normality:

Code:

swilk currentsmoker cigsperday prevalentstroke totchol bmi glucose Shapiro-Wilk W test for normal data Variable | Obs W V z Prob>z -------------+------------------------------------------------------ currentsmo~r | 4,238 0.99997 0.063 -7.221 1.00000 cigsperday | 4,209 0.95455 105.598 12.160 0.00000 prevalents~e | 4,238 0.92485 175.659 13.491 0.00000 totchol | 4,188 0.96867 72.464 11.176 0.00000 bmi | 4,219 0.95759 98.739 11.986 0.00000 glucose | 3,850 0.56337 936.450 17.804 0.00000 Note: The normal approximation to the sampling distribution of W' is valid for 4<=n<=2000.

as stated in the note, it said that the normality assessed here may not be representing true normality as my sample size is 4323 and currently shapiro-wilk is only for 2000. Similarly with

Code:

sktest

Secondly, I would like to ask a technical question. I am really sorry I know that STATA is a software and statistical interpretation is derived from the researcher itself. If I have:
Variable Coronary Disease

Yes No pvalue OR

Have diabetes 40 69 0.000 3.3

Dont have diabetes 604 3525

How do I know that 0.000 truly represents the result of our variable, given that the proportion between population with diabetes and without diabetes is totally imbalanced?

Thank you very much
Tags: None

Carlo Lazzaro

Join Date: Apr 2014
Posts: 17678

16 Mar 2024, 04:13

Zhianni:
1) with a 4,000 sample size I would not be concerned about normality. That said, what's the aim of your research? Why are struggling with normality?
2) numbers in your dataset are what they are (and Stata -like any other statistical package- cannot do anything about that). The table you show can be easily reproduced with Stata -cci- command:

Code:

. cci 40 69 604 3525
                                                         Proportion
                 |   Exposed   Unexposed  |      Total      exposed
-----------------+------------------------+------------------------
           Cases |        40          69  |        109       0.3670
        Controls |       604        3525  |       4129       0.1463
-----------------+------------------------+------------------------
           Total |       644        3594  |       4238       0.1520
                 |                        |
                 |      Point estimate    |    [95% conf. interval]
                 |------------------------+------------------------
      Odds ratio |         3.383242       |    2.209614    5.118499 (exact)
 Attr. frac. ex. |         .7044255       |    .5474323    .8046302 (exact)
 Attr. frac. pop |         .2585048       |
                 +-------------------------------------------------
                               chi2(1) =    40.14  Pr>chi2 = 0.0000

.

About your last statement, painflul as it may force me to realize from time to time, Stata cannot replace researcher's knowledge about statistics, but can enormously help with its computation.

Last edited by Carlo Lazzaro; 16 Mar 2024, 04:18.

Kind regards,
Carlo
(Stata 19.0)

Comment

Nick Cox

Join Date: Mar 2014

Posts: 35451
#3

16 Mar 2024, 07:04

Tests for normality are widely over-rated, because (without giving a complete list of objections)

1. Marginal normality is rarely needed for almost anything statistical, contrary to myth.

2. They often answer the wrong question, depending on sample size. Sometimes your sample is too small to detect non-normality: even in these times of "big data" I routinely see people trying to get good results from very small samples. Often your sample is so large that any departure from non-normality will qualify as rejecting the null, even if it is not of scientific or statistical importance or concern. There is a zone in between but it's hard to state in advance what it might be.

Nevertheless it is a good idea to look at the data! For example, some variables may benefit from transformations on other grounds. And knowing something about how data are distributed often indicates what should be done and even if it indicates that there is not much to worry about it the exploration isn't wasted. For one, watching out for really wild outliers can be a check on data quality and warn you of a possible problem.

One device that helps -- even if normality is not really a major concern -- is, paradoxically or not, a normal quantile plot (aka normal probability plot, normal scores plot, normal plot, fractile diagram, probit plot).

qnorm is the official command but remains limited in what you can do.

multqplot from the Stata Journal, which in turn uses qplot from the Stata Journal, gives you a portfolio with relative ease.

Code:

sysuse auto, clear multqplot price-foreign, trscale(invnormal(@)) yla(#5) xla(-2/2)

The trscale() option calls for standard normal deviates on the x axis (and we specify sensible labels too). These variables are in different units and have different magnitudes, but yla(#5) works not too badly for all.

Even though testing for normality is absurd or pointless for indicator or ordered categorical variables -- let alone unordered (nominal) categorical variables -- we get to see their distribution too. (And naturally if you think those graphs are useless, just leave them out.)

Otherwise the plots show us all kind of departures from normality, not just in terms of general skewness and tail weight, but also in terms of multimodality or granularity, gaps and outliers.
2 likes
Comment

Variable	Coronary Disease
	Yes	No	pvalue	OR
Have diabetes	40	69	0.000	3.3
Dont have diabetes	604	3525

Announcement