Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Creating a bell shaped curve

    Hello all, I would like to create a bell shaped curve to see how many people in my population lie >5% centile

    Using the variable: Neonates which is a ratio I created

    I'm sure you all know what bell shaped curve I'm referring too but something like this:
    https://sphweb.bumc.bu.edu/otlt/MPH-...deviations.png

    I've tried a histogram

    Code:
    histogram neonates
    Which produces this:
    Click image for larger version

Name:	Screenshot 2023-09-27 at 17.39.02.png
Views:	1
Size:	523.5 KB
ID:	1728397


    I've been told by my supervisor, that histogram doesn't really work how we normally would like it too.
    As you can see some of my bars are not visible, I've tried log - doesn't work.
    I then thought of ECDF graph but this would be my last resort.

    I wonder if anyone can help me plot a bell shaped curve. I'll add lines at the 1SD, 2SD


    Here is a copy of my data

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input int(patientid ageyears) double(bodymassindex neoupper_M13 neolower_M14 tendon_M16) long(ethnicgrp gender) float(dup neolength neolength neonates)
     1 11                . 55.6 42.8 36.5 2 1 1 12.8  6.3  2.031746
     2 22                . 61.5   44   39 2 1 1 17.5    5       3.5
     3 22                . 61.6 43.3 41.3 2 1 1 18.3    2      9.15
     4 28 25.5937370619895   63   50 40.4 2 1 1   13  9.6 1.3541666
     5 29                . 59.7 47.6 33.7 2 1 1 12.1 13.9  .8705037
     6 33                .   56 46.9 33.3 2 1 1  9.1 13.6  .6691176
     7 35                . 60.3 40.3 36.5 2 1 1   20  3.8  5.263158
     8 39                . 60.6 45.8 36.5 2 1 1 14.8  9.3  1.591398
     9 42                . 53.7 42.3 29.5 2 1 1 11.4 12.8  .8906249
    10 45                . 60.9 40.9 36.3 2 1 1   20  4.6  4.347826
    11 46                . 57.1 43.5   37 2 1 1 13.6  6.5 2.0923078
    12 49                . 59.9   42   40 2 1 1 17.9    2      8.95
    13 59 21.9898923652637   59 48.7 38.8 2 1 1 10.3  9.9 1.0404041
    14 64                . 56.5   42   32 2 1 0 14.5   10      1.45
    14 65                . 60.9 48.7 39.4 2 1 0 12.2  9.3  1.311828
    15 65                . 58.9 46.3   37 2 1 0 12.6  9.3 1.3548387
    15 67                . 57.2 41.1 35.8 2 1 0 16.1  5.3  3.037736
    16 67                .   56 45.7 34.6 2 1 0 10.3 11.1  .9279279
    16 68                . 58.4 46.9 37.4 2 1 0 11.5  9.5 1.2105263
    17 68                . 57.8 47.7 35.8 2 1 0 10.1 11.9  .8487396
    end
    label values ethnicgrp ethnicgrp
    label def ethnicgrp 2 "Asian", modify
    label values gender gender
    label def gender 1 "f", modify

  • #2
    So you want to overlay a normal density curve on your histogram? See here, but I'm not sure that actually gets you the information you want to extract.

    Code:
    summarize neonates, detail
    Should give you summary information about percentiles. See here if you need to account for complex survey design.

    Comment


    • #3
      Thanks for your reply. I am aware of summarise, detail and pcentiles.

      this is a graphing issue

      i could plot a box plot to see the distribution

      but I would like to create a distribution curve just like the link.

      if you have any ideas let me know

      Comment


      • #4
        Another option occurs to me: if you want to know how many observations fall x number of standard deviations below the mean, then just convert your neonates variable to a z-score. It should be straightforward from there. Suppose we want to summarize neonates that are -1 SD below the mean:

        Code:
        zscore neonates
        sum neonates if z_neonates < -1

        Comment


        • #5
          The first link in #2 shows you how to draw a fitted normal distribution over your histogram. I'll copy the code here for reference.

          Code:
          histogram neonates, frequency normal
          If that's not what you are asking for, then I'm afraid I don't know what you mean. Obviously the link in the OP shows a theoretical unit normal distribution, not a plot of real data. Or do you need help drawing vertical bars on your histogram at each standard deviation?

          Comment


          • #6
            95% of your values lie above the 5% point. That's the definition. If instead your problem is more subtle such as how many values lie below the 5% point on a fitted normal with the same mean and SD as the data. the last is

            Code:
            summarize neonates, detail 
            count if neonates < r(mean) - invnormal(0.05)*r(sd)
            For checking or assessing graphically how far data lie from a normal distribution, in my view qnorm, which is dedicated to this task, beats messing around with a histogram and bell-shaped curve.

            Comment


            • #7
              Thanks, I'll check qnorm and see how this works. Yes this was really a question of creating pictorial representations of the data rather than looking a the text results

              Comment


              • #8
                A normal quantile plot (many other names. normal probability plot, normal scores plot, probit plot) has many advantages over histograms, including:

                1. The reference condition -- perfect normality -- is a straight line, not a curve, so deviations are easier to assess.

                2. You get to see clearly detail that might matter such as outliers, skewness, gaps and spikes.

                3. There is no need to choose, or to worry about, bin width or origin.

                See also https://www.statalist.org/forums/for...dable-from-ssc

                my 2019 presentation at the London meeting accessible via https://www.stata.com/meeting/uk19/

                Comment


                • #9
                  Originally posted by Nick Cox View Post
                  95% of your values lie above the 5% point. That's the definition. If instead your problem is more subtle such as how many values lie below the 5% point on a fitted normal with the same mean and SD as the data. the last is

                  Code:
                  summarize neonates, detail
                  count if neonates < r(mean) - invnormal(0.05)*r(sd)
                  For checking or assessing graphically how far data lie from a normal distribution, in my view qnorm, which is dedicated to this task, beats messing around with a histogram and bell-shaped curve.
                  Just thinking about this.

                  if I want to determine the number of values above the 95% point would the code be: (different to the q in post 6)
                  Code:
                   count if neonates > r(mean) + invnormal(0.95)*r(sd)
                  In terms of qnorm ...

                  Using:

                  Code:
                  qnorm neonates
                  I obtain the following graph:
                  Click image for larger version

Name:	Screenshot 2023-09-28 at 22.36.22.png
Views:	1
Size:	458.1 KB
ID:	1728521

                  As expected, this shows a normal distribution.

                  Now I would like to demarcate a Line at x axis= 1.2 (and find out what centile this is) and a separate line at the 95% centile point
                  which according to

                  Code:
                  sum neonate, detail
                  
                  // 95th centile = 1.12
                  I tried to add the above reference lines using:

                  Code:
                  qnorm neonates xline(1.2 1.12) lwidth(3) lstyle(grid)
                  Stata told me: factor-variable and time-series operators not allowed

                  Do you have a solution?
                  Also, I've just realised that 'jitter' doesn't work with qnorm - just to see the number of point on the scatter more clearly. is there any other solution?

                  I really like the graphs you produced from -transplot- Nick Cox
                  Particularly this as it has the standard deviation as the x axis with the values of the desired variable on the y axis.
                  However, -transplot- can not be used in my case as I don't want to transform my variable neonates.
                  Is there a way I can use qnorm with yaxis.- neonate variable, x axis as SD.




                  Click image for larger version

Name:	Screenshot 2023-09-28 at 22.34.58.png
Views:	1
Size:	178.2 KB
ID:	1728522
                  Last edited by Rose Matthews; 28 Sep 2023, 14:48.

                  Comment


                  • #10
                    Working more or less backwards:

                    1. Most of the work in the last graph you showed was done by qplot from the Stata Journal. The fact that transplot (SSC) called up qplot doesn't stop you using qplot directly. qnorm has less flexibility.

                    2. There is no Stata syntax rule against using jitter() with qnorm. I don't think it's going to help you, but that is a different problem.

                    3. The error message you report is admittedly not helpful. The problem is that you need a comma before the options start. (That may have been the problem with your attempt to use jitter().)

                    4. Reading off how many observations fall above or below particular thresholds is not best approached graphically. It is a job for (e.g.) the count command.

                    5. The distribution you call normal I would call normal in the middle (WInsor's principle asserts that many distributions are normal in the middle.)

                    Comment


                    • #11
                      Thanks Nick Cox
                      Regarding 2-3.
                      Managed to use Jitter option without error message - yes in fact I missed a comma.

                      For anyone, corrected code:
                      Code:
                      qnorm neonates, jitter(6) xline(1.0 1.30, lwidth(1) lstyle(grid) lcolor(gs12))

                      Question 1.
                      Regarding 4, yes agreed I would need to use count. As indicated in my first question in post 9, in order to count the values above the 95% centile - would this be the formula, which you provided in post 6:
                      Code:
                       count if neonates > r(mean) + invnormal(0.95)*r(sd) OR 
                      count if neonates >1.12 
                      //Something I just thought of now
                      What is the difference between both?- 1.12 is the value of the 95th centile as shown in post 9.
                      The count >1.12 is something I just thought of, the code with the -invnormal- is the code you gave in post 6. Just trying to understand it what the invnormal is doing.


                      Question2
                      Perhaps a silly question, but when using qnorm
                      What does inverse normal mean on the x axis? (this may help me understand q1)

                      Question 3
                      As I asked in post 9 can I change the x axis to SD just like you did in the -transplot- graph in post 9 - and if so how?

                      Comment


                      • #12
                        (Holding note: I am working on a long reply, partly because it will serve as a zeroth draft of something quite different. That's not to inhibit other replies, naturally.)

                        Comment


                        • #13
                          Looking forward !

                          Comment


                          • #14
                            Question 3 is already answered in #2 of the link https://www.statalist.org/forums/for...dable-from-ssc

                            Question 2 is linked so let's back up and take the two together. Here any extra q flags some qualifications or quirks that arise with quantile questions.

                            1. The idea of a median -- half the points lie above, half below, so 50% either way -- with small print on how to calculate it -- is quite old.

                            1q. The recipe to take the middlemost value of the ordered (ranked, sorted) values if the number of values is odd, and the mean of the two middle values if it is even, is explained to mathematicians as a convention and to non-mathematicians as a rule. The two middle values of an even number of values are naturally called comedians, as commented by both Stephen Stigler and Roger Koenker.

                            2. In 1879, Francis Galton (later Sir Francis) identified quartiles -- 3 values, meaning lower (first) quartile, median, and upper (third) quartile, that divide a set of sorted values into 4 groups of equal size, in principle -- and octiles -- 7 values defining 8 groups, similarly. Since then people have added further terms for the more general idea of a set of values equally spaced in terms of cumulative probabilities. https://stats.stackexchange.com/ques...half-a-percent contains the fullest list of such terms I know, but as its compiler I welcome any further examples. Hyndman and Fan (reference at https://www.jstor.org/stable/2684934) documented rules for calculating such measures, and even their discussion could be extended.

                            2q. The history is as usual more muddled than that summary implies. Over a long period people often summarized data, particularly measurements of what was supposedly the same quantity, in terms of an single figure, typically a mean, and a probable error, often denoted PE. The definition of PE is that half the errors are greater than the probable error in absolute value and half less. Probable errors seem to have disappeared from any current literature, but the point remains that the idea of mean +/- PE loosely matches the idea of median and quartiles, so long as a distribution is symmetric. In the context of repeated measurement in physical science, the ideal of a normal distribution of measurements was unsurprisingly common, whether explicit or implicit.

                            3. Someone had the simple but bright idea of rising above the mess of terminology and using the umbrella term quantile, a quantile being defined generally by some fraction or percent being lower and the complementary fraction or percent being higher. In numeric summaries, the term quantile is some times restricted to particular reported levels, but in graphical contexts it is now often used to mean the entire set of ordered values, or the order statistics, to use another term. That practice can certainly be seen in the classic paper by R. Gnanadesikan and M.B. Wilk in Biometrika in 1968.

                            3q. The invention of the term quantile is often attributed to M.G. Kendall (later Sir Maurice), who used the term in the title of a paper in 1940, but it is older, being used by R.A. Fisher (later Sir Ronald) and F. Yates in 1938. (References in the link given in point 2 above.)

                            4. Plotting the quantiles or ordered values on one axis of a Cartesian plot and the cumulative probability (or equivalently the rank or some one-to-one function of it) on the other axis is an old idea with many uses in the 19th century and early 20th century. Examples known to me are in the work of A. Quetelet, Galton, A.R. Wallace (better known for the idea of evolution by natural selection at about the same time as Darwin), G.U. Yule, and so on. It is now common (but not universal) to name such graphs by whatever appears on the vertical axis: following that convention would make a plot (a) a quantile plot if quantiles are on the vertical axis and (b) an (empirical (cumulative)) distribution (function) plot if the cumulative probabilities are on the vertical axis (hence abbreviations such as CDF plot or ECDF plot). Which plot people choose can depend on anything from tribal habit to a thoughtful decision on what works better. Which variable, if either, is thought of as the outcome or response to be explained can be germane.

                            4q. Plotting what we now call a survival function arose between two Huygens brothers even earlier. Current practice of plotting survival probability (the complement of the cumulative probability) on the vertical axis seems to match the idea that this is the outcome to be explained (predicted, fitted).

                            5. Linked to such plots, particularly when the question arises of how to handle tied values, is the idea of plotting position, itself discussed at excruciating length in an FAQ at https://www.stata.com/support/faqs/s...ting-positions. There is a small and sometimes angry literature about recipes for plotting positions. With unique ranks i = 1(1)n for a sample of size n, official Stata command quantile uses (i - 1/2) / n, itself the oldest recipe in the literature. For essentially all graphical purposes it does not, or should not, matter which recipe you use. Stata commands in this territory often use the wording "Fraction of the data" for quantile plots, not cumulative probability of plotting position.

                            5q. Common conventions are (a) to draw quantile plots as sets of points (which in practice can blur together especially in the middle of a large sample) and (b) to draw cumulative distribution plots as segmented lines with jumps at distinct values. As said, these are conventions, not rules.

                            6. So where the idea of inverse come in? It's just terminology: the quantile function say y = Q(p) that yields value y for given cumulative probability p is also known as the inverse cumulative distribution function -- because it is the inverse of the cumulative distribution function p = F(y), which goes the other way, and yields cumulative probability p from value y. I much prefer the name quantile function over inverse cumulative distribution function. The longer name is perhaps most linked to some defensible mathematical fussiness over whether and precisely how the function is definable and defined, something that doesn't worry practitioners unduly.

                            6q. This is in principle. In practice cumulative probabilities are usually calculated as multiples of 1/n, so range from 1/n to 1, while quantiles are usually plotted against plotting position, or some one-to-one function of it. Despite variations in recipe, almost all avoid assigning 0 or 1 to plotting positions (because then the quantile at cumulative probability 0 or 1 may not be defined if either tail stretches to negative or positive infinity.

                            7. The basic quantile plot as implemented in official Stata command quantile would produce a linear configuration of data points if and only if the distribution were uniform (rectangular, flat). That doesn't stop the plot being useful if the data are far from a uniform, or more crucially if the uniform is not a plausible reference distribution. Indeed, many features can be read off such a plot with thought and practice, such as skewness and tail weight, and outliers, gaps or spikes if they exist.

                            7q. Quantile plots of that kind can also be called quantile-quantile plots or QQ-plots or some such name, because one axis is used for the observed quantiles (the sorted data points) and the other axis is used for theoretical quantiles, in a simple quantile plot the quantiles of a uniform distribution on [0, 1], meaning more precisely the expected or mean quantiles in a sample of the same size. This is just terminology. I find the term quantile-quantile plot a bit of a mouthful, and QQ-plot a little too cryptic unless the other people in a conversation all know what it means, but these are small matters of taste.

                            8. But that takes us to the next idea: if some other distribution is of more intterest as a reference distribution, then we just need to plot against a scale that shows (a linear function of) the quantiles of that distribution. Then a good fit is shown by a linear configuration of data points. Plots for the normal or Gaussian distribution of this kind go back to M. P. Henry in 1894. Many names are in use such as normal quantile plot (my preference, and more importantly a term that I sense is rising in frequency), normal probability plot, normal scores plot and probit plot (a term that now seems rare). Some people prefer to say Gaussian, not normal.

                            8q. Other such names that readers can recall are welcome to add to the menagerie. Disciplines can be classified by the fraction of puzzlement that arises from different people using the same name for different things or different names for the same thing. In philosophy in some styles, that is perhaps over half. In statistics, the fraction is smaller but not zero.

                            9. The calculations are simple in Stata, given its implementation of invnormal() as a function. (Stata hasn't followed some other software that names its quantile functions using an abbreviation starting with q, although the names of corresponding graphics commands do begin that way.) Official command qnorm always plots against normal quantiles for a distribution with the same mean and standard deviation as the data offered.

                            10. There are other commands, both official and community-contributed, for quantile plots, but qplot from the Stata Journal is to my knowledge the most versatile. For example, through its trscale() option you can specify other scales for the horizontal axis, such as that of a standard normal distribution with mean 0 and standard deviation 1.

                            10q. qplot started out in 1999 as quantil2 from the Stata Technical Bulletin as a generalization of quantile. The spelling quantil2 arose because Stata for the then popular operating system DOS had to follow the DOS restrictions on filename length, so that all ado-files for Stata had to fit within an 8.3 pattern of filename.ext: either element could be shorter, but neither could be longer, so quantile2 would have been a character too long. All that aside, the naming was optimistic with the thought that StataCorp would want to fold the functionality back into official Stata. That never happened, so eventually I chose the name qplot instead (which has nothing to with any other program, command or function with the same name outside Stata).

                            Well, that was more than you were expecting. Let's return to your specific questions.

                            Let us use mpg from the auto data as an example. It's not normally distributed, but we will make that the main point.

                            The 5% point is reported as 14. We have various difficulties in using that further, one being that it is a repeated value.

                            Code:
                            . su mpg, detail
                            
                                                    Mileage (mpg)
                            -------------------------------------------------------------
                                  Percentiles      Smallest
                             1%           12             12
                             5%           14             12
                            10%           14             14       Obs                  74
                            25%           18             14       Sum of wgt.          74
                            
                            50%           20                      Mean            21.2973
                                                    Largest       Std. dev.      5.785503
                            75%           25             34
                            90%           29             35       Variance       33.47205
                            95%           34             35       Skewness       .9487176
                            99%           41             41       Kurtosis       3.975005
                            
                            . count if mpg == 14
                              6
                            Once we have run count -- which is an r-class command -- the results of summarize -- another r-class command disappear. So, we either save results carefully immediately after we run summarize or we run the command again. If we fit a normal distribution with the same mean and standard deviation (SD), that is below the lowest observed value, and no points lie below it.


                            Code:
                             
                            . su mpg, detail
                            
                                                    Mileage (mpg)
                            -------------------------------------------------------------
                                  Percentiles      Smallest
                             1%           12             12
                             5%           14             12
                            10%           14             14       Obs                  74
                            25%           18             14       Sum of wgt.          74
                            
                            50%           20                      Mean            21.2973
                                                    Largest       Std. dev.      5.785503
                            75%           25             34
                            90%           29             35       Variance       33.47205
                            95%           34             35       Skewness       .9487176
                            99%           41             41       Kurtosis       3.975005
                            
                            . di r(mean) + invnormal(0.05) * r(sd)
                            11.780991
                            
                            . count if mpg < r(mean) + invnormal(0.05) * r(sd)
                              0
                            You can get a normal quantile plot in terms of a standard normal (mean 0, SD 1) just by asking for it, say

                            Code:
                            qplot mpg, trscale(invnormal(@)) xtitle(Standard normal deviate) xla(-2/2) aspect(1)
                            Here is an example. I used Georgia font on a whim, but choose your own.

                            Click image for larger version

Name:	mpg_qplot.png
Views:	1
Size:	25.3 KB
ID:	1728642

                            Comment


                            • #15
                              Thanks for this.
                              Thinking of changing to qplot, however I would like to know

                              Is it possible to know the value 1.2 lies in terms of the standard deviation so I can insert a line.
                              I know the 95th centile lies on the 1.96 standard deviation...

                              Many thanks

                              Comment

                              Working...
                              X