Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • how to get smooth lines from multiple CDFs

    Hi, I am struggling plotting multiple CDFs in one graph and zooming in for certain data range. I'm getting stairsteps but I want smooth lines. I got the distribution of A1c but want to show the graph zooming in for A1c<6. it is giving me very weird stairsteps graph. here is the code I'm using:

    ***generating CDF for 7 time periods

    foreach wave of numlist 1 2 3 4 5 6 7{
    cumul a1c [aw=svyweight_yr] if wave ==`wave', gen(Fa1c`wave')}

    ***plotting multiple CDFs if A1c<6
    scalar Target_level = 5.7
    twoway (scatter Fa1c1 a1c, sort connect(J) lw(thin) lc(pink) ms(none)) ///
    (scatter Fa1c2 a1c, sort connect(J) lw(thin) lc(blue) ms(none)) ///
    (scatter Fa1c3 a1c, sort connect(J) lw(thin) lc(orange) ms(none)) ///
    (scatter Fa1c4 a1c, sort connect(J) lw(thin) lc(green) ms(none)) ///
    (scatter Fa1c5 a1c, sort connect(J) lw(thin) lc(purple) ms(none)) ///
    (scatter Fa1c6 a1c, sort connect(J) lw(thin) lc(black) ms(none)) ///
    (scatter Fa1c7 a1c, sort connect(J) lw(thin) lc(red) ms(none)) ///
    if a1c<6 , xline(`=Target_level', lwidth(1pt) lcolor(green)) xtitle(Glycohemoglobin (%)) legend(label( 1 "1988-1991") label(2 "1992-1994") label(3 "1999-2002") label(4 "2003-2006") label(5 "2007-2010") label(6 "2011-2014") label(7 "2015-2018") rows(2) size(small))


    If I use "distplot" and fix the range A1c <6, it will not give me whole distribution. I need whole distribution but zooming in up to 6.
    It would be really helpful if anyone could please help me sorting this issue. Thanks in advance.

  • #2
    This is the one i am getting.

    Click image for larger version

Name:	ss1.jpg
Views:	1
Size:	30.6 KB
ID:	1747180

    Comment


    • #3
      The empirical CDF is by definition a step function, but the steps seem to be in the wrong place. Here is an example of how to fix that

      Code:
      sysuse nlsw88, clear
      bys race : cumul tenure, gen(c) equal
      separate c, by(race) veryshortlabel
      sort race c
      twoway line c? tenure if tenure < 5 , connect(J ..)
      Click image for larger version

Name:	Graph.png
Views:	1
Size:	64.1 KB
ID:	1747197
      Last edited by Maarten Buis; 19 Mar 2024, 05:11.
      ---------------------------------
      Maarten L. Buis
      University of Konstanz
      Department of history and sociology
      box 40
      78457 Konstanz
      Germany
      http://www.maartenbuis.nl
      ---------------------------------

      Comment


      • #4
        You’re missing the equal option of cumul.

        See also
        distplot from the Stata Journal — which does the whole task in one.

        Comment


        • #5
          Just to add to the discussion.
          It is possible to get a CDF that is smooth. Its based on the same principles as kdensity. But I do not know if there is a command that gets the numbers for you.
          It is a good exercise of thinking about kernel methods tho.

          Comment


          • #6
            Two thoughts on that:

            1. The strength of the ECDF is that it can display the data as it is. Data is by its very nature discrete: we have values for one person, than another person, than another person, but we don't have half persons, or a quarter persons. It is also great at showing properties of the data like ties. So smoothing is possible, but in most situations it would undermine the strength of that display.

            2. Kernel methods are great for PDFs, but they tend to get weird near the tails (for good reasons). That is usually not a big problem for PDFs, since the tails aren't that visible. However, in a CDF the tails are visible. My first intuition would be to instead look at the Harrell-Davis estimator of quantiles, in Stata hdquantile by Nick Cox (available on SSC, type in Stata type: ssc desc hdquantile).

            But I don't think too much about that, as thought 1 takes precedence: the cumulation in the ECDF does all the smoothing you need, and any additional smoothing usually makes the graph worse.
            Last edited by Maarten Buis; 19 Mar 2024, 09:48.
            ---------------------------------
            Maarten L. Buis
            University of Konstanz
            Department of history and sociology
            box 40
            78457 Konstanz
            Germany
            http://www.maartenbuis.nl
            ---------------------------------

            Comment


            • #7
              My position is (perhaps unsurprisingly given his #2) close to that of Maarten Buis.

              Cumulation already imparts some smoothing, and if the CDF still looks complicated it may be telling you something about your data. (The opposite point is that the appearance of modes may be over-interpreted from histograms, to which kernel density estimation is often, but not always, a valuable corrective.)

              I have found that quantile smoothing sometimes gets rid of noise due to e.g. reporting conventions or other quirks leading to rounded data and/or granular data.

              Comment


              • #8
                Good points.
                Although I do think a good choice of bandwidth could make for a good smoothed graph for just visualization, preserving the characteristics of the data.
                Code:
                sysuse nlsw88, clear
                bys race : cumul tenure, gen(c) equal
                separate c, by(race) veryshortlabel
                sort race c
                
                range ten_d 0 26 104
                gen fd1 =.
                gen fd2 =.
                gen fd3 =.
                local bw = 0.25
                
                forvalues i = 0 (.25) 26 {
                    local j = `j'+1
                    capture drop aux    
                    gen aux = 1-normal((tenure-`i')/`bw')
                    sum aux if race==1, meanonly
                    replace fd1 = r(mean) in `j'
                    sum aux if race==2, meanonly
                    replace fd2 = r(mean) in `j'
                    sum aux if race==3, meanonly
                    replace fd3 = r(mean) in `j'
                }
                label var fd1 "smooth white"
                label var fd2 "smooth black"
                label var fd3 "smooth other"
                twoway (line c? tenure  , connect(J ..)) (line fd1 fd2 fd3 ten_d , pstyle(p1 p2 p3))
                Click image for larger version

Name:	Graph.jpg
Views:	1
Size:	25.0 KB
ID:	1747236

                Which of course is just a matter of preference.
                F

                Comment


                • #9
                  Originally posted by Maarten Buis View Post
                  The empirical CDF is by definition a step function, but the steps seem to be in the wrong place. Here is an example of how to fix that

                  Code:
                  sysuse nlsw88, clear
                  bys race : cumul tenure, gen(c) equal
                  separate c, by(race) veryshortlabel
                  sort race c
                  twoway line c? tenure if tenure < 5 , connect(J ..)
                  [ATTACH=CONFIG]n1747197[/ATTACH]
                  thank you so much Maarten Buis. i see you sorted data by wave and again sorted before plotting the ECDFs. I tried according to your code and got this curves. just to explore, i added connect (l) option and got a smooth curve as like theoretical CDF. Im not sure, if it is correct to use. If you could please help me understanding this.
                  my code:

                  *CDF for 3 waves
                  bys wave : cumul a1c, gen(Fa1c_) equal
                  separate Fa1c_, by(wave) veryshortlabel
                  sort a1c Fa1c_
                  scalar Target_level = 5.7
                  twoway (line Fa1c_1 a1c, connect(J) lw(thin) lc(pink) ms(none)) ///
                  (line Fa1c_2 a1c, connect(J) lw(thin) lc(blue) ms(none)) ///
                  (line Fa1c_3 a1c, connect(l) lw(thin) lc(orange) ms(none)) ///
                  if a1c < 6 , xline(`=Target_level', lwidth(1pt) lcolor(green)) legend(label( 1 "1988-1991") label(2 "1992-1994") label(3 "1999-2002") rows(2) size(small)) saving(forum.jpg)
                  Attached Files

                  Comment


                  • #10
                    Originally posted by Nick Cox View Post
                    You’re missing the equal option of cumul.

                    See also
                    distplot from the Stata Journal — which does the whole task in one.
                    hi Nick Cox , thank you so much for the suggestion. As I mentioned in my initial post, I first tried your program "distplot". But it doesn't give me the whole distribution if I only select a threshold. I added "equal" option in my original codes, and found the figure which is equivalent to the figure I got in #9. now my confusion is about the smooth line i got using connect(l) option. should i use it or not.

                    Comment


                    • #11
                      #10

                      hi Nick Cox , thank you so much for the suggestion. As I mentioned in my initial post, I first tried your program "distplot". But it doesn't give me the whole distribution if I only select a threshold.
                      If I understand this correctly, you just need to calculate the distribution function that is context separately and then plot it as well using the addplot() option of distplot that is supported as a standard twoway option.

                      I don't understand your second question.

                      Comment


                      • #12
                        Originally posted by Sanchita Chakrovorty View Post
                        just to explore, i added connect (l) option and got a smooth curve as like theoretical CDF. Im not sure, if it is correct to use.
                        You have a empirical CDF not a theoretical CDF. An empirical CDF is discrete, which is what the step function respresents. So in most situations I would not smooth it. Similarly, if I were to review your paper and I saw the smoothed CDF then that would confuse me (as a general rule, you don't want to confuse the reviewers). So you can do whatever you like (after all, I have no power over you), but my advise is to use the step function.
                        ---------------------------------
                        Maarten L. Buis
                        University of Konstanz
                        Department of history and sociology
                        box 40
                        78457 Konstanz
                        Germany
                        http://www.maartenbuis.nl
                        ---------------------------------

                        Comment


                        • #13
                          Originally posted by Maarten Buis View Post

                          You have a empirical CDF not a theoretical CDF. An empirical CDF is discrete, which is what the step function respresents. So in most situations I would not smooth it. Similarly, if I were to review your paper and I saw the smoothed CDF then that would confuse me (as a general rule, you don't want to confuse the reviewers). So you can do whatever you like (after all, I have no power over you), but my advise is to use the step function.
                          Thank you so much Maarten Buis . I understand now.

                          Comment

                          Working...
                          X