Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Cumulative sums in descending order within groups

    Hello Statalisters,

    I'm using Stata 16.1, and I have panel data in which the unit of observation is the region (called czone). I have a few measures of interest that, for a given indicator, represents each region’s share of the nation in a given year. Each of these ends in 'shr'. Here is a random sample of my data, generated thru dataex.

    Code:
    year czone  normshr lesshr leoshr leshr
    1960  1202 .0003638664 .00004861347 .00023602384            0
    1990 10000  .000834548  .0003118703  .0005555756  .0002700112
    1990 10102 .0006462617   .000217258  .0004091604  .0000675028
    1970 17501 .0015207013  .0008979761   .001310687  .0004139073
    2010 18400 .0006034978 .00019348213  .0004174155 .00013348313
    2000 26201 .0007506197  .0003190925  .0006227999  .0003383522
    2019 26203 .0006399278  .0001512621  .0004416599 .00013742084
    1970 27603 .0005867054  .0002417628  .0006869808            0
    1960 35002 .0003008295  .0000763926    .00029739 .00022237047
    1950 36404 .0007740941   .000380904  .0007541478            0

    For each year in the data, and within that for each 'shr' variable, what I want is to generate a variable that counts the number of regions it takes to reach a cumulative sum of 0.8. Ultimately, what I was thinking was that I want a summary dataset where the unit of observation is the year, that looks like:


    Code:
    year leshr80 lesshr80 leoshr80 normshr80
    where the variables ending in 80 display the number of regions it took to get to an 80 percent cumulative sum on the underlying share variable. I don't care about the naming - just trying to make my goal clear.

    I’m struggling to figure out how to do this. I see three issues - I am not sure how to solve any of them.

    First, I need cumulative counts, from data sorted in descending order within each year. But ‘gsort’ does not play with ‘by’. I could chop the data up into years, but this seems clumsy.

    Second, I tried dropping all but one year to see if I could make a go of it year by year, but I’m still stumped, this time the (seemingly simple) task of being able to generate a descending cumulative sum of the shares (highest to lowest). As an example, using the share indicator called leshr, I tried the following:

    Code:
    gsort -leshr            
    g cumut=0
    replace cumut=leshr[_n] + cumut[_n-1]
    But this returned all missings. I think I see why, but I'm not sure what to do about it.

    Third, if I could fix the above problems, I'm still not sure how to efficiently create a summary variable for each 'shr' variable that told me how many observations it took to get to the cumulative total of 0.8.

    I hope I'm making sense. Thanks in advance.

    Tom

  • #2
    The description of your data was too complicated and I lost the plot.

    Your command generates all missings, because in the first period the period before is missing, so the first is missing, and then from there on everything is missing.

    Try whether the following is doing what you want:

    Code:
    gsort -leshr           
    gen cumut=sum(leshr)
    Or alternatively:

    Code:
    gsort -leshr           
    gen cumut=0
    replace cumut=leshr[_n] + cumut[_n-1] if _n>1

    Comment


    • #3
      Thanks - this solves part 2 of my problem. I naively thought sum gave a grand not running total. I stand corrected.

      But (part 1), I cannot do this within years - since I cannot do the first line in the following:

      Code:
      bys year: gsort -leshr
      bys year: gen cumut=sum(leshr)
      And for part 3 of my problem: if I had a variable with yearly running sums of leshr, as above, how to summarise it such that the summary variable returns the count of observations it takes for the running sum variable to reach 0.8?

      Thanks.


      Comment


      • #4
        Using another data example such that the cumulative sums exceed 0.8, see if this code helps:
        Code:
        * Example generated by -dataex-. To install: ssc install dataex
        clear
        input float(year czone) double(normshr lesshr)
        1990 1000  .1  .9
        1990 2000  .2 .04
        1990 3000  .2   0
        1990 4000  .2 .01
        1990 5000  .2 .03
        1990 6000  .1 .02
        1995 1000 .05 .05
        1995 2000  .3 .08
        1995 3000  .4  .8
        1995 4000 .05   0
        1995 5000  .1 .02
        1995 6000   0   0
        1995 7000  .1 .05
        2000 3000 .25  .1
        2000 4000 .25  .4
        2000 5000 .25  .2
        2000 6000 .25  .3
        end
        
        reshape long @shr, i(year czone) j(indicator) string
        
        gsort year indicator -shr
        by year indicator: g double cumut = sum(shr)
        by year indicator: g shr80 = _n if cumut>=.8 & (cumut[_n-1]<.8 | _n==1)
        
        keep if !missing(shr80)
        keep year indicator shr80
        reshape wide @shr80, i(year) j(indicator) string
        
        list, noobs clean
        Output:
        Code:
          year   lesshr80   norms~80  
            1990          1          4  
            1995          1          4  
            2000          3          4  
        Last edited by Kelvin Foo; 20 Mar 2021, 12:10. Reason: Formatting, and misread original question..

        Comment


        • #5
          Thank you Kelvin - this is perfect!

          Comment

          Working...
          X