Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • sum of the 5 largest values (by date)

    Hey everyone,

    I would like some tips from you.

    I'm currently working on a panel database (monthly) and I need to create a variable like a "concentration index" (by month), which is the sum of the 5 largest values of variable "x" scaled by the total sum of variable "x" (by month).

    Unfortunately, I couldn't find a forum with a demand like this.

    Thank you all in advance

  • #2
    You did not provide example data, so I have made up a toy data set to illustrate the technique. Hopefully, this data set is similar enough to yours that the code will work. If not, we will have both wasted our time.

    To avoid such potential difficulties in the future, please always provide example data when asking for help with code, and always use the -dataex- command to do that. If you are running version 17, 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

    Code:
    //  CREATE A DEMONSTRATION DATA SITE
    clear*
    set obs 10
    gen int panelvar = _n
    expand 12
    by panelvar, sort: gen mdate = tm(1999m12) + _n
    format mdate %tm
    set seed 1234
    gen x = runiform()
    xtset panelvar mdate
    
    //  SPRINKLE IN SOME MISSING VALUES OF X
    replace x = . if runiform() < 0.2
    
    //  CALCULATE "CONCENTRATION INDEX"
    preserve
    drop if missing(x)
    by mdate (x), sort: egen top_5_total = total(cond(_N-_n < 5, x, .))
    by mdate (x): egen all_total = total(x)
    gen concentration_index = top_5_total/all_total
    by mdate, sort: keep if _n == 1
    keep mdate concentration_index
    tempfile working
    save `working'
    
    //  BRING IT INTO THE ORIGINAL DATA SET
    restore
    merge m:1 mdate using `working'
    Added: The code can be considerably simplified if it is guaranteed that variable x never has a missing value. But that seemed too strong an assumption to make.

    Comment


    • #3
      Dear professor Clyde Schechter,

      Thank you so much for your great and quick response!!!!

      That was amazing...it worked totally for my data!!!

      Next time I will use provide example data.

      Comment


      • #4
        I found this within a few seconds of a search.

        https://www.statalist.org/forums/for...e-past-1-month

        Comment


        • #5
          Here is a concrete example building on the code linked to in #4.


          Code:
          mata:  
          mata clear
          
          
          real rowvector conc(real matrix X) {
              X = select(X, (X :< .))
              _sort(X, -1)
              n = rowmin((5, rows(X)))
              return((n, sum(X[1::n,]), sum(X)))  
          }
          
          end 
          
          * sandbox 
          webuse grunfeld, clear 
          
          rangestat (conc) invest, int(year 0 0) 
          
          rename (conc?) (count sum5 sum)
          label var count "ideally 5"
          label var sum5 "highest 5"
          label var sum  "all values"
          gen conc = sum5/sum
          label var conc "ratio" 
          format conc %4.3f 
          format sum* %3.2f 
          
          tabdisp year, c(count sum5 sum conc)
          
          ----------------------------------------------------------
               Year |  ideally 5   highest 5  all values       ratio
          ----------+-----------------------------------------------
               1935 |          5      640.57      727.46       0.881
               1936 |          5      915.59     1016.07       0.901
               1937 |          5     1098.20     1224.81       0.897
               1938 |          5      669.71      775.55       0.864
               1939 |          5      704.36      805.26       0.875
               1940 |          5     1013.09     1132.65       0.894
               1941 |          5     1227.55     1397.19       0.879
               1942 |          5     1075.64     1226.65       0.877
               1943 |          5     1032.14     1177.90       0.876
               1944 |          5     1028.54     1209.25       0.851
               1945 |          5     1065.49     1241.59       0.858
               1946 |          5     1401.79     1613.59       0.869
               1947 |          5     1257.30     1471.35       0.855
               1948 |          5     1329.70     1539.48       0.864
               1949 |          5     1205.64     1392.44       0.866
               1950 |          5     1333.20     1510.61       0.883
               1951 |          5     1735.22     1995.83       0.869
               1952 |          5     1938.49     2240.33       0.865
               1953 |          5     2427.35     2755.83       0.881
               1954 |          5     2443.81     2737.81       0.893
          ----------------------------------------------------------
          
          .

          Comment


          • #6
            That said, this can be done yet more simply. As Clyde Schechter rightly pointed out, the main pitfall here would be any missing values, because sorting the data pushes any missing values to the end of any block of observations. That makes it more difficult to identify the highest 5.

            But we can just use an indicator for the high values. There aren't any missing values in the Grunfeld data but that is no barrier to writing cautious code.

            Code:
            webuse grunfeld, clear
            
            gen high = !missing(invest)
            
            bysort high year (invest) : replace high = high & (_N - _n) < 5  
            
            egen count = total(high), by(year)
            
            egen double sum5 = total(invest * high), by(year)
            
            egen double sum = total(invest), by(year)
            
            gen conc = sum5 / sum
            
            format sum* %3.2f
            format conc %4.3f
            
            tabdisp year, c(count sum* conc)
            
            
            
            ----------------------------------------------------------
                 Year |      count        sum5         sum        conc
            ----------+-----------------------------------------------
                 1935 |          5      640.57      727.46       0.881
                 1936 |          5      915.59     1016.07       0.901
                 1937 |          5     1098.20     1224.81       0.897
                 1938 |          5      669.71      775.55       0.864
                 1939 |          5      704.36      805.26       0.875
                 1940 |          5     1013.09     1132.65       0.894
                 1941 |          5     1227.55     1397.19       0.879
                 1942 |          5     1075.64     1226.65       0.877
                 1943 |          5     1032.14     1177.90       0.876
                 1944 |          5     1028.54     1209.25       0.851
                 1945 |          5     1065.49     1241.59       0.858
                 1946 |          5     1401.79     1613.59       0.869
                 1947 |          5     1257.30     1471.35       0.855
                 1948 |          5     1329.70     1539.48       0.864
                 1949 |          5     1205.64     1392.44       0.866
                 1950 |          5     1333.20     1510.61       0.883
                 1951 |          5     1735.22     1995.83       0.869
                 1952 |          5     1938.49     2240.33       0.865
                 1953 |          5     2427.35     2755.83       0.881
                 1954 |          5     2443.81     2737.81       0.893
            ----------------------------------------------------------



            Comment


            • #7
              Here's another way:

              Code:
              webuse grunfeld, clear
              
              bysort year: egen rank = rank(-invest), unique   
              
              by year: egen count = total(rank <= 5) 
              
              by year: egen double sum5 = total(invest * (rank <= 5))
               
              by year: egen double sum = total(invest) 
              
              gen conc = sum5 / sum
              
              format sum* %3.2f
              format conc %4.3f
              
              tabdisp year, c(count sum* conc)

              Comment

              Working...
              X