sum of the 5 largest values (by date)

Kleber Vasconcellos

Join Date: Jul 2019

Posts: 2
#1

sum of the 5 largest values (by date)

03 Jan 2022, 15:02

Hey everyone,

I would like some tips from you.

I'm currently working on a panel database (monthly) and I need to create a variable like a "concentration index" (by month), which is the sum of the 5 largest values of variable "x" scaled by the total sum of variable "x" (by month).

Unfortunately, I couldn't find a forum with a demand like this.

Thank you all in advance
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30101
#2

03 Jan 2022, 15:49

You did not provide example data, so I have made up a toy data set to illustrate the technique. Hopefully, this data set is similar enough to yours that the code will work. If not, we will have both wasted our time.

To avoid such potential difficulties in the future, please always provide example data when asking for help with code, and always use the -dataex- command to do that. If you are running version 17, 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

Code:

// CREATE A DEMONSTRATION DATA SITE clear* set obs 10 gen int panelvar = _n expand 12 by panelvar, sort: gen mdate = tm(1999m12) + _n format mdate %tm set seed 1234 gen x = runiform() xtset panelvar mdate // SPRINKLE IN SOME MISSING VALUES OF X replace x = . if runiform() < 0.2 // CALCULATE "CONCENTRATION INDEX" preserve drop if missing(x) by mdate (x), sort: egen top_5_total = total(cond(_N-_n < 5, x, .)) by mdate (x): egen all_total = total(x) gen concentration_index = top_5_total/all_total by mdate, sort: keep if _n == 1 keep mdate concentration_index tempfile working save `working' // BRING IT INTO THE ORIGINAL DATA SET restore merge m:1 mdate using `working'

Added: The code can be considerably simplified if it is guaranteed that variable x never has a missing value. But that seemed too strong an assumption to make.
2 likes
Comment
Kleber Vasconcellos

Join Date: Jul 2019

Posts: 2
#3

03 Jan 2022, 17:07

Dear professor Clyde Schechter,

Thank you so much for your great and quick response!!!!

That was amazing...it worked totally for my data!!!

Next time I will use provide example data.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35699
#4

03 Jan 2022, 19:20

I found this within a few seconds of a search.

https://www.statalist.org/forums/for...e-past-1-month
Comment

Nick Cox

Join Date: Mar 2014
Posts: 35699

04 Jan 2022, 04:54

Here is a concrete example building on the code linked to in #4.

Code:

mata:  
mata clear


real rowvector conc(real matrix X) {
    X = select(X, (X :< .))
    _sort(X, -1)
    n = rowmin((5, rows(X)))
    return((n, sum(X[1::n,]), sum(X)))  
}

end 

* sandbox 
webuse grunfeld, clear 

rangestat (conc) invest, int(year 0 0) 

rename (conc?) (count sum5 sum)
label var count "ideally 5"
label var sum5 "highest 5"
label var sum  "all values"
gen conc = sum5/sum
label var conc "ratio" 
format conc %4.3f 
format sum* %3.2f 

tabdisp year, c(count sum5 sum conc)

----------------------------------------------------------
     Year |  ideally 5   highest 5  all values       ratio
----------+-----------------------------------------------
     1935 |          5      640.57      727.46       0.881
     1936 |          5      915.59     1016.07       0.901
     1937 |          5     1098.20     1224.81       0.897
     1938 |          5      669.71      775.55       0.864
     1939 |          5      704.36      805.26       0.875
     1940 |          5     1013.09     1132.65       0.894
     1941 |          5     1227.55     1397.19       0.879
     1942 |          5     1075.64     1226.65       0.877
     1943 |          5     1032.14     1177.90       0.876
     1944 |          5     1028.54     1209.25       0.851
     1945 |          5     1065.49     1241.59       0.858
     1946 |          5     1401.79     1613.59       0.869
     1947 |          5     1257.30     1471.35       0.855
     1948 |          5     1329.70     1539.48       0.864
     1949 |          5     1205.64     1392.44       0.866
     1950 |          5     1333.20     1510.61       0.883
     1951 |          5     1735.22     1995.83       0.869
     1952 |          5     1938.49     2240.33       0.865
     1953 |          5     2427.35     2755.83       0.881
     1954 |          5     2443.81     2737.81       0.893
----------------------------------------------------------

.

Comment

Nick Cox

Join Date: Mar 2014
Posts: 35699

04 Jan 2022, 05:35

That said, this can be done yet more simply. As Clyde Schechter rightly pointed out, the main pitfall here would be any missing values, because sorting the data pushes any missing values to the end of any block of observations. That makes it more difficult to identify the highest 5.

But we can just use an indicator for the high values. There aren't any missing values in the Grunfeld data but that is no barrier to writing cautious code.

Code:

webuse grunfeld, clear

gen high = !missing(invest)

bysort high year (invest) : replace high = high & (_N - _n) < 5  

egen count = total(high), by(year)

egen double sum5 = total(invest * high), by(year)

egen double sum = total(invest), by(year)

gen conc = sum5 / sum

format sum* %3.2f
format conc %4.3f

tabdisp year, c(count sum* conc)



----------------------------------------------------------
     Year |      count        sum5         sum        conc
----------+-----------------------------------------------
     1935 |          5      640.57      727.46       0.881
     1936 |          5      915.59     1016.07       0.901
     1937 |          5     1098.20     1224.81       0.897
     1938 |          5      669.71      775.55       0.864
     1939 |          5      704.36      805.26       0.875
     1940 |          5     1013.09     1132.65       0.894
     1941 |          5     1227.55     1397.19       0.879
     1942 |          5     1075.64     1226.65       0.877
     1943 |          5     1032.14     1177.90       0.876
     1944 |          5     1028.54     1209.25       0.851
     1945 |          5     1065.49     1241.59       0.858
     1946 |          5     1401.79     1613.59       0.869
     1947 |          5     1257.30     1471.35       0.855
     1948 |          5     1329.70     1539.48       0.864
     1949 |          5     1205.64     1392.44       0.866
     1950 |          5     1333.20     1510.61       0.883
     1951 |          5     1735.22     1995.83       0.869
     1952 |          5     1938.49     2240.33       0.865
     1953 |          5     2427.35     2755.83       0.881
     1954 |          5     2443.81     2737.81       0.893
----------------------------------------------------------

Comment

Nick Cox

Join Date: Mar 2014
Posts: 35699

11 Jan 2022, 12:39

Here's another way:

Code:

webuse grunfeld, clear

bysort year: egen rank = rank(-invest), unique   

by year: egen count = total(rank <= 5) 

by year: egen double sum5 = total(invest * (rank <= 5))
 
by year: egen double sum = total(invest) 

gen conc = sum5 / sum

format sum* %3.2f
format conc %4.3f

tabdisp year, c(count sum* conc)

Announcement