How to calculate percentile scores for multidimensional panel data in Stata?

Mohammed Omran

Join Date: Jul 2020

Posts: 25
#1

How to calculate percentile scores for multidimensional panel data in Stata?

23 Apr 2023, 08:08

Hi
I was wondering if anyone can advise me on the following problem:

My dataset:

Code:

input int t byte(j i x a w) 2000 1 1 19 2 0 2001 1 1 60 2 1 2002 1 1 70 2 1 2000 1 2 22 2 1 2001 1 2 33 2 0 2002 1 2 44 2 0 2000 2 3 23 3 0 2001 2 3 53 3 0 2002 2 3 56 2 0 2000 2 4 90 3 2 2001 2 4 81 3 1 2002 2 4 85 2 1 2000 2 5 88 3 1 2001 2 5 92 3 2 2002 2 5 . . . end

time identifier: "t"
country identifier: "j"
panel identifier: "i"
x is covariate
a: refer to the number of observations available on x for all firms from country j at time t, obtained by the following command

Code:

by j t: egen a = count(x) if !missing(x)

E.g., how many observations on x are available for all firms from the US (being country j) in 2009 (being time t)

Required:
As for w, it is calculated manually, and it refers to how many observations on x are lower than the value of x for firm i using all firms from the same country for the same period.
E.g., How many observations on x in the US (j) are lower than x value for firm ABC using all observations on US companies for 2002?

And I would like to have the same for s, another variable that refers to how many observations on x are equal to the value of x for firm i using all firms from the same country for the same period.

I was wondering what Stata procdure I can use to obtain these variables?

Conclusion: by obtaining these variables, I would be able to calculate percentile scores for firms grouped by country-period combinations; that is, the scores are compared to each other only within a specific country and time period.

percentile score = [w + (s/2)]/a

Similar, not identical alternatives
xtine programme uses the same formula but unfortunately does not allow for using the grouping restriction (country-year)

The closest I got to what I need is using

Code:

by j t, sort : egen float sw = rank(x)

Which ranks x using my requested grouping and gives the sum of same and worst scores (denoted by sw) where ties are assigned averaged ranks, but the formula I need to use is based on (s/2), which solves for any ties (I'm restricted to using this formula).

Your advice in this regard is highly appreciated
Many thanks
Mohammed
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 29953
#2

23 Apr 2023, 10:11

w and s are jobs for -rangestat-, written by Robert Picard, Nick Cox, and Roberto Ferrer, available from SSC:

Code:

clear input int t byte(j i x) 2000 1 1 19 2001 1 1 60 2002 1 1 70 2000 1 2 22 2001 1 2 33 2002 1 2 44 2000 2 3 23 2001 2 3 53 2002 2 3 56 2000 2 4 90 2001 2 4 81 2002 2 4 85 2000 2 5 88 2001 2 5 92 2002 2 5 . end isid i j t rename j country rename i firm by country t, sort: egen a = count(x) if !missing(x) rangestat (count) w = x, by(country t) interval(x . -1) rangestat (count) s = x, by(country t) interval(x 0 0) mvencode w s, mv(0)

Notes:
1. I renamed variable j and i because I had difficulty keep straight which was which in their original names. This is not necessary: you can go back and replace country by j and firm by i and skip the -rename- commands if you prefer. (Although I do think it is better coding practice to use names that have mnemonic value than single letter names.)

2. The command I show for s counts the number of firms with the same value for x as the current value of x. This necessarily includes the current observation (where x is equal to itself). If you do not wish to count the firm itself, then add the -excludeself- option to that second -rangestat- command.
1 like
Comment
Mohammed Omran

Join Date: Jul 2020

Posts: 25
#3

23 Apr 2023, 10:47

Many thanks, Clyde

This works perfectly
And thanks for the valuable tips

Best wishes
Mohammed
Comment
Mohammed Omran

Join Date: Jul 2020

Posts: 25
#4

27 Apr 2023, 14:23

Dear Clyde, thanks again for the valuable guidance you provided earlier.
I was wondering if you would advise me on a related matter.

Method
I have two groups, A and B and their outcome variable is y. I'm interested in relative y as measured by the percentile score formula discussed earlier. For group A, I already calculated percentile scores for the raw scores. For group B, I want to assign percentile scores for their raw scores based on where their raw scores would fall in the distribution of the raw scores for Group A. I have reason not to compare units within Group B to each other and just simply calculate their percentile scores.

I will construct the empirical cumulative distribution (ECDF) for raw scores of Group A using the following notation:
F^A(Y) = P(y^a < Y)
where Y is the cumulative distribution of scores, and as such, this notation gives the probability a score for unit from Group A is strictly less than Y.

I will then use ECDF of the group B to find where the raw scores for units in Group B fall in the distribution of raw scores for Group A.
F^A(y^b) = P(y^a < y^b)
And then, based on the probability given by F^B(y^b), I assign units from Group B percentile scores for their raw scores.

Example
Table (1): Constructing the ECDF for Group A

(1) (2) (3) (4) (5)

Group A Raw Score Percentile Score F^A(Y)

Unit 1 15 0.1 0/5 0

Unit 2 15 0.1 0/5 0.0

Unit 3 17 0.5 2/5 0.4

Unit 4 19 0.7 3/5 0.6

Unit 5 20 0.9 4/5 0.8

Column (3) is obtained via percentile score formula, column (5) is obtained via F^A(Y) which is explained in column (4)
Table (2): Finding Percentile Scores for Group B Units

(1) (2) (3) (4)

Group B Raw Score F^A(y^b) Percentile Score

Unit 6 15 0.0 0.1

Unit 7 17 0.4 0.5

Unit 8 19 0.6 0.7

Column (3) in Table (2) is calculated as follows in relation to Table (1) information:
F^A(15) = P(y^a < 15)=0.0
F^A(17) = P(y^a < 17)=0.4
F^A(19) = P(y^a < 19)=0.6
For each probability, we can then assign the corresponding percentile score to units in Group A. For instance, Table 1 shows that the percentile score for raw score of 15 is 0.1, therefore, we assign percentile score of 0.1 to unit 6 in Table (2).

Stata Implementation

[[Calculating F^A(Y)]]
Code 1

Code:

// Rank the Group A units by country-year rangestat (count) control_rank = x if !missing(x) & treat!=1, by(country year) interval(x . -1) replace control_rank = 0 if control_rank==. & !missing(x) & treat!=1

Code 2

Code:

// Count the number of units from Group B for each country-year grouping bysort country year: egen control_count = count(x) if !missing(x) & treat!=1

Code 3

Code:

// Compute the ECDF for the units from Group A by country-year bysort country year: gen control_ecdf = control_rank / control_count if !missing(x) & treat!=1

Here is the dataset in my first post extended to include some units from Group B (treatment group), where x includes raw scores and pctile includes the percentile scores of these raw scores and treat is an indicator variable that takes 1 if a unit belongs to Group B and 0 otherwise. The dataset include the calculated ECDF for the Group A (the control group).

Code:

clear input int year byte(country id x) float(treat pctile) double control_rank float(control_count control_ecdf) 2000 1 6 23 1 . . . . 2000 1 2 22 0 .75 1 2 .5 2000 1 1 19 0 .25 0 2 0 2000 2 4 90 0 .8333333 2 3 .6666667 2000 2 7 15 1 . . . . 2000 2 3 23 0 .16666667 0 3 0 2000 2 5 88 0 .5 1 3 .3333333 2001 1 2 33 0 .25 0 2 0 2001 1 1 60 0 .75 1 2 .5 2001 1 6 75 1 . . . . 2001 2 4 81 0 .5 1 3 .3333333 2001 2 3 53 0 .16666667 0 3 0 2001 2 7 44 1 . . . . 2001 2 5 92 0 .8333333 2 3 .6666667 2002 1 2 44 0 .25 0 2 0 2002 1 1 70 0 .75 1 2 .5 2002 1 6 97 1 . . . . 2002 2 7 45 1 . . . . 2002 2 5 . 0 . . . . 2002 2 4 85 0 .75 1 2 .5 2002 2 3 56 0 .25 0 2 0 end

Required:
I was wondering what Stata-based procedure I can use to obtain F^A(y^b) in column (3) in Table 2 and to retrieve the percentile scores in column (4) in the same table from column (3) in Table (1), given that all information is given in the same dataset as in the dataex code in this post.

I would also need to incorporate linear interpolation when no exact score for unit from Group B is found among the scores for Group A, such that we use weighted score as follows

If we have a raw score of 18 for units in Group B, we cannot find a match for it among Group A scores
F^A(18) = P(y^a < 18)

The lowest and highest existing scores in this context are 17 and 19, respectively. The distance between lower and 18 is 1 and the distance between 18 and upper is 1, as such, lower weight 1/2 and upper weight is 1/2.

weighted percentile score = pctile_lower(weight) + pctile_upper(weight)
0.6 = 0.5(1/2)+0.7(1/2)

Thanks again and sorry for the lengthy post
Mohammed

Last edited by Mohammed Omran; 27 Apr 2023, 14:25.
Comment
Mohammed Omran

Join Date: Jul 2020

Posts: 25
#5

27 Apr 2023, 15:29

Just a minor Correction

Code 2
Incorrect description: // Count the number of units from Group B for each country-year grouping
Correct description: // Count the number of units from Group A for each country-year grouping
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29953
#6

27 Apr 2023, 17:19

So, there is a problem here. In every instance in your example data, the treated entity's x-score is completely outside the range of observed values for the controls in the same country and year. You don't say how you want to handle that situation, and in the existing data this is the only situation that arises. I imagine when x is below the range of observed control values of x you would call that the zero percentile. But if x is above the range of observed control values of x, it is less clear what to do since your definition of the cumulative distribution function (Proportion of distribution strictly less than x) means that your control distributions don't top off at 1, but are improper distribution functions that only extend to some lower value. You can see that in your own data.
Comment
Mohammed Omran

Join Date: Jul 2020

Posts: 25
#7

27 Apr 2023, 19:06

Hi Clyde.
Thank you for the explanation.
I realise this definition may depart from a theoretical ideal; it is an empirical definition used in relevant empirical applications in prior literature.

The hypothetical example dataset is not properly constructed, and I should have paid more attention to its content. In the actual dataset, the control group is way larger than the treatment group, with greater variation in the x-scores. Importantly, in none of the country-year combinations the x-scores for the treatment group are outside the range of the control group x-scores. Accordingly, the approach I described in post #4 is practical and feasible for my setting.

Thank you again for your highly valuable comments.
Mohammed
Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 29953

27 Apr 2023, 19:26

OK. The observations with missing values of x are a colossal nuisance here. They get in the way of solving the problem and they have no information that contributes to the solution. So I decided to rewrite the code you used to create the empirical CDF among the controls with simpler code that starts by getting rid of all observationos with missing x. After that, we fit the treated cases into the eCDF and interpolate. Finally, we restore the observations with missing x that were initially removed.

Code:

* Example generated by -dataex-. For more info, type help dataex
clear*
input int year byte(country id x) float treat
2000 1 6 23 1
2000 1 2 22 0
2000 1 1 19 0
2000 2 4 90 0
2000 2 7 15 1
2000 2 3 23 0
2000 2 5 88 0
2001 1 2 33 0
2001 1 1 60 0
2001 1 6 75 1
2001 2 4 81 0
2001 2 3 53 0
2001 2 7 44 1
2001 2 5 92 0
2002 1 2 44 0
2002 1 1 70 0
2002 1 6 97 1
2002 2 7 45 1
2002 2 5  . 0
2002 2 4 85 0
2002 2 3 56 0
end



//    SEPARATE OUT THE OBSERVATIONS WITH MISSING X
//    THEY COMPLICATE THE CODE AND ALSO CONTRIBUTE NO INFORMATION
//    RELEVANT TO THE PROBLEM AT HAND.
preserve
keep if missing(x)
tempfile holding
save `holding'
restore
drop if missing(x)

//    SEPARATE THE TWO GROUPS
frame put _all if treat == 0, into(controls)
drop if treat == 0
frame rename default treated

//    CALCULATE THE CONTROL DISTRIBUTIONS
frame change controls
rangestat (count) n_dominated = x, by(country year) interval(x . -1)
replace n_dominated = 0 if missing(n_dominated)
by year country (x), sort: gen ecdf_control = n_dominated/_N
drop n_dominated
tempfile controls
save `controls'

//    FIND WHERE THE CASES FALL ON THE CONTROL DISTRIBUTIONS
frame change treated
append using `controls'
frame drop controls
gsort year country x -treat id
//    INTERPOLATE THE VALUES OF ECDF_CONTROL FOR THE TREATED OBSERVATIONS
ipolate ecdf_control x, gen(pctile_among_controls) by(year country)

//    FINALLY, BRING BACK THE OBSERVATIONS WITH MISSING X
append using `holding'
isid year country id, sort

As already noted in #6, in this example data, the code actually ends up doing nothing. If it does not work properly in your real data set, please post back with a different example that illustrates the problem. In the event your real data does contain some treated observations whose values are outside the observed range for controls, their value of pctile_among_controls will be missing.

Comment

Mohammed Omran

Join Date: Jul 2020
Posts: 25

27 Apr 2023, 19:38

I revised the example to not include x-scores for the treatment units that is outside the observed range of x-scores for the control group. I used similar x-scores for some firms from the treatment and control in some country-year combination (country 2 and year 2002, units 4 and 7), both units have the same percentile score now. Here are the results of procedure using the command you provided in #8:

Code:

clear
input int year byte(country id x) float(treat pctile) double control_rank float(control_count control_ecdf ecdf_control) double pctile_among_controls
2000 1 1 19 0       .25 0 2        0        0                   0
2000 1 2 22 0       .75 1 2       .5       .5                  .5
2000 1 6 21 1         . . .        .        .   .3333333333333335
2000 2 3 23 0 .16666667 0 3        0        0                   0
2000 2 4 90 0  .8333333 2 3 .6666667 .6666667   .6666666865348816
2000 2 5 88 0        .5 1 3 .3333333 .3333333   .3333333432674408
2000 2 7 28 1         . . .        .        . .025641026405187756
2001 1 1 60 0       .75 1 2       .5       .5                  .5
2001 1 2 33 0       .25 0 2        0        0                   0
2001 1 6 55 1         . . .        .        .  .40740740740740733
2001 2 3 53 0 .16666667 0 3        0        0                   0
2001 2 4 81 0        .5 1 3 .3333333 .3333333   .3333333432674408
2001 2 5 92 0  .8333333 2 3 .6666667 .6666667   .6666666865348816
2001 2 7 71 1         . . .        .        .   .2142857206719262
2002 1 1 70 0       .75 1 2       .5       .5                  .5
2002 1 2 44 0       .25 0 2        0        0                   0
2002 1 6 59 1         . . .        .        .   .2884615384615383
2002 2 3 56 0       .25 0 2        0        0                   0
2002 2 4 85 0       .75 1 2       .5       .5                  .5
2002 2 7 85 1         . . .        .        .                  .5
end

"In the event your real data does contain some treated observations whose values are outside the observed range for controls, their value of pctile_among_controls will be missing."
Thank you for this important note

Many thanks for your highly appreciated comments and guidance
Best wishes
Mohammed

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 29953
#10

28 Apr 2023, 12:30

So, my results for the control eCDF (variable control_ecdf) differ from yours (variable pctile). I think yours are not correct.

For example, look at year 2000 country 1. There are two control observations, with x = 19, and 22, respectively. The observation with x = 19 has no observations with x strictly smaller. So its percentile should be 0. The one with x = 22 has a single observation (x = 19) that is smaller. So 1 out of the total of two control observations, i.e. 50%, is strictly smaller. Accordingly control_ecdf reports the correspondences 19 <=> 0 and 22 <=> 0.5. By contrast, the variable pctile shows 0.25 and 0.75. I don't see where that comes from. The numbers .25 and .75 suggest that the denominators of whatever fractions generated those are 4 or a multiple thereof. But even counting the treated observation, there are only 3 observations for year 2000 country 1, and only 2 of those are controls. And how can 19 be at the 25th percentile when it is the very smallest in the data set and you are defining percentiles by strict inequality?
Comment
Mohammed Omran

Join Date: Jul 2020

Posts: 25
#11

28 Apr 2023, 15:55

Hi Clyde.
Thank you for your comment.

The pctile column refers to the standardised x-scores using percentile scores obtained via the formula in #1,

percentile score = [w + (s/2)]/a
which calculates pctile using control firms only for given country-year combination.
For country-year 1-2000,
For unit 1 with x-score 19, pctile is
.25= [0 + (1/2)]/2
For unit 2 with x-score 22, pctile is
.75= [1 + (1/2)]/2

The percentile ranks are defined differently for the ECDF where distribution of the data is reflected.
This can be observed in Table (1) in #4, where percentile scores that are equivalent to pctile column in the Stata example are given in column 3 and the ECDF results are given column 5. For x-scores of 15, the ECDF is 0, however the percentile score used to standardised the x-score is 0.1.

Many thanks
Mohammed
Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 29953

#12

28 Apr 2023, 16:49

Got it. So here's how I would put that all together. The approach in #8 is still sound--it just needs a minor modification:

Code:

* Example generated by -dataex-. For more info, type help dataex
clear*
input int year byte(country id x) float treat
2000 1 6 23 1
2000 1 2 22 0
2000 1 1 19 0
2000 2 4 90 0
2000 2 7 15 1
2000 2 3 23 0
2000 2 5 88 0
2001 1 2 33 0
2001 1 1 60 0
2001 1 6 75 1
2001 2 4 81 0
2001 2 3 53 0
2001 2 7 44 1
2001 2 5 92 0
2002 1 2 44 0
2002 1 1 70 0
2002 1 6 97 1
2002 2 7 45 1
2002 2 5  . 0
2002 2 4 85 0
2002 2 3 56 0
end



//    SEPARATE OUT THE OBSERVATIONS WITH MISSING X
//    THEY COMPLICATE THE CODE AND ALSO CONTRIBUTE NO INFORMATION
//    RELEVANT TO THE PROBLEM AT HAND.
preserve
keep if missing(x)
tempfile holding
save `holding'
restore
drop if missing(x)

//    SEPARATE THE TWO GROUPS
frame put _all if treat == 0, into(controls)
drop if treat == 0
frame rename default treated

//    CALCULATE THE CONTROL DISTRIBUTIONS
frame change controls
rangestat (count) n_dominated = x, by(country year) interval(x . -1)
replace n_dominated = 0 if missing(n_dominated)
rangestat (count) n_equal = x, by(country year) interval(x 0 0)
replace n_equal = 0 if missing(n_equal)
by year country (x), sort: gen ecdf_control = (n_dominated + 0.5*n_equal)/_N
drop n_dominated n_equal
tempfile controls
save `controls'

//    FIND WHERE THE CASES FALL ON THE CONTROL DISTRIBUTIONS
frame change treated
append using `controls'
frame drop controls
gsort year country x -treat id
//    INTERPOLATE THE VALUES OF ECDF_CONTROL FOR THE TREATED OBSERVATIONS
ipolate ecdf_control x, gen(pctile_among_controls) by(year country)

//    FINALLY, BRING BACK THE OBSERVATIONS WITH MISSING X
append using `holding'
isid year country id, sort

Comment

Mohammed Omran

Join Date: Jul 2020

Posts: 25
#13

28 Apr 2023, 18:11

The percentile score command defined here is extremely effective. As a reliability check, I managed to replicate some ESG metrics from a well-established proprietary database that uses various layers of calculation based on the percentile score formula as the one defined in #1 using country-year combinations.

However, I have noticed that the command for the ECDF does not produce the intended output. I don't know how I couldn't notice this in #9. The values of the ECDF produced here are identical to the percentile scores of x, which is different from my context. pctile in my context can be viewed as merely another relative variable (it happens to be on a percentile scale).

The ECDF role is just to identify where would the treated units fall on the distribution of scores of the control units; it shouldn't recalculate percentile scores. For example for year 2002, country 2, both units 4 (a control unit) and 7 (a treated unit) have the same x-score (i.e., 85). The percentile score for unit 4 is .75; accordingly, unit 7 should be assigned a score of .75. However, the ECDF command produces a percentile score of .5 for both.

Thank you again Clyde
Comment

Mohammed Omran

Join Date: Jul 2020
Posts: 25

#14

28 Apr 2023, 18:49

I forgot to mention that I excute this command to produce the percentile scores (pctile variable) before excuting the command for the ECDF

Code:

//   [Step:1] Identify the country-year grouping
isid id country year

//   [Step: 2] Generate key components of the percentile score formula

*[a: denotes all scores AVILABLE on x within a specific country-year grouping]
*[w: denotes all scores on x that are LOWER than the value of x for firm i for a specific country-year grouping]
*[s: denotes all scores on x that are EQUAL to the value of x for firm i for a specific country-year grouping]
by country year, sort: egen a = count(x) if !missing(x) & treat !=1
rangestat (count) w = x  if !missing(x) & treat !=1, by(country year) interval(x . -1)  
rangestat (count) s = x if !missing(x) & treat !=1 , by(country year) interval(x 0 0) 
mvencode w s, mv(0)

//  [Step:3] Calculate percentile score
gen pctile = (w+(s/2))/a  if treat!=1
 
replace pctile = 0  if x ==0

drop a w s

The ECDF command should assign percentile scores to the treated from the already calculated percentile scores for the control units if exact match is found based on x-scores and interpolate similar to the example in #4.
I do not separate the two procedures for a particular purpose. Perhaps it is possible to integrate both procedures into a single programme, but I couldn't think of this.

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 29953
#15

28 Apr 2023, 18:54

Yes, your code in #14 is equivalent to the code underneath the CALCULATE THE CONTROL DISTRIBUTIONS heading in #12. The variable that the latter produces, called ecdf_control, is the same as your variable pctile. And while #12 does it all in one procedure, actually the calculation proceeds sequentially, so it's all a matter of point of view whether it is one procedure or two. The main reason I chose to put it all together is because the exclusion of observations with missing values of x simplifies the code for both pieces--that is the main unifying theme.
Comment

(1)	(2)	(3)	(4)	(5)
Group A	Raw Score	Percentile Score		F^A(Y)
Unit 1	15	0.1	0/5	0
Unit 2	15	0.1	0/5	0.0
Unit 3	17	0.5	2/5	0.4
Unit 4	19	0.7	3/5	0.6
Unit 5	20	0.9	4/5	0.8

(1)	(2)	(3)	(4)
Group B	Raw Score	F^A(y^b)	Percentile Score
Unit 6	15	0.0	0.1
Unit 7	17	0.4	0.5
Unit 8	19	0.6	0.7

Announcement