Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to calculate percentile scores for multidimensional panel data in Stata?

    Hi
    I was wondering if anyone can advise me on the following problem:

    My dataset:
    Code:
    input int t byte(j i x a w)
    2000 1 1 19 2 0
    2001 1 1 60 2 1
    2002 1 1 70 2 1
    2000 1 2 22 2 1
    2001 1 2 33 2 0
    2002 1 2 44 2 0
    2000 2 3 23 3 0
    2001 2 3 53 3 0
    2002 2 3 56 2 0
    2000 2 4 90 3 2
    2001 2 4 81 3 1
    2002 2 4 85 2 1
    2000 2 5 88 3 1
    2001 2 5 92 3 2
    2002 2 5  . . .
    end
    time identifier: "t"
    country identifier: "j"
    panel identifier: "i"
    x is covariate
    a: refer to the number of observations available on x for all firms from country j at time t, obtained by the following command
    Code:
    by j t: egen a = count(x)  if !missing(x)
    E.g., how many observations on x are available for all firms from the US (being country j) in 2009 (being time t)

    Required:
    As for w, it is calculated manually, and it refers to how many observations on x are lower than the value of x for firm i using all firms from the same country for the same period.
    E.g., How many observations on x in the US (j) are lower than x value for firm ABC using all observations on US companies for 2002?

    And I would like to have the same for s, another variable that refers to how many observations on x are equal to the value of x for firm i using all firms from the same country for the same period.

    I was wondering what Stata procdure I can use to obtain these variables?

    Conclusion: by obtaining these variables, I would be able to calculate percentile scores for firms grouped by country-period combinations; that is, the scores are compared to each other only within a specific country and time period.

    percentile score = [w + (s/2)]/a


    Similar, not identical alternatives
    xtine programme uses the same formula but unfortunately does not allow for using the grouping restriction (country-year)

    The closest I got to what I need is using
    Code:
    by j t, sort : egen float sw = rank(x)
    Which ranks x using my requested grouping and gives the sum of same and worst scores (denoted by sw) where ties are assigned averaged ranks, but the formula I need to use is based on (s/2), which solves for any ties (I'm restricted to using this formula).

    Your advice in this regard is highly appreciated
    Many thanks
    Mohammed

  • #2
    w and s are jobs for -rangestat-, written by Robert Picard, Nick Cox, and Roberto Ferrer, available from SSC:

    Code:
    clear
    input int t byte(j i x)
    2000 1 1 19
    2001 1 1 60
    2002 1 1 70
    2000 1 2 22
    2001 1 2 33
    2002 1 2 44
    2000 2 3 23
    2001 2 3 53
    2002 2 3 56
    2000 2 4 90
    2001 2 4 81
    2002 2 4 85
    2000 2 5 88
    2001 2 5 92
    2002 2 5  .
    end
    
    isid i j t
    rename j country
    rename i firm
    
    by country t, sort: egen a = count(x) if !missing(x)
    
    rangestat (count) w = x, by(country t) interval(x . -1)
    
    rangestat (count) s = x, by(country t) interval(x 0 0)
    
    mvencode w s, mv(0)
    Notes:
    1. I renamed variable j and i because I had difficulty keep straight which was which in their original names. This is not necessary: you can go back and replace country by j and firm by i and skip the -rename- commands if you prefer. (Although I do think it is better coding practice to use names that have mnemonic value than single letter names.)

    2. The command I show for s counts the number of firms with the same value for x as the current value of x. This necessarily includes the current observation (where x is equal to itself). If you do not wish to count the firm itself, then add the -excludeself- option to that second -rangestat- command.

    Comment


    • #3
      Many thanks, Clyde

      This works perfectly
      And thanks for the valuable tips

      Best wishes
      Mohammed

      Comment


      • #4
        Dear Clyde, thanks again for the valuable guidance you provided earlier.
        I was wondering if you would advise me on a related matter.

        Method
        I have two groups, A and B and their outcome variable is y. I'm interested in relative y as measured by the percentile score formula discussed earlier. For group A, I already calculated percentile scores for the raw scores. For group B, I want to assign percentile scores for their raw scores based on where their raw scores would fall in the distribution of the raw scores for Group A. I have reason not to compare units within Group B to each other and just simply calculate their percentile scores.

        I will construct the empirical cumulative distribution (ECDF) for raw scores of Group A using the following notation:
        FA (Y) = P(ya < Y)
        where Y is the cumulative distribution of scores, and as such, this notation gives the probability a score for unit from Group A is strictly less than Y.

        I will then use ECDF of the group B to find where the raw scores for units in Group B fall in the distribution of raw scores for Group A.
        FA (yb) = P(ya < yb)
        And then, based on the probability given by FB (yb), I assign units from Group B percentile scores for their raw scores.

        Example
        Table (1): Constructing the ECDF for Group A
        (1) (2) (3) (4) (5)
        Group A Raw Score Percentile Score FA(Y)
        Unit 1 15 0.1 0/5 0
        Unit 2 15 0.1 0/5 0.0
        Unit 3 17 0.5 2/5 0.4
        Unit 4 19 0.7 3/5 0.6
        Unit 5 20 0.9 4/5 0.8
        Column (3) is obtained via percentile score formula, column (5) is obtained via FA (Y) which is explained in column (4)
        Table (2): Finding Percentile Scores for Group B Units
        (1) (2) (3) (4)
        Group B Raw Score FA (yb) Percentile Score
        Unit 6 15 0.0 0.1
        Unit 7 17 0.4 0.5
        Unit 8 19 0.6 0.7

        Column (3) in Table (2) is calculated as follows in relation to Table (1) information:
        FA (15) = P(ya < 15)=0.0
        FA (17) = P(ya < 17)=0.4
        FA (19) = P(ya < 19)=0.6
        For each probability, we can then assign the corresponding percentile score to units in Group A. For instance, Table 1 shows that the percentile score for raw score of 15 is 0.1, therefore, we assign percentile score of 0.1 to unit 6 in Table (2).

        Stata Implementation

        [[Calculating FA(Y)]]
        Code 1

        Code:
        // Rank the Group A units by country-year
        rangestat (count) control_rank = x  if !missing(x) & treat!=1, by(country year) interval(x . -1)
        replace control_rank = 0  if control_rank==. & !missing(x) & treat!=1
        Code 2
        Code:
        // Count the number of units from Group B for each country-year grouping
        bysort country year: egen control_count = count(x)  if !missing(x) & treat!=1
        Code 3
        Code:
        // Compute the ECDF for the units from Group A by country-year
        bysort country year: gen control_ecdf = control_rank / control_count if !missing(x) & treat!=1
        Here is the dataset in my first post extended to include some units from Group B (treatment group), where x includes raw scores and pctile includes the percentile scores of these raw scores and treat is an indicator variable that takes 1 if a unit belongs to Group B and 0 otherwise. The dataset include the calculated ECDF for the Group A (the control group).

        Code:
        clear
        input int year byte(country id x) float(treat pctile) double control_rank float(control_count control_ecdf)
        2000 1 6 23 1         . . .        .
        2000 1 2 22 0       .75 1 2       .5
        2000 1 1 19 0       .25 0 2        0
        2000 2 4 90 0  .8333333 2 3 .6666667
        2000 2 7 15 1         . . .        .
        2000 2 3 23 0 .16666667 0 3        0
        2000 2 5 88 0        .5 1 3 .3333333
        2001 1 2 33 0       .25 0 2        0
        2001 1 1 60 0       .75 1 2       .5
        2001 1 6 75 1         . . .        .
        2001 2 4 81 0        .5 1 3 .3333333
        2001 2 3 53 0 .16666667 0 3        0
        2001 2 7 44 1         . . .        .
        2001 2 5 92 0  .8333333 2 3 .6666667
        2002 1 2 44 0       .25 0 2        0
        2002 1 1 70 0       .75 1 2       .5
        2002 1 6 97 1         . . .        .
        2002 2 7 45 1         . . .        .
        2002 2 5  . 0         . . .        .
        2002 2 4 85 0       .75 1 2       .5
        2002 2 3 56 0       .25 0 2        0
        end

        Required:
        I was wondering what Stata-based procedure I can use to obtain FA (yb) in column (3) in Table 2 and to retrieve the percentile scores in column (4) in the same table from column (3) in Table (1), given that all information is given in the same dataset as in the dataex code in this post.

        I would also need to incorporate linear interpolation when no exact score for unit from Group B is found among the scores for Group A, such that we use weighted score as follows

        If we have a raw score of 18 for units in Group B, we cannot find a match for it among Group A scores
        FA (18) = P(ya < 18)

        The lowest and highest existing scores in this context are 17 and 19, respectively. The distance between lower and 18 is 1 and the distance between 18 and upper is 1, as such, lower weight 1/2 and upper weight is 1/2.

        weighted percentile score = pctile_lower(weight) + pctile_upper(weight)
        0.6 = 0.5(1/2)+0.7(1/2)

        Thanks again and sorry for the lengthy post
        Mohammed

        Last edited by Mohammed Omran; 27 Apr 2023, 14:25.

        Comment


        • #5
          Just a minor Correction

          Code 2

          Incorrect description: // Count the number of units from Group B for each country-year grouping
          Correct description: // Count the number of units from Group A for each country-year grouping

          Comment


          • #6
            So, there is a problem here. In every instance in your example data, the treated entity's x-score is completely outside the range of observed values for the controls in the same country and year. You don't say how you want to handle that situation, and in the existing data this is the only situation that arises. I imagine when x is below the range of observed control values of x you would call that the zero percentile. But if x is above the range of observed control values of x, it is less clear what to do since your definition of the cumulative distribution function (Proportion of distribution strictly less than x) means that your control distributions don't top off at 1, but are improper distribution functions that only extend to some lower value. You can see that in your own data.

            Comment


            • #7
              Hi Clyde.
              Thank you for the explanation.
              I realise this definition may depart from a theoretical ideal; it is an empirical definition used in relevant empirical applications in prior literature.

              The hypothetical example dataset is not properly constructed, and I should have paid more attention to its content. In the actual dataset, the control group is way larger than the treatment group, with greater variation in the x-scores. Importantly, in none of the country-year combinations the x-scores for the treatment group are outside the range of the control group x-scores. Accordingly, the approach I described in post #4 is practical and feasible for my setting.

              Thank you again for your highly valuable comments.
              Mohammed


              Comment


              • #8
                OK. The observations with missing values of x are a colossal nuisance here. They get in the way of solving the problem and they have no information that contributes to the solution. So I decided to rewrite the code you used to create the empirical CDF among the controls with simpler code that starts by getting rid of all observationos with missing x. After that, we fit the treated cases into the eCDF and interpolate. Finally, we restore the observations with missing x that were initially removed.

                Code:
                * Example generated by -dataex-. For more info, type help dataex
                clear*
                input int year byte(country id x) float treat
                2000 1 6 23 1
                2000 1 2 22 0
                2000 1 1 19 0
                2000 2 4 90 0
                2000 2 7 15 1
                2000 2 3 23 0
                2000 2 5 88 0
                2001 1 2 33 0
                2001 1 1 60 0
                2001 1 6 75 1
                2001 2 4 81 0
                2001 2 3 53 0
                2001 2 7 44 1
                2001 2 5 92 0
                2002 1 2 44 0
                2002 1 1 70 0
                2002 1 6 97 1
                2002 2 7 45 1
                2002 2 5  . 0
                2002 2 4 85 0
                2002 2 3 56 0
                end
                
                
                
                //    SEPARATE OUT THE OBSERVATIONS WITH MISSING X
                //    THEY COMPLICATE THE CODE AND ALSO CONTRIBUTE NO INFORMATION
                //    RELEVANT TO THE PROBLEM AT HAND.
                preserve
                keep if missing(x)
                tempfile holding
                save `holding'
                restore
                drop if missing(x)
                
                //    SEPARATE THE TWO GROUPS
                frame put _all if treat == 0, into(controls)
                drop if treat == 0
                frame rename default treated
                
                //    CALCULATE THE CONTROL DISTRIBUTIONS
                frame change controls
                rangestat (count) n_dominated = x, by(country year) interval(x . -1)
                replace n_dominated = 0 if missing(n_dominated)
                by year country (x), sort: gen ecdf_control = n_dominated/_N
                drop n_dominated
                tempfile controls
                save `controls'
                
                //    FIND WHERE THE CASES FALL ON THE CONTROL DISTRIBUTIONS
                frame change treated
                append using `controls'
                frame drop controls
                gsort year country x -treat id
                //    INTERPOLATE THE VALUES OF ECDF_CONTROL FOR THE TREATED OBSERVATIONS
                ipolate ecdf_control x, gen(pctile_among_controls) by(year country)
                
                //    FINALLY, BRING BACK THE OBSERVATIONS WITH MISSING X
                append using `holding'
                isid year country id, sort
                As already noted in #6, in this example data, the code actually ends up doing nothing. If it does not work properly in your real data set, please post back with a different example that illustrates the problem. In the event your real data does contain some treated observations whose values are outside the observed range for controls, their value of pctile_among_controls will be missing.

                Comment


                • #9
                  I revised the example to not include x-scores for the treatment units that is outside the observed range of x-scores for the control group. I used similar x-scores for some firms from the treatment and control in some country-year combination (country 2 and year 2002, units 4 and 7), both units have the same percentile score now. Here are the results of procedure using the command you provided in #8:

                  Code:
                  clear
                  input int year byte(country id x) float(treat pctile) double control_rank float(control_count control_ecdf ecdf_control) double pctile_among_controls
                  2000 1 1 19 0       .25 0 2        0        0                   0
                  2000 1 2 22 0       .75 1 2       .5       .5                  .5
                  2000 1 6 21 1         . . .        .        .   .3333333333333335
                  2000 2 3 23 0 .16666667 0 3        0        0                   0
                  2000 2 4 90 0  .8333333 2 3 .6666667 .6666667   .6666666865348816
                  2000 2 5 88 0        .5 1 3 .3333333 .3333333   .3333333432674408
                  2000 2 7 28 1         . . .        .        . .025641026405187756
                  2001 1 1 60 0       .75 1 2       .5       .5                  .5
                  2001 1 2 33 0       .25 0 2        0        0                   0
                  2001 1 6 55 1         . . .        .        .  .40740740740740733
                  2001 2 3 53 0 .16666667 0 3        0        0                   0
                  2001 2 4 81 0        .5 1 3 .3333333 .3333333   .3333333432674408
                  2001 2 5 92 0  .8333333 2 3 .6666667 .6666667   .6666666865348816
                  2001 2 7 71 1         . . .        .        .   .2142857206719262
                  2002 1 1 70 0       .75 1 2       .5       .5                  .5
                  2002 1 2 44 0       .25 0 2        0        0                   0
                  2002 1 6 59 1         . . .        .        .   .2884615384615383
                  2002 2 3 56 0       .25 0 2        0        0                   0
                  2002 2 4 85 0       .75 1 2       .5       .5                  .5
                  2002 2 7 85 1         . . .        .        .                  .5
                  end
                  "In the event your real data does contain some treated observations whose values are outside the observed range for controls, their value of pctile_among_controls will be missing."
                  Thank you for this important note

                  Many thanks for your highly appreciated comments and guidance
                  Best wishes
                  Mohammed

                  Comment


                  • #10
                    So, my results for the control eCDF (variable control_ecdf) differ from yours (variable pctile). I think yours are not correct.

                    For example, look at year 2000 country 1. There are two control observations, with x = 19, and 22, respectively. The observation with x = 19 has no observations with x strictly smaller. So its percentile should be 0. The one with x = 22 has a single observation (x = 19) that is smaller. So 1 out of the total of two control observations, i.e. 50%, is strictly smaller. Accordingly control_ecdf reports the correspondences 19 <=> 0 and 22 <=> 0.5. By contrast, the variable pctile shows 0.25 and 0.75. I don't see where that comes from. The numbers .25 and .75 suggest that the denominators of whatever fractions generated those are 4 or a multiple thereof. But even counting the treated observation, there are only 3 observations for year 2000 country 1, and only 2 of those are controls. And how can 19 be at the 25th percentile when it is the very smallest in the data set and you are defining percentiles by strict inequality?

                    Comment


                    • #11
                      Hi Clyde.
                      Thank you for your comment.

                      The pctile column refers to the standardised x-scores using percentile scores obtained via the formula in #1,
                      percentile score = [w + (s/2)]/a
                      which calculates pctile using control firms only for given country-year combination.
                      For country-year 1-2000,
                      For unit 1 with x-score 19, pctile is
                      .25= [0 + (1/2)]/2
                      For unit 2 with x-score 22, pctile is
                      .75= [1 + (1/2)]/2

                      The percentile ranks are defined differently for the ECDF where distribution of the data is reflected.
                      This can be observed in Table (1) in #4, where percentile scores that are equivalent to pctile column in the Stata example are given in column 3 and the ECDF results are given column 5. For x-scores of 15, the ECDF is 0, however the percentile score used to standardised the x-score is 0.1.

                      Many thanks
                      Mohammed

                      Comment


                      • #12
                        Got it. So here's how I would put that all together. The approach in #8 is still sound--it just needs a minor modification:
                        Code:
                        * Example generated by -dataex-. For more info, type help dataex
                        clear*
                        input int year byte(country id x) float treat
                        2000 1 6 23 1
                        2000 1 2 22 0
                        2000 1 1 19 0
                        2000 2 4 90 0
                        2000 2 7 15 1
                        2000 2 3 23 0
                        2000 2 5 88 0
                        2001 1 2 33 0
                        2001 1 1 60 0
                        2001 1 6 75 1
                        2001 2 4 81 0
                        2001 2 3 53 0
                        2001 2 7 44 1
                        2001 2 5 92 0
                        2002 1 2 44 0
                        2002 1 1 70 0
                        2002 1 6 97 1
                        2002 2 7 45 1
                        2002 2 5  . 0
                        2002 2 4 85 0
                        2002 2 3 56 0
                        end
                        
                        
                        
                        //    SEPARATE OUT THE OBSERVATIONS WITH MISSING X
                        //    THEY COMPLICATE THE CODE AND ALSO CONTRIBUTE NO INFORMATION
                        //    RELEVANT TO THE PROBLEM AT HAND.
                        preserve
                        keep if missing(x)
                        tempfile holding
                        save `holding'
                        restore
                        drop if missing(x)
                        
                        //    SEPARATE THE TWO GROUPS
                        frame put _all if treat == 0, into(controls)
                        drop if treat == 0
                        frame rename default treated
                        
                        //    CALCULATE THE CONTROL DISTRIBUTIONS
                        frame change controls
                        rangestat (count) n_dominated = x, by(country year) interval(x . -1)
                        replace n_dominated = 0 if missing(n_dominated)
                        rangestat (count) n_equal = x, by(country year) interval(x 0 0)
                        replace n_equal = 0 if missing(n_equal)
                        by year country (x), sort: gen ecdf_control = (n_dominated + 0.5*n_equal)/_N
                        drop n_dominated n_equal
                        tempfile controls
                        save `controls'
                        
                        //    FIND WHERE THE CASES FALL ON THE CONTROL DISTRIBUTIONS
                        frame change treated
                        append using `controls'
                        frame drop controls
                        gsort year country x -treat id
                        //    INTERPOLATE THE VALUES OF ECDF_CONTROL FOR THE TREATED OBSERVATIONS
                        ipolate ecdf_control x, gen(pctile_among_controls) by(year country)
                        
                        //    FINALLY, BRING BACK THE OBSERVATIONS WITH MISSING X
                        append using `holding'
                        isid year country id, sort

                        Comment


                        • #13
                          The percentile score command defined here is extremely effective. As a reliability check, I managed to replicate some ESG metrics from a well-established proprietary database that uses various layers of calculation based on the percentile score formula as the one defined in #1 using country-year combinations.

                          However, I have noticed that the command for the ECDF does not produce the intended output. I don't know how I couldn't notice this in #9. The values of the ECDF produced here are identical to the percentile scores of x, which is different from my context. pctile in my context can be viewed as merely another relative variable (it happens to be on a percentile scale).

                          The ECDF role is just to identify where would the treated units fall on the distribution of scores of the control units; it shouldn't recalculate percentile scores. For example for year 2002, country 2, both units 4 (a control unit) and 7 (a treated unit) have the same x-score (i.e., 85). The percentile score for unit 4 is .75; accordingly, unit 7 should be assigned a score of .75. However, the ECDF command produces a percentile score of .5 for both.

                          Thank you again Clyde

                          Comment


                          • #14
                            I forgot to mention that I excute this command to produce the percentile scores (pctile variable) before excuting the command for the ECDF

                            Code:
                            //   [Step:1] Identify the country-year grouping
                            isid id country year
                            
                            //   [Step: 2] Generate key components of the percentile score formula
                            
                            *[a: denotes all scores AVILABLE on x within a specific country-year grouping]
                            *[w: denotes all scores on x that are LOWER than the value of x for firm i for a specific country-year grouping]
                            *[s: denotes all scores on x that are EQUAL to the value of x for firm i for a specific country-year grouping]
                            by country year, sort: egen a = count(x) if !missing(x) & treat !=1
                            rangestat (count) w = x  if !missing(x) & treat !=1, by(country year) interval(x . -1)  
                            rangestat (count) s = x if !missing(x) & treat !=1 , by(country year) interval(x 0 0) 
                            mvencode w s, mv(0)
                            
                            //  [Step:3] Calculate percentile score
                            gen pctile = (w+(s/2))/a  if treat!=1
                             
                            replace pctile = 0  if x ==0
                            
                            drop a w s
                            The ECDF command should assign percentile scores to the treated from the already calculated percentile scores for the control units if exact match is found based on x-scores and interpolate similar to the example in #4.
                            I do not separate the two procedures for a particular purpose. Perhaps it is possible to integrate both procedures into a single programme, but I couldn't think of this.

                            Comment


                            • #15
                              Yes, your code in #14 is equivalent to the code underneath the CALCULATE THE CONTROL DISTRIBUTIONS heading in #12. The variable that the latter produces, called ecdf_control, is the same as your variable pctile. And while #12 does it all in one procedure, actually the calculation proceeds sequentially, so it's all a matter of point of view whether it is one procedure or two. The main reason I chose to put it all together is because the exclusion of observations with missing values of x simplifies the code for both pieces--that is the main unifying theme.

                              Comment

                              Working...
                              X