Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Sum highest values within subgroups and assign result to observation

    Hi,

    I am quite lost with the following task and would really appreciate your help.

    One of my datasets (dataset A) contains compensation data for CFOs in various years:

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input str6 Company_ID str5 Manager_ID str25 Name str4 Year double Compensation
    "001004" "33979" "Rick Poulton"      "2007"      300
    "001004" "33979" "Rick Poulton"      "2008"      330
    "001004" "19999" "Mike Sharp"        "2015"  382.418
    "001004" "19999" "Mike Sharp"        "2016"  227.692
    "001050" ""      "Dennis Blazer"     "2007"        .
    "001050" ""      "Dennis Blazer"     "2008"        .
    "001056" "17853" "John Adamovich Jr" "2007"  823.534
    "001076" "18039" "Gil Danielson"     "2003"      300
    "001076" "18039" "Gil Danielson"     "2007"      400
    "001076" "18039" "Gil Danielson"     "2008"      425
    "001078" "24400" "Tom Freyman"       "2009"  914.461
    "001078" "24400" "Tom Freyman"       "2010"  941.923
    "001078" "24400" "Tom Freyman"       "2013"  969.748
    "001078" "24400" "Tom Freyman"       "2014" 1012.604
    "001161" "37980" "Thomas Seifert"    "2012"  494.826
    "001177" "24433" "Alan Bennett"      "2003"  839.464
    "001177" "24433" "Alan Bennett"      "2004" 1024.113
    "001177" "24433" "Alan Bennett"      "2005"     1090
    "001177" "29788" "Shawn Guertin"     "2014"  697.318
    "001177" "13386" "Joseph Zubretsky"  "2010"  730.728
    "001177" "13386" "Joseph Zubretsky"  "2011"      800
    "001177" "13386" "Joseph Zubretsky"  "2012"   795.52
    "001209" "25777" "John Owings"       "2002"  421.346
    "001228" ""      "John Carlson"      "2001"        .
    "001228" ""      "John Carlson"      "2006"        .
    "001230" "41121" "Brandon Pedersen"  "2016"  390.769
    "001254" "29075" "Chris Benjamin"    "2006"   293.75
    "001254" "42393" "Joel Wine"         "2013"  442.839
    "001254" "42393" "Joel Wine"         "2014"  454.887
    "001266" ""      "W Humphrey"        "2014"        .
    "001274" "42333" "Roger Gorham"      "2007"      510
    "001274" "42333" "Roger Gorham"      "2011"      550
    "001279" "25689" "Bruce Walenczyk"   "2001"  316.584
    "001327" "32403" "Don Palette"       "2009"  327.692
    "001327" "32403" "Don Palette"       "2011"    357.8
    "001327" "32403" "Don Palette"       "2015"   418.75
    "001327" "36045" "Kris Sennesael"    "2018"  456.366
    "001382" ""      "Derek Kerr"        "2005"        .
    "001408" "18591" "Craig Omtvedt"     "2006"      575
    "001408" "42400" "Bob Probst"        "2012"   558.25
    "001410" "34865" "Jim Lusk"          "2010"  465.313
    "001410" "34865" "Jim Lusk"          "2012"  526.763
    end
    Whereas:
    • Company_ID: Unique identifier of the company
    • Name: Name of the Manager
    • Manager_ID: Unique identifier of the manager
    • Year: Reporting year of the compensation
    • Compensation: Salary + Bonus earned by the Manager in the respective Year and company.
    The data on the managers' compensation come from a dataset B, which contains information on compensation for various members of the top management team (including CEO, COO, CFO...) of those companies included in dataset A.


    Code:
    clear
    input str6 Company_ID str5 Manager_ID str50 Name double Compensation str4 Year
    "001004" "09249" "David P. Storch"           946 "2000"
    "001004" "09249" "David P. Storch"         665.4 "2001"
    "001004" "09249" "David P. Storch"       661.466 "2002"
    "001004" "09249" "David P. Storch"        1157.5 "2003"
    "001004" "09249" "David P. Storch"      1287.238 "2004"
    "001004" "09249" "David P. Storch"      1757.651 "2005"
    "001004" "09249" "David P. Storch"         741.5 "2006"
    "001004" "09249" "David P. Storch"       768.248 "2007"
    "001004" "09249" "David P. Storch"       791.295 "2008"
    "001004" "09249" "David P. Storch"       799.208 "2009"
    "001004" "09249" "David P. Storch"           850 "2010"
    "001004" "09249" "David P. Storch"           867 "2011"
    "001004" "09249" "David P. Storch"       877.838 "2012"
    "001004" "09249" "David P. Storch"       906.449 "2013"
    "001004" "09249" "David P. Storch"       906.449 "2014"
    "001004" "09249" "David P. Storch"        755.25 "2015"
    "001004" "09249" "David P. Storch"           835 "2016"
    "001004" "09249" "David P. Storch"           941 "2017"
    "001004" "09250" "Philip C. Slapke"          470 "2000"
    "001004" "09251" "Howard A. Pulsifer"        325 "2000"
    "001004" "09251" "Howard A. Pulsifer"      255.6 "2001"
    "001004" "09251" "Howard A. Pulsifer"      261.1 "2002"
    "001004" "09251" "Howard A. Pulsifer"    388.389 "2003"
    "001004" "09251" "Howard A. Pulsifer"    417.389 "2004"
    "001004" "09251" "Howard A. Pulsifer"    456.855 "2005"
    "001004" "09251" "Howard A. Pulsifer"      286.4 "2006"
    "001004" "09251" "Howard A. Pulsifer"    296.738 "2007"
    "001004" "09252" "Timothy J. Romenesko"      390 "2000"
    "001004" "09252" "Timothy J. Romenesko"    293.7 "2001"
    "001004" "09252" "Timothy J. Romenesko"    300.6 "2002"
    "001004" "09252" "Timothy J. Romenesko"  526.251 "2003"
    "001004" "09252" "Timothy J. Romenesko"  549.805 "2004"
    "001004" "09252" "Timothy J. Romenesko"   666.02 "2005"
    "001004" "09252" "Timothy J. Romenesko"      330 "2006"
    "001004" "09252" "Timothy J. Romenesko"      400 "2007"
    "001004" "09252" "Timothy J. Romenesko"      450 "2008"
    "001004" "09252" "Timothy J. Romenesko"    454.5 "2009"
    "001004" "09252" "Timothy J. Romenesko"   468.18 "2010"
    "001004" "09252" "Timothy J. Romenesko"  477.544 "2011"
    "001004" "09252" "Timothy J. Romenesko"  483.513 "2012"
    "001004" "09252" "Timothy J. Romenesko"  499.272 "2013"
    "001004" "09252" "Timothy J. Romenesko"  499.272 "2014"
    "001004" "09252" "Timothy J. Romenesko"   963.65 "2015"
    "001004" "09252" "Timothy J. Romenesko"    463.5 "2016"
    "001004" "09252" "Timothy J. Romenesko" 2784.869 "2017"
    "001004" "19999" "Michael J. Sharp"          204 "2000"
    "001004" "19999" "Michael J. Sharp"        169.3 "2001"
    "001004" "19999" "Michael J. Sharp"        171.3 "2002"
    "001004" "19999" "Michael J. Sharp"      312.576 "2012"
    "001004" "19999" "Michael J. Sharp"      360.353 "2013"
    "001004" "19999" "Michael J. Sharp"      382.418 "2015"
    "001004" "19999" "Michael J. Sharp"      227.692 "2016"
    "001004" "23780" "Joseph M. Gullion"       103.5 "2000"
    "001004" "23780" "Joseph M. Gullion"       391.9 "2001"
    end
    I would like to compare the compensation of each observation from Data Set A with the sum of the compensations of the five highest-paid managers from that company in the respective year: CFO_Compensationc,i,t/Sum(Compensation five highest paid managers in company)c,t

    The logic behind the code should be: For each observation in dataset A: Look for all observations in dataset B that are assigned to the Company_ID and the year of the corresponding observation -> Sum up the compensation of the five best paid managers (except of course the CFOs own compensation) and assign the result to the observation from A. If there are observations for less than 5 managers, calculate the sum based on the maximum possible observations (so 4, 3, or 2).

    Using the following code I managed to calculate the total compensation paid per company in a given year in dataset B:

    Code:
    bysort Company_ID Year: egen totals = total(Compensation)
    egen tag = tag(Company_ID Year)
    gen Comp_Total = totals if tag == 1
    drop tag drop totals
    This sum however, includes the compensation for all Managers of a company in a given year and not only the 5 highest paid ones.

    Any help is much appreciated

    Happy holidays
    Last edited by Klaus Klausen; 22 Dec 2021, 08:03.

  • #2
    Thanks for your data example. My first thought was to get the 5 highest values after sorting first. But then I see you have missing values, so they need to be segregated. So, the idea is to set up an indicator for what should be added (or counted), which is then necessarily (not missing) & (in highest 5).


    Code:
    clear
    input str6 Company_ID str5 Manager_ID str50 Name double Compensation str4 Year
    "001004" "09249" "David P. Storch"           946 "2000"
    "001004" "09249" "David P. Storch"         665.4 "2001"
    "001004" "09249" "David P. Storch"       661.466 "2002"
    "001004" "09249" "David P. Storch"        1157.5 "2003"
    "001004" "09249" "David P. Storch"      1287.238 "2004"
    "001004" "09249" "David P. Storch"      1757.651 "2005"
    "001004" "09249" "David P. Storch"         741.5 "2006"
    "001004" "09249" "David P. Storch"       768.248 "2007"
    "001004" "09249" "David P. Storch"       791.295 "2008"
    "001004" "09249" "David P. Storch"       799.208 "2009"
    "001004" "09249" "David P. Storch"           850 "2010"
    "001004" "09249" "David P. Storch"           867 "2011"
    "001004" "09249" "David P. Storch"       877.838 "2012"
    "001004" "09249" "David P. Storch"       906.449 "2013"
    "001004" "09249" "David P. Storch"       906.449 "2014"
    "001004" "09249" "David P. Storch"        755.25 "2015"
    "001004" "09249" "David P. Storch"           835 "2016"
    "001004" "09249" "David P. Storch"           941 "2017"
    "001004" "09250" "Philip C. Slapke"          470 "2000"
    "001004" "09251" "Howard A. Pulsifer"        325 "2000"
    "001004" "09251" "Howard A. Pulsifer"      255.6 "2001"
    "001004" "09251" "Howard A. Pulsifer"      261.1 "2002"
    "001004" "09251" "Howard A. Pulsifer"    388.389 "2003"
    "001004" "09251" "Howard A. Pulsifer"    417.389 "2004"
    "001004" "09251" "Howard A. Pulsifer"    456.855 "2005"
    "001004" "09251" "Howard A. Pulsifer"      286.4 "2006"
    "001004" "09251" "Howard A. Pulsifer"    296.738 "2007"
    "001004" "09252" "Timothy J. Romenesko"      390 "2000"
    "001004" "09252" "Timothy J. Romenesko"    293.7 "2001"
    "001004" "09252" "Timothy J. Romenesko"    300.6 "2002"
    "001004" "09252" "Timothy J. Romenesko"  526.251 "2003"
    "001004" "09252" "Timothy J. Romenesko"  549.805 "2004"
    "001004" "09252" "Timothy J. Romenesko"   666.02 "2005"
    "001004" "09252" "Timothy J. Romenesko"      330 "2006"
    "001004" "09252" "Timothy J. Romenesko"      400 "2007"
    "001004" "09252" "Timothy J. Romenesko"      450 "2008"
    "001004" "09252" "Timothy J. Romenesko"    454.5 "2009"
    "001004" "09252" "Timothy J. Romenesko"   468.18 "2010"
    "001004" "09252" "Timothy J. Romenesko"  477.544 "2011"
    "001004" "09252" "Timothy J. Romenesko"  483.513 "2012"
    "001004" "09252" "Timothy J. Romenesko"  499.272 "2013"
    "001004" "09252" "Timothy J. Romenesko"  499.272 "2014"
    "001004" "09252" "Timothy J. Romenesko"   963.65 "2015"
    "001004" "09252" "Timothy J. Romenesko"    463.5 "2016"
    "001004" "09252" "Timothy J. Romenesko" 2784.869 "2017"
    "001004" "19999" "Michael J. Sharp"          204 "2000"
    "001004" "19999" "Michael J. Sharp"        169.3 "2001"
    "001004" "19999" "Michael J. Sharp"        171.3 "2002"
    "001004" "19999" "Michael J. Sharp"      312.576 "2012"
    "001004" "19999" "Michael J. Sharp"      360.353 "2013"
    "001004" "19999" "Michael J. Sharp"      382.418 "2015"
    "001004" "19999" "Michael J. Sharp"      227.692 "2016"
    "001004" "23780" "Joseph M. Gullion"       103.5 "2000"
    "001004" "23780" "Joseph M. Gullion"       391.9 "2001"
    end
    
    gen touse = !missing(Compensation)
    bysort touse Company Year (Compensation) : replace touse = touse & _n > _N - 5 
    
    bysort Company Year : egen wanted = total(touse * Compensation)
    by Company Year : egen count = total(touse)
    
    tabdisp Year Company, c(count wanted) cellwidth(20)
    
    
    --------------------------------
              |      Company_ID     
         Year |               001004
    ----------+---------------------
         2000 |                    5
              |                 2335
              | 
         2001 |                    5
              |               1775.9
              | 
         2002 |                    4
              |             1394.466
              | 
         2003 |                    3
              |              2072.14
              | 
         2004 |                    3
              |             2254.432
              | 
         2005 |                    3
              |             2880.526
              | 
         2006 |                    3
              |               1357.9
              | 
         2007 |                    3
              |             1464.986
              | 
         2008 |                    2
              |             1241.295
              | 
         2009 |                    2
              |             1253.708
              | 
         2010 |                    2
              |              1318.18
              | 
         2011 |                    2
              |             1344.544
              | 
         2012 |                    3
              |             1673.927
              | 
         2013 |                    3
              |             1766.074
              | 
         2014 |                    2
              |             1405.721
              | 
         2015 |                    3
              |             2101.318
              | 
         2016 |                    3
              |             1526.192
              | 
         2017 |                    2
              |             3725.869
    --------------------------------

    Comment


    • #3
      Code:
      bysort Company_ID Year: egen ranking = rank(-Compensation)
      bysort Company_ID Year: egen sum_top_5 = total(Compensation) if ranking <= 5
      collapse (mean) sum_top_5, by(Company_ID Year)
      Crossed with #2.

      Comment


      • #4
        @Ken Chui's method is good, indeed more straightforward than mine. It relies on rank() in egen ignoring missings, which is fine.

        Although it's unlikely with these data, ties in the ranking could be a minor problem with other variables. If so, the unique option is what will fix matters.

        I guess it does no harm and may be helpful with your real full data to keep track of whether there really are 5 values to summarize.
        Last edited by Nick Cox; 22 Dec 2021, 08:50.

        Comment


        • #5
          Both ways work just perfect, thank you very much.

          Comment

          Working...
          X