Sum highest values within subgroups and assign result to observation

Klaus Klausen

Join Date: Mar 2021
Posts: 72

Sum highest values within subgroups and assign result to observation

22 Dec 2021, 07:01

Hi,

I am quite lost with the following task and would really appreciate your help.

One of my datasets (dataset A) contains compensation data for CFOs in various years:

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input str6 Company_ID str5 Manager_ID str25 Name str4 Year double Compensation
"001004" "33979" "Rick Poulton"      "2007"      300
"001004" "33979" "Rick Poulton"      "2008"      330
"001004" "19999" "Mike Sharp"        "2015"  382.418
"001004" "19999" "Mike Sharp"        "2016"  227.692
"001050" ""      "Dennis Blazer"     "2007"        .
"001050" ""      "Dennis Blazer"     "2008"        .
"001056" "17853" "John Adamovich Jr" "2007"  823.534
"001076" "18039" "Gil Danielson"     "2003"      300
"001076" "18039" "Gil Danielson"     "2007"      400
"001076" "18039" "Gil Danielson"     "2008"      425
"001078" "24400" "Tom Freyman"       "2009"  914.461
"001078" "24400" "Tom Freyman"       "2010"  941.923
"001078" "24400" "Tom Freyman"       "2013"  969.748
"001078" "24400" "Tom Freyman"       "2014" 1012.604
"001161" "37980" "Thomas Seifert"    "2012"  494.826
"001177" "24433" "Alan Bennett"      "2003"  839.464
"001177" "24433" "Alan Bennett"      "2004" 1024.113
"001177" "24433" "Alan Bennett"      "2005"     1090
"001177" "29788" "Shawn Guertin"     "2014"  697.318
"001177" "13386" "Joseph Zubretsky"  "2010"  730.728
"001177" "13386" "Joseph Zubretsky"  "2011"      800
"001177" "13386" "Joseph Zubretsky"  "2012"   795.52
"001209" "25777" "John Owings"       "2002"  421.346
"001228" ""      "John Carlson"      "2001"        .
"001228" ""      "John Carlson"      "2006"        .
"001230" "41121" "Brandon Pedersen"  "2016"  390.769
"001254" "29075" "Chris Benjamin"    "2006"   293.75
"001254" "42393" "Joel Wine"         "2013"  442.839
"001254" "42393" "Joel Wine"         "2014"  454.887
"001266" ""      "W Humphrey"        "2014"        .
"001274" "42333" "Roger Gorham"      "2007"      510
"001274" "42333" "Roger Gorham"      "2011"      550
"001279" "25689" "Bruce Walenczyk"   "2001"  316.584
"001327" "32403" "Don Palette"       "2009"  327.692
"001327" "32403" "Don Palette"       "2011"    357.8
"001327" "32403" "Don Palette"       "2015"   418.75
"001327" "36045" "Kris Sennesael"    "2018"  456.366
"001382" ""      "Derek Kerr"        "2005"        .
"001408" "18591" "Craig Omtvedt"     "2006"      575
"001408" "42400" "Bob Probst"        "2012"   558.25
"001410" "34865" "Jim Lusk"          "2010"  465.313
"001410" "34865" "Jim Lusk"          "2012"  526.763
end

Whereas:

Company_ID: Unique identifier of the company
Name: Name of the Manager
Manager_ID: Unique identifier of the manager
Year: Reporting year of the compensation
Compensation: Salary + Bonus earned by the Manager in the respective Year and company.

The data on the managers' compensation come from a dataset B, which contains information on compensation for various members of the top management team (including CEO, COO, CFO...) of those companies included in dataset A.

Code:

clear
input str6 Company_ID str5 Manager_ID str50 Name double Compensation str4 Year
"001004" "09249" "David P. Storch"           946 "2000"
"001004" "09249" "David P. Storch"         665.4 "2001"
"001004" "09249" "David P. Storch"       661.466 "2002"
"001004" "09249" "David P. Storch"        1157.5 "2003"
"001004" "09249" "David P. Storch"      1287.238 "2004"
"001004" "09249" "David P. Storch"      1757.651 "2005"
"001004" "09249" "David P. Storch"         741.5 "2006"
"001004" "09249" "David P. Storch"       768.248 "2007"
"001004" "09249" "David P. Storch"       791.295 "2008"
"001004" "09249" "David P. Storch"       799.208 "2009"
"001004" "09249" "David P. Storch"           850 "2010"
"001004" "09249" "David P. Storch"           867 "2011"
"001004" "09249" "David P. Storch"       877.838 "2012"
"001004" "09249" "David P. Storch"       906.449 "2013"
"001004" "09249" "David P. Storch"       906.449 "2014"
"001004" "09249" "David P. Storch"        755.25 "2015"
"001004" "09249" "David P. Storch"           835 "2016"
"001004" "09249" "David P. Storch"           941 "2017"
"001004" "09250" "Philip C. Slapke"          470 "2000"
"001004" "09251" "Howard A. Pulsifer"        325 "2000"
"001004" "09251" "Howard A. Pulsifer"      255.6 "2001"
"001004" "09251" "Howard A. Pulsifer"      261.1 "2002"
"001004" "09251" "Howard A. Pulsifer"    388.389 "2003"
"001004" "09251" "Howard A. Pulsifer"    417.389 "2004"
"001004" "09251" "Howard A. Pulsifer"    456.855 "2005"
"001004" "09251" "Howard A. Pulsifer"      286.4 "2006"
"001004" "09251" "Howard A. Pulsifer"    296.738 "2007"
"001004" "09252" "Timothy J. Romenesko"      390 "2000"
"001004" "09252" "Timothy J. Romenesko"    293.7 "2001"
"001004" "09252" "Timothy J. Romenesko"    300.6 "2002"
"001004" "09252" "Timothy J. Romenesko"  526.251 "2003"
"001004" "09252" "Timothy J. Romenesko"  549.805 "2004"
"001004" "09252" "Timothy J. Romenesko"   666.02 "2005"
"001004" "09252" "Timothy J. Romenesko"      330 "2006"
"001004" "09252" "Timothy J. Romenesko"      400 "2007"
"001004" "09252" "Timothy J. Romenesko"      450 "2008"
"001004" "09252" "Timothy J. Romenesko"    454.5 "2009"
"001004" "09252" "Timothy J. Romenesko"   468.18 "2010"
"001004" "09252" "Timothy J. Romenesko"  477.544 "2011"
"001004" "09252" "Timothy J. Romenesko"  483.513 "2012"
"001004" "09252" "Timothy J. Romenesko"  499.272 "2013"
"001004" "09252" "Timothy J. Romenesko"  499.272 "2014"
"001004" "09252" "Timothy J. Romenesko"   963.65 "2015"
"001004" "09252" "Timothy J. Romenesko"    463.5 "2016"
"001004" "09252" "Timothy J. Romenesko" 2784.869 "2017"
"001004" "19999" "Michael J. Sharp"          204 "2000"
"001004" "19999" "Michael J. Sharp"        169.3 "2001"
"001004" "19999" "Michael J. Sharp"        171.3 "2002"
"001004" "19999" "Michael J. Sharp"      312.576 "2012"
"001004" "19999" "Michael J. Sharp"      360.353 "2013"
"001004" "19999" "Michael J. Sharp"      382.418 "2015"
"001004" "19999" "Michael J. Sharp"      227.692 "2016"
"001004" "23780" "Joseph M. Gullion"       103.5 "2000"
"001004" "23780" "Joseph M. Gullion"       391.9 "2001"
end

I would like to compare the compensation of each observation from Data Set A with the sum of the compensations of the five highest-paid managers from that company in the respective year: CFO_Compensation_c,_i,t/Sum(Compensation five highest paid managers in company)_c,t

The logic behind the code should be: For each observation in dataset A: Look for all observations in dataset B that are assigned to the Company_ID and the year of the corresponding observation -> Sum up the compensation of the five best paid managers (except of course the CFOs own compensation) and assign the result to the observation from A. If there are observations for less than 5 managers, calculate the sum based on the maximum possible observations (so 4, 3, or 2).

Using the following code I managed to calculate the total compensation paid per company in a given year in dataset B:

Code:

bysort Company_ID Year: egen totals = total(Compensation)
egen tag = tag(Company_ID Year)
gen Comp_Total = totals if tag == 1
drop tag drop totals

This sum however, includes the compensation for all Managers of a company in a given year and not only the 5 highest paid ones.

Any help is much appreciated
Happy holidays

Last edited by Klaus Klausen; 22 Dec 2021, 07:03.

Tags: None

Nick Cox

Join Date: Mar 2014
Posts: 35698

22 Dec 2021, 07:19

Thanks for your data example. My first thought was to get the 5 highest values after sorting first. But then I see you have missing values, so they need to be segregated. So, the idea is to set up an indicator for what should be added (or counted), which is then necessarily (not missing) & (in highest 5).

Code:

clear
input str6 Company_ID str5 Manager_ID str50 Name double Compensation str4 Year
"001004" "09249" "David P. Storch"           946 "2000"
"001004" "09249" "David P. Storch"         665.4 "2001"
"001004" "09249" "David P. Storch"       661.466 "2002"
"001004" "09249" "David P. Storch"        1157.5 "2003"
"001004" "09249" "David P. Storch"      1287.238 "2004"
"001004" "09249" "David P. Storch"      1757.651 "2005"
"001004" "09249" "David P. Storch"         741.5 "2006"
"001004" "09249" "David P. Storch"       768.248 "2007"
"001004" "09249" "David P. Storch"       791.295 "2008"
"001004" "09249" "David P. Storch"       799.208 "2009"
"001004" "09249" "David P. Storch"           850 "2010"
"001004" "09249" "David P. Storch"           867 "2011"
"001004" "09249" "David P. Storch"       877.838 "2012"
"001004" "09249" "David P. Storch"       906.449 "2013"
"001004" "09249" "David P. Storch"       906.449 "2014"
"001004" "09249" "David P. Storch"        755.25 "2015"
"001004" "09249" "David P. Storch"           835 "2016"
"001004" "09249" "David P. Storch"           941 "2017"
"001004" "09250" "Philip C. Slapke"          470 "2000"
"001004" "09251" "Howard A. Pulsifer"        325 "2000"
"001004" "09251" "Howard A. Pulsifer"      255.6 "2001"
"001004" "09251" "Howard A. Pulsifer"      261.1 "2002"
"001004" "09251" "Howard A. Pulsifer"    388.389 "2003"
"001004" "09251" "Howard A. Pulsifer"    417.389 "2004"
"001004" "09251" "Howard A. Pulsifer"    456.855 "2005"
"001004" "09251" "Howard A. Pulsifer"      286.4 "2006"
"001004" "09251" "Howard A. Pulsifer"    296.738 "2007"
"001004" "09252" "Timothy J. Romenesko"      390 "2000"
"001004" "09252" "Timothy J. Romenesko"    293.7 "2001"
"001004" "09252" "Timothy J. Romenesko"    300.6 "2002"
"001004" "09252" "Timothy J. Romenesko"  526.251 "2003"
"001004" "09252" "Timothy J. Romenesko"  549.805 "2004"
"001004" "09252" "Timothy J. Romenesko"   666.02 "2005"
"001004" "09252" "Timothy J. Romenesko"      330 "2006"
"001004" "09252" "Timothy J. Romenesko"      400 "2007"
"001004" "09252" "Timothy J. Romenesko"      450 "2008"
"001004" "09252" "Timothy J. Romenesko"    454.5 "2009"
"001004" "09252" "Timothy J. Romenesko"   468.18 "2010"
"001004" "09252" "Timothy J. Romenesko"  477.544 "2011"
"001004" "09252" "Timothy J. Romenesko"  483.513 "2012"
"001004" "09252" "Timothy J. Romenesko"  499.272 "2013"
"001004" "09252" "Timothy J. Romenesko"  499.272 "2014"
"001004" "09252" "Timothy J. Romenesko"   963.65 "2015"
"001004" "09252" "Timothy J. Romenesko"    463.5 "2016"
"001004" "09252" "Timothy J. Romenesko" 2784.869 "2017"
"001004" "19999" "Michael J. Sharp"          204 "2000"
"001004" "19999" "Michael J. Sharp"        169.3 "2001"
"001004" "19999" "Michael J. Sharp"        171.3 "2002"
"001004" "19999" "Michael J. Sharp"      312.576 "2012"
"001004" "19999" "Michael J. Sharp"      360.353 "2013"
"001004" "19999" "Michael J. Sharp"      382.418 "2015"
"001004" "19999" "Michael J. Sharp"      227.692 "2016"
"001004" "23780" "Joseph M. Gullion"       103.5 "2000"
"001004" "23780" "Joseph M. Gullion"       391.9 "2001"
end

gen touse = !missing(Compensation)
bysort touse Company Year (Compensation) : replace touse = touse & _n > _N - 5 

bysort Company Year : egen wanted = total(touse * Compensation)
by Company Year : egen count = total(touse)

tabdisp Year Company, c(count wanted) cellwidth(20)


--------------------------------
          |      Company_ID     
     Year |               001004
----------+---------------------
     2000 |                    5
          |                 2335
          | 
     2001 |                    5
          |               1775.9
          | 
     2002 |                    4
          |             1394.466
          | 
     2003 |                    3
          |              2072.14
          | 
     2004 |                    3
          |             2254.432
          | 
     2005 |                    3
          |             2880.526
          | 
     2006 |                    3
          |               1357.9
          | 
     2007 |                    3
          |             1464.986
          | 
     2008 |                    2
          |             1241.295
          | 
     2009 |                    2
          |             1253.708
          | 
     2010 |                    2
          |              1318.18
          | 
     2011 |                    2
          |             1344.544
          | 
     2012 |                    3
          |             1673.927
          | 
     2013 |                    3
          |             1766.074
          | 
     2014 |                    2
          |             1405.721
          | 
     2015 |                    3
          |             2101.318
          | 
     2016 |                    3
          |             1526.192
          | 
     2017 |                    2
          |             3725.869
--------------------------------

Comment

Ken Chui

Join Date: Aug 2014
Posts: 1058

22 Dec 2021, 07:24

Code:

bysort Company_ID Year: egen ranking = rank(-Compensation)
bysort Company_ID Year: egen sum_top_5 = total(Compensation) if ranking <= 5
collapse (mean) sum_top_5, by(Company_ID Year)

Crossed with #2.

Comment

Nick Cox

Join Date: Mar 2014

Posts: 35698
#4

22 Dec 2021, 07:30

@Ken Chui's method is good, indeed more straightforward than mine. It relies on rank() in egen ignoring missings, which is fine.

Although it's unlikely with these data, ties in the ranking could be a minor problem with other variables. If so, the unique option is what will fix matters.

I guess it does no harm and may be helpful with your real full data to keep track of whether there really are 5 values to summarize.

Last edited by Nick Cox; 22 Dec 2021, 07:50.
2 likes
Comment
Klaus Klausen

Join Date: Mar 2021

Posts: 72
#5

23 Dec 2021, 01:18

Both ways work just perfect, thank you very much.
Comment

Announcement

Sum highest values within subgroups and assign result to observation

Comment

Comment

Comment

Comment