Calculate total research grant income by investigator

Chris Boulis

Join Date: Feb 2019
Posts: 368

Calculate total research grant income by investigator

14 Apr 2025, 00:31

Hi Statalist.

I'd like to calculate total research grant income between 2001 to 2023 for each investigator in my dataset - with particular interest in identifying the top grant income earners. Investigator names lie within two columns of data "lead investigator" and "other investigator" - while the former includes a single name, the latter includes the names of multiple investigators. A sample of my data follows:

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input long(id grant2) strL(lead_investigator other_investigator) int grant_year
10000 509899 "Prof Ana Roz" "Prof Ana Roz; Prof Mar Pet; Prof Mic Ler; A/Prof Sus Gua; Dr Stu Geo" 2023
10001 500627 "A/Prof Bry Bor"    "A/Prof Bry Bor; A/Prof Sal Tho; Prof Mar Ama; A/Prof Joe Hur" 2023                                        
10002 684116 "Dr Ale Com"    "Dr Ale Com; Dr Loi Roy; Dr Jua Nun" 2023                                                                                  
10003 266702 "Prof Man Che"       "Prof Man Che; Prof Ker Hum; Prof Ken Trot" 2022                                                                                    
10004 465576 "Dr Geo Fri"         "Prof Ana Roz; Dr Geo Fri; Prof Dr Sus Geb" 2022                                                                                                      
10005 364875 "Dr Kat Bak"       "A/Prof Bry Bor; Dr Kat Bak; Prof Ric Ric" 2021                                                                                                      
10006 819627 "Prof And Bro"     "Prof And Bro; Dr Ale Com; Dr Jul Viv" 2021                                                                                                         
10007 611728 "Prof Owe Atk"        "Prof Dr Gui Tch; Prof Owe Atk; Dr And Sca; Prof Gra Far; Prof Ste Sit" 2020                   
10008 411315 "Dr Dee Adh"     "Prof Man Che; Dr Dee Adh; Prof Joh Car" 2020                                                                                                        
10009 381647 "A/Prof Ivo Lab"       "Prof Joh Car; A/Prof Ivo Lab; Asst Prof Rac Bez" 2020                                                                                                  
10010 500028 "Prof Gra Ste"      "Prof Gra Ste; Prof Qin Li" 2019                                                                                                              
10011 423071 "Prof Roo Haz"    "Prof Roo Haz; Prof Ben Ste" 2019                                                                        
10012 537513 "Prof Aks Tan"   "Prof Gra Ste; Prof Aks Tan; Prof Ala Cha; Prof Adr van" 2019                                                                   
10013 398006 "Prof Vic Che"        "Prof Vic Che; Dr Rut Kni" 2019                                                                                                              
10014 435281 "Prof Joh Car"      "Prof Vic Che; Prof Joh Car; Dr Dee Adh" 2019                                                                                                     
end

I attempted

Code:

bysort lead_investigator : egen grant_total = total(grant2)

however, this does not do what I want. Appreciate some guidance on approach/code.
Sincerely, Chris.

(Running Stata SE 17.0.)

Tags: None

Felix Bittmann

Join Date: Aug 2018

Posts: 694
#2

14 Apr 2025, 00:37

Code:

egen total

treats missings as zero. I assume you want

Code:

egen sum

instead, assuming that the lead investigator is the cluster of interest to you.

Best wishes

Stata 18.0 MP | ORCID | Google Scholar
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35698
#3

14 Apr 2025, 02:03

Felix Bittmann Not so. sum() is just an undocumented synonym for total(). Here is the code for sum():

Code:

*! version 4.0.0 04oct2004 program define _gsum // -egen- function -sum()- renamed to -total()- for Stata 9 // no -version #- statement _gtotal `0' end

sum() was renamed total() in Stata 9, to sharpen the difference between the function sum() -- which can be used with generate and gives cumulative or running sums -- and the egen function sum(), which gave and still gives ... totals. The original functionality of egen, sum() remains in order not to break ancient scripts -- or habits.

At a guess, Chris needs to move in this direction:

Code:

split other_investigator, gen(wanted) parse(;) reshape long wanted, i(id) j(which) replace wanted = trim(wanted) drop if wanted == ""

However, I suspect that Chris has more problems than this. Over the period between 2001 and 2023, people could easily change status and some could change their names and there could be many other inconistencies in spacing or spelling that Stata will take literally.

Last edited by Nick Cox; 14 Apr 2025, 02:14.
1 like
Comment
Felix Bittmann

Join Date: Aug 2018

Posts: 694
#4

14 Apr 2025, 02:08

Thanks Nick Cox for the correction. TIL I am using totally outdated syntax.

Best wishes

Stata 18.0 MP | ORCID | Google Scholar
Comment

Hemanshu Kumar

Join Date: Mar 2015
Posts: 1398

14 Apr 2025, 21:44

Here is some code that might meet OP's needs. The only major addition to #3 is the use of regular expressions to eliminate the designation, which is the thing that is most likely to change over the years.

Code:

split other_investigator, gen(investigator) p(;)
drop other_investigator
rename lead_investigator investigator0
reshape long investigator, i(id grant2) j(num)
replace investigator = trim(itrim(investigator))
drop if missing(investigator)

gen investigator_name = ustrregexrf(investigator, "^(Prof Dr|Prof|Dr|A\/Prof|Asst Prof) ", "", 1)
duplicates drop id investigator_name, force
egen grant_total = total(grant2), by(investigator_name)

which produces:

Code:

. egen tagged = tag(investigator_name)
. sort investigator_name
. li investigator_name grant_total if tagged, abbrev(20) noobs

  +---------------------------------+
  | investigator_name   grant_total |
  |---------------------------------|
  |           Adr van        537513 |
  |           Aks Tan        537513 |
  |           Ala Cha        537513 |
  |           Ale Com       1503743 |
  |           Ana Roz        975475 |
  |---------------------------------|
  |           And Bro        819627 |
  |           And Sca        611728 |
  |           Ben Ste        423071 |
  |           Bry Bor        865502 |
  |           Dee Adh        846596 |
  |---------------------------------|
  |           Geo Fri        465576 |
  |           Gra Far        611728 |
  |           Gra Ste       1037541 |
  |           Gui Tch        611728 |
  |           Ivo Lab        381647 |
  |---------------------------------|
  |           Joe Hur        500627 |
  |           Joh Car       1228243 |
  |           Jua Nun        684116 |
  |           Jul Viv        819627 |
  |           Kat Bak        364875 |
  |---------------------------------|
  |          Ken Trot        266702 |
  |           Ker Hum        266702 |
  |           Loi Roy        684116 |
  |           Man Che        678017 |
  |           Mar Ama        500627 |
  |---------------------------------|
  |           Mar Pet        509899 |
  |           Mic Ler        509899 |
  |           Owe Atk        611728 |
  |            Qin Li        500028 |
  |           Rac Bez        381647 |
  |---------------------------------|
  |           Ric Ric        364875 |
  |           Roo Haz        423071 |
  |           Rut Kni        398006 |
  |           Sal Tho        500627 |
  |           Ste Sit        611728 |
  |---------------------------------|
  |           Stu Geo        509899 |
  |           Sus Geb        465576 |
  |           Sus Gua        509899 |
  |           Vic Che        833287 |
  +---------------------------------+

Last edited by Hemanshu Kumar; 14 Apr 2025, 22:44. Reason: Added a command to drop duplicates by id & investigator.

Comment

Chris Boulis

Join Date: Feb 2019

Posts: 368
#6

14 Apr 2025, 22:23

Thank you Nick Cox for your code to separate out all the names in the 'other_investigator variable. And for you noting some other issues in the data I will need to address.

First, I'd like to learn how to calculate total grant income per investigator - ensuring I count each investigator once per grant -id- (noting the 'lead investigator' is often also listed in the 'other_investigator').

(Stata SE 17.0).
Comment
Hemanshu Kumar

Join Date: Mar 2015

Posts: 1398
#7

14 Apr 2025, 22:45

Chris Boulis does #5 solve your problem? I have added a line for the de-duplication you mentioned in #6.
2 likes
Comment

Chris Boulis

Join Date: Feb 2019
Posts: 368

15 Apr 2025, 01:10

Thank you Hemanshu Kumar for your code/help. In running this code, Stata reported the following error:

Code:

variable id does not uniquely identify the observations
    Your data are currently wide. You are performing a reshape long. You specified
    i(id grant2) and j(num). In the current wide form, variable id grant2 should
    uniquely identify the observations. Remember this picture:
         long                                wide
        +---------------+                   +------------------+
        | i   j   a   b |                   | i   a1 a2  b1 b2 |
        |---------------| <--- reshape ---> |------------------|
        | 1   1   1   2 |                   | 1   1   3   2  4 |
        | 1   2   3   4 |                   | 2   5   7   6  8 |
        | 2   1   5   6 |                   +------------------+
        | 2   2   7   8 |
        +---------------+
    Type reshape error for a list of the problem observations.
r(9);

I did as requested, but did not identify any problem observations.
I note there are over 31,000 observations (different grant ids) from between 2001 and 2023 in my dataset.

Last edited by Chris Boulis; 15 Apr 2025, 01:23. Reason: Added note about observations.

Comment

Hemanshu Kumar

Join Date: Mar 2015

Posts: 1398
#9

15 Apr 2025, 02:05

Ah I see. It's easy to create a temporary variable to uniquely identify each observation. Here's the same code with the changes needed highlighted:

Code:

split other_investigator, gen(investigator) p(;) drop other_investigator rename lead_investigator investigator0 gen `c(obs_t)' x = _n reshape long investigator, i(x id grant2) j(num) replace investigator = trim(itrim(investigator)) drop if missing(investigator) drop x gen investigator_name = ustrregexrf(investigator, "^(Prof Dr|Prof|Dr|A\/Prof|Asst Prof) ", "", 1) duplicates drop id investigator_name, force egen grant_total = total(grant2), by(investigator_name)

While this would be a good workaround for your problem, the error you got is worth investigating. There appear to be multiple observations with the same id and grant2 value. Is this okay? If those are not typos or outright errors, they may not bite for the task at hand (since we are de-duplicating anyway), but it may cause other unexpected problems in your overall code.

See

Code:

help duplicates

to investigate this issue.

Last edited by Hemanshu Kumar; 15 Apr 2025, 02:11.
1 like
Comment
Chris Boulis

Join Date: Feb 2019

Posts: 368
#10

16 Apr 2025, 00:13

Hi Hemanshu Kumar. There is and should only be one observation per grant id but it is possible there are multiple observations with the same grant value as these can sometimes be a set amount - depending on the grant type and scheme.

Thank you for the updated code - it ran to completion this time. Although I now appear to have duplicate grant id numbers (id). When I ran

Code:

duplicates list id

it shows a large number duplicated id numbers. I checked my dataset and it is fine prior to running the code. I appreciate your support. Regards, Chris. Stata SE 17.0.
Comment
Hemanshu Kumar

Join Date: Mar 2015

Posts: 1398
#11

16 Apr 2025, 02:03

Thank you for the updated code - it ran to completion this time. Although I now appear to have duplicate grant id numbers (id).

I wouldn't be worried about that. The data has been reshaped to enable us to easily add up the grant values -- there is now an observation for each investigator of each grant. So of course a grant will have as many observations as it has unique investigators.

There is and should only be one observation per grant id but it is possible there are multiple observations with the same grant value

If that had been the case, reshape should not have complained as you reported in #8. reshape is looking for id grant2 to be jointly unique for an observation, not separately. Do check thoroughly what is going on.
Comment
Chris Boulis

Join Date: Feb 2019

Posts: 368
#12

16 Apr 2025, 07:16

I'll clarify. My dataset contains over 30,000 observations between 2001 and 2023. In each observation, I have a range of information, including grant amount (grant2), grant id (id), the lead investigator and other investigators. I am comfortable with the duplications based on your explanation as there will be a repeat of each grant id to coincide with the number of investigators, which will allow me to estimate the investigators that have accumulated the most grant income.

To that end, I'd like to list the top 20 grant income earning investigators - will the following do that?

Code:

list investigator_name grant_total in -20/1 if tagged, abbrev(20) noobs
Comment
Hemanshu Kumar

Join Date: Mar 2015

Posts: 1398
#13

16 Apr 2025, 08:13

You should do

Code:

gsort -tagged -grant_total list investigator_name grant_total in 1/20 if tagged, abbrev(20) noobs

Though do be mindful that since you have several people with identical grant sizes, some of those left out of the top 20 might have the same grant amounts as the last few that are included.

Last edited by Hemanshu Kumar; 16 Apr 2025, 08:15.
1 like
Comment
Chris Boulis

Join Date: Feb 2019

Posts: 368
#14

16 Apr 2025, 08:54

Thank you Hemanshu Kumar. Yes I understand that - though I do not believe it will not be an issue for this part of the analysis, which is to find those researchers with the highest grant income. I will keep that in mind with other work that I do. Kind regards, Chris.
Comment
Chris Boulis

Join Date: Feb 2019

Posts: 368
#15

24 Apr 2025, 01:07

Hi Hemanshu Kumar. In addition to reporting the total amount of grant money by investigator, I would also like to report the number of grants received by each investigator. Making some amendments to your earlier code, I tried the following but received no output

Code:

egen nos_grant = count(grant_id), by(investigator_name) egen tag_nos = tag(investigator_name) gsort -tag_nos -nos_grant format %16.0gc nos_grant list investigator_name nos_grant in 1/20 if tag_type, abbrev(20) noobs

I'd appreciate some help/guidance on approach please.
Kind regards, Chris (Stata/SE 17.0)
Comment

Announcement

Calculate total research grant income by investigator

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment