Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Calculate total research grant income by investigator

    Hi Statalist.

    I'd like to calculate total research grant income between 2001 to 2023 for each investigator in my dataset - with particular interest in identifying the top grant income earners. Investigator names lie within two columns of data "lead investigator" and "other investigator" - while the former includes a single name, the latter includes the names of multiple investigators. A sample of my data follows:
    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input long(id grant2) strL(lead_investigator other_investigator) int grant_year
    10000 509899 "Prof Ana Roz" "Prof Ana Roz; Prof Mar Pet; Prof Mic Ler; A/Prof Sus Gua; Dr Stu Geo" 2023
    10001 500627 "A/Prof Bry Bor"    "A/Prof Bry Bor; A/Prof Sal Tho; Prof Mar Ama; A/Prof Joe Hur" 2023                                        
    10002 684116 "Dr Ale Com"    "Dr Ale Com; Dr Loi Roy; Dr Jua Nun" 2023                                                                                  
    10003 266702 "Prof Man Che"       "Prof Man Che; Prof Ker Hum; Prof Ken Trot" 2022                                                                                    
    10004 465576 "Dr Geo Fri"         "Prof Ana Roz; Dr Geo Fri; Prof Dr Sus Geb" 2022                                                                                                      
    10005 364875 "Dr Kat Bak"       "A/Prof Bry Bor; Dr Kat Bak; Prof Ric Ric" 2021                                                                                                      
    10006 819627 "Prof And Bro"     "Prof And Bro; Dr Ale Com; Dr Jul Viv" 2021                                                                                                         
    10007 611728 "Prof Owe Atk"        "Prof Dr Gui Tch; Prof Owe Atk; Dr And Sca; Prof Gra Far; Prof Ste Sit" 2020                   
    10008 411315 "Dr Dee Adh"     "Prof Man Che; Dr Dee Adh; Prof Joh Car" 2020                                                                                                        
    10009 381647 "A/Prof Ivo Lab"       "Prof Joh Car; A/Prof Ivo Lab; Asst Prof Rac Bez" 2020                                                                                                  
    10010 500028 "Prof Gra Ste"      "Prof Gra Ste; Prof Qin Li" 2019                                                                                                              
    10011 423071 "Prof Roo Haz"    "Prof Roo Haz; Prof Ben Ste" 2019                                                                        
    10012 537513 "Prof Aks Tan"   "Prof Gra Ste; Prof Aks Tan; Prof Ala Cha; Prof Adr van" 2019                                                                   
    10013 398006 "Prof Vic Che"        "Prof Vic Che; Dr Rut Kni" 2019                                                                                                              
    10014 435281 "Prof Joh Car"      "Prof Vic Che; Prof Joh Car; Dr Dee Adh" 2019                                                                                                     
    end
    I attempted
    Code:
    bysort lead_investigator : egen grant_total = total(grant2)
    however, this does not do what I want. Appreciate some guidance on approach/code.
    Sincerely, Chris.

    (Running Stata SE 17.0.)

  • #2
    Code:
    egen total
    treats missings as zero. I assume you want
    Code:
    egen sum
    instead, assuming that the lead investigator is the cluster of interest to you.
    Best wishes

    (Stata 16.1 MP)

    Comment


    • #3
      Felix Bittmann Not so. sum() is just an undocumented synonym for total(). Here is the code for sum():

      Code:
      *! version 4.0.0  04oct2004
      program define _gsum
              // -egen- function -sum()- renamed to -total()- for Stata 9
      
              // no -version #- statement
              _gtotal `0'
      end
      sum() was renamed total() in Stata 9, to sharpen the difference between the function sum() -- which can be used with generate and gives cumulative or running sums -- and the egen function sum(), which gave and still gives ... totals. The original functionality of egen, sum() remains in order not to break ancient scripts -- or habits.

      At a guess, Chris needs to move in this direction:

      Code:
      split other_investigator, gen(wanted) parse(;)
      
      reshape long wanted, i(id) j(which)
      
      replace wanted = trim(wanted)
      
      drop if wanted == ""
      However, I suspect that Chris has more problems than this. Over the period between 2001 and 2023, people could easily change status and some could change their names and there could be many other inconistencies in spacing or spelling that Stata will take literally.
      Last edited by Nick Cox; 14 Apr 2025, 02:14.

      Comment


      • #4
        Thanks Nick Cox for the correction. TIL I am using totally outdated syntax.
        Best wishes

        (Stata 16.1 MP)

        Comment


        • #5
          Here is some code that might meet OP's needs. The only major addition to #3 is the use of regular expressions to eliminate the designation, which is the thing that is most likely to change over the years.

          Code:
          split other_investigator, gen(investigator) p(;)
          drop other_investigator
          rename lead_investigator investigator0
          reshape long investigator, i(id grant2) j(num)
          replace investigator = trim(itrim(investigator))
          drop if missing(investigator)
          
          gen investigator_name = ustrregexrf(investigator, "^(Prof Dr|Prof|Dr|A\/Prof|Asst Prof) ", "", 1)
          duplicates drop id investigator_name, force
          egen grant_total = total(grant2), by(investigator_name)
          which produces:

          Code:
          . egen tagged = tag(investigator_name)
          . sort investigator_name
          . li investigator_name grant_total if tagged, abbrev(20) noobs
          
            +---------------------------------+
            | investigator_name   grant_total |
            |---------------------------------|
            |           Adr van        537513 |
            |           Aks Tan        537513 |
            |           Ala Cha        537513 |
            |           Ale Com       1503743 |
            |           Ana Roz        975475 |
            |---------------------------------|
            |           And Bro        819627 |
            |           And Sca        611728 |
            |           Ben Ste        423071 |
            |           Bry Bor        865502 |
            |           Dee Adh        846596 |
            |---------------------------------|
            |           Geo Fri        465576 |
            |           Gra Far        611728 |
            |           Gra Ste       1037541 |
            |           Gui Tch        611728 |
            |           Ivo Lab        381647 |
            |---------------------------------|
            |           Joe Hur        500627 |
            |           Joh Car       1228243 |
            |           Jua Nun        684116 |
            |           Jul Viv        819627 |
            |           Kat Bak        364875 |
            |---------------------------------|
            |          Ken Trot        266702 |
            |           Ker Hum        266702 |
            |           Loi Roy        684116 |
            |           Man Che        678017 |
            |           Mar Ama        500627 |
            |---------------------------------|
            |           Mar Pet        509899 |
            |           Mic Ler        509899 |
            |           Owe Atk        611728 |
            |            Qin Li        500028 |
            |           Rac Bez        381647 |
            |---------------------------------|
            |           Ric Ric        364875 |
            |           Roo Haz        423071 |
            |           Rut Kni        398006 |
            |           Sal Tho        500627 |
            |           Ste Sit        611728 |
            |---------------------------------|
            |           Stu Geo        509899 |
            |           Sus Geb        465576 |
            |           Sus Gua        509899 |
            |           Vic Che        833287 |
            +---------------------------------+
          Last edited by Hemanshu Kumar; 14 Apr 2025, 22:44. Reason: Added a command to drop duplicates by id & investigator.

          Comment


          • #6
            Thank you Nick Cox for your code to separate out all the names in the 'other_investigator variable. And for you noting some other issues in the data I will need to address.

            First, I'd like to learn how to calculate total grant income per investigator - ensuring I count each investigator once per grant -id- (noting the 'lead investigator' is often also listed in the 'other_investigator').

            (Stata SE 17.0).

            Comment


            • #7
              Chris Boulis does #5 solve your problem? I have added a line for the de-duplication you mentioned in #6.

              Comment


              • #8
                Thank you Hemanshu Kumar for your code/help. In running this code, Stata reported the following error:
                Code:
                variable id does not uniquely identify the observations
                    Your data are currently wide. You are performing a reshape long. You specified
                    i(id grant2) and j(num). In the current wide form, variable id grant2 should
                    uniquely identify the observations. Remember this picture:
                         long                                wide
                        +---------------+                   +------------------+
                        | i   j   a   b |                   | i   a1 a2  b1 b2 |
                        |---------------| <--- reshape ---> |------------------|
                        | 1   1   1   2 |                   | 1   1   3   2  4 |
                        | 1   2   3   4 |                   | 2   5   7   6  8 |
                        | 2   1   5   6 |                   +------------------+
                        | 2   2   7   8 |
                        +---------------+
                    Type reshape error for a list of the problem observations.
                r(9);
                I did as requested, but did not identify any problem observations.
                I note there are over 31,000 observations (different grant ids) from between 2001 and 2023 in my dataset.
                Last edited by Chris Boulis; 15 Apr 2025, 01:23. Reason: Added note about observations.

                Comment


                • #9
                  Ah I see. It's easy to create a temporary variable to uniquely identify each observation. Here's the same code with the changes needed highlighted:

                  Code:
                  split other_investigator, gen(investigator) p(;)
                  drop other_investigator
                  rename lead_investigator investigator0
                  gen `c(obs_t)' x = _n
                  reshape long investigator, i(x id grant2) j(num)
                  replace investigator = trim(itrim(investigator))
                  drop if missing(investigator)
                  drop x
                  
                  gen investigator_name = ustrregexrf(investigator, "^(Prof Dr|Prof|Dr|A\/Prof|Asst Prof) ", "", 1)
                  duplicates drop id investigator_name, force
                  egen grant_total = total(grant2), by(investigator_name)
                  While this would be a good workaround for your problem, the error you got is worth investigating. There appear to be multiple observations with the same id and grant2 value. Is this okay? If those are not typos or outright errors, they may not bite for the task at hand (since we are de-duplicating anyway), but it may cause other unexpected problems in your overall code.

                  See
                  Code:
                  help duplicates
                  to investigate this issue.
                  Last edited by Hemanshu Kumar; 15 Apr 2025, 02:11.

                  Comment


                  • #10
                    Hi Hemanshu Kumar. There is and should only be one observation per grant id but it is possible there are multiple observations with the same grant value as these can sometimes be a set amount - depending on the grant type and scheme.

                    Thank you for the updated code - it ran to completion this time. Although I now appear to have duplicate grant id numbers (id). When I ran
                    Code:
                    duplicates list id
                    it shows a large number duplicated id numbers. I checked my dataset and it is fine prior to running the code. I appreciate your support. Regards, Chris. Stata SE 17.0.



                    Comment


                    • #11
                      Thank you for the updated code - it ran to completion this time. Although I now appear to have duplicate grant id numbers (id).
                      I wouldn't be worried about that. The data has been reshaped to enable us to easily add up the grant values -- there is now an observation for each investigator of each grant. So of course a grant will have as many observations as it has unique investigators.

                      There is and should only be one observation per grant id but it is possible there are multiple observations with the same grant value
                      If that had been the case, reshape should not have complained as you reported in #8. reshape is looking for id grant2 to be jointly unique for an observation, not separately. Do check thoroughly what is going on.

                      Comment


                      • #12
                        I'll clarify. My dataset contains over 30,000 observations between 2001 and 2023. In each observation, I have a range of information, including grant amount (grant2), grant id (id), the lead investigator and other investigators. I am comfortable with the duplications based on your explanation as there will be a repeat of each grant id to coincide with the number of investigators, which will allow me to estimate the investigators that have accumulated the most grant income.

                        To that end, I'd like to list the top 20 grant income earning investigators - will the following do that?
                        Code:
                        list investigator_name grant_total in -20/1 if tagged, abbrev(20) noobs

                        Comment


                        • #13
                          You should do
                          Code:
                          gsort -tagged -grant_total
                          list investigator_name grant_total in 1/20 if tagged, abbrev(20) noobs
                          Though do be mindful that since you have several people with identical grant sizes, some of those left out of the top 20 might have the same grant amounts as the last few that are included.
                          Last edited by Hemanshu Kumar; 16 Apr 2025, 08:15.

                          Comment


                          • #14
                            Thank you Hemanshu Kumar. Yes I understand that - though I do not believe it will not be an issue for this part of the analysis, which is to find those researchers with the highest grant income. I will keep that in mind with other work that I do. Kind regards, Chris.

                            Comment

                            Working...
                            X