Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Calculate total research grant income by investigator

    Hi Statalist.

    I'd like to calculate total research grant income between 2001 to 2023 for each investigator in my dataset - with particular interest in identifying the top grant income earners. Investigator names lie within two columns of data "lead investigator" and "other investigator" - while the former includes a single name, the latter includes the names of multiple investigators. A sample of my data follows:
    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input long(id grant2) strL(lead_investigator other_investigator) int grant_year
    10000 509899 "Prof Ana Roz" "Prof Ana Roz; Prof Mar Pet; Prof Mic Ler; A/Prof Sus Gua; Dr Stu Geo" 2023
    10001 500627 "A/Prof Bry Bor"    "A/Prof Bry Bor; A/Prof Sal Tho; Prof Mar Ama; A/Prof Joe Hur" 2023                                        
    10002 684116 "Dr Ale Com"    "Dr Ale Com; Dr Loi Roy; Dr Jua Nun" 2023                                                                                  
    10003 266702 "Prof Man Che"       "Prof Man Che; Prof Ker Hum; Prof Ken Trot" 2022                                                                                    
    10004 465576 "Dr Geo Fri"         "Prof Ana Roz; Dr Geo Fri; Prof Dr Sus Geb" 2022                                                                                                      
    10005 364875 "Dr Kat Bak"       "A/Prof Bry Bor; Dr Kat Bak; Prof Ric Ric" 2021                                                                                                      
    10006 819627 "Prof And Bro"     "Prof And Bro; Dr Ale Com; Dr Jul Viv" 2021                                                                                                         
    10007 611728 "Prof Owe Atk"        "Prof Dr Gui Tch; Prof Owe Atk; Dr And Sca; Prof Gra Far; Prof Ste Sit" 2020                   
    10008 411315 "Dr Dee Adh"     "Prof Man Che; Dr Dee Adh; Prof Joh Car" 2020                                                                                                        
    10009 381647 "A/Prof Ivo Lab"       "Prof Joh Car; A/Prof Ivo Lab; Asst Prof Rac Bez" 2020                                                                                                  
    10010 500028 "Prof Gra Ste"      "Prof Gra Ste; Prof Qin Li" 2019                                                                                                              
    10011 423071 "Prof Roo Haz"    "Prof Roo Haz; Prof Ben Ste" 2019                                                                        
    10012 537513 "Prof Aks Tan"   "Prof Gra Ste; Prof Aks Tan; Prof Ala Cha; Prof Adr van" 2019                                                                   
    10013 398006 "Prof Vic Che"        "Prof Vic Che; Dr Rut Kni" 2019                                                                                                              
    10014 435281 "Prof Joh Car"      "Prof Vic Che; Prof Joh Car; Dr Dee Adh" 2019                                                                                                     
    end
    I attempted
    Code:
    bysort lead_investigator : egen grant_total = total(grant2)
    however, this does not do what I want. Appreciate some guidance on approach/code.
    Sincerely, Chris.

    (Running Stata SE 17.0.)

  • #2
    Code:
    egen total
    treats missings as zero. I assume you want
    Code:
    egen sum
    instead, assuming that the lead investigator is the cluster of interest to you.
    Best wishes

    (Stata 16.1 MP)

    Comment


    • #3
      Felix Bittmann Not so. sum() is just an undocumented synonym for total(). Here is the code for sum():

      Code:
      *! version 4.0.0  04oct2004
      program define _gsum
              // -egen- function -sum()- renamed to -total()- for Stata 9
      
              // no -version #- statement
              _gtotal `0'
      end
      sum() was renamed total() in Stata 9, to sharpen the difference between the function sum() -- which can be used with generate and gives cumulative or running sums -- and the egen function sum(), which gave and still gives ... totals. The original functionality of egen, sum() remains in order not to break ancient scripts -- or habits.

      At a guess, Chris needs to move in this direction:

      Code:
      split other_investigator, gen(wanted) parse(;)
      
      reshape long wanted, i(id) j(which)
      
      replace wanted = trim(wanted)
      
      drop if wanted == ""
      However, I suspect that Chris has more problems than this. Over the period between 2001 and 2023, people could easily change status and some could change their names and there could be many other inconistencies in spacing or spelling that Stata will take literally.
      Last edited by Nick Cox; Yesterday, 02:14.

      Comment


      • #4
        Thanks Nick Cox for the correction. TIL I am using totally outdated syntax.
        Best wishes

        (Stata 16.1 MP)

        Comment


        • #5
          Here is some code that might meet OP's needs. The only major addition to #3 is the use of regular expressions to eliminate the designation, which is the thing that is most likely to change over the years.

          Code:
          split other_investigator, gen(investigator) p(;)
          drop other_investigator
          rename lead_investigator investigator0
          reshape long investigator, i(id grant2) j(num)
          replace investigator = trim(itrim(investigator))
          drop if missing(investigator)
          
          gen investigator_name = ustrregexrf(investigator, "^(Prof Dr|Prof|Dr|A\/Prof|Asst Prof) ", "", 1)
          duplicates drop id investigator_name, force
          egen grant_total = total(grant2), by(investigator_name)
          which produces:

          Code:
          . egen tagged = tag(investigator_name)
          . sort investigator_name
          . li investigator_name grant_total if tagged, abbrev(20) noobs
          
            +---------------------------------+
            | investigator_name   grant_total |
            |---------------------------------|
            |           Adr van        537513 |
            |           Aks Tan        537513 |
            |           Ala Cha        537513 |
            |           Ale Com       1503743 |
            |           Ana Roz        975475 |
            |---------------------------------|
            |           And Bro        819627 |
            |           And Sca        611728 |
            |           Ben Ste        423071 |
            |           Bry Bor        865502 |
            |           Dee Adh        846596 |
            |---------------------------------|
            |           Geo Fri        465576 |
            |           Gra Far        611728 |
            |           Gra Ste       1037541 |
            |           Gui Tch        611728 |
            |           Ivo Lab        381647 |
            |---------------------------------|
            |           Joe Hur        500627 |
            |           Joh Car       1228243 |
            |           Jua Nun        684116 |
            |           Jul Viv        819627 |
            |           Kat Bak        364875 |
            |---------------------------------|
            |          Ken Trot        266702 |
            |           Ker Hum        266702 |
            |           Loi Roy        684116 |
            |           Man Che        678017 |
            |           Mar Ama        500627 |
            |---------------------------------|
            |           Mar Pet        509899 |
            |           Mic Ler        509899 |
            |           Owe Atk        611728 |
            |            Qin Li        500028 |
            |           Rac Bez        381647 |
            |---------------------------------|
            |           Ric Ric        364875 |
            |           Roo Haz        423071 |
            |           Rut Kni        398006 |
            |           Sal Tho        500627 |
            |           Ste Sit        611728 |
            |---------------------------------|
            |           Stu Geo        509899 |
            |           Sus Geb        465576 |
            |           Sus Gua        509899 |
            |           Vic Che        833287 |
            +---------------------------------+
          Last edited by Hemanshu Kumar; Yesterday, 22:44. Reason: Added a command to drop duplicates by id & investigator.

          Comment


          • #6
            Thank you Nick Cox for your code to separate out all the names in the 'other_investigator variable. And for you noting some other issues in the data I will need to address.

            First, I'd like to learn how to calculate total grant income per investigator - ensuring I count each investigator once per grant -id- (noting the 'lead investigator' is often also listed in the 'other_investigator').

            (Stata SE 17.0).

            Comment


            • #7
              Chris Boulis does #5 solve your problem? I have added a line for the de-duplication you mentioned in #6.

              Comment


              • #8
                Thank you Hemanshu Kumar for your code/help. In running this code, Stata reported the following error:
                Code:
                variable id does not uniquely identify the observations
                    Your data are currently wide. You are performing a reshape long. You specified
                    i(id grant2) and j(num). In the current wide form, variable id grant2 should
                    uniquely identify the observations. Remember this picture:
                         long                                wide
                        +---------------+                   +------------------+
                        | i   j   a   b |                   | i   a1 a2  b1 b2 |
                        |---------------| <--- reshape ---> |------------------|
                        | 1   1   1   2 |                   | 1   1   3   2  4 |
                        | 1   2   3   4 |                   | 2   5   7   6  8 |
                        | 2   1   5   6 |                   +------------------+
                        | 2   2   7   8 |
                        +---------------+
                    Type reshape error for a list of the problem observations.
                r(9);
                I did as requested, but did not identify any problem observations.
                I note there are over 31,000 observations (different grant ids) from between 2001 and 2023 in my dataset.
                Last edited by Chris Boulis; Today, 01:23. Reason: Added note about observations.

                Comment


                • #9
                  Ah I see. It's easy to create a temporary variable to uniquely identify each observation. Here's the same code with the changes needed highlighted:

                  Code:
                  split other_investigator, gen(investigator) p(;)
                  drop other_investigator
                  rename lead_investigator investigator0
                  gen `c(obs_t)' x = _n
                  reshape long investigator, i(x id grant2) j(num)
                  replace investigator = trim(itrim(investigator))
                  drop if missing(investigator)
                  drop x
                  
                  gen investigator_name = ustrregexrf(investigator, "^(Prof Dr|Prof|Dr|A\/Prof|Asst Prof) ", "", 1)
                  duplicates drop id investigator_name, force
                  egen grant_total = total(grant2), by(investigator_name)
                  While this would be a good workaround for your problem, the error you got is worth investigating. There appear to be multiple observations with the same id and grant2 value. Is this okay? If those are not typos or outright errors, they may not bite for the task at hand (since we are de-duplicating anyway), but it may cause other unexpected problems in your overall code.

                  See
                  Code:
                  help duplicates
                  to investigate this issue.
                  Last edited by Hemanshu Kumar; Today, 02:11.

                  Comment

                  Working...
                  X