Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Calculating cumulative h-index with Stata?

    Hello statalisters,

    I've been trying to trying to calculate the h-index for a large dataset consisting of scientists. The h-index is defined as the maximum value of h such that the given author/journal has published h papers that have each been cited at least h times. The dataset looks somewhat like this:

    author_id year article_id citation hindex c_hindex
    A 1990 1 7
    A 1990 2 5
    A 1990 3 13
    A 1990 4 12
    A 1990 5 17
    A 1991 6 11
    A 1991 7 9
    A 1991 8 19
    A 1991 9 15
    A 1992 10 14
    A 1992 11 4
    A 1992 12 3
    A 1992 13 7
    A 1992 14 5
    A 1992 15 4
    A 1992 16 11

    With a little bit of help from the Stata forum (https://www.stata.com/statalist/arch.../msg00625.html), I could calculate the h-index of each authorid-year (hindex, column 5) using the following command:

    bysort authorid year : egen temp = rank(-citation), unique
    bysort authorid year citation : egen rank = max(temp)
    by authorid year : egen hindextemp = max(rank) if citation >= rank
    bysort authorid year : egen hindex = max(hindextemp)
    drop rank temp hindextemp


    What I'm having a hard time with is calculating the cumulative h-index of each authorid-year (c_hindex, column 6).

    For instance, there are 7 articles that have been cited at least 7 times from 1990 to 1991, therefore the cumulative h index for A in 1991 would be 7. As of 1992, the cumulative h index would be 9.

    Could anybody help me up with the command to generate the cumulative h-index? Thank you very much in advance!

    Hyeonjin

  • #2
    I believe the following does what you want:
    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input str1 author_id int year byte(article_id citation)
    "A" 1990  1  7
    "A" 1990  2  5
    "A" 1990  3 13
    "A" 1990  4 12
    "A" 1990  5 17
    "A" 1991  6 11
    "A" 1991  7  9
    "A" 1991  8 19
    "A" 1991  9 15
    "A" 1992 10 14
    "A" 1992 11  4
    "A" 1992 12  3
    "A" 1992 13  7
    "A" 1992 14  5
    "A" 1992 15  4
    "A" 1992 16 11
    end
    
    
    capture program drop one_author
    program define one_author
        gsort -citation
        gen indexable = (_n >= citation )
        egen index = max(cond(indexable, citation, .))
        replace index = min(_N, citation[_N]) if missing(index)
        drop indexable
        keep in L
        exit
    end
    
    
    //  CALCULATE CUMULATIVE H-INDEX FOR EACH AUTHOR
    rangerun one_author, by(author) interval(year . 0)
    rename index c_hindex
    To use this code you need the -rangerun- program, written by Robert Picard and available from SSC. To use -rangerun- you must also install -rangestat-, by Robert Picard, Nick Cox, and Roberto Ferrer, also available from SSC.


    In the future, when showing data examples, please use the -dataex- command to do so, as I have here. If you are running version 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.




    Comment


    • #3
      Inspired by sensei Clyde Schechter's solution, which revokes the beauty of the wonderful package -rangerun-, I would like to contribute my part: A shorter road to go.
      Code:
      capture program drop one_author2
      program define one_author2
          egen f = rank(citation), f
          egen c = max(f*(citation >= f))
          drop f
      end
      
      rangerun one_author2, by(author) interval(year . 0)
      Notice that the h_index by each author year (as in your original post) could also be captured with the same mechanism.
      Code:
      bysort authorid year: egen f2 = rank(citation), f
      bysort authorid year: egen h = max(f2*(citation >= f2))
      drop f2

      Comment

      Working...
      X