Dummy variables for table values for percentile

Laksh Noori

Join Date: Dec 2018

Posts: 23
#1

Dummy variables for table values for percentile

02 Dec 2018, 13:27

Hey guys,

I want to create one dummy variable which assigns a value of 1 for values above 90 percentile cites for each journal.
What I mean is that I want a dummy variable = ( if journal = "A", cites >107 AND if journal = "B", cites >92 AND if journal = "C", cites >55 AND if journal = "D", cites >51 AND if journal = "E", cites >106)
How do I do it?

journal N(cites) p90(cites)
----
A 152 107
B 227 92
C 181 55
D 156 51
E 144 106
----

Thanks!
Tags: None

David Benson

Join Date: Oct 2018
Posts: 489

02 Dec 2018, 14:14

See if this works. I created some toy data, so my p90 won't match yours.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str1 journal byte article int cites
"A" 1  38
"A" 2 119
"A" 3  76
"A" 4  34
"A" 5  73
"A" 6  89
"A" 7 210
"A" 8 164
"B" 1 189
"B" 2 176
"B" 3 148
"B" 4 130
"B" 5  26
"C" 1 124
"C" 2 182
"C" 3 139
end

Code:

. table journal, c(N cites p90 cites)

----------------------------------
  journal |   N(cites)  p90(cites)
----------+-----------------------
        A |          8         210
        B |          5         189
        C |          3         182
----------------------------------

bysort journal: egen journal_90 = pctile(cites), p(90)
gen article_90 = 0
replace article_90 = 1 if cites >= journal_90 & cites!=.  // these two could be combined into single line to create indicator variable
label var article_90 "1 if article in 90th percentile in cites for journal"

. list journal article cites journal_90 article_90, sepby(journal)

     +-------------------------------------------------+
     | journal   article   cites   journ~90   artic~90 |
     |-------------------------------------------------|
  1. |       A         1      38        210          0 |
  2. |       A         2     119        210          0 |
  3. |       A         3      76        210          0 |
  4. |       A         4      34        210          0 |
  5. |       A         5      73        210          0 |
  6. |       A         6      89        210          0 |
  7. |       A         7     210        210          1 |
  8. |       A         8     164        210          0 |
     |-------------------------------------------------|
  9. |       B         1     189        189          1 |
 10. |       B         2     176        189          0 |
 11. |       B         3     148        189          0 |
 12. |       B         4     130        189          0 |
 13. |       B         5      26        189          0 |
     |-------------------------------------------------|
 14. |       C         1     124        182          0 |
 15. |       C         2     182        182          1 |
 16. |       C         3     139        182          0 |
     +-------------------------------------------------+

.

Comment

Laksh Noori

Join Date: Dec 2018

Posts: 23
#3

02 Dec 2018, 15:19

Thank you David! It works perfectly! Exactly what I was looking for.

I just changed this replace article_90 = 1 if cites >= journal_90 & cites!=. to this replace article_90 = 1 if cites >= journal_90 cause I got an error, but the solution still holds.
Comment
David Benson

Join Date: Oct 2018

Posts: 489
#4

02 Dec 2018, 17:28

Hi Laksh,

Glad I could help! Just check to make sure you don't have observations where cites is missing, because Stata treats missing (cites==. ) as larger than any number. And you probably don't want obs with missing cites as listed in the top 10% . (You probably already know that, but just wanted to make sure.)
Comment
Laksh Noori

Join Date: Dec 2018

Posts: 23
#5

03 Dec 2018, 16:12

Thank you for the tip! Thankfully there were no missing data in my data set, so I haven't faced the issue.

Also, is there anyway to sum up duplicate data?

Code:

Code:

* Example generated by -dataex-. input str3 author str1 journal byte article int cites "Lak" "A" 1 38 "Tak" "A" 2 119 "Sak" "A" 3 16 "Lak" "A" 4 34 "Sak" "B" 1 189 "Mak" "B" 2 176 "Dak" "C" 1 14 "Tak" "C" 2 10 end

So, here the author might make the cut off in one journal and be assigned 1 (article_90 variable) and be assigned 0 for another journal. The author's name may or may not be repeated multiple times. Is there any way to sum up the cites for each author, and give them 1 if they get assigned it once at least, and then remove all duplicate names?
Comment

David Benson

Join Date: Oct 2018
Posts: 489

03 Dec 2018, 16:38

Sure, the only challenge would be how to handle multiple authors on the paper. Also, I would create a numeric author_id (rather than try to do this by author name).

Code:

NOTE: This includes the count_article_90 variable created
* dataex, varlabel
* Example generated by -dataex-. To install: ssc install dataex
clear
input str1 journal byte article int cites float(journal_90 article_90) str7 author float count_article_90
"A" 1  38 210 0 "Amanda"  0
"B" 2 176 189 0 "Amanda"  0
"A" 2 119 210 0 "Brooke"  1
"B" 1 189 189 1 "Brooke"  1
"A" 3  76 210 0 "Charles" 0
"B" 3 148 189 0 "Charles" 0
"A" 4  34 210 0 "David"   0
"A" 8 164 210 0 "David"   0
"B" 4 130 189 0 "David"   0
"A" 5  73 210 0 "Emily"   0
"B" 5  26 189 0 "Emily"   0
"C" 3 139 182 0 "Frank"   0
"A" 6  89 210 0 "Laksh"   0
"C" 1 124 182 0 "Laksh"   0
"A" 7 210 210 1 "Preeti"  2
"C" 2 182 182 1 "Preeti"  2
end
label var article_90 "1 if article in 90th percentile in cites for journal"
label var count_article_90 "Count of articles in top 10% an author has"

Code:

egen count_article_90 = total(article_90), by(author)  // this is what creates it
label var count_article_90 "Count of articles in top 10% an author has"

. list journal article cites journal_90 article_90 author count_article_90, sepby(journal)

     +----------------------------------------------------------------------+
     | journal   article   cites   journ~90   artic~90    author   count~90 |
     |----------------------------------------------------------------------|
  1. |       A         1      38        210          0    Amanda          0 |
  2. |       A         2     119        210          0    Brooke          1 |
  3. |       A         3      76        210          0   Charles          0 |
  4. |       A         4      34        210          0     David          0 |
  5. |       A         5      73        210          0     Emily          0 |
  6. |       A         6      89        210          0     Laksh          0 |
  7. |       A         7     210        210          1    Preeti          2 |
  8. |       A         8     164        210          0     David          0 |
     |----------------------------------------------------------------------|
  9. |       B         1     189        189          1    Brooke          1 |
 10. |       B         2     176        189          0    Amanda          0 |
 11. |       B         3     148        189          0   Charles          0 |
 12. |       B         4     130        189          0     David          0 |
 13. |       B         5      26        189          0     Emily          0 |
     |----------------------------------------------------------------------|
 14. |       C         1     124        182          0     Laksh          0 |
 15. |       C         2     182        182          1    Preeti          2 |
 16. |       C         3     139        182          0     Frank          0 |
     +----------------------------------------------------------------------+


sort author journal
list journal article cites journal_90 article_90 author count_article_90, sepby(author)

     +----------------------------------------------------------------------+
     | journal   article   cites   journ~90   artic~90    author   count~90 |
     |----------------------------------------------------------------------|
  1. |       A         1      38        210          0    Amanda          0 |
  2. |       B         2     176        189          0    Amanda          0 |
     |----------------------------------------------------------------------|
  3. |       A         2     119        210          0    Brooke          1 |
  4. |       B         1     189        189          1    Brooke          1 |
     |----------------------------------------------------------------------|
  5. |       A         3      76        210          0   Charles          0 |
  6. |       B         3     148        189          0   Charles          0 |
     |----------------------------------------------------------------------|
  7. |       A         4      34        210          0     David          0 |
  8. |       A         8     164        210          0     David          0 |
  9. |       B         4     130        189          0     David          0 |
     |----------------------------------------------------------------------|
 10. |       A         5      73        210          0     Emily          0 |
 11. |       B         5      26        189          0     Emily          0 |
     |----------------------------------------------------------------------|
 12. |       C         3     139        182          0     Frank          0 |
     |----------------------------------------------------------------------|
 13. |       A         6      89        210          0     Laksh          0 |
 14. |       C         1     124        182          0     Laksh          0 |
     |----------------------------------------------------------------------|
 15. |       A         7     210        210          1    Preeti          2 |
 16. |       C         2     182        182          1    Preeti          2 |
     +----------------------------------------------------------------------+

Comment

Laksh Noori

Join Date: Dec 2018
Posts: 23

03 Dec 2018, 17:17

How can I sum up the cites for each author, and consolidate it to one author entry like the one given below? Would it be possible? (Thank you for your time - really sorry if the questions seem silly)

Code:

    
  +----------------------------------------------------------------------  
        cites    author   count~90                                            
    ----------------------------------------------------------------------
1. |      214   Amanda       0                                            
2. |      308   Brooke       1                                                
3. |      224   Charles      0                                                
4. |      328    David       0                                                                                      
5. |      392    Preeti      2                                              
   +---------------------------------------------------------------------

Last edited by Laksh Noori; 03 Dec 2018, 17:23.

Comment

David Benson

Join Date: Oct 2018
Posts: 489

03 Dec 2018, 17:44

You'll use the collapse command. NOTE: Save the data before collapsing, because collapsing deletes data and creates a new dataset. (For example, save it as "One line per author.dta"). You will probably also want to rename the variable cites in the collapsed data to total_cites or something like that because it now represents the total cites received by an author (over *all* his or her articles), whereas in the original data it represented cites a *specific article* received.

Code:

preserve   // allows you to do restore afterwards
collapse (sum) cites (max) count_article_90, by(author)

. list author cites count_article_90 , noobs

  +----------------------------+
  |  author   cites   count~90 |
  |----------------------------|
  |  Amanda     214          0 |
  |  Brooke     308          1 |
  | Charles     224          0 |
  |   David     328          0 |
  |   Emily      99          0 |
  |----------------------------|
  |   Frank     139          0 |
  |   Laksh     213          0 |
  |  Preeti     392          2 |
  +----------------------------+

. desc

Contains data
  obs:             8                          
 vars:             3                          
 size:           152                          
---------------------------------------------------------------------------------------------------------------------------------------------
              storage   display    value
variable name   type    format     label      variable label
---------------------------------------------------------------------------------------------------------------------------------------------
author          str7    %9s                  
cites           double  %8.0g                 (sum) cites
count_articl~90 float   %9.0g                 (max) count_article_90
---------------------------------------------------------------------------------------------------------------------------------------------
Sorted by: author

Comment

David Benson

Join Date: Oct 2018
Posts: 489

03 Dec 2018, 18:00

After I posted the previous, I thought you might want to keep more data in the collapsed file (like how many articles the person had, the max cites any one of their papers received, etc). So from the data in post #6, I did the following:

Code:

preserve
collapse (sum) cites (count) article_count = cites (mean) avg_cites = cites (min) min_cites = cites (max) max_cites=cites count_article_90, by(author)
* restore (if desired)
* Obviously, you could also create avg_cites = cites / article_count

list author cites article_count avg_cites min_cites max_cites count_article_90, noobs

  +------------------------------------------------------------------------+
  |  author   cites   articl~t   avg_ci~s   min_ci~s   max_ci~s   count~90 |
  |------------------------------------------------------------------------|
  |  Amanda     214          2        107         38        176          0 |
  |  Brooke     308          2        154        119        189          1 |
  | Charles     224          2        112         76        148          0 |
  |   David     328          3    109.333         34        164          0 |
  |   Emily      99          2       49.5         26         73          0 |
  |------------------------------------------------------------------------|
  |   Frank     139          1        139        139        139          0 |
  |   Laksh     213          2      106.5         89        124          0 |
  |  Preeti     392          2        196        182        210          2 |
  +------------------------------------------------------------------------+


. desc

Contains data
  obs:             8                          
 vars:             7                          
 size:           248                          
---------------------------------------------------------------------------------------------------------------------------------------------
              storage   display    value
variable name   type    format     label      variable label
---------------------------------------------------------------------------------------------------------------------------------------------
author          str7    %9s                  
cites           double  %8.0g                 (sum) cites
article_count   long    %8.0g                 (count) cites
avg_cites       float   %8.0g                 (mean) cites
min_cites       int     %8.0g                 (min) cites
max_cites       int     %8.0g                 (max) cites
count_articl~90 float   %9.0g                 (max) count_article_90
---------------------------------------------------------------------------------------------------------------------------------------------
Sorted by: author

Last edited by David Benson; 03 Dec 2018, 18:03.

Comment

Laksh Noori

Join Date: Dec 2018

Posts: 23
#10

03 Dec 2018, 18:15

Thank you so, so, so much! Thank you for taking time to help me out with this.
1 like
Comment

Announcement