How to identify a variable's top/bottom 30% of each year in panel data?

Noah Liu

Join Date: Jul 2019
Posts: 15

How to identify a variable's top/bottom 30% of each year in panel data?

15 Aug 2019, 12:21

Dear Stata users,

I am trying to run an cross-sectional regression on firms in the bottom 30% and top30% of the distribution of book-to-market value of panel data.
I tried to rank firms every year, but I can't identify the top/ bottom 30% of them, because this an unbalanced panel , and each year's total number of firms is different.
I would be grateful if someone could help me to identify these firms each year.

here's the code i use

Code:

sort gvkey year
local i=1964 // the time period is 1964-2014
while `i'<=2014{
quietly egen per70`i'=pctile(btm), p(70) //btm is the book-to-market value, and I have to find out the firms with top/bottom 30% of the distribution of btm
quietly drop if btm<70`i' 
quietly drop per70`i' 
local i=`i'+1
}

Here's part of my data

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input long gvkey double year float btm
 2403 1964 .16921677
 9103 1964 .16258383
 1608 1964 .18209743
10060 1964 .18040167
 1481 1964 .14549348
 4780 1964  .1590813
 3874 1964  .1805083
 3235 1964  .1769771
11535 1964 .17150614
 4475 1964  .1712105
 4453 1964 .12189302
 4021 1964 .18900825
11264 1964 .17115825
10878 1964 .19931643
 6502 1964 .18697643
 8645 1964 .16715898
 6113 1964 .16654503
 3489 1964  .1921034
11280 1964 .16220094
 9616 1964 .18040165
end

Thank you for your help in advance!

Tags: None

Igor Paploski

Join Date: Oct 2014

Posts: 174
#2

15 Aug 2019, 13:03

Hey Noah, your example is not perfect because you are only showing a single year. I took the liberty to simulate the data in order to have more than one year of data.

Code:

clear set obs 3 gen n = 10 gen year = _n+1963 expand n sort year gen btm = runiform() drop n

So now there are 10 observations per year for 3 years. The not-so-elegant code below creates one variable for each year specifying which observations were the bottom and top 30% values for btm for each year. In the "topbottom" variables, 0 codes for bottom 30%, 1 codes for top 30%. This should be resistant for years with different amount of observations. It might not work properly if there are less than 10 observations per year, you might have to check for that. The code also don't pay any attention to if there are ties in assignment of deciles (to obtain the 30% top and bottom, I'm classifying the observations for each year in deciles first).

Code:

levelsof year, local(levels) foreach year of local levels{ xtile topbottom`year' = btm if year==`year', nq(10) } foreach var of varlist topbottom1964 - topbottom1966{ replace `var' = 0 if `var' <=3 replace `var' = 1 if `var' >=8 & `var' !=. replace `var' = . if `var' >1 & `var' !=. }

Last edited by Igor Paploski; 15 Aug 2019, 13:06.
Comment
Noah Liu

Join Date: Jul 2019

Posts: 15
#3

16 Aug 2019, 01:44

Thank you for your help!
There are hundreds of firms each year in my data so the code works well.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35697
#4

16 Aug 2019, 02:59

I can't see any reason to loop here. Or rather, the egen call can use by() to calculate separately by year, after which you have all the ingredients you need.

Code:

egen per70 = pctile(btm), p(70) by(year) egen per30 = pctile(btm), p(30) by(year) gen wanted = cond(btm <= per30, 1, cond(btm <= per70, 2, 3)) if btm < .
Comment
Noah Liu

Join Date: Jul 2019

Posts: 15
#5

16 Aug 2019, 09:47

Originally posted by Nick Cox View Post

I can't see any reason to loop here. Or rather, the egen call can use by() to calculate separately by year, after which you have all the ingredients you need.

Code:

egen per70 = pctile(btm), p(70) by(year) egen per30 = pctile(btm), p(30) by(year) gen wanted = cond(btm <= per30, 1, cond(btm <= per70, 2, 3)) if btm < .

Thank you for your code!
Before reading your post, I thought pctile() cannot be used togerther with by().
Since it can be used like this, it is more conveniet now. Thank you!
Comment

Announcement

How to identify a variable's top/bottom 30% of each year in panel data?

Comment

Comment

Comment

Comment