Cohort Analysis

Raza Jafri

Join Date: May 2017

Posts: 31
#1

Cohort Analysis

09 Jul 2017, 07:28

Hello everyone, I am working on the synthetic panel which is formed from repeated cross-sectional data. Data is after every two years starting from 2004 to 2016. I want to ask do i need to make same cohorts for every round? For instance, i have considered people age 26 or more up till 50 or less. For this purpose, the oldest cohort in the first round is 1954 and youngest cohort in the last round becomes 1986. Keeping in mind 2004 (survey year) - 50 = 1954, 2016(survey year) - 26 = 1986. So i made the cohorts like this by taking 5 years gap for each cohort, and put same cohorts for all the rounds. So i can study age effect. My code for cohort construction or cohort bin is same for all the years which is mentioned below. I wanted to ask if it is a correct way to do it? because i am getting some weird results. Apart from this, my oldest cohort finishes at 1954, but in my code, the range goes till 1951 by the logic of the code. Please see the code below and suggest me if it is correct way to do it.

Code:

//"1" 26-30 years, "2" 31-35 years, "3" 36-40 years, "4" 41-45 years, "5" 46-50 years. drop if age > 50 drop if age < 25 | age == 25 gen year= 2004 gen cohort= year - age summarize cohort, d //recode cohort(1974/1978=1)(1969/1973=2)(1964/1968=3)(1959/1963=4)(1954/1958=5), gen(cbin) recode cohort(1986/1990=1) (1981/1985=2) (1976/1980=3) (1971/1975=4) (1966/1970=5) (1961/1965=6) (1956/1960=7) (1951/1955=8), gen(cbin) gen c_age= 28 if cbin==1 //cbin stands for cohort bin, and we put the median number 28, from the age range of 26 to 30 replace c_age=33 if cbin==2 replace c_age=38 if cbin==3 replace c_age=43 if cbin==4 replace c_age=48 if cbin==5 replace c_age=53 if cbin==6 replace c_age=58 if cbin==7 replace c_age=63 if cbin==8

This code is for a first round 2004, and it is same for 2006, 2008, 2010, 2012, 2014, and 2016.
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 29948
#2

09 Jul 2017, 09:56

First, your calculations of c_age amd cbom are not consistent with each other. Consider somebody of age 26, 27, or 28. Their birth cohort years are 1978, 1977, and 1976, respectively. That puts them in cbin 3. So you then calculate c_age = 38, but nobody in that cbin is anywhere near that age. The same error propagates throughout the data Also, the values 28, 33, 38... are not correct medians for the actual ages in the different values of cbin here: the correct medians are 27 (cbin 1 a smaller group), 31, 36, 41, ...

I think rather than working this out through a long series of -gen- and -replace- statements, let Stata do it for you:

Code:

egen c_age2 = median(age), by(cbin)

That's less work, and the results will be correct.
Comment
Raza Jafri

Join Date: May 2017

Posts: 31
#3

09 Jul 2017, 14:57

Sir thank you so much for the reply. But actually, i am little confused here, for instance, if we look at the code

Code:

gen cohort= year - age

it gives me birth year which i am refering as cohort. I put them in a bin by 1, 2, 3..8. Now as per your example people having age 26, 27, and 28 will fall in the first cohort. This code puts them in cbin 3, but i have to keep cohorts same for all the rounds. I check the youngest and oldest cohort in the data and made this code? As per my little knowledge, from your code i understand my cohorts are fine but problem is in cbin and c_age. So with this single command everything will be fine? or still i need to do adjustment in making cohorts for this first and later rounds. I am only considering people between the age of 26 and 50, both inclusive. Is it correct to go uptill cbin 8, in order to keep cohorts same throught the survey rounds? because in 2004 will be having people till cbin 5, rest of the cohorts will be empty i guess. However for 2006 onwards some people will fall in cbin 6, 7, 8 and in last round 2016 there will be no one in first two cohorts i think.

Last edited by Raza Jafri; 09 Jul 2017, 15:01.
Comment

Raza Jafri

Join Date: May 2017
Posts: 31

09 Jul 2017, 15:38

I have changed the code in order to keep c_age and cbin consistent with each other. But i wanted to draw your attention towards one issue for first three rounds results are fine, but from 2010 onwards i see the negative change in income and consumption for people having university education, whereas there is a positve change in income for people having junior middle education. However, when i look into the data people with university education are earning more and consuming more as well. Just for your reference, i putting the graph here. If you can suggest me something. Results are fine for 2004-2006, 2006 to 2008 and 2008 to 2010. However, the problem comes in the later rounds. Here is my code

Code:

  *************************************************
  *** CHOOSE THE INCOME AND CONSUMPTION MEASURE ***
  *************************************************
  
  
  local income_measure           hour_wage
  *local income_measure      hour_wage
  
  *local consumption_measure cosnumption
  local consumption_measure  ae_consumption
  
  
  keep if cbin ~=.  //cbin means Cohort bin (Age cohorts)
  
  capture program drop residualcy        
  program residualcy, eclass   //eclass stores the results of regression
        
        egen subgroup = group(cbin `1')    //(1 means argument 1 which is pertaining to edu_group 1 to 7)
        keep if year == `2' | year == `3'   //(2 and 3 are also argument e-g year 2004 , 2006 etc)
        bys year subgroup: egen m_c_group = mean(`4')  // (in order to make synthetic panel from repeated cross sections we need to generate subgroups in terms of means. Here it is pertaining to consumption)
        bys year subgroup: egen m_y_group = mean(`5')    //same as above but pertaining to income)
        keep year c_age m_c_group m_y_group subgroup cbin `1'
        duplicates drop
        gen lnm_c_group = ln(m_c_group)            //taking logs
        gen lnm_y_group = ln(m_y_group)        
        
        
        bys subgroup (year): gen d_c = lnm_c_group[2]-lnm_c_group[1] //subtracting log group means of two different years between same subgroups. like subtracting year 2006 from year 2004 for consumptiom
        bys subgroup (year): gen d_y = lnm_y_group[2]-lnm_y_group[1] //same as above but for income
        gen c_age2 = c_age*c_age  //in order to reduce age effect for those cohorts who were interviewed later in the survey (here we make age square)
        gen c_age3 = c_age2*c_age //age cube
        //keep if year == `3' 
        drop year m_c_group m_y_group lnm_c_group lnm_y_group
        duplicates drop
        reg d_c c_age c_age2 c_age3  //change in consumption on age (residual here is the risk effecting the consumption)
        predict eps_c, resid            
        reg d_y c_age c_age2 c_age3  //change in income on age (residual here is the risk means income shock)
        predict eps_y, resid 
        reg eps_c eps_y  // income shock is independent here and consumption shock is dependent here. So we can check the consumption insurance hypothesis.
  end 
        
 
 
 
 capture program drop adgraph2
  program adgraph2
        twoway  (scatter eps_c eps_y if `1' == 1 , mcolor(dknavy) msymbol(O))  /// 
                (scatter eps_c eps_y if `1' == 2 , mcolor(green) msymbol(o))  ///
                (scatter eps_c eps_y if `1' == 3 , mcolor(blue) msymbol(O))  ///
                (lfit eps_c eps_y, lpattern(solid) lcolor(black))  ///
             , ylabel(`10'(0.1)`11', labsize(small)) xlabel(`10'(0.1)`11',labsize(small) angle(vertical)) scheme(s1mono) xtitle("change in log disposable income",size(small)) ///
            ytitle("change in log consumption",size(small) angle(vertical))   ///
            legend(nobox symxsize(3) size(small) pos(12) row(3) region(fcolor(none))  ///
            order(1 "`2'" 2 "`3'" 3 "`4'" 4  "Slope `:di %4.3f _b[eps_y]' with s.e. `:di %4.3f _se[eps_y]'"))
        graph save "HIES Figures\ad-by-`1'-`5'-`6'-`8'-`9'-view1.gph", replace
           
        twoway  (scatter eps_c eps_y if cbin == 1, mcolor(blue)     msymbol(O))  ///
                (scatter eps_c eps_y if cbin == 2, mcolor(green)     msymbol(D))  ///
                (scatter eps_c eps_y if cbin == 3, mcolor(purple)     msymbol(T))  ///
                (scatter eps_c eps_y if cbin == 4, mcolor(magenta)     msymbol(S))  ///
                (scatter eps_c eps_y if cbin == 5, mcolor(red)         msymbol(+))  ///
                (scatter eps_c eps_y if cbin == 6, mcolor(brown)     msymbol(dh))  ///
                (scatter eps_c eps_y if cbin == 7, mcolor(gold)     msymbol(th))  ///
                (scatter eps_c eps_y if cbin == 8, mcolor(lavender) msymbol(sh))  ///
                (lfit eps_c eps_y, lpattern(solid) lcolor(black))  ///
            , ylabel(`10'(0.1)`11', labsize(small)) xlabel(`10'(0.1)`11',labsize(small) angle(vertical)) scheme(s1mono) xtitle("change in log disposable income",size(small)) ///
            ytitle("change in log consumption",size(small) angle(vertical))   ///
            legend(nobox symxsize(3) size(small) pos(12) row(3) region(fcolor(none)) order(1 "26-30" 2 "31-35" 3 "36-40" 4 "41-45" 5 "46-50" 6 "51-55" 7 "56-60" 8 "61-65" 9  "Slope `:di %4.3f _b[eps_y]' with s.e. `:di %4.3f _se[eps_y]'"))
        graph save "HIES Figures\ad-by-`1'-`5'-`6'-`8'-`9'-view2.gph", replace
        graph combine   "HIES Figures\ad-by-`1'-`5'-`6'-`8'-`9'-view1.gph"  ///
                        "HIES Figures\ad-by-`1'-`5'-`6'-`8'-`9'-view2.gph", col(2) scheme(s1mono) title("HIES `5' to `6' by `7'", size(small))
        graph save    "HIES Figures\ad-by-`1'-`5'-`6'-`8'-`9'.gph",replace
        graph export  "HIES Figures\ad-by-`1'-`5'-`6'-`8'-`9'.png",replace
        erase "HIES Figures\ad-by-`1'-`5'-`6'-`8'-`9'.gph"
        erase "HIES Figures\ad-by-`1'-`5'-`6'-`8'-`9'-view1.gph"
        erase "HIES Figures\ad-by-`1'-`5'-`6'-`8'-`9'-view2.gph"
  end  
 
  preserve
  keep if edu_group ~=.
  residualcy edu_group 2004 2006 `consumption_measure' `income_measure'
  adgraph2     edu_group "Junior Middle" "Intermediate" "University" 2004 2006 "Full Education Category" `consumption_measure' `income_measure' -0.2 0.2
  restore
  
  preserve
  keep if edu_group ~=.
  residualcy edu_group 2006 2008 `consumption_measure' `income_measure'
  adgraph2     edu_group "Junior Middle" "Intermediate" "University" 2006 2008 "Full Education Category" `consumption_measure' `income_measure' -0.2 0.2
  restore
  
  preserve
  keep if edu_group ~=.
  residualcy edu_group 2008 2010 `consumption_measure' `income_measure'
  adgraph2     edu_group "Junior Middle" "Intermediate" "University" 2008 2010 "Full Education Category" `consumption_measure' `income_measure' -0.2 0.2
  restore
  
  preserve
  keep if edu_group ~=.
  residualcy edu_group 2010 2012 `consumption_measure' `income_measure'
  adgraph2     edu_group "Junior Middle" "Intermediate" "University" 2010 2012 "Full Education Category" `consumption_measure' `income_measure' -0.2 0.2
  restore
  
  preserve
  keep if edu_group ~=.
  residualcy edu_group 2012 2014 `consumption_measure' `income_measure'
  adgraph2     edu_group "Junior Middle" "Intermediate" "University" 2012 2014 "Full Education Category" `consumption_measure' `income_measure' -0.2 0.2
  restore
  
  preserve
  keep if edu_group ~=.
  residualcy edu_group 2014 2016 `consumption_measure' `income_measure'
  adgraph2     edu_group "Junior Middle" "Intermediate" "University" 2014 2016 "Full Education Category" `consumption_measure' `income_measure' -0.2 0.2
  restore
  
 
  preserve
  keep if edu_group ~=.
  residualcy edu_group 2004 2016 `consumption_measure' `income_measure'
  adgraph2     edu_group "Junior Middle" "Intermediate" "University" 2004 2016 "Full Education Category" `consumption_measure' `income_measure' -0.2 0.2
  restore
  
 
  
  
  
  ******************************************************
  *************4 Years Difference***********************
  
  preserve
  keep if edu_group ~=.
  residualcy edu_group 2004 2008 `consumption_measure' `income_measure'
  adgraph2     edu_group "Junior Middle" "Intermediate" "University" 2004 2008 "Full Education Category" `consumption_measure' `income_measure' -0.2 0.2
  restore
  
  preserve
  keep if edu_group ~=.
  residualcy edu_group 2006 2010 `consumption_measure' `income_measure'
  adgraph2     edu_group "Junior Middle" "Intermediate" "University" 2006 2010 "Full Education Category" `consumption_measure' `income_measure' -0.2 0.2
  restore
  
  preserve
  keep if edu_group ~=.
  residualcy edu_group 2008 2012 `consumption_measure' `income_measure'
  adgraph2     edu_group "Junior Middle" "Intermediate" "University" 2008 2012 "Full Education Category" `consumption_measure' `income_measure' -0.2 0.2
  restore
  
  preserve
  keep if edu_group ~=.
  residualcy edu_group 2010 2014 `consumption_measure' `income_measure'
  adgraph2     edu_group "Junior Middle" "Intermediate" "University" 2010 2014 "Full Education Category" `consumption_measure' `income_measure' -0.2 0.2
  restore

Click image for larger version

Name: ad-by-edu_group-2012-2014-ae_consumption-hour_wage.png
Views: 1
Size: 27.4 KB
ID: 1401073

Click image for larger version

Name: ad-by-edu_group-2014-2016-ae_consumption-hour_wage.png
Views: 1
Size: 27.4 KB
ID: 1401074

the problem comes in the later round as i have explained.

Click image for larger version

Name: ad-by-edu_group-2004-2006-ae_consumption-hour_wage.png
Views: 1
Size: 27.8 KB
ID: 1401072

Attached Files

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 29948
#5

09 Jul 2017, 15:40

still i need to do adjustment in making cohorts for this first and later rounds

So, if you are referring her to rounds of the survey, each round of the survey occurs in a different year. So each respondent's age changes, but the birth cohorts and cbins remain the same. It follows that the median age in each bin changes in each round of the survey.

As I understand it, you have a series of data cross-sections, not a panel (longitudinal) data set. So you should not have a "variable" year that is constant at 2004. The variable year should be the actual year of the survey the observation comes from. Then you can calculate birth cohort as year - age. Following that, if you want the median age in each birth cohort bin in each round, the code would be -egen cage_2 = median(age), by(cbin year)-.

Edit: Crossed with #4. This is a response only to #3. #4 is too long and complicated for me to deal with today. I would also suggest that even if somebody else wants to jump in on this today, it is going to be very difficult to figure this out with out some example data (use -dataex-) to experiment with.

Last edited by Clyde Schechter; 09 Jul 2017, 15:44.
Comment
ashar awan

Join Date: Mar 2020

Posts: 23
#6

12 Mar 2020, 03:24

Raza Jafri brother I need your help i am also using PSLM/HIES various surveys to make panel, could you please share your do file? [email protected]
Comment

Announcement

Comment

Comment

Comment

Comment

Comment