Panel data wage regression: Controlling for occupation (a categorical variable) that has a break in definition over time

Celine Li IHEID

Join Date: Mar 2024
Posts: 4

Panel data wage regression: Controlling for occupation (a categorical variable) that has a break in definition over time

30 Mar 2024, 04:06

Dear all,

Happy Easter. I am writing a Master's thesis using panel data, with unique identifiers as the survey year and the person ID. In a wage regression to determine the native-immigrant wage gap in Germany, I am looking to control for occupation group. That is, I want to control for the fact that migrants may sort into different occupation groups and therefore earn lower wages.

In my dataset, the occupation variable was defined in two ways: First, from the start of the survey to 2013 it was defined according to the standard "ISCO-88", after 2013 it was changed to the standard "ISCO-08". The way both classification schemes defined the groups have many n-1 and 1-m splits and merges, so there is no way for me to harmonize the two. Given that there are 9 occupation groups in each standard, both in ISCO-88 and ISCO-08, I end up having 18 codes.

I define the variable "occup_combined" as a categorical variable has unique codes for every classification scheme-occupation group. Thus, it looks something like this:

person ID	survey year	occup_combined
1	2011	ISCO-88-Group9
1	2012	ISCO-88-Group9
1	2013	ISCO-08-Group8
1	2014	ISCO-08-Group8
2	2014	ISCO-08-Group3
2	2015	ISCO-08-Group3

I run the code below. My question is: Does this accurately control for occupation, no matter which time period it is?

Code:

egen cluster_var = group(bula regtyp)
regdhfe ln_wages_gro immigrant sex age age_sq married no_children i.educ_level years_work_exp i.occup_combined, a(cluster_var syear) vce(cluster cluster_var)

It seems that STATA accurately drops one reference category in the occupations, given collinearity within each pre- and post-time period (before and after 2013).

Code:

 reghdfe immigrant sex age age_sq married no_children i.educ_level years_work_exp i.occup_combined, a(cluster_var syear) vce(cluster cluster_var)
(MWFE estimator converged in 5 iterations)
note: 889.occup_combined omitted because of collinearity

HDFE Linear regression                            Number of obs   =    373,914
Absorbing 2 HDFE groups                           F(  24,     27) =    1487.98
Statistics robust to heteroskedasticity           Prob > F        =     0.0000
                                                  R-squared       =     0.2058
                                                  Adj R-squared   =     0.2057
                                                  Within R-sq.    =     0.1404
Number of clusters (cluster_var) =         28     Root MSE        =     0.3394

                             (Std. err. adjusted for 28 clusters in cluster_var)
--------------------------------------------------------------------------------
               |               Robust
     immigrant | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
---------------+----------------------------------------------------------------
           sex |  -.0062314   .0053259    -1.17   0.252    -.0171592    .0046965
           age |   .0129305   .0018842     6.86   0.000     .0090644    .0167966
        age_sq |  -.0190114   .0022333    -8.51   0.000    -.0235937   -.0144291
       married |   .0805015     .00617    13.05   0.000     .0678417    .0931612
   no_children |   .0305603   .0022926    13.33   0.000     .0258563    .0352643
               |
    educ_level |
            2  |  -.2275403    .013966   -16.29   0.000    -.2561961   -.1988844
            3  |  -.1845185   .0156826   -11.77   0.000    -.2166965   -.1523404
               |
years_work_exp |   .0012085    .000531     2.28   0.031     .0001189    .0022981
               |
occup_combined |
           82  |   .0365541   .0107535     3.40   0.002     .0144898    .0586184
           83  |   .0550285   .0114384     4.81   0.000     .0315589    .0784981
           84  |   .0754783    .010643     7.09   0.000     .0536408    .0973159
           85  |   .1769228   .0207138     8.54   0.000     .1344216    .2194241
           86  |   .0625833   .0236854     2.64   0.014     .0139848    .1111818
           87  |   .1984632   .0163995    12.10   0.000     .1648143    .2321121
           88  |   .2950671   .0256503    11.50   0.000     .2424369    .3476972
           89  |   .3754826   .0248805    15.09   0.000     .3244321    .4265331
          881  |  -.2014318   .0285681    -7.05   0.000    -.2600487    -.142815
          882  |   -.218334   .0289038    -7.55   0.000    -.2776397   -.1590283
          883  |  -.1955812    .027253    -7.18   0.000    -.2514997   -.1396627
          884  |  -.1969916   .0264007    -7.46   0.000    -.2511613   -.1428219
          885  |  -.1457004   .0210747    -6.91   0.000    -.1889421   -.1024588
          886  |  -.1956609   .0408315    -4.79   0.000    -.2794402   -.1118816
          887  |  -.0568629   .0111428    -5.10   0.000    -.0797259   -.0339998
          888  |   .0205399   .0117739     1.74   0.092    -.0036182     .044698
          889  |          0  (omitted)
               |
         _cons |   .1553774   .0249477     6.23   0.000     .1041889    .2065659
--------------------------------------------------------------------------------

Absorbed degrees of freedom:
-----------------------------------------------------+
 Absorbed FE | Categories  - Redundant  = Num. Coefs |
-------------+---------------------------------------|
 cluster_var |        28          28           0    *|
       syear |        37           1          36     |
-----------------------------------------------------+
* = FE nested within cluster; treated as redundant for DoF computation

Tags: None

Carlo Lazzaro

Join Date: Apr 2014

Posts: 17601
#2

30 Mar 2024, 04:19

Celine:
1) if your question refers to Stata dropping one level of your categiorical variable to protect your analysis from the so called "dummy trap" (Dummy variable (statistics) - Wikipedia), the answer is: yes, it does;
2) as far as your panel data regression is concerned, as you're not double-clustering your SE (on panelid and year), you will be probably better off with -xtreg,fe-, which, supporting the -fvvarlist- notation, allows you to explore via -margins. and -marginsplot- the turning point in -age-, provided that you code that predictor as follows:

Code:

c.age##c.age

Kind regards,
Carlo
(StataNow 18.5)
Comment
Celine Li IHEID

Join Date: Mar 2024

Posts: 4
#3

30 Mar 2024, 05:42

Dear Carlo,

thank you so much for your quick response. Glad to hear that controlling for occupation in this way avoids the dummy trap.

2) as far as your panel data regression is concerned, as you're not double-clustering your SE (on panelid and year), you will be probably better off with -xtreg,fe-, which, supporting the -fvvarlist- notation, allows you to explore via -margins. and -marginsplot- the turning point in -age-, provided that you code that predictor as follows:

Regarding your second point, I am aware that I am not making use of the panel nature, but rather treating the data as a large cross-section for now (with all the caveats that come with it). You mentioned clustering the standard errors, and here I would kindly like to ask a follow-up on this:

I am not double clustering on panelid and year, but rather on a geographic interaction term between the federal state in Germany and a dummy that indicates whether individual i lives in a rural or urban area. This results in 28 clusters. I believe that this is too few. However, I also don't want to cluster at the level of the individual because this would allow for no correlation of erros across individuals, which I deem too restrictive.

Do you have suggestions on what a reasonable grouping of individuals could be where we would see some correlation of the errors within clusters, but not across?

I was thinking about the following candidates:
federal state x rural-urban x education group (primary, secondary, tertiary)

federal state x rural-urban x industry (agriculture, manufacturing, etc.)

federal state x rural-urban x age cohort (by ten years, e.g. born in 1950-1959, 1960-1969, etc.)
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17601
#4

30 Mar 2024, 08:51

Celine:
I'd shy away from complexity in clustering standard errors but, if I had to choose among the proposed option, I would teak the second one a bit as:

Code:

federal_state x industry

as urban/rural is probably already captured by -industry-.

Kind regards,
Carlo
(StataNow 18.5)
Comment

Announcement

Panel data wage regression: Controlling for occupation (a categorical variable) that has a break in definition over time

Comment

Comment

Comment