Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Panel data wage regression: Controlling for occupation (a categorical variable) that has a break in definition over time

    Dear all,

    Happy Easter. I am writing a Master's thesis using panel data, with unique identifiers as the survey year and the person ID. In a wage regression to determine the native-immigrant wage gap in Germany, I am looking to control for occupation group. That is, I want to control for the fact that migrants may sort into different occupation groups and therefore earn lower wages.

    In my dataset, the occupation variable was defined in two ways: First, from the start of the survey to 2013 it was defined according to the standard "ISCO-88", after 2013 it was changed to the standard "ISCO-08". The way both classification schemes defined the groups have many n-1 and 1-m splits and merges, so there is no way for me to harmonize the two. Given that there are 9 occupation groups in each standard, both in ISCO-88 and ISCO-08, I end up having 18 codes.

    I define the variable "occup_combined" as a categorical variable has unique codes for every classification scheme-occupation group. Thus, it looks something like this:
    person ID survey year occup_combined
    1 2011 ISCO-88-Group9
    1 2012 ISCO-88-Group9
    1 2013 ISCO-08-Group8
    1 2014 ISCO-08-Group8
    2 2014 ISCO-08-Group3
    2 2015 ISCO-08-Group3

    I run the code below. My question is: Does this accurately control for occupation, no matter which time period it is?

    Code:
    egen cluster_var = group(bula regtyp)
    regdhfe ln_wages_gro immigrant sex age age_sq married no_children i.educ_level years_work_exp i.occup_combined, a(cluster_var syear) vce(cluster cluster_var)
    It seems that STATA accurately drops one reference category in the occupations, given collinearity within each pre- and post-time period (before and after 2013).

    Code:
     reghdfe immigrant sex age age_sq married no_children i.educ_level years_work_exp i.occup_combined, a(cluster_var syear) vce(cluster cluster_var)
    (MWFE estimator converged in 5 iterations)
    note: 889.occup_combined omitted because of collinearity
    
    HDFE Linear regression                            Number of obs   =    373,914
    Absorbing 2 HDFE groups                           F(  24,     27) =    1487.98
    Statistics robust to heteroskedasticity           Prob > F        =     0.0000
                                                      R-squared       =     0.2058
                                                      Adj R-squared   =     0.2057
                                                      Within R-sq.    =     0.1404
    Number of clusters (cluster_var) =         28     Root MSE        =     0.3394
    
                                 (Std. err. adjusted for 28 clusters in cluster_var)
    --------------------------------------------------------------------------------
                   |               Robust
         immigrant | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
    ---------------+----------------------------------------------------------------
               sex |  -.0062314   .0053259    -1.17   0.252    -.0171592    .0046965
               age |   .0129305   .0018842     6.86   0.000     .0090644    .0167966
            age_sq |  -.0190114   .0022333    -8.51   0.000    -.0235937   -.0144291
           married |   .0805015     .00617    13.05   0.000     .0678417    .0931612
       no_children |   .0305603   .0022926    13.33   0.000     .0258563    .0352643
                   |
        educ_level |
                2  |  -.2275403    .013966   -16.29   0.000    -.2561961   -.1988844
                3  |  -.1845185   .0156826   -11.77   0.000    -.2166965   -.1523404
                   |
    years_work_exp |   .0012085    .000531     2.28   0.031     .0001189    .0022981
                   |
    occup_combined |
               82  |   .0365541   .0107535     3.40   0.002     .0144898    .0586184
               83  |   .0550285   .0114384     4.81   0.000     .0315589    .0784981
               84  |   .0754783    .010643     7.09   0.000     .0536408    .0973159
               85  |   .1769228   .0207138     8.54   0.000     .1344216    .2194241
               86  |   .0625833   .0236854     2.64   0.014     .0139848    .1111818
               87  |   .1984632   .0163995    12.10   0.000     .1648143    .2321121
               88  |   .2950671   .0256503    11.50   0.000     .2424369    .3476972
               89  |   .3754826   .0248805    15.09   0.000     .3244321    .4265331
              881  |  -.2014318   .0285681    -7.05   0.000    -.2600487    -.142815
              882  |   -.218334   .0289038    -7.55   0.000    -.2776397   -.1590283
              883  |  -.1955812    .027253    -7.18   0.000    -.2514997   -.1396627
              884  |  -.1969916   .0264007    -7.46   0.000    -.2511613   -.1428219
              885  |  -.1457004   .0210747    -6.91   0.000    -.1889421   -.1024588
              886  |  -.1956609   .0408315    -4.79   0.000    -.2794402   -.1118816
              887  |  -.0568629   .0111428    -5.10   0.000    -.0797259   -.0339998
              888  |   .0205399   .0117739     1.74   0.092    -.0036182     .044698
              889  |          0  (omitted)
                   |
             _cons |   .1553774   .0249477     6.23   0.000     .1041889    .2065659
    --------------------------------------------------------------------------------
    
    Absorbed degrees of freedom:
    -----------------------------------------------------+
     Absorbed FE | Categories  - Redundant  = Num. Coefs |
    -------------+---------------------------------------|
     cluster_var |        28          28           0    *|
           syear |        37           1          36     |
    -----------------------------------------------------+
    * = FE nested within cluster; treated as redundant for DoF computation

  • #2
    Celine:
    1) if your question refers to Stata dropping one level of your categiorical variable to protect your analysis from the so called "dummy trap" (Dummy variable (statistics) - Wikipedia), the answer is: yes, it does;
    2) as far as your panel data regression is concerned, as you're not double-clustering your SE (on panelid and year), you will be probably better off with -xtreg,fe-, which, supporting the -fvvarlist- notation, allows you to explore via -margins. and -marginsplot- the turning point in -age-, provided that you code that predictor as follows:
    Code:
    c.age##c.age
    Kind regards,
    Carlo
    (StataNow 18.5)

    Comment


    • #3
      Dear Carlo,

      thank you so much for your quick response. Glad to hear that controlling for occupation in this way avoids the dummy trap.

      2) as far as your panel data regression is concerned, as you're not double-clustering your SE (on panelid and year), you will be probably better off with -xtreg,fe-, which, supporting the -fvvarlist- notation, allows you to explore via -margins. and -marginsplot- the turning point in -age-, provided that you code that predictor as follows:
      Regarding your second point, I am aware that I am not making use of the panel nature, but rather treating the data as a large cross-section for now (with all the caveats that come with it). You mentioned clustering the standard errors, and here I would kindly like to ask a follow-up on this:

      I am not double clustering on panelid and year, but rather on a geographic interaction term between the federal state in Germany and a dummy that indicates whether individual i lives in a rural or urban area. This results in 28 clusters. I believe that this is too few. However, I also don't want to cluster at the level of the individual because this would allow for no correlation of erros across individuals, which I deem too restrictive.

      Do you have suggestions on what a reasonable grouping of individuals could be where we would see some correlation of the errors within clusters, but not across?

      I was thinking about the following candidates:
      • federal state x rural-urban x education group (primary, secondary, tertiary)
      • federal state x rural-urban x industry (agriculture, manufacturing, etc.)
      • federal state x rural-urban x age cohort (by ten years, e.g. born in 1950-1959, 1960-1969, etc.)


      Comment


      • #4
        Celine:
        I'd shy away from complexity in clustering standard errors but, if I had to choose among the proposed option, I would teak the second one a bit as:
        Code:
        federal_state x industry
        as urban/rural is probably already captured by -industry-.
        Kind regards,
        Carlo
        (StataNow 18.5)

        Comment

        Working...
        X