failure to get F test when running xtreg with clustering on a group (more predictors than groups)

Doug Hess

Join Date: Nov 2016

Posts: 58
#1

failure to get F test when running xtreg with clustering on a group (more predictors than groups)

09 Jan 2022, 19:47

My question concerns what to do when you have too few degrees of freedom due to clustering data in -xtreg- with many variables. (Using State/MP 15.1)

I'm running a cross-sectional time-series on individual responses in the November Supplement to the Current Population Survey (US Census/Dept of Labor). The November Supplement, in even-numbered years, asks a few questions about voter registration and voting.

My dependent variable is registered ("are you registered to vote?" 0=no; 1=yes). The individual-level (or household-level) variables are categorical and fairly standard in the extensive literature: gender, age groups, last level of schooling, income quartile, race/ethnicity, and time at current address. Other than gender, the rest have four to six categories. The data set contains all 50 states plus DC (treated as a state) and several years. The sample size is>470,000 adult citizens. I'm using xtreg even with a binary dependent variable because registration rates are about 60-80%, depending on sub-population. So, I'm taking the Angrist and Pischke attitude that OLS is no worse than logistic. Plus, running - xtlogit - and then requesting - margins- even on a strong, multicore computer takes hours (i.e., overnight) with the large sample size.

However, I wish to interact most of these level-one variables with one or more of three state-level policy variables (also binary). Year and state are included in -xtreg- in the usual way:

Code:

xtreg registered i.year x1-x6 policy1 policy2 policy3, i(stateid) cluster(stateid) fe

The problem arises when interacting one policy variable with the individual-level variables while clustering on states: Stata won't run an F test. After reading -help j_robustsingular-, I see the problem is that I run out of degrees-of-freedom.

What are my options? Here are those I thought of:

1. Reduce categories in the variables? I would rather not drop or condense any as the goal is to get at the impact of policy on demographic groups. I could convert education and income into a scale using - alpha -, but that is hard to explain to a lay audience. The effect of the policies on age seems to hit those in the middle range, so I'd prefer not to use a continuous variable.
2. Run separate models with only a few batches of interactions at a time? It seems the policies do interact with nearly all the variables (but not all categories).
3. Determine if the St. Errors are trustworthy even without the F Test "working"? If I run the model as - reg, robust- with state ids included as an indicator, the change in p-values is between .01 and .1 for most interactions. And the F(95,472633)=780.86. Is that reason to run the model as - xtreg, cluster - and rely on the reported clustered st. errors?

Are there other solutions or things to be aware of?

Thank you.

Last edited by Doug Hess; 09 Jan 2022, 19:52.
Tags: cluster, degrees-of-freedom, interaction, panel data, xtreg
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17711
#2

10 Jan 2022, 01:35

Doug:
this happens frequently.
I would check whether the regression suffers from misspecification or other issues.
The fact that the F-test is not reported is basically negligible.

Kind regards,
Carlo
(Stata 19.0)
1 like
Comment
Doug Hess

Join Date: Nov 2016

Posts: 58
#3

11 Jan 2022, 11:54

Originally posted by Carlo Lazzaro View Post

Doug:
this happens frequently.
I would check whether the regression suffers from misspecification or other issues.
The fact that the F-test is not reported is basically negligible.

Thanks, Carlo Lazzaro. I don't think it's a specification problem, it's the number of clusters is < the number of predictors.
Comment

Carlo Lazzaro

Join Date: Apr 2014
Posts: 17711

11 Jan 2022, 12:03

Doug:
nothing to worry about, then.
Just to be clearer, I recommended to check for model misspecification independently from the F-test appearing or not (as it has no bearing on that issue).
You can use a procedure that heavily draws upon the -linktest- methodology, as in the following toy-example (where the -xtreg,fe- regerssion is deliberately misspecified):

Code:

. use "https://www.stata-press.com/data/r17/nlswork.dta"
(National Longitudinal Survey of Young Women, 14-24 years old in 1968)

. xtreg ln_wage c.age##c.age i.year, fe vce(cluster idcode)

Fixed-effects (within) regression               Number of obs     =     28,510
Group variable: idcode                          Number of groups  =      4,710

R-squared:                                      Obs per group:
     Within  = 0.1162                                         min =          1
     Between = 0.1078                                         avg =        6.1
     Overall = 0.0932                                         max =         15

                                                F(16,4709)        =      79.11
corr(u_i, Xb) = 0.0613                          Prob > F          =     0.0000

                             (Std. err. adjusted for 4,710 clusters in idcode)
------------------------------------------------------------------------------
             |               Robust
     ln_wage | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
         age |   .0728746    .013687     5.32   0.000     .0460416    .0997075
             |
 c.age#c.age |  -.0010113   .0001076    -9.40   0.000    -.0012224   -.0008003
             |
        year |
         69  |   .0647054   .0155249     4.17   0.000     .0342693    .0951415
         70  |   .0284423   .0264639     1.07   0.283    -.0234395     .080324
         71  |   .0579959   .0384111     1.51   0.131    -.0173078    .1332996
         72  |   .0510671   .0502675     1.02   0.310    -.0474808     .149615
         73  |   .0424104   .0624924     0.68   0.497    -.0801038    .1649247
         75  |   .0151376    .086228     0.18   0.861    -.1539096    .1841848
         77  |   .0340933   .1106841     0.31   0.758    -.1828994     .251086
         78  |   .0537334   .1232232     0.44   0.663    -.1878417    .2953084
         80  |   .0369475   .1473725     0.25   0.802    -.2519716    .3258667
         82  |   .0391687   .1715621     0.23   0.819    -.2971733    .3755108
         83  |    .058766   .1836086     0.32   0.749    -.3011928    .4187249
         85  |   .1042758   .2080199     0.50   0.616    -.3035406    .5120922
         87  |   .1242272   .2327328     0.53   0.594    -.3320379    .5804922
         88  |   .1904977   .2486083     0.77   0.444    -.2968909    .6778863
             |
       _cons |   .3937532   .2469015     1.59   0.111    -.0902893    .8777957
-------------+----------------------------------------------------------------
     sigma_u |  .40275174
     sigma_e |  .30127563
         rho |  .64120306   (fraction of variance due to u_i)
------------------------------------------------------------------------------

. predict fitted, xb
(24 missing values generated)

. gen sq_fitted=fitted^2
(24 missing values generated)

. xtreg ln_wage c.age##c.age i.year fitted sq_fitted , fe vce(cluster idcode)
note: c.age#c.age omitted because of collinearity.

Fixed-effects (within) regression               Number of obs     =     28,510
Group variable: idcode                          Number of groups  =      4,710

R-squared:                                      Obs per group:
     Within  = 0.1173                                         min =          1
     Between = 0.1121                                         avg =        6.1
     Overall = 0.0952                                         max =         15

                                                F(17,4709)        =      76.35
corr(u_i, Xb) = 0.0636                          Prob > F          =     0.0000

                             (Std. err. adjusted for 4,710 clusters in idcode)
------------------------------------------------------------------------------
             |               Robust
     ln_wage | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
         age |  -.0004375   .0123334    -0.04   0.972    -.0246168    .0237418
             |
 c.age#c.age |          0  (omitted)
             |
        year |
         69  |   -.016534   .0175908    -0.94   0.347    -.0510202    .0179523
         70  |  -.0127288   .0270511    -0.47   0.638    -.0657616    .0403039
         71  |  -.0164466   .0396213    -0.42   0.678    -.0941229    .0612297
         72  |    -.01567   .0511885    -0.31   0.760    -.1160234    .0846835
         73  |  -.0163476   .0631829    -0.26   0.796    -.1402158    .1075205
         75  |  -.0170026   .0864874    -0.20   0.844    -.1865584    .1525532
         77  |  -.0111413   .1109886    -0.10   0.920    -.2287309    .2064483
         78  |  -.0029997   .1236291    -0.02   0.981    -.2453706    .2393712
         80  |  -.0007318   .1475088    -0.00   0.996     -.289918    .2884544
         82  |   .0058067   .1716208     0.03   0.973    -.3306503    .3422638
         83  |   .0158354   .1837029     0.09   0.931    -.3443083     .375979
         85  |     .04142   .2083538     0.20   0.842    -.3670508    .4498909
         87  |   .0523993   .2330342     0.22   0.822    -.4044568    .5092553
         88  |   .0938441   .2496481     0.38   0.707     -.395583    .5832712
             |
      fitted |   5.201776   1.085644     4.79   0.000     3.073405    7.330147
   sq_fitted |  -1.321262   .3415637    -3.87   0.000    -1.990887   -.6516372
       _cons |  -3.307108    .892101    -3.71   0.000    -5.056043   -1.558172
-------------+----------------------------------------------------------------
     sigma_u |  .40189262
     sigma_e |   .3011033
         rho |  .64048345   (fraction of variance due to u_i)
------------------------------------------------------------------------------

. test sq_fitted

 ( 1)  sq_fitted = 0

       F(  1,  4709) =   14.96
            Prob > F =    0.0001

.

As -test- outcome (that simply echoes the coefficient p-value reported in -xtre,fe- outcome table) reaches staistical significance, the model is misspecified.

Kind regards,
Carlo
(Stata 19.0)

Comment

Doug Hess

Join Date: Nov 2016

Posts: 58
#5

11 Jan 2022, 13:02

Thanks. I ran this and Prob > F = 0.723 for the prediction-squared. Two questions:
I can interpret that to mean the model is unlikely to be misspecified?

If another version of the model had F=.430, does that mean it is not as well specified but unlikely problematic? Just wondering if you can compare models this way.
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17711
#6

11 Jan 2022, 13:44

Doug:
1) both tests do not show evidence that your regressions are misspecified;
2) I would use -adjusted Rsq- when available to compare two different regressions.

Kind regards,
Carlo
(Stata 19.0)
1 like
Comment

Announcement