Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • failure to get F test when running xtreg with clustering on a group (more predictors than groups)

    My question concerns what to do when you have too few degrees of freedom due to clustering data in -xtreg- with many variables. (Using State/MP 15.1)

    I'm running a cross-sectional time-series on individual responses in the November Supplement to the Current Population Survey (US Census/Dept of Labor). The November Supplement, in even-numbered years, asks a few questions about voter registration and voting.

    My dependent variable is registered ("are you registered to vote?" 0=no; 1=yes). The individual-level (or household-level) variables are categorical and fairly standard in the extensive literature: gender, age groups, last level of schooling, income quartile, race/ethnicity, and time at current address. Other than gender, the rest have four to six categories. The data set contains all 50 states plus DC (treated as a state) and several years. The sample size is>470,000 adult citizens. I'm using xtreg even with a binary dependent variable because registration rates are about 60-80%, depending on sub-population. So, I'm taking the Angrist and Pischke attitude that OLS is no worse than logistic. Plus, running - xtlogit - and then requesting - margins- even on a strong, multicore computer takes hours (i.e., overnight) with the large sample size.

    However, I wish to interact most of these level-one variables with one or more of three state-level policy variables (also binary). Year and state are included in -xtreg- in the usual way:

    Code:
    xtreg registered i.year x1-x6 policy1 policy2 policy3, i(stateid) cluster(stateid) fe
    The problem arises when interacting one policy variable with the individual-level variables while clustering on states: Stata won't run an F test. After reading -help j_robustsingular-, I see the problem is that I run out of degrees-of-freedom.

    What are my options? Here are those I thought of:

    1. Reduce categories in the variables? I would rather not drop or condense any as the goal is to get at the impact of policy on demographic groups. I could convert education and income into a scale using - alpha -, but that is hard to explain to a lay audience. The effect of the policies on age seems to hit those in the middle range, so I'd prefer not to use a continuous variable.
    2. Run separate models with only a few batches of interactions at a time? It seems the policies do interact with nearly all the variables (but not all categories).
    3. Determine if the St. Errors are trustworthy even without the F Test "working"? If I run the model as - reg, robust- with state ids included as an indicator, the change in p-values is between .01 and .1 for most interactions. And the F(95,472633)=780.86. Is that reason to run the model as - xtreg, cluster - and rely on the reported clustered st. errors?

    Are there other solutions or things to be aware of?

    Thank you.
    Last edited by Doug Hess; 09 Jan 2022, 20:52.

  • #2
    Doug:
    this happens frequently.
    I would check whether the regression suffers from misspecification or other issues.
    The fact that the F-test is not reported is basically negligible.
    Kind regards,
    Carlo
    (StataNow 18.5)

    Comment


    • #3
      Originally posted by Carlo Lazzaro View Post
      Doug:
      this happens frequently.
      I would check whether the regression suffers from misspecification or other issues.
      The fact that the F-test is not reported is basically negligible.
      Thanks, Carlo Lazzaro. I don't think it's a specification problem, it's the number of clusters is < the number of predictors.

      Comment


      • #4
        Doug:
        nothing to worry about, then.
        Just to be clearer, I recommended to check for model misspecification independently from the F-test appearing or not (as it has no bearing on that issue).
        You can use a procedure that heavily draws upon the -linktest- methodology, as in the following toy-example (where the -xtreg,fe- regerssion is deliberately misspecified):
        Code:
        . use "https://www.stata-press.com/data/r17/nlswork.dta"
        (National Longitudinal Survey of Young Women, 14-24 years old in 1968)
        
        . xtreg ln_wage c.age##c.age i.year, fe vce(cluster idcode)
        
        Fixed-effects (within) regression               Number of obs     =     28,510
        Group variable: idcode                          Number of groups  =      4,710
        
        R-squared:                                      Obs per group:
             Within  = 0.1162                                         min =          1
             Between = 0.1078                                         avg =        6.1
             Overall = 0.0932                                         max =         15
        
                                                        F(16,4709)        =      79.11
        corr(u_i, Xb) = 0.0613                          Prob > F          =     0.0000
        
                                     (Std. err. adjusted for 4,710 clusters in idcode)
        ------------------------------------------------------------------------------
                     |               Robust
             ln_wage | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
        -------------+----------------------------------------------------------------
                 age |   .0728746    .013687     5.32   0.000     .0460416    .0997075
                     |
         c.age#c.age |  -.0010113   .0001076    -9.40   0.000    -.0012224   -.0008003
                     |
                year |
                 69  |   .0647054   .0155249     4.17   0.000     .0342693    .0951415
                 70  |   .0284423   .0264639     1.07   0.283    -.0234395     .080324
                 71  |   .0579959   .0384111     1.51   0.131    -.0173078    .1332996
                 72  |   .0510671   .0502675     1.02   0.310    -.0474808     .149615
                 73  |   .0424104   .0624924     0.68   0.497    -.0801038    .1649247
                 75  |   .0151376    .086228     0.18   0.861    -.1539096    .1841848
                 77  |   .0340933   .1106841     0.31   0.758    -.1828994     .251086
                 78  |   .0537334   .1232232     0.44   0.663    -.1878417    .2953084
                 80  |   .0369475   .1473725     0.25   0.802    -.2519716    .3258667
                 82  |   .0391687   .1715621     0.23   0.819    -.2971733    .3755108
                 83  |    .058766   .1836086     0.32   0.749    -.3011928    .4187249
                 85  |   .1042758   .2080199     0.50   0.616    -.3035406    .5120922
                 87  |   .1242272   .2327328     0.53   0.594    -.3320379    .5804922
                 88  |   .1904977   .2486083     0.77   0.444    -.2968909    .6778863
                     |
               _cons |   .3937532   .2469015     1.59   0.111    -.0902893    .8777957
        -------------+----------------------------------------------------------------
             sigma_u |  .40275174
             sigma_e |  .30127563
                 rho |  .64120306   (fraction of variance due to u_i)
        ------------------------------------------------------------------------------
        
        . predict fitted, xb
        (24 missing values generated)
        
        . gen sq_fitted=fitted^2
        (24 missing values generated)
        
        . xtreg ln_wage c.age##c.age i.year fitted sq_fitted , fe vce(cluster idcode)
        note: c.age#c.age omitted because of collinearity.
        
        Fixed-effects (within) regression               Number of obs     =     28,510
        Group variable: idcode                          Number of groups  =      4,710
        
        R-squared:                                      Obs per group:
             Within  = 0.1173                                         min =          1
             Between = 0.1121                                         avg =        6.1
             Overall = 0.0952                                         max =         15
        
                                                        F(17,4709)        =      76.35
        corr(u_i, Xb) = 0.0636                          Prob > F          =     0.0000
        
                                     (Std. err. adjusted for 4,710 clusters in idcode)
        ------------------------------------------------------------------------------
                     |               Robust
             ln_wage | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
        -------------+----------------------------------------------------------------
                 age |  -.0004375   .0123334    -0.04   0.972    -.0246168    .0237418
                     |
         c.age#c.age |          0  (omitted)
                     |
                year |
                 69  |   -.016534   .0175908    -0.94   0.347    -.0510202    .0179523
                 70  |  -.0127288   .0270511    -0.47   0.638    -.0657616    .0403039
                 71  |  -.0164466   .0396213    -0.42   0.678    -.0941229    .0612297
                 72  |    -.01567   .0511885    -0.31   0.760    -.1160234    .0846835
                 73  |  -.0163476   .0631829    -0.26   0.796    -.1402158    .1075205
                 75  |  -.0170026   .0864874    -0.20   0.844    -.1865584    .1525532
                 77  |  -.0111413   .1109886    -0.10   0.920    -.2287309    .2064483
                 78  |  -.0029997   .1236291    -0.02   0.981    -.2453706    .2393712
                 80  |  -.0007318   .1475088    -0.00   0.996     -.289918    .2884544
                 82  |   .0058067   .1716208     0.03   0.973    -.3306503    .3422638
                 83  |   .0158354   .1837029     0.09   0.931    -.3443083     .375979
                 85  |     .04142   .2083538     0.20   0.842    -.3670508    .4498909
                 87  |   .0523993   .2330342     0.22   0.822    -.4044568    .5092553
                 88  |   .0938441   .2496481     0.38   0.707     -.395583    .5832712
                     |
              fitted |   5.201776   1.085644     4.79   0.000     3.073405    7.330147
           sq_fitted |  -1.321262   .3415637    -3.87   0.000    -1.990887   -.6516372
               _cons |  -3.307108    .892101    -3.71   0.000    -5.056043   -1.558172
        -------------+----------------------------------------------------------------
             sigma_u |  .40189262
             sigma_e |   .3011033
                 rho |  .64048345   (fraction of variance due to u_i)
        ------------------------------------------------------------------------------
        
        . test sq_fitted
        
         ( 1)  sq_fitted = 0
        
               F(  1,  4709) =   14.96
                    Prob > F =    0.0001
        
        .
        As -test- outcome (that simply echoes the coefficient p-value reported in -xtre,fe- outcome table) reaches staistical significance, the model is misspecified.
        Kind regards,
        Carlo
        (StataNow 18.5)

        Comment


        • #5
          Thanks. I ran this and Prob > F = 0.723 for the prediction-squared. Two questions:
          1. I can interpret that to mean the model is unlikely to be misspecified?
          2. If another version of the model had F=.430, does that mean it is not as well specified but unlikely problematic? Just wondering if you can compare models this way.

          Comment


          • #6
            Doug:
            1) both tests do not show evidence that your regressions are misspecified;
            2) I would use -adjusted Rsq- when available to compare two different regressions.
            Kind regards,
            Carlo
            (StataNow 18.5)

            Comment

            Working...
            X