Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Census GEOIDs and Factor Variables

    I have a panel of census tracts, where each is identified by an 11-digit GEOID. I would like to run a regression with fixed effects, e.g.,

    Code:
    reg y X i.geoid
    However, I get the following error message:

    Code:
    geoid:  factor variables may not contain noninteger values
    This is strange, as all GEOIDs are integers in the mathematical sense. I believe the error occurs because Stata restricts the integer data type to [–32767, 32740]. Regardless, I can circumvent the issue with the following trick:

    Code:
    tostring geoid, gen(temp) format("%011.0f")
    encode temp, gen(geoid2)
    drop id geoid
    ren geoid2 geoid
    xtset geoid year, yearly
    This solution, however, causes issues when working with R and Python. Namely, the -geoid- value label is ignored, and the arbitrary integer values are used instead. Thus arises my question: Does anyone have a better solution? Thanks in advance!

  • #2
    No, this is not the source of the error. -xtset- is perfectly happy to work with 11 digit integer values:
    Code:
    . * Example generated by -dataex-. For more info, type help dataex
    . clear
    
    . input double geoid
    
             geoid
      1. 60263366656
      2. 66147524608
      3. 88040579072
      4. 9.00631e+10
      5. 39655878656
      6. 15050482688
      7. 75617173504
      8. 39581945856
      9. 11651555328
     10. 1.29019e+10
     11. end
    
    .
    . xtset geoid
    
    Panel variable: geoid (balanced)
    (The values shown here are not real census geoids; they are just random 11-digit integers.)

    So, I conclude that your data is not what you think it is. Something in the geoid variable is not an integer. You can easily find the offending observations with:
    Code:
    browse if geoid != int(geoid)
    and then figure out how to fix the data.

    Added: I responded to your post as if the problem had arisen with -xtset-. But it actually arose when you tried to use i.geoid. With regard to this, you are correct, factor variable notation will not work with 11 digit numbers (or things that cannot be held within an -int- data storage type). Also, though probably not an issue here, factor variable notation can only be applied to non-negative integer variables. I do not think there is any solution within Stata to your dilemma that preserves compatibility with R and Python. However, instead of -regress Y X i.geoid- you can instead do
    Code:
    xtset geoid
    xtreg y x
    to get the same results. This will work with your original, untransformed, 11 digit geoid variable.
    Last edited by Clyde Schechter; 25 Sep 2023, 09:53.

    Comment


    • #3
      Thanks @Clyde for your response.

      To clarify, I can use -xtset- without any issue. The error occurs when I try to run the regression. Modifying the example code you generously provided:

      Code:
      . clear
      
      .
      . input float geoid
      
               geoid
        1.
      . 60263366656
        2. 66147524608
        3. 88040579072
        4. 9.00631e+10
        5. 39655878656
        6. 15050482688
        7. 75617173504
        8. 39581945856
        9. 11651555328
       10. 1.29019e+10
       11.
      . end
      
      .
      . expand 4
      (30 observations created)
      
      . bysort geoid: egen year = seq(), from(2000) to(2004)
      
      .
      . xtset geoid year, yearly
      
      Panel variable: geoid (strongly balanced)
       Time variable: year, 2000 to 2003
               Delta: 1 year
      
      .
      . reg year i.geoid
      geoid:  factor variables may not contain noninteger values
      r(452);
      Do you have any insight into the cause of the error in the regression, if not what I mentioned in the original post?

      Update: I just saw your addendum. Thanks again!
      Last edited by Noah Blake Smith; 25 Sep 2023, 10:08.

      Comment


      • #4
        Actually, using the method @Clyde recommended, I encountered an error when I add population weights to the regression:

        Code:
        . xtreg y X i.year [aweight = pop], fe
        weight must be constant within geoid
        Is this requirement for constant weights a limitation of Stata? Or are there econometric/statistical reasons for this? (I have seen other threads on this error, but I haven't found an answer explaining the reason behind it.)

        For reference, I can run the same regression using my transformed GEOID variable without any issue:

        Code:
        reg y X i.geoid i.year [aweight = pop]

        Comment


        • #5
          There is, as far as I know, no statistical reason why the weights must be constant within the panels. There may be an econometric reason why varying weights within panel is problematic--I wouldn't know, it's not my discipline. I think that the reason that Stata requires this in -xtreg, fe- is that its method of handling the panel fixed effects (within-panel demeaning) would be complicated to implement without this restriction. To get around this, you can use Sergio Correa's -reghdfe- instead of -xtreg, fe-.

          [code
          reghdfe Y X [aweight = pop], absorb(geoid)
          [/code]
          This will work with the original 11 digit geoid variable.

          -reghdfe- is available from SSC.


          Comment

          Working...
          X