Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Problem with Fixed effects in Cross Sectional Data Regression

    Dear STATALIST participants,

    I am writing to you to ask about fixed effects in cross-sectional data regression. I have a model to evaluate the performance of Private Equity funds. I want to add fixed effects dummy variables based on geography, industry, and time periods (for example, pre-crisis, crisis, post-crisis). How can I do that in STATA? I have read several papers with cross-sectional models involving fixed effects with not only fixed effects dummy variables but even group dummy variables based on vintage, geography, and industry. But I did not really understand how I can implement it in my regression in STATA. Can anyone assist me with this?

    Kind regards,
    Firangiz Aghayeva

  • #2
    Firangiz:
    when we deal with cross-sectional data, it is not correct to speak about fixed (or random) effects (that relate to panel data regression, where we have at least two data waves), as we have one wave of data only.
    What you can do is to add a categorical predictor for each one of the independent variables you're interested in and see its effect on variation of the regressand when adjusted for the remaining predictors..
    Something like:
    Code:
    . sysuse auto.dta
    (1978 Automobile Data)
    
    . regress price i.foreign i.rep78
    
          Source |       SS           df       MS      Number of obs   =        69
    -------------+----------------------------------   F(5, 63)        =      0.19
           Model |  8372481.37         5  1674496.27   Prob > F        =    0.9670
        Residual |   568424478        63  9022610.75   R-squared       =    0.0145
    -------------+----------------------------------   Adj R-squared   =   -0.0637
           Total |   576796959        68  8482308.22   Root MSE        =    3003.8
    
    ------------------------------------------------------------------------------
           price |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
    -------------+----------------------------------------------------------------
         foreign |
        Foreign  |    36.7572   1010.484     0.04   0.971    -1982.533    2056.048
                 |
           rep78 |
              2  |   1403.125   2374.686     0.59   0.557    -3342.306    6148.556
              3  |   1861.058   2195.967     0.85   0.400    -2527.232    6249.347
              4  |   1488.621   2295.176     0.65   0.519    -3097.921    6075.164
              5  |   1318.426   2452.565     0.54   0.593    -3582.634    6219.485
                 |
           _cons |     4564.5   2123.983     2.15   0.035     320.0579    8808.942
    ------------------------------------------------------------------------------
    
    .
    Kind regards,
    Carlo
    (Stata 19.0)

    Comment


    • #3
      Dear Mr. Lazzaro,

      Thank you for your quick reply.

      I understand. One quick question. How can I create a group dummy variable in STATA? (for instance, based on industry and region categories)

      Kind regards,
      Firangiz

      Comment


      • #4
        Firangiz:
        extracting a cross-sectional study from a panel dataset:
        Code:
        . use "https://www.stata-press.com/data/r16/nlswork.dta"
        (National Longitudinal Survey.  Young Women 14-26 years of age in 1968)
        
        . egen wanted=group( ind_code south)
        (349 missing values generated)
        
        . reg ln_wage i.wanted if year==70
        
              Source |       SS           df       MS      Number of obs   =     1,654
        -------------+----------------------------------   F(21, 1632)     =     27.64
               Model |  68.9621455        21  3.28391169   Prob > F        =    0.0000
            Residual |  193.866444     1,632  .118790713   R-squared       =    0.2624
        -------------+----------------------------------   Adj R-squared   =    0.2529
               Total |  262.828589     1,653  .159000962   Root MSE        =    .34466
        
        ------------------------------------------------------------------------------
             ln_wage |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
        -------------+----------------------------------------------------------------
              wanted |
                  2  |  -.2403066   .1749217    -1.37   0.170    -.5834014    .1027881
                  5  |   .4498844   .1861378     2.42   0.016     .0847902    .8149786
                  6  |   .5003821   .2224773     2.25   0.025     .0640111    .9367532
                  7  |   .3065661   .1424477     2.15   0.032     .0271665    .5859658
                  8  |   .1791387   .1431401     1.25   0.211     -.101619    .4598964
                  9  |   .5020912   .1481833     3.39   0.001     .2114418    .7927407
                 10  |   .2220843   .1541368     1.44   0.150    -.0802424     .524411
                 11  |   .0881686   .1425464     0.62   0.536    -.1914244    .3677617
                 12  |  -.1128865   .1443001    -0.78   0.434    -.3959193    .1701464
                 13  |   .3815219    .145175     2.63   0.009      .096773    .6662708
                 14  |   .2308109   .1502033     1.54   0.125    -.0638006    .5254224
                 15  |   .3333188   .1541368     2.16   0.031     .0309921    .6356455
                 16  |   .2307913   .1701064     1.36   0.175    -.1028586    .5644413
                 17  |   -.191698   .1455326    -1.32   0.188    -.4771484    .0937524
                 18  |  -.3109592   .1473582    -2.11   0.035    -.5999902   -.0219281
                 19  |    .094099   .1861378     0.51   0.613    -.2709952    .4591932
                 20  |  -.0910608   .2814139    -0.32   0.746    -.6430314    .4609098
                 21  |   .3760486   .1424995     2.64   0.008     .0965474    .6555497
                 22  |   .1668051   .1437998     1.16   0.246    -.1152465    .4488567
                 23  |   .4149318    .149794     2.77   0.006      .121123    .7087405
                 24  |   .3839549   .1533319     2.50   0.012      .083207    .6847029
                     |
               _cons |   1.320147    .140707     9.38   0.000     1.044162    1.596132
        ------------------------------------------------------------------------------
        That said, it's probably more helpful, in cross-sectional study, to interact -year- with region- (please see -fvvralist- help file for further details).
        As an aside, please call me Carlo, like all on (and many more off) this forum do. Thanks.
        Kind regards,
        Carlo
        (Stata 19.0)

        Comment


        • #5
          The fixed and random effect ESTIMATION methods are commonly used in cross-sectional settings, particularly when one knows geographical location. For example, I may have census data on households and I know the zip code of each household. To control for differences across zip codes, fixed effects estimation using zip code is often used. Statistically, it's like panel data without the natural ordering of the data within a zip code. (Time is the natural ordering in a true panel data set.) One can even get xtreg to do the appropriate estimation by using

          Code:
          xtset zipcode
          xtreg y x1 ... xK, fe vce(cluster zipcode)
          xtreg y x1 ... xK z1 ... zJ, re vce(cluster zipcode)
          where z1, ..., zJ vary only by zipcode and not household. One can even use the Mundlak version of the Hausman test to choose between them, although FE will be more desirable from a robustness standpoint.

          An alternative and very useful command is the community contributed command -reghdfe-. It can be used for panel data or cross-sectional data.

          If you have many region/industry categories I would find a way to obtain a unique identifier and then use it in xtreg or reghdfe. Or, you can construct the dummies and use a (long) regression:

          Code:
          reg y x1 ... xK i.id, vce(cluster id)
          The standard errors on the dummies will be nonsense, though. The standard errors on the xj are fine.

          Comment


          • #6
            Hi Jeff Wooldridge. Did you mean some thing like this

            Code:
            use "https://www.stata-press.com/data/r16/nlswork.dta"
            (National Longitudinal Survey.  Young Women 14-26 years of age in 1968)
            
            . rename ind_code zip_code
            
            . xtset zip_code
            
            Panel variable: zip_code (unbalanced)
            
            . xtreg ln_wage age hours,fe vce(cluster zip_code)
            
            Fixed-effects (within) regression               Number of obs     =     28,106
            Group variable: zip_code                        Number of groups  =         12
            
            R-squared:                                      Obs per group:
                 Within  = 0.0682                                         min =         52
                 Between = 0.5401                                         avg =    2,342.2
                 Overall = 0.0837                                         max =      8,459
            
                                                            F(2,11)           =     271.32
            corr(u_i, Xb) = 0.1267                          Prob > F          =     0.0000
            
                                          (Std. err. adjusted for 12 clusters in zip_code)
            ------------------------------------------------------------------------------
                         |               Robust
                 ln_wage | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
            -------------+----------------------------------------------------------------
                     age |    .017222   .0010558    16.31   0.000     .0148982    .0195458
                   hours |   .0020541   .0008156     2.52   0.029      .000259    .0038493
                   _cons |   1.100761   .0258992    42.50   0.000     1.043757    1.157765
            -------------+----------------------------------------------------------------
                 sigma_u |  .21355508
                 sigma_e |  .42592316
                     rho |  .20089205   (fraction of variance due to u_i)
            ------------------------------------------------------------------------------
            
            . reg ln_wage age hours i.zip_code, vce(cluster zip_code)
            
            Linear regression                               Number of obs     =     28,106
                                                            F(1, 11)          =          .
                                                            Prob > F          =          .
                                                            R-squared         =     0.2074
                                                            Root MSE          =     .42592
            
                                          (Std. err. adjusted for 12 clusters in zip_code)
            ------------------------------------------------------------------------------
                         |               Robust
                 ln_wage | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
            -------------+----------------------------------------------------------------
                     age |    .017222    .001056    16.31   0.000     .0148977    .0195463
                   hours |   .0020541   .0008158     2.52   0.029     .0002586    .0038496
                         |
                zip_code |
                      2  |   .4633736   .0034908   132.74   0.000     .4556904    .4710567
                      3  |    .405033   .0019544   207.24   0.000     .4007314    .4093346
                      4  |   .2861882   .0038659    74.03   0.000     .2776794     .294697
                      5  |   .6141848   .0023495   261.41   0.000     .6090137     .619356
                      6  |   .0949379   .0007282   130.38   0.000     .0933353    .0965406
                      7  |   .4096318   .0017923   228.55   0.000      .405687    .4135767
                      8  |   .2921583   .0014332   203.85   0.000     .2890039    .2953127
                      9  |  -.1239015   .0018778   -65.98   0.000    -.1280345   -.1197686
                     10  |   .2884519   .0030983    93.10   0.000     .2816326    .2952711
                     11  |   .3655057   .0012967   281.88   0.000     .3626518    .3683596
                     12  |   .4918411   .0020961   234.65   0.000     .4872277    .4964546
                         |
                   _cons |   .8100116   .0253592    31.94   0.000     .7541963     .865827
            ------------------------------------------------------------------------------
            
            .

            Comment


            • #7
              Yes, a couple of points. It looks like ind_code is something like an industry identifier. There are so few that it makes sense to use reg -- and there's no need to change the name. I was thinking of cases where you had many groups and so clustering can make sense. In a situation like the above, you shouldn't cluster. You essentially have a random sample and the variables of interest vary at the individual level. People use the term "industry fixed effects" but it's really just industry dummies. This was part of Carlo's point. Do you know which case you'll be in?

              Comment


              • #8
                Thanks Jeff Wooldridge for the rapid reply with insightful comments. However, I am not sure if I understood things correctly.

                It looks like ind_code is something like an industry identifier. There are so few that it makes sense to use reg -- and there's no need to change the name.
                Yes, ind_code is the industry indentifier and I though that, just as many individual households have same ZIP code if they are located in the same place, many firms belong to a common industry. Is my analogy correct?

                I was thinking of cases where you had many groups and so clustering can make sense. In a situation like the above, you shouldn't cluster.
                I am sorry as I didnt understand this! it is true that if we classify firms into industries using some industry classification, we may have some 48-60 industries with firms nested within. In this case dont you recommend clustering option? If not would be the standard errors robust? Is there a minimum threshold for the use of cluster options? For instance I have seen papers that use INDUSTRY FIXED EFFECTS instead of firm fixed effects and CLUSTER AT INDUSTRY LEVEL. Isnot this a correct practice

                You essentially have a random sample and the variables of interest vary at the individual level
                This is applicable to all cases which involve unit level analysis (firm-level, household level) where both dependent and independent variables vary at individual level. Isnt it?

                People use the term "industry fixed effects" but it's really just industry dummies. This was part of Carlo's point. Do you know which case you'll be in?
                Yes I meant industry dummies as industry fixed effects but if I use reg instead of xtreg, the former uses GLS method right? I wish I had an answer for the question, whether we should use firm (household) fixed effects or industry (zip code) fixed effects in general

                Comment


                • #9
                  Dear Jeff Wooldridge though I havent read your paper "When Should You Adjust Standard Errors for Clustering", is there something from it that applies in the above

                  Comment


                  • #10
                    Originally posted by Jeff Wooldridge View Post
                    The fixed and random effect ESTIMATION methods are commonly used in cross-sectional settings, particularly when one knows geographical location. For example, I may have census data on households and I know the zip code of each household. To control for differences across zip codes, fixed effects estimation using zip code is often used. Statistically, it's like panel data without the natural ordering of the data within a zip code. (Time is the natural ordering in a true panel data set.) One can even get xtreg to do the appropriate estimation by using

                    Code:
                    xtset zipcode
                    xtreg y x1 ... xK, fe vce(cluster zipcode)
                    xtreg y x1 ... xK z1 ... zJ, re vce(cluster zipcode)
                    where z1, ..., zJ vary only by zipcode and not household. One can even use the Mundlak version of the Hausman test to choose between them, although FE will be more desirable from a robustness standpoint.

                    An alternative and very useful command is the community contributed command -reghdfe-. It can be used for panel data or cross-sectional data.

                    If you have many region/industry categories I would find a way to obtain a unique identifier and then use it in xtreg or reghdfe. Or, you can construct the dummies and use a (long) regression:

                    Code:
                    reg y x1 ... xK i.id, vce(cluster id)
                    The standard errors on the dummies will be nonsense, though. The standard errors on the xj are fine.

                    Dear Prof. Jeff Wooldridge, thank you for your reply. I am in a very similar case in which I am using census data on individuals. Specifically, I have cross-sectional data and I know the state codes and years. I have only 10 years so I could easily use dummies. However, I am interested in state fixed effects too, and because I have 51 states, the methodology that you mentioned sounds appealing. I have only one question: is the following output an issue?

                    Panel variable: zip_code (unbalanced)

                    PS: it is an honor to get your help on this forum having studied for 6 years on your books. Thanks for that.

                    Best,

                    Comment

                    Working...
                    X