Problem with Fixed effects in Cross Sectional Data Regression

Firangiz Aghayeva

Join Date: Oct 2021

Posts: 33
#1

Problem with Fixed effects in Cross Sectional Data Regression

27 Oct 2021, 03:54

Dear STATALIST participants,

I am writing to you to ask about fixed effects in cross-sectional data regression. I have a model to evaluate the performance of Private Equity funds. I want to add fixed effects dummy variables based on geography, industry, and time periods (for example, pre-crisis, crisis, post-crisis). How can I do that in STATA? I have read several papers with cross-sectional models involving fixed effects with not only fixed effects dummy variables but even group dummy variables based on vintage, geography, and industry. But I did not really understand how I can implement it in my regression in STATA. Can anyone assist me with this?

Kind regards,
Firangiz Aghayeva
Tags: None

Carlo Lazzaro

Join Date: Apr 2014
Posts: 17676

27 Oct 2021, 04:06

Firangiz:
when we deal with cross-sectional data, it is not correct to speak about fixed (or random) effects (that relate to panel data regression, where we have at least two data waves), as we have one wave of data only.
What you can do is to add a categorical predictor for each one of the independent variables you're interested in and see its effect on variation of the regressand when adjusted for the remaining predictors..
Something like:

Code:

. sysuse auto.dta
(1978 Automobile Data)

. regress price i.foreign i.rep78

      Source |       SS           df       MS      Number of obs   =        69
-------------+----------------------------------   F(5, 63)        =      0.19
       Model |  8372481.37         5  1674496.27   Prob > F        =    0.9670
    Residual |   568424478        63  9022610.75   R-squared       =    0.0145
-------------+----------------------------------   Adj R-squared   =   -0.0637
       Total |   576796959        68  8482308.22   Root MSE        =    3003.8

------------------------------------------------------------------------------
       price |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
     foreign |
    Foreign  |    36.7572   1010.484     0.04   0.971    -1982.533    2056.048
             |
       rep78 |
          2  |   1403.125   2374.686     0.59   0.557    -3342.306    6148.556
          3  |   1861.058   2195.967     0.85   0.400    -2527.232    6249.347
          4  |   1488.621   2295.176     0.65   0.519    -3097.921    6075.164
          5  |   1318.426   2452.565     0.54   0.593    -3582.634    6219.485
             |
       _cons |     4564.5   2123.983     2.15   0.035     320.0579    8808.942
------------------------------------------------------------------------------

.

Kind regards,
Carlo
(Stata 19.0)

Comment

Firangiz Aghayeva

Join Date: Oct 2021

Posts: 33
#3

27 Oct 2021, 05:22

Dear Mr. Lazzaro,

Thank you for your quick reply.

I understand. One quick question. How can I create a group dummy variable in STATA? (for instance, based on industry and region categories)

Kind regards,
Firangiz
Comment

Carlo Lazzaro

Join Date: Apr 2014
Posts: 17676

27 Oct 2021, 06:45

Firangiz:
extracting a cross-sectional study from a panel dataset:

Code:

. use "https://www.stata-press.com/data/r16/nlswork.dta"
(National Longitudinal Survey.  Young Women 14-26 years of age in 1968)

. egen wanted=group( ind_code south)
(349 missing values generated)

. reg ln_wage i.wanted if year==70

      Source |       SS           df       MS      Number of obs   =     1,654
-------------+----------------------------------   F(21, 1632)     =     27.64
       Model |  68.9621455        21  3.28391169   Prob > F        =    0.0000
    Residual |  193.866444     1,632  .118790713   R-squared       =    0.2624
-------------+----------------------------------   Adj R-squared   =    0.2529
       Total |  262.828589     1,653  .159000962   Root MSE        =    .34466

------------------------------------------------------------------------------
     ln_wage |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
      wanted |
          2  |  -.2403066   .1749217    -1.37   0.170    -.5834014    .1027881
          5  |   .4498844   .1861378     2.42   0.016     .0847902    .8149786
          6  |   .5003821   .2224773     2.25   0.025     .0640111    .9367532
          7  |   .3065661   .1424477     2.15   0.032     .0271665    .5859658
          8  |   .1791387   .1431401     1.25   0.211     -.101619    .4598964
          9  |   .5020912   .1481833     3.39   0.001     .2114418    .7927407
         10  |   .2220843   .1541368     1.44   0.150    -.0802424     .524411
         11  |   .0881686   .1425464     0.62   0.536    -.1914244    .3677617
         12  |  -.1128865   .1443001    -0.78   0.434    -.3959193    .1701464
         13  |   .3815219    .145175     2.63   0.009      .096773    .6662708
         14  |   .2308109   .1502033     1.54   0.125    -.0638006    .5254224
         15  |   .3333188   .1541368     2.16   0.031     .0309921    .6356455
         16  |   .2307913   .1701064     1.36   0.175    -.1028586    .5644413
         17  |   -.191698   .1455326    -1.32   0.188    -.4771484    .0937524
         18  |  -.3109592   .1473582    -2.11   0.035    -.5999902   -.0219281
         19  |    .094099   .1861378     0.51   0.613    -.2709952    .4591932
         20  |  -.0910608   .2814139    -0.32   0.746    -.6430314    .4609098
         21  |   .3760486   .1424995     2.64   0.008     .0965474    .6555497
         22  |   .1668051   .1437998     1.16   0.246    -.1152465    .4488567
         23  |   .4149318    .149794     2.77   0.006      .121123    .7087405
         24  |   .3839549   .1533319     2.50   0.012      .083207    .6847029
             |
       _cons |   1.320147    .140707     9.38   0.000     1.044162    1.596132
------------------------------------------------------------------------------

That said, it's probably more helpful, in cross-sectional study, to interact -year- with region- (please see -fvvralist- help file for further details).
As an aside, please call me Carlo, like all on (and many more off) this forum do. Thanks.

Kind regards,
Carlo
(Stata 19.0)

Comment

Jeff Wooldridge

Join Date: Apr 2014

Posts: 2121
#5

27 Oct 2021, 10:29

The fixed and random effect ESTIMATION methods are commonly used in cross-sectional settings, particularly when one knows geographical location. For example, I may have census data on households and I know the zip code of each household. To control for differences across zip codes, fixed effects estimation using zip code is often used. Statistically, it's like panel data without the natural ordering of the data within a zip code. (Time is the natural ordering in a true panel data set.) One can even get xtreg to do the appropriate estimation by using

Code:

xtset zipcode xtreg y x1 ... xK, fe vce(cluster zipcode) xtreg y x1 ... xK z1 ... zJ, re vce(cluster zipcode)

where z1, ..., zJ vary only by zipcode and not household. One can even use the Mundlak version of the Hausman test to choose between them, although FE will be more desirable from a robustness standpoint.

An alternative and very useful command is the community contributed command -reghdfe-. It can be used for panel data or cross-sectional data.

If you have many region/industry categories I would find a way to obtain a unique identifier and then use it in xtreg or reghdfe. Or, you can construct the dummies and use a (long) regression:

Code:

reg y x1 ... xK i.id, vce(cluster id)

The standard errors on the dummies will be nonsense, though. The standard errors on the xj are fine.
2 likes
Comment

lal mohan kumar

Join Date: May 2019
Posts: 265

27 Oct 2021, 11:23

Hi Jeff Wooldridge. Did you mean some thing like this

Code:

use "https://www.stata-press.com/data/r16/nlswork.dta"
(National Longitudinal Survey.  Young Women 14-26 years of age in 1968)

. rename ind_code zip_code

. xtset zip_code

Panel variable: zip_code (unbalanced)

. xtreg ln_wage age hours,fe vce(cluster zip_code)

Fixed-effects (within) regression               Number of obs     =     28,106
Group variable: zip_code                        Number of groups  =         12

R-squared:                                      Obs per group:
     Within  = 0.0682                                         min =         52
     Between = 0.5401                                         avg =    2,342.2
     Overall = 0.0837                                         max =      8,459

                                                F(2,11)           =     271.32
corr(u_i, Xb) = 0.1267                          Prob > F          =     0.0000

                              (Std. err. adjusted for 12 clusters in zip_code)
------------------------------------------------------------------------------
             |               Robust
     ln_wage | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
         age |    .017222   .0010558    16.31   0.000     .0148982    .0195458
       hours |   .0020541   .0008156     2.52   0.029      .000259    .0038493
       _cons |   1.100761   .0258992    42.50   0.000     1.043757    1.157765
-------------+----------------------------------------------------------------
     sigma_u |  .21355508
     sigma_e |  .42592316
         rho |  .20089205   (fraction of variance due to u_i)
------------------------------------------------------------------------------

. reg ln_wage age hours i.zip_code, vce(cluster zip_code)

Linear regression                               Number of obs     =     28,106
                                                F(1, 11)          =          .
                                                Prob > F          =          .
                                                R-squared         =     0.2074
                                                Root MSE          =     .42592

                              (Std. err. adjusted for 12 clusters in zip_code)
------------------------------------------------------------------------------
             |               Robust
     ln_wage | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
         age |    .017222    .001056    16.31   0.000     .0148977    .0195463
       hours |   .0020541   .0008158     2.52   0.029     .0002586    .0038496
             |
    zip_code |
          2  |   .4633736   .0034908   132.74   0.000     .4556904    .4710567
          3  |    .405033   .0019544   207.24   0.000     .4007314    .4093346
          4  |   .2861882   .0038659    74.03   0.000     .2776794     .294697
          5  |   .6141848   .0023495   261.41   0.000     .6090137     .619356
          6  |   .0949379   .0007282   130.38   0.000     .0933353    .0965406
          7  |   .4096318   .0017923   228.55   0.000      .405687    .4135767
          8  |   .2921583   .0014332   203.85   0.000     .2890039    .2953127
          9  |  -.1239015   .0018778   -65.98   0.000    -.1280345   -.1197686
         10  |   .2884519   .0030983    93.10   0.000     .2816326    .2952711
         11  |   .3655057   .0012967   281.88   0.000     .3626518    .3683596
         12  |   .4918411   .0020961   234.65   0.000     .4872277    .4964546
             |
       _cons |   .8100116   .0253592    31.94   0.000     .7541963     .865827
------------------------------------------------------------------------------

.

Comment

Jeff Wooldridge

Join Date: Apr 2014

Posts: 2121
#7

27 Oct 2021, 13:20

Yes, a couple of points. It looks like ind_code is something like an industry identifier. There are so few that it makes sense to use reg -- and there's no need to change the name. I was thinking of cases where you had many groups and so clustering can make sense. In a situation like the above, you shouldn't cluster. You essentially have a random sample and the variables of interest vary at the individual level. People use the term "industry fixed effects" but it's really just industry dummies. This was part of Carlo's point. Do you know which case you'll be in?
2 likes
Comment
lal mohan kumar

Join Date: May 2019

Posts: 265
#8

27 Oct 2021, 13:50

Thanks Jeff Wooldridge for the rapid reply with insightful comments. However, I am not sure if I understood things correctly.

It looks like ind_code is something like an industry identifier. There are so few that it makes sense to use reg -- and there's no need to change the name.

Yes, ind_code is the industry indentifier and I though that, just as many individual households have same ZIP code if they are located in the same place, many firms belong to a common industry. Is my analogy correct?

I was thinking of cases where you had many groups and so clustering can make sense. In a situation like the above, you shouldn't cluster.

I am sorry as I didnt understand this! it is true that if we classify firms into industries using some industry classification, we may have some 48-60 industries with firms nested within. In this case dont you recommend clustering option? If not would be the standard errors robust? Is there a minimum threshold for the use of cluster options? For instance I have seen papers that use INDUSTRY FIXED EFFECTS instead of firm fixed effects and CLUSTER AT INDUSTRY LEVEL. Isnot this a correct practice

You essentially have a random sample and the variables of interest vary at the individual level

This is applicable to all cases which involve unit level analysis (firm-level, household level) where both dependent and independent variables vary at individual level. Isnt it?

People use the term "industry fixed effects" but it's really just industry dummies. This was part of Carlo's point. Do you know which case you'll be in?

Yes I meant industry dummies as industry fixed effects but if I use reg instead of xtreg, the former uses GLS method right? I wish I had an answer for the question, whether we should use firm (household) fixed effects or industry (zip code) fixed effects in general
Comment
lal mohan kumar

Join Date: May 2019

Posts: 265
#9

28 Oct 2021, 10:44

Dear Jeff Wooldridge though I havent read your paper "When Should You Adjust Standard Errors for Clustering", is there something from it that applies in the above
Comment
Carlo Pasqua

Join Date: Jun 2022

Posts: 1
#10

18 Jun 2022, 09:03

Originally posted by Jeff Wooldridge View Post

The fixed and random effect ESTIMATION methods are commonly used in cross-sectional settings, particularly when one knows geographical location. For example, I may have census data on households and I know the zip code of each household. To control for differences across zip codes, fixed effects estimation using zip code is often used. Statistically, it's like panel data without the natural ordering of the data within a zip code. (Time is the natural ordering in a true panel data set.) One can even get xtreg to do the appropriate estimation by using

Code:

xtset zipcode xtreg y x1 ... xK, fe vce(cluster zipcode) xtreg y x1 ... xK z1 ... zJ, re vce(cluster zipcode)

where z1, ..., zJ vary only by zipcode and not household. One can even use the Mundlak version of the Hausman test to choose between them, although FE will be more desirable from a robustness standpoint.

An alternative and very useful command is the community contributed command -reghdfe-. It can be used for panel data or cross-sectional data.

If you have many region/industry categories I would find a way to obtain a unique identifier and then use it in xtreg or reghdfe. Or, you can construct the dummies and use a (long) regression:

Code:

reg y x1 ... xK i.id, vce(cluster id)

The standard errors on the dummies will be nonsense, though. The standard errors on the xj are fine.

Dear Prof. Jeff Wooldridge, thank you for your reply. I am in a very similar case in which I am using census data on individuals. Specifically, I have cross-sectional data and I know the state codes and years. I have only 10 years so I could easily use dummies. However, I am interested in state fixed effects too, and because I have 51 states, the methodology that you mentioned sounds appealing. I have only one question: is the following output an issue?

Panel variable: zip_code (unbalanced)

PS: it is an honor to get your help on this forum having studied for 6 years on your books. Thanks for that.

Best,
Comment

Announcement