Regression with multiple categorical variable

Enrico Azzini

Join Date: Jul 2020

Posts: 79
#1

Regression with multiple categorical variable

08 Jan 2022, 08:23

Goodmorning, I have cross sectional data and I want to make multiple regression. In my model there are 5 categorical variable: Years, Sex, Maritial_status, Education and Regions. Is it possible to regroup some of these variables in a single group of controls? I'm not interested to know the effect of such variables separately but only in term of fixed effect and I'm afraid that adding too many dummy variables separately would make the intrepretation of the results confusing.

Code:

local controls Sex Maritial_status Education Regions reg wage prox1 `controls' i.Years

Last edited by Enrico Azzini; 08 Jan 2022, 08:27. Reason: categorical variables
Tags: None
Jared Greathouse

Join Date: Sep 2021

Posts: 2170
#2

08 Jan 2022, 11:47

What variable do you want to regroup? Please give an example of your dataset using dataex.
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17603
#3

08 Jan 2022, 11:58

Enrico:
as Jared wisely highlighted, the lack of any example makes replying more difficult.
That said, you may want to consider one of the -egen- function (eg., -group-).
As an aside, I find your approach questionable on a methodological point of view: if you're not interested in some control variables, simply exclude them from the right hand-side of your regression-equation..

Kind regards,
Carlo
(StataNow 18.5)
Comment

Enrico Azzini

Join Date: Jul 2020
Posts: 79

08 Jan 2022, 12:14

Sorry, I have the following dataset
I want to estimate the effect of proxy1_t on retric which rapresent the wage earned by that person.
I also want to inclode Married, age, region, and education to properly define the model but I'm not intrested to know the separete effect for each of these variable. I put them in the model only with the purpose to better specfy the regression. If I don't insert them in the model won't the model suffer from ommitted variable bias?

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input byte(region age) int retric float(education proxy1_t Sesso Condprof Married)
 1 37  500 3           . 1 1 1
 1 42 1000 3 .0009782822 2 1 1
12 44  800 3           . 2 1 1
 5 45    . 5           . 2 1 1
12 43    . 3           . 1 1 1
19 41  700 1   .01730224 1 1 1
 3 46    . 4   .15106286 1 1 1
 1 32  540 3           . 2 1 0
 8 45  500 6           . 1 1 1
10 23 1350 4           . 1 1 0
11 49  800 4           . 1 1 1
 7 39  950 3           . 2 1 1
16 52  900 2           . 2 1 1
 2 48 2100 6 .0009313878 1 1 1
 1 34 1200 6  .004532125 2 1 1
 1 40  700 3   .00065185 2 1 0
 4 50  770 3   .03750375 2 1 1
 3 52  750 4           . 1 1 1
20 48 1200 2           . 1 1 1
16 35 1000 5           . 1 1 0
13 48    . 1  .006454535 1 1 0
 8 19  250 5 .0011673005 2 1 0
 9 28 1100 3    .4254324 2 1 0
10 57  900 5           . 2 1 1
 3 44    . 3           . 1 1 1
 7 41  500 5           . 2 1 1
 1 37 1450 3           . 1 1 1
11 29    . 3           . 1 1 0
17 46 1000 3           . 1 1 1
12 37    . 6           . 1 1 0
 5 51 1300 6           . 1 1 1
19 60 2000 3           . 1 1 1
 3 36 1600 6           . 2 1 0
 2 38    . 5           . 1 1 0
 1 45    . 6           . 1 1 1
 1 52 1700 5 .0011495574 1 1 1
 9 35 1100 3    .1122975 1 1 0
 5 29    . 3  .000961756 2 1 1
 8 21  600 4           . 2 1 0
11 56  980 4  .002398492 1 1 1
18 49  900 1           . 1 1 1
 8 46 1100 3           . 1 1 1
 8 48  600 2           . 1 1 1
 1 26 1300 3     .179375 1 1 1
 3 42 1300 3           . 1 1 1
 6 45 1250 5           . 1 1 1
 5 34 1020 3    .0396105 2 1 0
10 42  700 3           . 2 1 1
 3 45    . 3           . 1 1 1
15 34  600 3   .01332741 1 1 0
 1 37  350 3           . 2 1 0
 4 20  800 4           . 1 1 0
18 47    . 3           . 1 1 1
10 42  700 2           . 1 1 1
 7 33    . 2           . 1 1 1
 1 31 1600 5           . 1 1 1
 3 33    . 5  .000961756 2 1 1
13 51 1000 5           . 2 1 0
 9 35    . 3           . 1 1 1
19 42 1000 2   .00519818 1 1 0
19 48 1000 5           . 1 1 1
 6 41  570 3           . 1 1 1
 3 37 2300 2    .0591716 1 1 1
 2 26 1100 3 .0011673005 1 1 1
19 49 1540 3           . 1 1 1
17 48 1400 3           . 1 1 1
 4 30 1200 5           . 2 1 1
12 62 1080 3           . 1 1 1
 5 33 1300 6           . 2 1 1
10 35 1400 3           . 1 1 1
 2 28  500 5           . 2 1 1
 4 49 1200 3           . 1 1 1
 3 24 1250 5           . 1 1 0
 3 31 1100 3           . 2 1 0
12 44 1500 5 .0014688977 2 1 1
12 58  700 5           . 2 1 0
 1 51  900 3           . 1 1 0
10 28  800 5  .008932039 2 1 0
 5 45  950 5           . 1 1 1
16 53  860 3  .009070295 2 1 1
12 52 1060 2           . 2 1 1
 1 47  920 5  .002398492 1 1 1
 1 37    . 3           . 2 1 1
 1 42 1420 3           . 1 1 1
 4 25 1480 3           . 1 1 0
 3 51 1100 5           . 1 1 1
 9 38    . 1  .005045526 2 1 0
12 33  600 5 .0008383903 1 1 0
 8 40 1000 1           . 2 1 1
12 35 1900 6 .0008383903 1 1 0
 1 39 1200 3           . 2 1 0
 9 39 1200 3           . 1 1 1
 5 47  800 5           . 1 1 1
10 33  950 1    .7132353 1 1 1
 7 57  300 3           . 2 1 0
 8 27 1300 3           . 1 1 1
 6 44  980 6   .04130435 2 1 1
 3 26 1450 5           . 1 1 1
 3 43 1550 5           . 1 1 1
 6 41 1300 3    .6294156 1 1 0
end
label values education w_all
label def w_all 1 "No qualification", modify
label def w_all 2 "Elementary education", modify
label def w_all 3 "Middle school education", modify
label def w_all 4 "Diploma 2-3 years", modify
label def w_all 5 "Diploma 4-5 years", modify
label def w_all 6 "Degree", modify
label values Sesso x_all
label def x_all 1 "Male", modify
label def x_all 2 "Female", modify
label values Condprof u_all
label def u_all 1 "Occupati", modify

Comment

Carlo Lazzaro

Join Date: Apr 2014
Posts: 17603

09 Jan 2022, 04:17

Enrico:
I would go with this code:

Code:

. reg retric proxy1_t c.age##c.age i.education i.Sesso i.Married i.Condprof
note: 1.Condprof omitted because of collinearity.

      Source |       SS           df       MS      Number of obs   =        26
-------------+----------------------------------   F(10, 15)       =      2.21
       Model |  3333033.47        10  333303.347   Prob > F        =    0.0806
    Residual |  2265416.53        15  151027.768   R-squared       =    0.5953
-------------+----------------------------------   Adj R-squared   =    0.3256
       Total |     5598450        25      223938   Root MSE        =    388.62

------------------------------------------------------------------------------------------
                  retric | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------------------+----------------------------------------------------------------
                proxy1_t |   697.0022     488.08     1.43   0.174    -343.3156     1737.32
                     age |    63.0875    80.9472     0.78   0.448    -109.4474    235.6224
                         |
             c.age#c.age |  -.7893141   1.092342    -0.72   0.481    -3.117586    1.538957
                         |
               education |
   Elementary education  |   1227.902   439.6489     2.79   0.014     290.8128    2164.992
Middle school education  |   732.7444   360.9857     2.03   0.060    -36.67831    1502.167
      Diploma 2-3 years  |   591.3479    660.891     0.89   0.385    -817.3079    2000.004
      Diploma 4-5 years  |   813.5435     413.45     1.97   0.068    -67.70436    1694.791
                 Degree  |   1214.937   392.6527     3.09   0.007     378.0176    2051.856
                         |
                   Sesso |
                 Female  |  -295.5228    173.048    -1.71   0.108    -664.3659    73.32039
               1.Married |   370.4006   200.3919     1.85   0.084    -56.72469    797.5258
                         |
                Condprof |
               Occupati  |          0  (omitted)
                   _cons |  -1041.031   1600.838    -0.65   0.525    -4453.135    2371.073
------------------------------------------------------------------------------------------

.

And then run the usual postestimation tests:

checking for heteroskedasticity

Code:

. estat hettest

Breusch–Pagan/Cook–Weisberg test for heteroskedasticity
Assumption: Normal error terms
Variable: Fitted values of retric

H0: Constant variance

    chi2(1) =   1.06
Prob > chi2 = 0.3033

.

checking for misspecification off the funtional form of theregressand:

Code:

. predict fitted, xb
(69 missing values generated)

. g sq_fitted=fitted^2
(69 missing values generated)

. reg retric proxy1_t c.age##c.age i.education i.Sesso i.Married i.Condprof fitted sq_fitted
note: c.age#c.age omitted because of collinearity.
note: 1.Condprof omitted because of collinearity.

      Source |       SS           df       MS      Number of obs   =        26
-------------+----------------------------------   F(11, 14)       =      2.32
       Model |  3617104.54        11  328827.685   Prob > F        =    0.0699
    Residual |  1981345.46        14  141524.676   R-squared       =    0.6461
-------------+----------------------------------   Adj R-squared   =    0.3680
       Total |     5598450        25      223938   Root MSE        =     376.2

------------------------------------------------------------------------------------------
                  retric | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------------------+----------------------------------------------------------------
                proxy1_t |  -419.2607   1096.421    -0.38   0.708     -2770.85    1932.328
                     age |   .3840776   12.10963     0.03   0.975     -25.5885    26.35666
                         |
             c.age#c.age |          0  (omitted)
                         |
               education |
   Elementary education  |  -1087.086   1840.533    -0.59   0.564    -5034.636    2860.463
Middle school education  |  -384.1059   997.6089    -0.39   0.706    -2523.764    1755.552
      Diploma 2-3 years  |  -171.2475    680.766    -0.25   0.805    -1631.345     1288.85
      Diploma 4-5 years  |  -474.3915   1073.208    -0.44   0.665    -2776.194    1827.411
                 Degree  |  -923.6652   1761.888    -0.52   0.608     -4702.54     2855.21
                         |
                   Sesso |
                 Female  |   176.0326   466.0397     0.38   0.711    -823.5232    1175.588
               1.Married |  -235.1549   496.5516    -0.47   0.643    -1300.152    829.8422
                         |
                Condprof |
               Occupati  |          0  (omitted)
                  fitted |  -.3248515   1.633758    -0.20   0.845    -3.828914    3.179211
               sq_fitted |   .0009455   .0006674     1.42   0.178    -.0004859    .0023769
                   _cons |    776.184   734.3992     1.06   0.308    -798.9458    2351.314
------------------------------------------------------------------------------------------

As usual, this kind of researches, may suffer from a source of endogeneity (latent variable) that may be embedded in the residuals, as your predictors do not include interindividual heterogeneity. Other things being equal, on average smarter persons obtain highe educational degrees (predictor) and negotiate better wage (your regressand). I would discuss this issue with your supervisor/teacher/mentor.
As example of this source of endogeneity is reported and fixed in

https://www.stata.com/bookstore/microeconometrics-stata, pages 177-209.

Kind regards,
Carlo
(StataNow 18.5)

Comment

Enrico Azzini

Join Date: Jul 2020

Posts: 79
#6

09 Jan 2022, 10:13

Hi Carlo thank you for the suggestion. I will analyze better the issue of endogeneity as you suggested, thanks!
Comment
Enrico Azzini

Join Date: Jul 2020

Posts: 79
#7

19 Jan 2022, 09:13

Hi why you used in the model the interaction of age and not only age?
c.age##c.age
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17603
#8

19 Jan 2022, 09:42

Enrico:
because the original idea was to search for potential turning points (ie, quadratic relationship between - age- and the regressand), that do not seem to be present in the example elaborated on your excerpt.

Kind regards,
Carlo
(StataNow 18.5)
Comment
Enrico Azzini

Join Date: Jul 2020

Posts: 79
#9

20 Jan 2022, 05:21

Thankyou Carlo. Very helpful. Actually when I include in the regression also the variables with the linear prediction and and its square all the other variables lost significance.
You implement the model also with these variables to check for functional form misspecification and if the model is properly specified the linear predction and its square should be not significant right? However I find strange the fact that when I run the regression controlling for misspecification, variables like education or sex turns out to be no more significant when it is commonly believed that these have an impact on salary. In the interpretation of results I must rely only on statistical validity only or I can belive that even if the model could suffer some form of bias nevertheless, it goes in the correct direction if it estimates that education have positive effect and sex negative?
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17603
#10

20 Jan 2022, 06:45

Enrico:
1) as you do not share what you typed and what Stata gave you back, it is difficult to say. As a general rule, if the squared term is not statistically significant, you can get rid of it and re-run the model with the linear term only;
2) you can also check for potential misspecification of your model via a restricted regression:

Code:

reg retric fitted sq_fitted

That said, the recommendation of testing for latent variable-led endogeneity still holds. The possible instruments are father and/or mother education level.

Last edited by Carlo Lazzaro; 20 Jan 2022, 06:51.

Kind regards,
Carlo
(StataNow 18.5)
Comment

Enrico Azzini

Join Date: Jul 2020
Posts: 79

#11

20 Jan 2022, 07:38

Hi Carlo here my regression:

Code:

*5) regression log wage robust standard error with new proxy1_t
quietly reg In_retric newproxy1_t c.age##c.age  i.education i.sex i.Married  i.year if working==1
 
estat hettest
 
eststo: quietly reg In_retric newproxy1_t c.age##c.age i.education i.sex i.Married  i.year if working==1, vce(robust)
 
predict fittednew, xb
g sq_fittednew=fittednew^2

eststo: quietly reg In_retric newproxy1_t c.age##c.age  i.education i.sex i.Married  i.year fittednew sq_fittednew if working==1, vce(robust)
esttab, p compress label nobaselevels interaction(" X ")


*6) regression log wage cluster  error with new proxy1_t

quietly reg In_retric newproxy1_t c.agesq##c.agesq  i.education i.sex i.Married  i.year if working==1
 
estat hettest
 
eststo: quietly reg In_retric newproxy1_t c.age##c.age i.education i.sex i.Married  i.year if working==1, vce(cluster Countryoforign)
 
predict fittednew2, xb
g sq_fittednew2=fittednew2^2
 
eststo: quietly reg In_retric newproxy1_t c.age##c.age i.education i.sex i.Married  i.year fittednew2 sq_fittednew2 if working==1, vce(cluster Countryoforign)
 
esttab, p compress label nobaselevels interaction(" X ")

and the results are:

Code:

. esttab, p compress label nobaselevels interaction(" X ")

--------------------------------------------------------------------
                       (1)          (2)          (3)          (4)  
                 In_retric    In_retric    In_retric    In_retric  
--------------------------------------------------------------------
newproxy1_t         -0.346***   0.00417       -0.346***   0.00417  
                   (0.000)      (0.922)      (0.000)      (0.959)  

ETAM                0.0370***  0.000180       0.0370***  0.000180  
                   (0.000)      (0.469)      (0.000)      (0.783)  

ETAM X ETAM      -0.000319***         0    -0.000319***         0  
                   (0.000)          (.)      (0.000)          (.)  

Elementary edu~n    0.0602**   -0.00301       0.0602     -0.00301  
                   (0.008)      (0.896)      (0.196)      (0.941)  

Middle school ~n     0.265*** -0.000776        0.265*** -0.000776  
                   (0.000)      (0.973)      (0.000)      (0.983)  

Diploma 2-3 ye~s     0.398***   0.00161        0.398***   0.00161  
                   (0.000)      (0.946)      (0.000)      (0.966)  

Diploma 4-5 ye~s     0.464***   0.00313        0.464***   0.00313  
                   (0.000)      (0.898)      (0.000)      (0.927)  

Degree               0.677***    0.0101        0.677***    0.0101  
                   (0.000)      (0.713)      (0.000)      (0.719)  

Female              -0.299***  -0.00695       -0.299***  -0.00695  
                   (0.000)      (0.369)      (0.000)      (0.822)  

Married=1           0.0330***   0.00103       0.0330***   0.00103  
                   (0.000)      (0.696)      (0.000)      (0.879)  

ANNO=2015           0.0111**   0.000236       0.0111***  0.000236  
                   (0.002)      (0.947)      (0.000)      (0.894)  

ANNO=2016           0.0180***  0.000342       0.0180***  0.000342  
                   (0.000)      (0.924)      (0.000)      (0.719)  

ANNO=2017           0.0232***  0.000393       0.0232***  0.000393  
                   (0.000)      (0.913)      (0.000)      (0.921)  

ANNO=2018           0.0251***  0.000452       0.0251***  0.000452  
                   (0.000)      (0.900)      (0.000)      (0.924)  

ANNO=2019           0.0367***  0.000671       0.0367***  0.000671  
                   (0.000)      (0.851)      (0.000)      (0.906)  

ANNO=2020           0.0462***  0.000946       0.0462***  0.000946  
                   (0.000)      (0.795)      (0.000)      (0.856)  

Linear predict~n                  1.672***                          
                                (0.000)                            

sq_fittednew                    -0.0489**                          
                                (0.002)                            

Linear predict~n                                            1.672*  
                                                          (0.010)  

sq_fittednew2                                             -0.0489  
                                                          (0.235)  

Constant             5.799***    -2.316**      5.799***    -2.316  
                   (0.000)      (0.003)      (0.000)      (0.351)  
--------------------------------------------------------------------
Observations        172695       172695       172695       172695  
--------------------------------------------------------------------
p-values in parentheses
* p<0.05, ** p<0.01, *** p<0.001

when I run the model usign robust standard errors the sq_fitted values are significant but not when I use cluster standard errors, however in both cases all the other coefficient loose their significance when I add the fitted values and the squares. I was wondering if in the discussion of the results I can report as variable with a significant effect those who are significant in the first and third column or not because the model suffer from misspecification.

Last edited by Enrico Azzini; 20 Jan 2022, 07:48.

Comment

Enrico Azzini

Join Date: Jul 2020

Posts: 79
#12

20 Jan 2022, 07:41

I can't control for endogeneity using an instrumental variable, can I use the command eteffects?
Thanks for your help!
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17603
#13

20 Jan 2022, 09:48

Enrico:
1) you do not have to include -fitted. and -sq_fitted- in the regressions you discuss; they're simply used to test possible misspecification of the functional for of the regressand (in brief, if there's evidence of a non-linear relatinship between the regressand and -sq_fitted- some predictor and/or interaction is missing in the right.hand side of your regression equation);
2) please note that, unlike -xtreg-, -regress- options for non-default standard errors deal with heteroskedasticity (-robust-) and serial correlation of the residuals (-vce(cluster clusterid)-), resepctively. Put differently, they are not interchangeable.
3) I'm not familiar with -eteffects- hence I cannot advise you on that.

Kind regards,
Carlo
(StataNow 18.5)
Comment
Enrico Azzini

Join Date: Jul 2020

Posts: 79
#14

21 Jan 2022, 02:45

Hi Carlo thanks for you help. With respect to the point 2) I read that, using reg, vce(cluster clusterid) can be used to deal with heteroskedasticity and serial correlation at the same time.
I would like to ask you if is possible to use xtset to set the data as a panle data, without the time dimension. my unit of observation are individual who were interviewed only once from year 2014 to 2020.
I had thought of setting as a variable panel Nationlity.
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17603
#15

21 Jan 2022, 03:01

Enrico:
if, as it seems, you have cross-sectional data:
1) the recommended approach (see the really valuable
https://www.stata.com/bookstore/environmental-econometrics-using-stata,
page 28) for dealing with both heteroskedastcity and autocorrelation is switching from -regress- to -newey- (assuming that your data are cross-sectional and your regressand is contnuous);
2) while it's absolutely legal to -xtset- a panel dataset with -panelid- only, I fail to get what you woud gain with following this approach with cross-sectional data.

Kind regards,
Carlo
(StataNow 18.5)
Comment

Announcement