Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Regression with multiple categorical variable

    Goodmorning, I have cross sectional data and I want to make multiple regression. In my model there are 5 categorical variable: Years, Sex, Maritial_status, Education and Regions. Is it possible to regroup some of these variables in a single group of controls? I'm not interested to know the effect of such variables separately but only in term of fixed effect and I'm afraid that adding too many dummy variables separately would make the intrepretation of the results confusing.
    Code:
     local controls Sex Maritial_status Education Regions
    reg wage prox1 `controls' i.Years
    Last edited by Enrico Azzini; 08 Jan 2022, 08:27. Reason: categorical variables

  • #2
    What variable do you want to regroup? Please give an example of your dataset using dataex.

    Comment


    • #3
      Enrico:
      as Jared wisely highlighted, the lack of any example makes replying more difficult.
      That said, you may want to consider one of the -egen- function (eg., -group-).
      As an aside, I find your approach questionable on a methodological point of view: if you're not interested in some control variables, simply exclude them from the right hand-side of your regression-equation..

      Kind regards,
      Carlo
      (StataNow 18.5)

      Comment


      • #4
        Sorry, I have the following dataset
        I want to estimate the effect of proxy1_t on retric which rapresent the wage earned by that person.
        I also want to inclode Married, age, region, and education to properly define the model but I'm not intrested to know the separete effect for each of these variable. I put them in the model only with the purpose to better specfy the regression. If I don't insert them in the model won't the model suffer from ommitted variable bias?
        Code:
        * Example generated by -dataex-. For more info, type help dataex
        clear
        input byte(region age) int retric float(education proxy1_t Sesso Condprof Married)
         1 37  500 3           . 1 1 1
         1 42 1000 3 .0009782822 2 1 1
        12 44  800 3           . 2 1 1
         5 45    . 5           . 2 1 1
        12 43    . 3           . 1 1 1
        19 41  700 1   .01730224 1 1 1
         3 46    . 4   .15106286 1 1 1
         1 32  540 3           . 2 1 0
         8 45  500 6           . 1 1 1
        10 23 1350 4           . 1 1 0
        11 49  800 4           . 1 1 1
         7 39  950 3           . 2 1 1
        16 52  900 2           . 2 1 1
         2 48 2100 6 .0009313878 1 1 1
         1 34 1200 6  .004532125 2 1 1
         1 40  700 3   .00065185 2 1 0
         4 50  770 3   .03750375 2 1 1
         3 52  750 4           . 1 1 1
        20 48 1200 2           . 1 1 1
        16 35 1000 5           . 1 1 0
        13 48    . 1  .006454535 1 1 0
         8 19  250 5 .0011673005 2 1 0
         9 28 1100 3    .4254324 2 1 0
        10 57  900 5           . 2 1 1
         3 44    . 3           . 1 1 1
         7 41  500 5           . 2 1 1
         1 37 1450 3           . 1 1 1
        11 29    . 3           . 1 1 0
        17 46 1000 3           . 1 1 1
        12 37    . 6           . 1 1 0
         5 51 1300 6           . 1 1 1
        19 60 2000 3           . 1 1 1
         3 36 1600 6           . 2 1 0
         2 38    . 5           . 1 1 0
         1 45    . 6           . 1 1 1
         1 52 1700 5 .0011495574 1 1 1
         9 35 1100 3    .1122975 1 1 0
         5 29    . 3  .000961756 2 1 1
         8 21  600 4           . 2 1 0
        11 56  980 4  .002398492 1 1 1
        18 49  900 1           . 1 1 1
         8 46 1100 3           . 1 1 1
         8 48  600 2           . 1 1 1
         1 26 1300 3     .179375 1 1 1
         3 42 1300 3           . 1 1 1
         6 45 1250 5           . 1 1 1
         5 34 1020 3    .0396105 2 1 0
        10 42  700 3           . 2 1 1
         3 45    . 3           . 1 1 1
        15 34  600 3   .01332741 1 1 0
         1 37  350 3           . 2 1 0
         4 20  800 4           . 1 1 0
        18 47    . 3           . 1 1 1
        10 42  700 2           . 1 1 1
         7 33    . 2           . 1 1 1
         1 31 1600 5           . 1 1 1
         3 33    . 5  .000961756 2 1 1
        13 51 1000 5           . 2 1 0
         9 35    . 3           . 1 1 1
        19 42 1000 2   .00519818 1 1 0
        19 48 1000 5           . 1 1 1
         6 41  570 3           . 1 1 1
         3 37 2300 2    .0591716 1 1 1
         2 26 1100 3 .0011673005 1 1 1
        19 49 1540 3           . 1 1 1
        17 48 1400 3           . 1 1 1
         4 30 1200 5           . 2 1 1
        12 62 1080 3           . 1 1 1
         5 33 1300 6           . 2 1 1
        10 35 1400 3           . 1 1 1
         2 28  500 5           . 2 1 1
         4 49 1200 3           . 1 1 1
         3 24 1250 5           . 1 1 0
         3 31 1100 3           . 2 1 0
        12 44 1500 5 .0014688977 2 1 1
        12 58  700 5           . 2 1 0
         1 51  900 3           . 1 1 0
        10 28  800 5  .008932039 2 1 0
         5 45  950 5           . 1 1 1
        16 53  860 3  .009070295 2 1 1
        12 52 1060 2           . 2 1 1
         1 47  920 5  .002398492 1 1 1
         1 37    . 3           . 2 1 1
         1 42 1420 3           . 1 1 1
         4 25 1480 3           . 1 1 0
         3 51 1100 5           . 1 1 1
         9 38    . 1  .005045526 2 1 0
        12 33  600 5 .0008383903 1 1 0
         8 40 1000 1           . 2 1 1
        12 35 1900 6 .0008383903 1 1 0
         1 39 1200 3           . 2 1 0
         9 39 1200 3           . 1 1 1
         5 47  800 5           . 1 1 1
        10 33  950 1    .7132353 1 1 1
         7 57  300 3           . 2 1 0
         8 27 1300 3           . 1 1 1
         6 44  980 6   .04130435 2 1 1
         3 26 1450 5           . 1 1 1
         3 43 1550 5           . 1 1 1
         6 41 1300 3    .6294156 1 1 0
        end
        label values education w_all
        label def w_all 1 "No qualification", modify
        label def w_all 2 "Elementary education", modify
        label def w_all 3 "Middle school education", modify
        label def w_all 4 "Diploma 2-3 years", modify
        label def w_all 5 "Diploma 4-5 years", modify
        label def w_all 6 "Degree", modify
        label values Sesso x_all
        label def x_all 1 "Male", modify
        label def x_all 2 "Female", modify
        label values Condprof u_all
        label def u_all 1 "Occupati", modify

        Comment


        • #5
          Enrico:
          I would go with this code:
          Code:
          . reg retric proxy1_t c.age##c.age i.education i.Sesso i.Married i.Condprof
          note: 1.Condprof omitted because of collinearity.
          
                Source |       SS           df       MS      Number of obs   =        26
          -------------+----------------------------------   F(10, 15)       =      2.21
                 Model |  3333033.47        10  333303.347   Prob > F        =    0.0806
              Residual |  2265416.53        15  151027.768   R-squared       =    0.5953
          -------------+----------------------------------   Adj R-squared   =    0.3256
                 Total |     5598450        25      223938   Root MSE        =    388.62
          
          ------------------------------------------------------------------------------------------
                            retric | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
          -------------------------+----------------------------------------------------------------
                          proxy1_t |   697.0022     488.08     1.43   0.174    -343.3156     1737.32
                               age |    63.0875    80.9472     0.78   0.448    -109.4474    235.6224
                                   |
                       c.age#c.age |  -.7893141   1.092342    -0.72   0.481    -3.117586    1.538957
                                   |
                         education |
             Elementary education  |   1227.902   439.6489     2.79   0.014     290.8128    2164.992
          Middle school education  |   732.7444   360.9857     2.03   0.060    -36.67831    1502.167
                Diploma 2-3 years  |   591.3479    660.891     0.89   0.385    -817.3079    2000.004
                Diploma 4-5 years  |   813.5435     413.45     1.97   0.068    -67.70436    1694.791
                           Degree  |   1214.937   392.6527     3.09   0.007     378.0176    2051.856
                                   |
                             Sesso |
                           Female  |  -295.5228    173.048    -1.71   0.108    -664.3659    73.32039
                         1.Married |   370.4006   200.3919     1.85   0.084    -56.72469    797.5258
                                   |
                          Condprof |
                         Occupati  |          0  (omitted)
                             _cons |  -1041.031   1600.838    -0.65   0.525    -4453.135    2371.073
          ------------------------------------------------------------------------------------------
          
          .
          And then run the usual postestimation tests:

          checking for heteroskedasticity
          Code:
          . estat hettest
          
          Breusch–Pagan/Cook–Weisberg test for heteroskedasticity
          Assumption: Normal error terms
          Variable: Fitted values of retric
          
          H0: Constant variance
          
              chi2(1) =   1.06
          Prob > chi2 = 0.3033
          
          .
          checking for misspecification off the funtional form of theregressand:
          Code:
          . predict fitted, xb
          (69 missing values generated)
          
          . g sq_fitted=fitted^2
          (69 missing values generated)
          
          . reg retric proxy1_t c.age##c.age i.education i.Sesso i.Married i.Condprof fitted sq_fitted
          note: c.age#c.age omitted because of collinearity.
          note: 1.Condprof omitted because of collinearity.
          
                Source |       SS           df       MS      Number of obs   =        26
          -------------+----------------------------------   F(11, 14)       =      2.32
                 Model |  3617104.54        11  328827.685   Prob > F        =    0.0699
              Residual |  1981345.46        14  141524.676   R-squared       =    0.6461
          -------------+----------------------------------   Adj R-squared   =    0.3680
                 Total |     5598450        25      223938   Root MSE        =     376.2
          
          ------------------------------------------------------------------------------------------
                            retric | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
          -------------------------+----------------------------------------------------------------
                          proxy1_t |  -419.2607   1096.421    -0.38   0.708     -2770.85    1932.328
                               age |   .3840776   12.10963     0.03   0.975     -25.5885    26.35666
                                   |
                       c.age#c.age |          0  (omitted)
                                   |
                         education |
             Elementary education  |  -1087.086   1840.533    -0.59   0.564    -5034.636    2860.463
          Middle school education  |  -384.1059   997.6089    -0.39   0.706    -2523.764    1755.552
                Diploma 2-3 years  |  -171.2475    680.766    -0.25   0.805    -1631.345     1288.85
                Diploma 4-5 years  |  -474.3915   1073.208    -0.44   0.665    -2776.194    1827.411
                           Degree  |  -923.6652   1761.888    -0.52   0.608     -4702.54     2855.21
                                   |
                             Sesso |
                           Female  |   176.0326   466.0397     0.38   0.711    -823.5232    1175.588
                         1.Married |  -235.1549   496.5516    -0.47   0.643    -1300.152    829.8422
                                   |
                          Condprof |
                         Occupati  |          0  (omitted)
                            fitted |  -.3248515   1.633758    -0.20   0.845    -3.828914    3.179211
                         sq_fitted |   .0009455   .0006674     1.42   0.178    -.0004859    .0023769
                             _cons |    776.184   734.3992     1.06   0.308    -798.9458    2351.314
          ------------------------------------------------------------------------------------------
          As usual, this kind of researches, may suffer from a source of endogeneity (latent variable) that may be embedded in the residuals, as your predictors do not include interindividual heterogeneity. Other things being equal, on average smarter persons obtain highe educational degrees (predictor) and negotiate better wage (your regressand). I would discuss this issue with your supervisor/teacher/mentor.
          As example of this source of endogeneity is reported and fixed in
          https://www.stata.com/bookstore/microeconometrics-stata, pages 177-209.
          Kind regards,
          Carlo
          (StataNow 18.5)

          Comment


          • #6
            Hi Carlo thank you for the suggestion. I will analyze better the issue of endogeneity as you suggested, thanks!

            Comment


            • #7
              Hi why you used in the model the interaction of age and not only age?
              c.age##c.age

              Comment


              • #8
                Enrico:
                because the original idea was to search for potential turning points (ie, quadratic relationship between - age- and the regressand), that do not seem to be present in the example elaborated on your excerpt.
                Kind regards,
                Carlo
                (StataNow 18.5)

                Comment


                • #9
                  Thankyou Carlo. Very helpful. Actually when I include in the regression also the variables with the linear prediction and and its square all the other variables lost significance.
                  You implement the model also with these variables to check for functional form misspecification and if the model is properly specified the linear predction and its square should be not significant right? However I find strange the fact that when I run the regression controlling for misspecification, variables like education or sex turns out to be no more significant when it is commonly believed that these have an impact on salary. In the interpretation of results I must rely only on statistical validity only or I can belive that even if the model could suffer some form of bias nevertheless, it goes in the correct direction if it estimates that education have positive effect and sex negative?

                  Comment


                  • #10
                    Enrico:
                    1) as you do not share what you typed and what Stata gave you back, it is difficult to say. As a general rule, if the squared term is not statistically significant, you can get rid of it and re-run the model with the linear term only;
                    2) you can also check for potential misspecification of your model via a restricted regression:
                    Code:
                    reg retric  fitted sq_fitted
                    That said, the recommendation of testing for latent variable-led endogeneity still holds. The possible instruments are father and/or mother education level.
                    Last edited by Carlo Lazzaro; 20 Jan 2022, 06:51.
                    Kind regards,
                    Carlo
                    (StataNow 18.5)

                    Comment


                    • #11
                      Hi Carlo here my regression:
                      Code:
                      *5) regression log wage robust standard error with new proxy1_t
                      quietly reg In_retric newproxy1_t c.age##c.age  i.education i.sex i.Married  i.year if working==1
                       
                      estat hettest
                       
                      eststo: quietly reg In_retric newproxy1_t c.age##c.age i.education i.sex i.Married  i.year if working==1, vce(robust)
                       
                      predict fittednew, xb
                      g sq_fittednew=fittednew^2
                      
                      eststo: quietly reg In_retric newproxy1_t c.age##c.age  i.education i.sex i.Married  i.year fittednew sq_fittednew if working==1, vce(robust)
                      esttab, p compress label nobaselevels interaction(" X ")
                      
                      
                      *6) regression log wage cluster  error with new proxy1_t
                      
                      quietly reg In_retric newproxy1_t c.agesq##c.agesq  i.education i.sex i.Married  i.year if working==1
                       
                      estat hettest
                       
                      eststo: quietly reg In_retric newproxy1_t c.age##c.age i.education i.sex i.Married  i.year if working==1, vce(cluster Countryoforign)
                       
                      predict fittednew2, xb
                      g sq_fittednew2=fittednew2^2
                       
                      eststo: quietly reg In_retric newproxy1_t c.age##c.age i.education i.sex i.Married  i.year fittednew2 sq_fittednew2 if working==1, vce(cluster Countryoforign)
                       
                      esttab, p compress label nobaselevels interaction(" X ")
                      and the results are:
                      Code:
                      . esttab, p compress label nobaselevels interaction(" X ")
                      
                      --------------------------------------------------------------------
                                             (1)          (2)          (3)          (4)  
                                       In_retric    In_retric    In_retric    In_retric  
                      --------------------------------------------------------------------
                      newproxy1_t         -0.346***   0.00417       -0.346***   0.00417  
                                         (0.000)      (0.922)      (0.000)      (0.959)  
                      
                      ETAM                0.0370***  0.000180       0.0370***  0.000180  
                                         (0.000)      (0.469)      (0.000)      (0.783)  
                      
                      ETAM X ETAM      -0.000319***         0    -0.000319***         0  
                                         (0.000)          (.)      (0.000)          (.)  
                      
                      Elementary edu~n    0.0602**   -0.00301       0.0602     -0.00301  
                                         (0.008)      (0.896)      (0.196)      (0.941)  
                      
                      Middle school ~n     0.265*** -0.000776        0.265*** -0.000776  
                                         (0.000)      (0.973)      (0.000)      (0.983)  
                      
                      Diploma 2-3 ye~s     0.398***   0.00161        0.398***   0.00161  
                                         (0.000)      (0.946)      (0.000)      (0.966)  
                      
                      Diploma 4-5 ye~s     0.464***   0.00313        0.464***   0.00313  
                                         (0.000)      (0.898)      (0.000)      (0.927)  
                      
                      Degree               0.677***    0.0101        0.677***    0.0101  
                                         (0.000)      (0.713)      (0.000)      (0.719)  
                      
                      Female              -0.299***  -0.00695       -0.299***  -0.00695  
                                         (0.000)      (0.369)      (0.000)      (0.822)  
                      
                      Married=1           0.0330***   0.00103       0.0330***   0.00103  
                                         (0.000)      (0.696)      (0.000)      (0.879)  
                      
                      ANNO=2015           0.0111**   0.000236       0.0111***  0.000236  
                                         (0.002)      (0.947)      (0.000)      (0.894)  
                      
                      ANNO=2016           0.0180***  0.000342       0.0180***  0.000342  
                                         (0.000)      (0.924)      (0.000)      (0.719)  
                      
                      ANNO=2017           0.0232***  0.000393       0.0232***  0.000393  
                                         (0.000)      (0.913)      (0.000)      (0.921)  
                      
                      ANNO=2018           0.0251***  0.000452       0.0251***  0.000452  
                                         (0.000)      (0.900)      (0.000)      (0.924)  
                      
                      ANNO=2019           0.0367***  0.000671       0.0367***  0.000671  
                                         (0.000)      (0.851)      (0.000)      (0.906)  
                      
                      ANNO=2020           0.0462***  0.000946       0.0462***  0.000946  
                                         (0.000)      (0.795)      (0.000)      (0.856)  
                      
                      Linear predict~n                  1.672***                          
                                                      (0.000)                            
                      
                      sq_fittednew                    -0.0489**                          
                                                      (0.002)                            
                      
                      Linear predict~n                                            1.672*  
                                                                                (0.010)  
                      
                      sq_fittednew2                                             -0.0489  
                                                                                (0.235)  
                      
                      Constant             5.799***    -2.316**      5.799***    -2.316  
                                         (0.000)      (0.003)      (0.000)      (0.351)  
                      --------------------------------------------------------------------
                      Observations        172695       172695       172695       172695  
                      --------------------------------------------------------------------
                      p-values in parentheses
                      * p<0.05, ** p<0.01, *** p<0.001
                      when I run the model usign robust standard errors the sq_fitted values are significant but not when I use cluster standard errors, however in both cases all the other coefficient loose their significance when I add the fitted values and the squares. I was wondering if in the discussion of the results I can report as variable with a significant effect those who are significant in the first and third column or not because the model suffer from misspecification.
                      Last edited by Enrico Azzini; 20 Jan 2022, 07:48.

                      Comment


                      • #12
                        I can't control for endogeneity using an instrumental variable, can I use the command eteffects?
                        Thanks for your help!

                        Comment


                        • #13
                          Enrico:
                          1) you do not have to include -fitted. and -sq_fitted- in the regressions you discuss; they're simply used to test possible misspecification of the functional for of the regressand (in brief, if there's evidence of a non-linear relatinship between the regressand and -sq_fitted- some predictor and/or interaction is missing in the right.hand side of your regression equation);
                          2) please note that, unlike -xtreg-, -regress- options for non-default standard errors deal with heteroskedasticity (-robust-) and serial correlation of the residuals (-vce(cluster clusterid)-), resepctively. Put differently, they are not interchangeable.
                          3) I'm not familiar with -eteffects- hence I cannot advise you on that.
                          Kind regards,
                          Carlo
                          (StataNow 18.5)

                          Comment


                          • #14
                            Hi Carlo thanks for you help. With respect to the point 2) I read that, using reg, vce(cluster clusterid) can be used to deal with heteroskedasticity and serial correlation at the same time.
                            I would like to ask you if is possible to use xtset to set the data as a panle data, without the time dimension. my unit of observation are individual who were interviewed only once from year 2014 to 2020.
                            I had thought of setting as a variable panel Nationlity.

                            Comment


                            • #15
                              Enrico:
                              if, as it seems, you have cross-sectional data:
                              1) the recommended approach (see the really valuable
                              https://www.stata.com/bookstore/environmental-econometrics-using-stata,
                              page 28) for dealing with both heteroskedastcity and autocorrelation is switching from -regress- to -newey- (assuming that your data are cross-sectional and your regressand is contnuous);
                              2) while it's absolutely legal to -xtset- a panel dataset with -panelid- only, I fail to get what you woud gain with following this approach with cross-sectional data.
                              Kind regards,
                              Carlo
                              (StataNow 18.5)

                              Comment

                              Working...
                              X