How to include two dummy variables with intercept in cross-sectional regression for panel data

Sabrina Gong

Join Date: Nov 2014

Posts: 19
#1

How to include two dummy variables with intercept in cross-sectional regression for panel data

17 Nov 2014, 15:21

Hello, I am new to stata and I know my question maybe simple. However, I tried almost everything in google but still couldn't figure it out.

I have a panel data with firm-year observations. I want to apply Fama-MacBeth Method, which is to do cross-section regression each year, then average the time-series coefficient.

I want to include two dummy variables, one is whether the company i pay dividends at year t, and the other is whether the company is regulated at year t.

When I try to do the regression for year t, using code: reg dependent variables independent variables, one dummy is always omitted because of multicollinearity. However, I want to see the effect of the two different dummy variables and I want to include intercept in the regression.

Is there any way to fix it? I tried, xi, areg but all these codes didn't work. I don't want to have fixed effect in the regression, so seems I couldn't find any way to include the two dummy and one intercept in one regression. Help from anyone will be greatly appreciated.

Thank you
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 29962
#2

17 Nov 2014, 16:10

I think that in order to get a useful answer you will need to show us the exact code you ran and the exact response you got from Stata. Please post those in a code block. [Click on the underlined A button, then click on the # button. A pair of code-block delimiters will appear. Copy your commands and output from the Stata Results window and paste it between those delimiters. Please do not retype your code: copy and paste it to avoid introducing errors that might look trivial but could be important.] And, in addition to the code and output you have already run, after your regression, please also run the command -tab dummy1 dummy2 if e(sample)- and include that and the resulting output.

There is no reason you cannot have two dummy variables and a constant term in a regression: it's done all the time. So either you are not generating the dummies correctly, or there is something in the data that leads to collinearity among these three terms. To see what's going on we will need to see, at the least, your code and Stata's response.
Comment

Sabrina Gong

Join Date: Nov 2014
Posts: 19

19 Nov 2014, 13:35

Originally posted by Clyde Schechter View Post

I think that in order to get a useful answer you will need to show us the exact code you ran and the exact response you got from Stata. Please post those in a code block. [Click on the underlined A button, then click on the # button. A pair of code-block delimiters will appear. Copy your commands and output from the Stata Results window and paste it between those delimiters. Please do not retype your code: copy and paste it to avoid introducing errors that might look trivial but could be important.] And, in addition to the code and output you have already run, after your regression, please also run the command -tab dummy1 dummy2 if e(sample)- and include that and the resulting output.

There is no reason you cannot have two dummy variables and a constant term in a regression: it's done all the time. So either you are not generating the dummies correctly, or there is something in the data that leads to collinearity among these three terms. To see what's going on we will need to see, at the least, your code and Stata's response.

Code:

 reg logcr mb logaat cfat nwat ceat is lat rs dd reg_d if fyear==1973,robust
note: reg_d omitted because of collinearity

Linear regression                                      Number of obs =      26
                                                       F(  9,    16) =    4.07
                                                       Prob > F      =  0.0071
                                                       R-squared     =  0.3594
                                                       Root MSE      =  1.0996

------------------------------------------------------------------------------
             |               Robust
       logcr |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
          mb |    .466772   .3095702     1.51   0.151    -.1894874    1.123032
      logaat |   -.284593   .1925173    -1.48   0.159    -.6927115    .1235254
        cfat |  -1.736845   2.636832    -0.66   0.519     -7.32668    3.852989
        nwat |  -.7571572   1.487423    -0.51   0.618    -3.910354     2.39604
        ceat |   .7957093   1.867384     0.43   0.676    -3.162967    4.754386
          is |   .0214257   .0373236     0.57   0.574    -.0576969    .1005482
         lat |  -1.437894   1.806325    -0.80   0.438    -5.267132    2.391343
          rs |  -2.045013   12.55942    -0.16   0.873     -28.6698    24.57978
          dd |   .7462403   .7127653     1.05   0.311    -.7647546    2.257235
       reg_d |          0  (omitted)
       _cons |   -2.03492   1.256298    -1.62   0.125    -4.698153    .6283125
------------------------------------------------------------------------------

. 
end of do-file

. do "C:\Users\Owner\AppData\Local\Temp\STD00000000.tmp"

. tab dd reg_d if e(logcr)

           |         reg_d
        dd |         0          1 |     Total
-----------+----------------------+----------
         0 |    65,206        294 |    65,500 
         1 |    13,145        575 |    13,720 
-----------+----------------------+----------
     Total |    78,351        869 |    79,220

Last edited by Sabrina Gong; 19 Nov 2014, 13:38.

Comment

Sabrina Gong

Join Date: Nov 2014

Posts: 19
#4

19 Nov 2014, 13:42

Originally posted by Clyde Schechter View Post

There is no reason you cannot have two dummy variables and a constant term in a regression: it's done all the time. So either you are not generating the dummies correctly, or there is something in the data that leads to collinearity among these three terms. To see what's going on we will need to see, at the least, your code and Stata's response.

Thanks a lot for the reply!!I attached the code why I try the simple regression. Also to generate the dummy, I simply use

Code:

gen reg_d=1 if sic=="4011" & fyear<1980 replace reg_d=1 if sic=="4210" & fyear<1980 **generate dividend payout dummy gen dd=1 if !missing(dvc) & dvc!=0 replace dd=0 if missing(dd)

I really couldn't figure out why the problem happened. Your help will be greatly appreciated.
Comment
Sarah Edgington

Join Date: Apr 2014

Posts: 284
#5

19 Nov 2014, 13:50

You're running the regression for a small subsample of your data (only 26 observations are included in the regression). You need to look at the relationship between reg_d and your other covariates for that particular sample.

You should also confirm that you expect a regression for a single year to only contain 26 observations. You may have missing data decreasing your sample size further than you expect.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29962
#6

19 Nov 2014, 13:52

Your -tab dd reg_d if e(logcr)- is not what I asked for, but it actually sheds some additional light on the problem.

What I did ask for is -tab dd reg_d if e(sample)-, which would have shown the cross-tabulation between dd and reg_d restricted to the observations that participated in the regression. e(logcr) doesn't actually exist, so it is evaluated as missing, which in Stata translates to Boolean true. So your cross tab shows the relationship of dd and reg_d in your entire data set. It is striking that your entire data set contains 79,220 observations, yet your regression analysis sample is only 26 observations. Given how rare dd=1 and reg_d = 1 both are in your full data, it would not be surprising at all to find that when restricted to just the 26 in your regression reg_d is always 1, or, alternatively, that reg_d is always equal to dd. Either of those conditions would lead to reg_d being dropped for collinearity (the first case being collinearity with _cons, and the second being collinearity with dd).

Anyway, I suggest you run -tab dd reg_d if e(sample)- to get the actual results. If that doesn't make the problem clear, then the next step is to run

Code:

regress reg_d mb logaat cfat nwat ceat is lat rs dd if e(sample)

Since reg_d is known to be collinear with these variables, the output of this regression will show you exactly which variable(s) it is collinear with, and exactly what the offending linear combination is. At that point you can decide to remove one of the involved variables to break the collinearity, or you can conclude that there is something wrong with your data and investigate fixing it, or you can conclude that this is just a fluke that happened to occur in this very tiny subsample of your data and is nothing to worry about.
Comment
Sabrina Gong

Join Date: Nov 2014

Posts: 19
#7

19 Nov 2014, 13:59

Originally posted by Sarah Edgington View Post

You're running the regression for a small subsample of your data (only 26 observations are included in the regression). You need to look at the relationship between reg_d and your other covariates for that particular sample.

You should also confirm that you expect a regression for a single year to only contain 26 observations. You may have missing data decreasing your sample size further than you expect.

Thanks a lot. It's actually a large sample from 1972-1994.However, since I am using Fama-Macbeth method, I am hoping to get 24 cross-sectional regressions. The sample sample in 1973 is 638. Pls see

Code:

. count if fyear==1973 638

I am thinking whether stata dropped some variables since one independent variable maybe missing. However, there are some missing values and I couldn't assigned them to 0.
Comment

Sabrina Gong

Join Date: Nov 2014
Posts: 19

19 Nov 2014, 14:11

Originally posted by Clyde Schechter View Post

Since reg_d is known to be collinear with these variables, the output of this regression will show you exactly which variable(s) it is collinear with, and exactly what the offending linear combination is. At that point you can decide to remove one of the involved variables to break the collinearity, or you can conclude that there is something wrong with your data and investigate fixing it, or you can conclude that this is just a fluke that happened to occur in this very tiny subsample of your data and is nothing to worry about.

Thank you very much Clyde.. I really appreciate your detailed reply.

Yes the sample actually is large with 79220 firm-year obs, however, what I am doing is trying to do cross-sectional regression by year then average coefficients, basically Fama-MacBeth(1973) approach. I do have 638 obs in 1973, but some variables, like cfat or nwat maybe missing, so I assume stata dropped all the obs if any one independent variable is missing.

If I understand you correctly, reg_d by my setting just exist(=1) before 1980, so from 1980-1994, reg_d=0, which causes the multicollinearity problem with the constant. That's why stata just dropped it. However, there are some years the reg_d=1. Pls see the code the result I did again for year 1981 obs

Code:

 reg logcr mb logaat cfat nwat ceat is lat rs dd reg_d if fyear==1981,robust

Linear regression                                      Number of obs =     864
                                                       F( 10,   853) =   24.27
                                                       Prob > F      =  0.0000
                                                       R-squared     =  0.2398
                                                       Root MSE      =  1.6164

------------------------------------------------------------------------------
             |               Robust
       logcr |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
          mb |   .1105858    .015035     7.36   0.000      .081076    .1400957
      logaat |  -.2749471   .0473331    -5.81   0.000    -.3678501   -.1820441
        cfat |   .1195163   .1497906     0.80   0.425    -.1744851    .4135176
        nwat |   .6455212   .1263569     5.11   0.000     .3975143    .8935281
        ceat |   2.383476   .3612025     6.60   0.000     1.674526    3.092425
          is |   .0099727   .0055972     1.78   0.075    -.0010131    .0209586
         lat |   .1614295     .38309     0.42   0.674    -.5904799    .9133389
          rs |   .0181547   .0123518     1.47   0.142    -.0060888    .0423981
          dd |   .3676682   .1375784     2.67   0.008     .0976363    .6377001
       reg_d |   1.100781   .7298431     1.51   0.132     -.331718     2.53328
       _cons |   -2.63111   .1701292   -15.47   0.000    -2.965031   -2.297189
------------------------------------------------------------------------------

. 
end of do-file

. do "C:\Users\Owner\AppData\Local\Temp\STD00000000.tmp"

. tab dd reg_d if e(sample)

           |         reg_d
        dd |         0          1 |     Total
-----------+----------------------+----------
         0 |       680          2 |       682 
         1 |       181          1 |       182 
-----------+----------------------+----------
     Total |       861          3 |       864 


. 
end of do-file

.

Seems it works now. However, I am replicating one paper, in which the author did get the coefficient... I really couldn't figure out.

Thank you so much.

Last edited by Sabrina Gong; 19 Nov 2014, 14:14.

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 29962
#9

19 Nov 2014, 14:24

Something is amiss. In #4, you said you created reg_d with the code:

Code:

gen reg_d=1 if sic=="4011" & fyear<1980 replace reg_d=1 if sic=="4210" & fyear<1980

But there must be more to it, because in that case, reg_d would always be either 1 or missing. There is nothing here that leads to zero values for reg_d. So there must be additional code that modifies reg_d before you get to your regressions. Similarly, you would not be able to obtain:

Code:

count if reg_d==1 & fyear==1981 93

based only on the two commands you showed us in #4, since reg_d must be missing when fyear == 1981 according to #4. So somewhere, something is changing the values of reg_d from those initial values.

Now in #8 you show us that a regression done subject to the qualification -if fyear==1994- again drops reg_d for collinearity. This result is entirely consistent with the subsequent tabulation showing that reg_d is always zero in this estimation sample. This time, however, your estimation sample is reasonably hefty, so it is unlikely to be a fluke due to small subsample size.

So the issue boils down entirely to this: you need to be clear on how reg_d is calculated. The two lines posted in #4 are not the whole story of that variable's creation. You need to review every line of code that might change the values of reg_d until you understand what is going on with this variable. The regressions are behaving in accordance with the actual values of reg_d, but it appears the values of reg_d are not what you expect them to be.
Comment

Sabrina Gong

Join Date: Nov 2014
Posts: 19

#10

19 Nov 2014, 15:20

Originally posted by Clyde Schechter View Post

Something is amiss. In #4, you said you created reg_d with the code:

So the issue boils down entirely to this: you need to be clear on how reg_d is calculated. The two lines posted in #4 are not the whole story of that variable's creation. You need to review every line of code that might change the values of reg_d until you understand what is going on with this variable. The regressions are behaving in accordance with the actual values of reg_d, but it appears the values of reg_d are not what you expect them to be.

Yes you are right. Sorry I missed something when pasting my code. The code to generate reg_d should be

Code:

gen reg_d=1 if sic=="4011" & fyear<1980
replace reg_d=1 if sic=="4210" & fyear<1980
replace reg_d=1 if sic=="4213" & fyear<1980
replace reg_d=1 if sic=="4512" & fyear<1978
replace reg_d=1 if sic=="4812" & fyear<1982
replace reg_d=1 if sic=="4813" & fyear<1982
replace reg_d=0 if missing(reg_d)
 codebook reg_d

-----------------------------------------------------------------------------------------------------------------
reg_d                                                                                                 (unlabeled)
-----------------------------------------------------------------------------------------------------------------

                  type:  numeric (float)

                 range:  [0,1]                        units:  1
         unique values:  2                        missing .:  0/79220

            tabulation:  Freq.  Value
                         78351  0
                           869  1

. 
end of do-file

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 29962
#11

19 Nov 2014, 15:31

Thanks, that makes more sense.
Comment
Sabrina Gong

Join Date: Nov 2014

Posts: 19
#12

21 Nov 2014, 10:57

Originally posted by Clyde Schechter View Post

Thanks, that makes more sense.

Thank you for helping me to solve the problem.
Comment

Announcement