Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to include two dummy variables with intercept in cross-sectional regression for panel data


    Hello, I am new to stata and I know my question maybe simple. However, I tried almost everything in google but still couldn't figure it out.

    I have a panel data with firm-year observations. I want to apply Fama-MacBeth Method, which is to do cross-section regression each year, then average the time-series coefficient.

    I want to include two dummy variables, one is whether the company i pay dividends at year t, and the other is whether the company is regulated at year t.

    When I try to do the regression for year t, using code: reg dependent variables independent variables, one dummy is always omitted because of multicollinearity. However, I want to see the effect of the two different dummy variables and I want to include intercept in the regression.

    Is there any way to fix it? I tried, xi, areg but all these codes didn't work. I don't want to have fixed effect in the regression, so seems I couldn't find any way to include the two dummy and one intercept in one regression. Help from anyone will be greatly appreciated.

    Thank you

  • #2
    I think that in order to get a useful answer you will need to show us the exact code you ran and the exact response you got from Stata. Please post those in a code block. [Click on the underlined A button, then click on the # button. A pair of code-block delimiters will appear. Copy your commands and output from the Stata Results window and paste it between those delimiters. Please do not retype your code: copy and paste it to avoid introducing errors that might look trivial but could be important.] And, in addition to the code and output you have already run, after your regression, please also run the command -tab dummy1 dummy2 if e(sample)- and include that and the resulting output.

    There is no reason you cannot have two dummy variables and a constant term in a regression: it's done all the time. So either you are not generating the dummies correctly, or there is something in the data that leads to collinearity among these three terms. To see what's going on we will need to see, at the least, your code and Stata's response.

    Comment


    • #3
      Originally posted by Clyde Schechter View Post
      I think that in order to get a useful answer you will need to show us the exact code you ran and the exact response you got from Stata. Please post those in a code block. [Click on the underlined A button, then click on the # button. A pair of code-block delimiters will appear. Copy your commands and output from the Stata Results window and paste it between those delimiters. Please do not retype your code: copy and paste it to avoid introducing errors that might look trivial but could be important.] And, in addition to the code and output you have already run, after your regression, please also run the command -tab dummy1 dummy2 if e(sample)- and include that and the resulting output.

      There is no reason you cannot have two dummy variables and a constant term in a regression: it's done all the time. So either you are not generating the dummies correctly, or there is something in the data that leads to collinearity among these three terms. To see what's going on we will need to see, at the least, your code and Stata's response.
      Code:
       reg logcr mb logaat cfat nwat ceat is lat rs dd reg_d if fyear==1973,robust
      note: reg_d omitted because of collinearity
      
      Linear regression                                      Number of obs =      26
                                                             F(  9,    16) =    4.07
                                                             Prob > F      =  0.0071
                                                             R-squared     =  0.3594
                                                             Root MSE      =  1.0996
      
      ------------------------------------------------------------------------------
                   |               Robust
             logcr |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
      -------------+----------------------------------------------------------------
                mb |    .466772   .3095702     1.51   0.151    -.1894874    1.123032
            logaat |   -.284593   .1925173    -1.48   0.159    -.6927115    .1235254
              cfat |  -1.736845   2.636832    -0.66   0.519     -7.32668    3.852989
              nwat |  -.7571572   1.487423    -0.51   0.618    -3.910354     2.39604
              ceat |   .7957093   1.867384     0.43   0.676    -3.162967    4.754386
                is |   .0214257   .0373236     0.57   0.574    -.0576969    .1005482
               lat |  -1.437894   1.806325    -0.80   0.438    -5.267132    2.391343
                rs |  -2.045013   12.55942    -0.16   0.873     -28.6698    24.57978
                dd |   .7462403   .7127653     1.05   0.311    -.7647546    2.257235
             reg_d |          0  (omitted)
             _cons |   -2.03492   1.256298    -1.62   0.125    -4.698153    .6283125
      ------------------------------------------------------------------------------
      
      . 
      end of do-file
      
      . do "C:\Users\Owner\AppData\Local\Temp\STD00000000.tmp"
      
      . tab dd reg_d if e(logcr)
      
                 |         reg_d
              dd |         0          1 |     Total
      -----------+----------------------+----------
               0 |    65,206        294 |    65,500 
               1 |    13,145        575 |    13,720 
      -----------+----------------------+----------
           Total |    78,351        869 |    79,220
      Last edited by Sabrina Gong; 19 Nov 2014, 13:38.

      Comment


      • #4
        Originally posted by Clyde Schechter View Post

        There is no reason you cannot have two dummy variables and a constant term in a regression: it's done all the time. So either you are not generating the dummies correctly, or there is something in the data that leads to collinearity among these three terms. To see what's going on we will need to see, at the least, your code and Stata's response.
        Thanks a lot for the reply!!I attached the code why I try the simple regression. Also to generate the dummy, I simply use
        Code:
        gen reg_d=1 if sic=="4011" & fyear<1980
        replace reg_d=1 if sic=="4210" & fyear<1980
        **generate dividend payout dummy
        gen dd=1 if !missing(dvc) & dvc!=0
        replace dd=0 if missing(dd)
        I really couldn't figure out why the problem happened. Your help will be greatly appreciated.

        Comment


        • #5
          You're running the regression for a small subsample of your data (only 26 observations are included in the regression). You need to look at the relationship between reg_d and your other covariates for that particular sample.

          You should also confirm that you expect a regression for a single year to only contain 26 observations. You may have missing data decreasing your sample size further than you expect.

          Comment


          • #6
            Your -tab dd reg_d if e(logcr)- is not what I asked for, but it actually sheds some additional light on the problem.

            What I did ask for is -tab dd reg_d if e(sample)-, which would have shown the cross-tabulation between dd and reg_d restricted to the observations that participated in the regression. e(logcr) doesn't actually exist, so it is evaluated as missing, which in Stata translates to Boolean true. So your cross tab shows the relationship of dd and reg_d in your entire data set. It is striking that your entire data set contains 79,220 observations, yet your regression analysis sample is only 26 observations. Given how rare dd=1 and reg_d = 1 both are in your full data, it would not be surprising at all to find that when restricted to just the 26 in your regression reg_d is always 1, or, alternatively, that reg_d is always equal to dd. Either of those conditions would lead to reg_d being dropped for collinearity (the first case being collinearity with _cons, and the second being collinearity with dd).

            Anyway, I suggest you run -tab dd reg_d if e(sample)- to get the actual results. If that doesn't make the problem clear, then the next step is to run

            Code:
            regress reg_d mb logaat cfat nwat ceat is lat rs dd if e(sample)
            Since reg_d is known to be collinear with these variables, the output of this regression will show you exactly which variable(s) it is collinear with, and exactly what the offending linear combination is. At that point you can decide to remove one of the involved variables to break the collinearity, or you can conclude that there is something wrong with your data and investigate fixing it, or you can conclude that this is just a fluke that happened to occur in this very tiny subsample of your data and is nothing to worry about.

            Comment


            • #7
              Originally posted by Sarah Edgington View Post
              You're running the regression for a small subsample of your data (only 26 observations are included in the regression). You need to look at the relationship between reg_d and your other covariates for that particular sample.

              You should also confirm that you expect a regression for a single year to only contain 26 observations. You may have missing data decreasing your sample size further than you expect.

              Thanks a lot. It's actually a large sample from 1972-1994.However, since I am using Fama-Macbeth method, I am hoping to get 24 cross-sectional regressions. The sample sample in 1973 is 638. Pls see
              Code:
              . count if fyear==1973
                638
              I am thinking whether stata dropped some variables since one independent variable maybe missing. However, there are some missing values and I couldn't assigned them to 0.

              Comment


              • #8
                Originally posted by Clyde Schechter View Post
                Since reg_d is known to be collinear with these variables, the output of this regression will show you exactly which variable(s) it is collinear with, and exactly what the offending linear combination is. At that point you can decide to remove one of the involved variables to break the collinearity, or you can conclude that there is something wrong with your data and investigate fixing it, or you can conclude that this is just a fluke that happened to occur in this very tiny subsample of your data and is nothing to worry about.
                Thank you very much Clyde.. I really appreciate your detailed reply.

                Yes the sample actually is large with 79220 firm-year obs, however, what I am doing is trying to do cross-sectional regression by year then average coefficients, basically Fama-MacBeth(1973) approach. I do have 638 obs in 1973, but some variables, like cfat or nwat maybe missing, so I assume stata dropped all the obs if any one independent variable is missing.

                If I understand you correctly, reg_d by my setting just exist(=1) before 1980, so from 1980-1994, reg_d=0, which causes the multicollinearity problem with the constant. That's why stata just dropped it. However, there are some years the reg_d=1. Pls see the code the result I did again for year 1981 obs

                Code:
                 reg logcr mb logaat cfat nwat ceat is lat rs dd reg_d if fyear==1981,robust
                
                Linear regression                                      Number of obs =     864
                                                                       F( 10,   853) =   24.27
                                                                       Prob > F      =  0.0000
                                                                       R-squared     =  0.2398
                                                                       Root MSE      =  1.6164
                
                ------------------------------------------------------------------------------
                             |               Robust
                       logcr |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
                -------------+----------------------------------------------------------------
                          mb |   .1105858    .015035     7.36   0.000      .081076    .1400957
                      logaat |  -.2749471   .0473331    -5.81   0.000    -.3678501   -.1820441
                        cfat |   .1195163   .1497906     0.80   0.425    -.1744851    .4135176
                        nwat |   .6455212   .1263569     5.11   0.000     .3975143    .8935281
                        ceat |   2.383476   .3612025     6.60   0.000     1.674526    3.092425
                          is |   .0099727   .0055972     1.78   0.075    -.0010131    .0209586
                         lat |   .1614295     .38309     0.42   0.674    -.5904799    .9133389
                          rs |   .0181547   .0123518     1.47   0.142    -.0060888    .0423981
                          dd |   .3676682   .1375784     2.67   0.008     .0976363    .6377001
                       reg_d |   1.100781   .7298431     1.51   0.132     -.331718     2.53328
                       _cons |   -2.63111   .1701292   -15.47   0.000    -2.965031   -2.297189
                ------------------------------------------------------------------------------
                
                . 
                end of do-file
                
                . do "C:\Users\Owner\AppData\Local\Temp\STD00000000.tmp"
                
                . tab dd reg_d if e(sample)
                
                           |         reg_d
                        dd |         0          1 |     Total
                -----------+----------------------+----------
                         0 |       680          2 |       682 
                         1 |       181          1 |       182 
                -----------+----------------------+----------
                     Total |       861          3 |       864 
                
                
                . 
                end of do-file
                
                .

                Seems it works now. However, I am replicating one paper, in which the author did get the coefficient... I really couldn't figure out.

                Thank you so much.
                Last edited by Sabrina Gong; 19 Nov 2014, 14:14.

                Comment


                • #9
                  Something is amiss. In #4, you said you created reg_d with the code:

                  Code:
                  gen reg_d=1 if sic=="4011" & fyear<1980
                  replace reg_d=1 if sic=="4210" & fyear<1980
                  But there must be more to it, because in that case, reg_d would always be either 1 or missing. There is nothing here that leads to zero values for reg_d. So there must be additional code that modifies reg_d before you get to your regressions. Similarly, you would not be able to obtain:

                  Code:
                   count if reg_d==1 & fyear==1981
                     93
                  based only on the two commands you showed us in #4, since reg_d must be missing when fyear == 1981 according to #4. So somewhere, something is changing the values of reg_d from those initial values.

                  Now in #8 you show us that a regression done subject to the qualification -if fyear==1994- again drops reg_d for collinearity. This result is entirely consistent with the subsequent tabulation showing that reg_d is always zero in this estimation sample. This time, however, your estimation sample is reasonably hefty, so it is unlikely to be a fluke due to small subsample size.

                  So the issue boils down entirely to this: you need to be clear on how reg_d is calculated. The two lines posted in #4 are not the whole story of that variable's creation. You need to review every line of code that might change the values of reg_d until you understand what is going on with this variable. The regressions are behaving in accordance with the actual values of reg_d, but it appears the values of reg_d are not what you expect them to be.

                  Comment


                  • #10
                    Originally posted by Clyde Schechter View Post
                    Something is amiss. In #4, you said you created reg_d with the code:


                    So the issue boils down entirely to this: you need to be clear on how reg_d is calculated. The two lines posted in #4 are not the whole story of that variable's creation. You need to review every line of code that might change the values of reg_d until you understand what is going on with this variable. The regressions are behaving in accordance with the actual values of reg_d, but it appears the values of reg_d are not what you expect them to be.
                    Yes you are right. Sorry I missed something when pasting my code. The code to generate reg_d should be

                    Code:
                    gen reg_d=1 if sic=="4011" & fyear<1980
                    replace reg_d=1 if sic=="4210" & fyear<1980
                    replace reg_d=1 if sic=="4213" & fyear<1980
                    replace reg_d=1 if sic=="4512" & fyear<1978
                    replace reg_d=1 if sic=="4812" & fyear<1982
                    replace reg_d=1 if sic=="4813" & fyear<1982
                    replace reg_d=0 if missing(reg_d)
                     codebook reg_d
                    
                    -----------------------------------------------------------------------------------------------------------------
                    reg_d                                                                                                 (unlabeled)
                    -----------------------------------------------------------------------------------------------------------------
                    
                                      type:  numeric (float)
                    
                                     range:  [0,1]                        units:  1
                             unique values:  2                        missing .:  0/79220
                    
                                tabulation:  Freq.  Value
                                             78351  0
                                               869  1
                    
                    . 
                    end of do-file

                    Comment


                    • #11
                      Thanks, that makes more sense.

                      Comment


                      • #12
                        Originally posted by Clyde Schechter View Post
                        Thanks, that makes more sense.

                        Thank you for helping me to solve the problem.

                        Comment

                        Working...
                        X