Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Need help with replication of research results for further analysis

    Hello everyone
    I need help with my project for a research in which I use Stata for the first time and therefore do not know how to use it so well and i hope my question is not too long . I have been despairing of this for some time. I would like to be able to reproduce the research results of Brav et al. (https://www.sciencedirect.com/scienc...26tMaSI7sdaczn ), so that I am sure that I have proceeded correctly and then continue my construction for further investigations with additional variables.

    I am concerned here with table 7, panels b and c in the research paper. For my part, I have already been able to reduce the data from the dataset cut to the individual requirements that were given. In total there are almost 72000 observations (so no dataex is possible) for panel b, where I then estimate 2 parameters in the next step with an ols regression for the formula.

    Here my steps so far for Panel B, Panel C is quite different but not much:

    Code:
    egen firmid = group(gvkey)
    egen timeid = group(fyear)
    duplicates report firmid timeid
    duplicates tag firmid timeid, gen(isdup)
    drop if isdup
    
    tsset firmid timeid
    sum fyear, d
    return list
    
    //generate 3 new variable
    gen prior_year_dvpsx_f = L.dvpsx_f
    gen earning = epspx
    gen deltadvpsx = dvpsx_f - prior_year_dvpsx_f
    
    // drop if missing
    drop if earning==.
    drop if dvpsx_f==.
    
    //generate subperiod
    generate fy = .
    
    //generate 3 subperiod
    replace fy=1 if fyear <= 1964
    replace fy=2 if fyear > 1964 & fyear <= 1983
    replace fy=3 if fyear > 1983 & fyear <= 2002
    
    //Count the number of firmid in subperiod
    bysort firmid: egen counter1=count(firmid) if fyear <= 1964
    bysort firmid: egen counter2=count(firmid) if fyear > 1964 & fyear <= 1983
    bysort firmid: egen counter3=count(firmid) if fyear > 1983 & fyear <= 2002
    
    
    //Drop obs. if data not available whe same like subperiod
    drop if fy == 1 & counter1 < 15
    drop if fy == 2 & counter2 < 19
    drop if fy == 3 & counter3 < 19
    
    // Summarize each subperiod obvservation
    sum counter1 if fy == 1 & counter1==15, d
    sum counter2 if fy == 2 & counter2==19, d
    sum counter3 if fy == 3 & counter3==19, d
    
    // Regression
    reg deltadvpsx prior_year_dvpsx_f earning if fy==1
    reg deltadvpsx prior_year_dvpsx_f earning if fy==2
    reg deltadvpsx prior_year_dvpsx_f earning if fy==3
    And here begins the problem where I would have several questions. On the one hand my calculated values in the first two subperiods are a bit low, but here it also seems that my estimated earning value(beta2) is too high (or beta1 too low), because after converting my regression results in my head I have a higher estimated value for beta2 than for beta1 although this should not be according to the research paper. However, the third and last sub-period for me is completely beyond the scope, because here my beta1 has a much larger value than in the first two sub-periods. According to Brav et al. this should not be the case.

    There I ask myself the question, what mistake did I make here that I can not reflect the results, especially the third subperiod? Should I not use the pooled OLS regression or the crosssectional OLS regression, as it is mentioned in the description of the table? If yes, how do I do that? The whole thing confuses me extremely because the number of observations i have correspond quite closely to those in the paper.

    Code:
    . reg deltadvpsx prior_year_dvpsx_f earning if fy==1
    
          Source |       SS           df       MS      Number of obs   =     7,210
    -------------+----------------------------------   F(2, 7207)      =   1530.15
           Model |  713.729123         2  356.864562   Prob > F        =    0.0000
        Residual |  1680.83347     7,207  .233222349   R-squared       =    0.2981
    -------------+----------------------------------   Adj R-squared   =    0.2979
           Total |  2394.56259     7,209  .332162934   Root MSE        =    .48293
    
    ------------------------------------------------------------------------------------
            deltadvpsx | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
    -------------------+----------------------------------------------------------------
    prior_year_dvpsx_f |  -.3149017   .0058851   -53.51   0.000    -.3264382   -.3033653
               earning |   .1047265   .0025142    41.65   0.000     .0997979    .1096552
                 _cons |   .1120818   .0101159    11.08   0.000     .0922517    .1319118
    ------------------------------------------------------------------------------------
    
    . reg deltadvpsx prior_year_dvpsx_f earning if fy==2
    
          Source |       SS           df       MS      Number of obs   =    32,144
    -------------+----------------------------------   F(2, 32141)     =   3662.12
           Model |  1016.22144         2  508.110722   Prob > F        =    0.0000
        Residual |  4459.48791    32,141   .13874764   R-squared       =    0.1856
    -------------+----------------------------------   Adj R-squared   =    0.1855
           Total |  5475.70936    32,143  .170354645   Root MSE        =    .37249
    
    ------------------------------------------------------------------------------------
            deltadvpsx | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
    -------------------+----------------------------------------------------------------
    prior_year_dvpsx_f |  -.2169661   .0028232   -76.85   0.000    -.2224997   -.2114324
               earning |   .0737496   .0010292    71.66   0.000     .0717322    .0757669
                 _cons |   .0330613   .0029897    11.06   0.000     .0272013    .0389212
    ------------------------------------------------------------------------------------
    
    . reg deltadvpsx prior_year_dvpsx_f earning if fy==3
    
          Source |       SS           df       MS      Number of obs   =    33,464
    -------------+----------------------------------   F(2, 33461)     =   8296.23
           Model |  56096.8386         2  28048.4193   Prob > F        =    0.0000
        Residual |  113127.127    33,461  3.38086511   R-squared       =    0.3315
    -------------+----------------------------------   Adj R-squared   =    0.3315
           Total |  169223.966    33,463  5.05704707   Root MSE        =    1.8387
    
    ------------------------------------------------------------------------------------
            deltadvpsx | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
    -------------------+----------------------------------------------------------------
    prior_year_dvpsx_f |   -.658861   .0051149  -128.81   0.000    -.6688864   -.6488356
               earning |   7.94e-07   8.72e-06     0.09   0.927    -.0000163    .0000179
                 _cons |   .3111938   .0103593    30.04   0.000     .2908892    .3314985
    ------------------------------------------------------------------------------------
    Here you can see, that the coefficient for the third regression beta1(prior_year_dvpsx_f) is much higher than the other one. This shouldnt't be the case.
    Also what is the difference if i insert vce(robust) in my regression. I saw a big difference in my f-value and t-value but what this changes mean?

    Furthermore, I come to the second question, namely the conversion of my regression results into a table that corresponds to the research results, i.e. how can I create a table with the regression results in 3 different subperiods with the years and the SD,median,25th and 75th percentile for SOA,TP and adjusted R squared, where SOA is -beta1 and TP is -beta2/beta1?

    Would be super helpful to see answers that can move me forward here and finally continue the investigations as well.

    Greetings
    Steffen

  • #2
    Steffen:
    while I cannot comment on the results of your OLS regression (exception made for the evidence that -earning- is totally negligible in your 3rd OLS), -robust- standard errors takes care of heteroskedasticity of the residual distribution (you probably have heteroskedasticity, that you can test via -estat hettest-, if your results remarkably change vs. those obtained with default standard errors).
    In addition, I would also run -linktest- to check the robustness of the functional form of the regressand (OLS with two predictors only look always sinister).
    As far as the table you're after is concerned, take a look at -table- suite in Stata 17.
    Kind regards,
    Carlo
    (Stata 19.0)

    Comment


    • #3
      Thanks for your message Carlo.
      True the -earning- in the 3rd OLS is negligible which shouldn't be that low and confuse me.That is my main reason for the post, because of the big question if i did something wrong in my progress.
      Also can the OLS regression be viewed as a cross sectional distribution or what command is needed to do a OLS regression with cross-sectional distribution because this is what the author in the paper did right? All thes different types confuse me at first glance.

      I tried the -estat hettest- for every independent and dependet variable and got for almost every variable except for the earning in the 3rd OLS (huh?) a Value of 0.0000. So that it is better to work from here on with the -robost- standard error because of the heteroskedastictiy? With the robust SE i get a much different t-value and F-value here but then I wouldn't have the adj. R-squared like in the paper above.
      Code:
      . estat hettest
      
      Breusch–Pagan/Cook–Weisberg test for heteroskedasticity
      Assumption: Normal error terms
      Variable: Fitted values of deltadvpsx
      
      H0: Constant variance
      
          chi2(1) = 1.40e+06
      Prob > chi2 =   0.0000
      With the -linktest- i get this values in compare to the -reg- for the 1st OLS:
      Code:
      . reg deltadvpsx prior_year_dvpsx_f earning if fy==1
      
            Source |       SS           df       MS      Number of obs   =     7,210
      -------------+----------------------------------   F(2, 7207)      =   1530.15
             Model |  713.729123         2  356.864562   Prob > F        =    0.0000
          Residual |  1680.83347     7,207  .233222349   R-squared       =    0.2981
      -------------+----------------------------------   Adj R-squared   =    0.2979
             Total |  2394.56259     7,209  .332162934   Root MSE        =    .48293
      
      ------------------------------------------------------------------------------------
              deltadvpsx | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
      -------------------+----------------------------------------------------------------
      prior_year_dvpsx_f |  -.3149017   .0058851   -53.51   0.000    -.3264382   -.3033653
                 earning |   .1047265   .0025142    41.65   0.000     .0997979    .1096552
                   _cons |   .1120818   .0101159    11.08   0.000     .0922517    .1319118
      ------------------------------------------------------------------------------------
      
      .
      end of do-file
      
      . linktest
      
            Source |       SS           df       MS      Number of obs   =     7,210
      -------------+----------------------------------   F(2, 7207)      =   2254.15
             Model |  921.480408         2  460.740204   Prob > F        =    0.0000
          Residual |  1473.08219     7,207   .20439603   R-squared       =    0.3848
      -------------+----------------------------------   Adj R-squared   =    0.3847
             Total |  2394.56259     7,209  .332162934   Root MSE        =     .4521
      
      ------------------------------------------------------------------------------
        deltadvpsx | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
      -------------+----------------------------------------------------------------
              _hat |   1.060913   .0170302    62.30   0.000     1.027528    1.094297
            _hatsq |  -.2748068   .0086197   -31.88   0.000     -.291704   -.2579097
             _cons |   .0321113    .005522     5.82   0.000     .0212866     .042936
      ------------------------------------------------------------------------------
      What it tells me? That the predicted values of deltadvpsx is significant (because of the high t-value) and therefore not a good prediction or how should i understand this output?

      Regard the -table- suite i tried to understand this function but i don't find a good guidevideo or something which helps me to achieve what i want specifically so far. Because for example if I try to restrict observation regarding the subperiods for fy==1,fy==2 and for 3, i get the message invalid. Seems it takes time to understand the tables builder because i can't find any clue how to calculate new results from my current regressionsresult and put every subperiods as one big table.

      Comment


      • #4
        Steffen:
        1) you actually ran three cross-sectional regressions;
        2) you have a clear problem of heteroskedasticity; therefore you should go -robust-;
        3) the -linktest- outcome tells you that the functional form of your regressand is misspecified. Put differently, your model is not that informative because of a too smal number of predictors. I do not know whether the Authors of the papers you're trying to replicate actually investigated this possible nuisance;
        4) you can retrieve the adj_R_sq via:
        Code:
        . use "C:\Program Files\Stata17\ado\base\a\auto.dta"
        (1978 automobile data)
        
        . regress price mpg, robust
        
        Linear regression                               Number of obs     =         74
                                                        F(1, 72)          =      17.28
                                                        Prob > F          =     0.0001
                                                        R-squared         =     0.2196
                                                        Root MSE          =     2623.7
        
        ------------------------------------------------------------------------------
                     |               Robust
               price | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
        -------------+----------------------------------------------------------------
                 mpg |  -238.8943   57.47701    -4.16   0.000    -353.4727    -124.316
               _cons |   11253.06   1376.393     8.18   0.000     8509.272    13996.85
        ------------------------------------------------------------------------------
        
        . di e(r2_a)
        .20874373
        
        .
        5) the -table- suite needs a bit of training and does not cover all the needs;
        6) what's the role of your supervisor in your research?
        Kind regards,
        Carlo
        (Stata 19.0)

        Comment


        • #5
          Thanks for your input Carlo.
          You are right my values are misleading, which is why I asked my supervisor about it. My supervisor role is reviewing my master's thesis only and I can't ask questions frequently because this research is mostly done independently, ergo I can't expect as much help from the supervisor as I get here in the forum.

          However, the supervisor did give me the information to do the regression as a loop. My whole work before of course is strongly biased by the fact that the regressions do not take place on firm level. So the regression has to be done for each company individually.

          Therefor i want to create a loop ols regression for each company, in the three subperiods (fy) that i have, that takes into account individually each company for each year from 1950-2020 for around 74,000 observation depending on the membership of the subperiods.

          I tried something like this:
          Code:
          forvalues fyear = 1950(1)2020 {
              
              reg deltadividend prior_dividend earning if fy==1
              reg deltadividend prior_dividend earning if fy==2
              reg deltadividend prior_dividend earning if fy==3
          }
          The problem here, however, is that it considers the whole subperiod as a whole and not each year individually. So I get each of the three regressions displayed over and over again for each year. What am I doing wrong here? Also, how do I bring the different companies(firmid) into play here?
          Or is it enough to make a regression loop for each firm level in each subperiod, and it is the same as if I had regressed each company and each year only to then summarize the regression estimates for one company in one subperiod and then i can summarize the average, the percentile and the median of all firms in one subperiod as one result?

          In addition, I would like to convert the estimated values after the regression ( earning to -(earning/prior_dividend) as SOA for example) to then summarize the converted values within a company of a subperiod and then summarize all values of a subperiod as an average as the last step. So i have to do this with the -table- suite or is there a command for that? Because I would have to make the conversion before I then summarize the converted values of the individual companies as one for each subperiod.

          Comment


          • #6
            Steffen:
            1) I doubt that your -forval- loop can work properly the way it is coded.
            2) my guess is that (set aside 1) for a while), -years- and -fe- mix up and the whole stuff gets less informative than expected.
            I would go something along the following lines:
            Code:
             forvalues fyear = 1950(1)2020 { 
             reg deltadividend prior_dividend earning i.fy if `fyear'!=. }
            I would fix the regression first and then deal with the remaining items.
            Kind regards,
            Carlo
            (Stata 19.0)

            Comment


            • #7
              Thanks for your reply Carlo.
              I noticed that I did my setup upside down. Usually I first want to do a firm level regression over the years separated by 3 subperiods fy (1950-1963, 1964-1983 and 1984-2002). 1950-2002 is the whole set with 73k+ observation and 3023 firms. Because some firms are missing because of no observation i tidied up my firms (it was around 34000 firms before).

              So first I fixed my regression:
              Code:
              egen newid = group(firmid),  
              sum newid, d
              return list
              
              local max=r(max)
              
              forvalues id = 1/`max' {
                 forvalues p = 1/3 {
                 capture noisily regress deltadividend prior_dividend earning if newid ==`id' &  fy == `p'
                  }
              }
              This gives me the regression result for each firmid in each of the three subperiod fy.
              Although some results are just empty because of no observation for a specific firm in a subperiod, the results seems okay to me.

              My next step is to calculate 2 new variables with the results of each regression and in the end sum up all firms in one subperiod as one results with the average, the sd, the median and the percentile.

              I tried it with statsby and this works out pretty good for the calculation.
              Code:
              statsby _b _se SOA = (-(_b[prior_dividend]))  TP = (-((_b[earning])/(_b[ prior_dividend]))) r_squared = e(r2_a), by(newid) : regress deltadividend prior_dividend earning
              Now my problem is that i cannot combine the extracted values of the regressions and recalculated values with my dataset. I get the message that it does not work because the data has not changed. If I use the command clear, I replace the dataset with the values of statsby. However, this has the disadvantage that I have no more reference to the subperiods to then combine the data into 3 subperiods.
              How do I solve the problem? Do I have to perform a kind of merge that brings my dataset together with the statsby values and then I can summarize the values?

              Code:
              summarize _b _se SOA TP adj_r_squared if fy==1, r(N) r(mean) r(p25) r(p50) r(p75) r(sd)
              summarize _b _se SOA TP adj_r_squared if fy==2, r(N) r(mean) r(p25) r(p50) r(p75) r(sd)
              summarize _b _se SOA TP adj_r_squared if fy==3, r(N) r(mean) r(p25) r(p50) r(p75) r(sd)
              Also, is this command formulated correctly for my next step or do i need to write down all the coefficients completely so no _b?

              I can't test it as long as the previous step is not solved.



              Comment


              • #8
                Steffen:
                1) I would give -merge- a shot;
                2) it depends how your regression results were named by -statsby-.
                Kind regards,
                Carlo
                (Stata 19.0)

                Comment


                • #9
                  Thanks.
                  I have managed to summarize my values into 3 subperiods. they now look like this.
                  Code:
                  summarize Constant Speed_of_adjustment Target_Payout Adj_R_squared if fy==1, detail
                  summarize Constant Speed_of_adjustment Target_Payout Adj_R_squared if fy==2, detail
                  summarize Constant Speed_of_adjustment Target_Payout Adj_R_squared if fy==3, detail
                  
                  
                  // for fy=3 a part of it:
                             -((_b[earning])/(_b[ prior_dividend]))
                  -------------------------------------------------------------
                        Percentiles      Smallest
                   1%    -1.221968      -113.1293
                   5%    -.0717447      -21.63607
                  10%    -.0026554      -17.49833       Obs               1,767
                  25%            0      -13.54007       Sum of wgt.       1,767
                  
                  50%     .0167484                      Mean           .0404623
                                          Largest       Std. dev.      2.930668
                  75%     .1800658       6.307414
                  90%     .4503128       11.60898       Variance       8.588813
                  95%     .6568139       14.80244       Skewness      -32.75353
                  99%     1.459117       21.33253       Kurtosis       1264.848
                  
                                             e(r2_a)
                  -------------------------------------------------------------
                        Percentiles      Smallest
                   1%    -.1039925      -.1206766
                   5%    -.0559962       -.120493
                  10%    -.0136272      -.1169248       Obs               1,211
                  25%     .0991301      -.1161157       Sum of wgt.       1,211
                  
                  50%     .2846498                      Mean           .3109122
                                          Largest       Std. dev.      .2623919
                  75%     .4787887              1
                  90%     .6726665              1       Variance       .0688495
                  95%     .8111694              1       Skewness       .5467246
                  99%     .9997442              1       Kurtosis       2.708484
                  My next and last step is to display the summarized values as a table side by side for each subperiod with the values I want to have with the table builder. I have done some research and have read the command collect get several times. However, this only captures the last of the 3000+ regressions as a table for me. But I want to collect the summaries for each variable in the 3 subperiods to be able to display this at the end as a table clearly.

                  What command can i use here to capture the summarized subperiod for each variable? Or can i just do it without the builder although i gues it will be more complicated right?

                  Comment


                  • #10
                    Steffen:
                    Stata 17 includes a totally revised version of the -table- command.
                    Take a look at it; I've found it really helpful.
                    Kind regards,
                    Carlo
                    (Stata 19.0)

                    Comment


                    • #11
                      Hello Carlo, hopefully my last question.
                      I did it with the -collect- command (instead of the table, worked better for me) so far in the last couple days and have this now:
                      Code:
                      collect clear
                      
                      foreach i in Constant Speed_of_adjustment Target_Payout Adj_R_squared {
                       
                          collect: summarize `i' if fy == 2 , detail
                      }
                      
                      collect title Subperiod 1965-1983 (N=1697)
                      collect label values result mean "Average" sd "Standard Deviation" p25 "25th Percentile" p50 "Median" p75 "75th Percentile", modify
                      collect style column, extraspace(1)
                      collect label levels cmdset 1 "Constant" 2 "Speed of adjustment" 3 "Target Payout" 4 "Adjusted R²" , modify
                      collect style cell result[mean sd p25 p50 p75], nformat(%9.3f)
                      collect stars mean p25 p50 p75 sd 0.01 "***" 0.05 "**" 0.1 "*", attach(mean median p25 p75 sd) nformat(%9.7g) shownote
                      
                      collect layout (cmdset) (result[mean sd p25 p50 p75]) (), name(default)
                      
                      
                      Subperiod 1965-1983 (N=1697)
                      ------------------------------------------------------------------------------------------------
                                          |  Average   Standard Deviation   25th Percentile   Median   75th Percentile
                      --------------------+---------------------------------------------------------------------------
                      Constant            |  0.107**              0.303**          -0.013**    0.023           0.163**
                      Speed of adjustment |    0.394                0.311             0.157    0.348             0.596
                      Target Payout       |   0.216*               3.165*            0.080*    0.220            0.404*
                      Adjusted R²         |    0.384                0.254             0.189    0.364             0.577
                      ------------------------------------------------------------------------------------------------
                      *** p<.01, ** p<.05, * p<.1

                      My last question is: I would like to have the t-value below my coefficients. How can I do that? I could not find any helpful sites that can answer this question for me.
                      On the one hand I would like to clarify the significance again and on the other hand I want to check whether my whole command is also correct concerning the significance levels. Thanks in advance
                      Last edited by Steffen Scheifele; 09 Aug 2022, 09:54.

                      Comment


                      • #12
                        Steffen:
                        see Stata blog entry "Customizable tables in Stata 17".
                        Kind regards,
                        Carlo
                        (Stata 19.0)

                        Comment

                        Working...
                        X