Need help with replication of research results for further analysis

Steffen Scheifele

Join Date: Jul 2022
Posts: 8

Need help with replication of research results for further analysis

22 Jul 2022, 11:49

Hello everyone
I need help with my project for a research in which I use Stata for the first time and therefore do not know how to use it so well and i hope my question is not too long . I have been despairing of this for some time. I would like to be able to reproduce the research results of Brav et al. (https://www.sciencedirect.com/scienc...26tMaSI7sdaczn ), so that I am sure that I have proceeded correctly and then continue my construction for further investigations with additional variables.

I am concerned here with table 7, panels b and c in the research paper. For my part, I have already been able to reduce the data from the dataset cut to the individual requirements that were given. In total there are almost 72000 observations (so no dataex is possible) for panel b, where I then estimate 2 parameters in the next step with an ols regression for the formula.

Here my steps so far for Panel B, Panel C is quite different but not much:

Code:

egen firmid = group(gvkey)
egen timeid = group(fyear)
duplicates report firmid timeid
duplicates tag firmid timeid, gen(isdup)
drop if isdup

tsset firmid timeid
sum fyear, d
return list

//generate 3 new variable
gen prior_year_dvpsx_f = L.dvpsx_f
gen earning = epspx
gen deltadvpsx = dvpsx_f - prior_year_dvpsx_f

// drop if missing
drop if earning==.
drop if dvpsx_f==.

//generate subperiod
generate fy = .

//generate 3 subperiod
replace fy=1 if fyear <= 1964
replace fy=2 if fyear > 1964 & fyear <= 1983
replace fy=3 if fyear > 1983 & fyear <= 2002

//Count the number of firmid in subperiod
bysort firmid: egen counter1=count(firmid) if fyear <= 1964
bysort firmid: egen counter2=count(firmid) if fyear > 1964 & fyear <= 1983
bysort firmid: egen counter3=count(firmid) if fyear > 1983 & fyear <= 2002


//Drop obs. if data not available whe same like subperiod
drop if fy == 1 & counter1 < 15
drop if fy == 2 & counter2 < 19
drop if fy == 3 & counter3 < 19

// Summarize each subperiod obvservation
sum counter1 if fy == 1 & counter1==15, d
sum counter2 if fy == 2 & counter2==19, d
sum counter3 if fy == 3 & counter3==19, d

// Regression
reg deltadvpsx prior_year_dvpsx_f earning if fy==1
reg deltadvpsx prior_year_dvpsx_f earning if fy==2
reg deltadvpsx prior_year_dvpsx_f earning if fy==3

And here begins the problem where I would have several questions. On the one hand my calculated values in the first two subperiods are a bit low, but here it also seems that my estimated earning value(beta2) is too high (or beta1 too low), because after converting my regression results in my head I have a higher estimated value for beta2 than for beta1 although this should not be according to the research paper. However, the third and last sub-period for me is completely beyond the scope, because here my beta1 has a much larger value than in the first two sub-periods. According to Brav et al. this should not be the case.

There I ask myself the question, what mistake did I make here that I can not reflect the results, especially the third subperiod? Should I not use the pooled OLS regression or the crosssectional OLS regression, as it is mentioned in the description of the table? If yes, how do I do that? The whole thing confuses me extremely because the number of observations i have correspond quite closely to those in the paper.

Code:

. reg deltadvpsx prior_year_dvpsx_f earning if fy==1

      Source |       SS           df       MS      Number of obs   =     7,210
-------------+----------------------------------   F(2, 7207)      =   1530.15
       Model |  713.729123         2  356.864562   Prob > F        =    0.0000
    Residual |  1680.83347     7,207  .233222349   R-squared       =    0.2981
-------------+----------------------------------   Adj R-squared   =    0.2979
       Total |  2394.56259     7,209  .332162934   Root MSE        =    .48293

------------------------------------------------------------------------------------
        deltadvpsx | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------------+----------------------------------------------------------------
prior_year_dvpsx_f |  -.3149017   .0058851   -53.51   0.000    -.3264382   -.3033653
           earning |   .1047265   .0025142    41.65   0.000     .0997979    .1096552
             _cons |   .1120818   .0101159    11.08   0.000     .0922517    .1319118
------------------------------------------------------------------------------------

. reg deltadvpsx prior_year_dvpsx_f earning if fy==2

      Source |       SS           df       MS      Number of obs   =    32,144
-------------+----------------------------------   F(2, 32141)     =   3662.12
       Model |  1016.22144         2  508.110722   Prob > F        =    0.0000
    Residual |  4459.48791    32,141   .13874764   R-squared       =    0.1856
-------------+----------------------------------   Adj R-squared   =    0.1855
       Total |  5475.70936    32,143  .170354645   Root MSE        =    .37249

------------------------------------------------------------------------------------
        deltadvpsx | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------------+----------------------------------------------------------------
prior_year_dvpsx_f |  -.2169661   .0028232   -76.85   0.000    -.2224997   -.2114324
           earning |   .0737496   .0010292    71.66   0.000     .0717322    .0757669
             _cons |   .0330613   .0029897    11.06   0.000     .0272013    .0389212
------------------------------------------------------------------------------------

. reg deltadvpsx prior_year_dvpsx_f earning if fy==3

      Source |       SS           df       MS      Number of obs   =    33,464
-------------+----------------------------------   F(2, 33461)     =   8296.23
       Model |  56096.8386         2  28048.4193   Prob > F        =    0.0000
    Residual |  113127.127    33,461  3.38086511   R-squared       =    0.3315
-------------+----------------------------------   Adj R-squared   =    0.3315
       Total |  169223.966    33,463  5.05704707   Root MSE        =    1.8387

------------------------------------------------------------------------------------
        deltadvpsx | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------------+----------------------------------------------------------------
prior_year_dvpsx_f |   -.658861   .0051149  -128.81   0.000    -.6688864   -.6488356
           earning |   7.94e-07   8.72e-06     0.09   0.927    -.0000163    .0000179
             _cons |   .3111938   .0103593    30.04   0.000     .2908892    .3314985
------------------------------------------------------------------------------------

Here you can see, that the coefficient for the third regression beta1(prior_year_dvpsx_f) is much higher than the other one. This shouldnt't be the case.
Also what is the difference if i insert vce(robust) in my regression. I saw a big difference in my f-value and t-value but what this changes mean?

Furthermore, I come to the second question, namely the conversion of my regression results into a table that corresponds to the research results, i.e. how can I create a table with the regression results in 3 different subperiods with the years and the SD,median,25th and 75th percentile for SOA,TP and adjusted R squared, where SOA is -beta1 and TP is -beta2/beta1?

Would be super helpful to see answers that can move me forward here and finally continue the investigations as well.

Greetings
Steffen

Tags: None

Carlo Lazzaro

Join Date: Apr 2014

Posts: 17709
#2

23 Jul 2022, 01:52

Steffen:
while I cannot comment on the results of your OLS regression (exception made for the evidence that -earning- is totally negligible in your 3rd OLS), -robust- standard errors takes care of heteroskedasticity of the residual distribution (you probably have heteroskedasticity, that you can test via -estat hettest-, if your results remarkably change vs. those obtained with default standard errors).
In addition, I would also run -linktest- to check the robustness of the functional form of the regressand (OLS with two predictors only look always sinister).
As far as the table you're after is concerned, take a look at -table- suite in Stata 17.

Kind regards,
Carlo
(Stata 19.0)
Comment

Steffen Scheifele

Join Date: Jul 2022
Posts: 8

23 Jul 2022, 10:18

Thanks for your message Carlo.
True the -earning- in the 3rd OLS is negligible which shouldn't be that low and confuse me.That is my main reason for the post, because of the big question if i did something wrong in my progress.
Also can the OLS regression be viewed as a cross sectional distribution or what command is needed to do a OLS regression with cross-sectional distribution because this is what the author in the paper did right? All thes different types confuse me at first glance.

I tried the -estat hettest- for every independent and dependet variable and got for almost every variable except for the earning in the 3rd OLS (huh?) a Value of 0.0000. So that it is better to work from here on with the -robost- standard error because of the heteroskedastictiy? With the robust SE i get a much different t-value and F-value here but then I wouldn't have the adj. R-squared like in the paper above.

Code:

. estat hettest

Breusch–Pagan/Cook–Weisberg test for heteroskedasticity
Assumption: Normal error terms
Variable: Fitted values of deltadvpsx

H0: Constant variance

    chi2(1) = 1.40e+06
Prob > chi2 =   0.0000

With the -linktest- i get this values in compare to the -reg- for the 1st OLS:

Code:

. reg deltadvpsx prior_year_dvpsx_f earning if fy==1

      Source |       SS           df       MS      Number of obs   =     7,210
-------------+----------------------------------   F(2, 7207)      =   1530.15
       Model |  713.729123         2  356.864562   Prob > F        =    0.0000
    Residual |  1680.83347     7,207  .233222349   R-squared       =    0.2981
-------------+----------------------------------   Adj R-squared   =    0.2979
       Total |  2394.56259     7,209  .332162934   Root MSE        =    .48293

------------------------------------------------------------------------------------
        deltadvpsx | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------------+----------------------------------------------------------------
prior_year_dvpsx_f |  -.3149017   .0058851   -53.51   0.000    -.3264382   -.3033653
           earning |   .1047265   .0025142    41.65   0.000     .0997979    .1096552
             _cons |   .1120818   .0101159    11.08   0.000     .0922517    .1319118
------------------------------------------------------------------------------------

.
end of do-file

. linktest

      Source |       SS           df       MS      Number of obs   =     7,210
-------------+----------------------------------   F(2, 7207)      =   2254.15
       Model |  921.480408         2  460.740204   Prob > F        =    0.0000
    Residual |  1473.08219     7,207   .20439603   R-squared       =    0.3848
-------------+----------------------------------   Adj R-squared   =    0.3847
       Total |  2394.56259     7,209  .332162934   Root MSE        =     .4521

------------------------------------------------------------------------------
  deltadvpsx | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
        _hat |   1.060913   .0170302    62.30   0.000     1.027528    1.094297
      _hatsq |  -.2748068   .0086197   -31.88   0.000     -.291704   -.2579097
       _cons |   .0321113    .005522     5.82   0.000     .0212866     .042936
------------------------------------------------------------------------------

What it tells me? That the predicted values of deltadvpsx is significant (because of the high t-value) and therefore not a good prediction or how should i understand this output?

Regard the -table- suite i tried to understand this function but i don't find a good guidevideo or something which helps me to achieve what i want specifically so far. Because for example if I try to restrict observation regarding the subperiods for fy==1,fy==2 and for 3, i get the message invalid. Seems it takes time to understand the tables builder because i can't find any clue how to calculate new results from my current regressionsresult and put every subperiods as one big table.

Comment

Carlo Lazzaro

Join Date: Apr 2014
Posts: 17709

23 Jul 2022, 11:40

Steffen:
1) you actually ran three cross-sectional regressions;
2) you have a clear problem of heteroskedasticity; therefore you should go -robust-;
3) the -linktest- outcome tells you that the functional form of your regressand is misspecified. Put differently, your model is not that informative because of a too smal number of predictors. I do not know whether the Authors of the papers you're trying to replicate actually investigated this possible nuisance;
4) you can retrieve the adj_R_sq via:

Code:

. use "C:\Program Files\Stata17\ado\base\a\auto.dta"
(1978 automobile data)

. regress price mpg, robust

Linear regression                               Number of obs     =         74
                                                F(1, 72)          =      17.28
                                                Prob > F          =     0.0001
                                                R-squared         =     0.2196
                                                Root MSE          =     2623.7

------------------------------------------------------------------------------
             |               Robust
       price | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
         mpg |  -238.8943   57.47701    -4.16   0.000    -353.4727    -124.316
       _cons |   11253.06   1376.393     8.18   0.000     8509.272    13996.85
------------------------------------------------------------------------------

. di e(r2_a)
.20874373

.

5) the -table- suite needs a bit of training and does not cover all the needs;
6) what's the role of your supervisor in your research?

Kind regards,
Carlo
(Stata 19.0)

Comment

Steffen Scheifele

Join Date: Jul 2022

Posts: 8
#5

27 Jul 2022, 10:30

Thanks for your input Carlo.
You are right my values are misleading, which is why I asked my supervisor about it. My supervisor role is reviewing my master's thesis only and I can't ask questions frequently because this research is mostly done independently, ergo I can't expect as much help from the supervisor as I get here in the forum.

However, the supervisor did give me the information to do the regression as a loop. My whole work before of course is strongly biased by the fact that the regressions do not take place on firm level. So the regression has to be done for each company individually.

Therefor i want to create a loop ols regression for each company, in the three subperiods (fy) that i have, that takes into account individually each company for each year from 1950-2020 for around 74,000 observation depending on the membership of the subperiods.

I tried something like this:

Code:

forvalues fyear = 1950(1)2020 { reg deltadividend prior_dividend earning if fy==1 reg deltadividend prior_dividend earning if fy==2 reg deltadividend prior_dividend earning if fy==3 }

The problem here, however, is that it considers the whole subperiod as a whole and not each year individually. So I get each of the three regressions displayed over and over again for each year. What am I doing wrong here? Also, how do I bring the different companies(firmid) into play here?
Or is it enough to make a regression loop for each firm level in each subperiod, and it is the same as if I had regressed each company and each year only to then summarize the regression estimates for one company in one subperiod and then i can summarize the average, the percentile and the median of all firms in one subperiod as one result?

In addition, I would like to convert the estimated values after the regression ( earning to -(earning/prior_dividend) as SOA for example) to then summarize the converted values within a company of a subperiod and then summarize all values of a subperiod as an average as the last step. So i have to do this with the -table- suite or is there a command for that? Because I would have to make the conversion before I then summarize the converted values of the individual companies as one for each subperiod.
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17709
#6

27 Jul 2022, 11:30

Steffen:
1) I doubt that your -forval- loop can work properly the way it is coded.
2) my guess is that (set aside 1) for a while), -years- and -fe- mix up and the whole stuff gets less informative than expected.
I would go something along the following lines:

Code:

forvalues fyear = 1950(1)2020 { reg deltadividend prior_dividend earning i.fy if `fyear'!=. }

I would fix the regression first and then deal with the remaining items.

Kind regards,
Carlo
(Stata 19.0)
Comment
Steffen Scheifele

Join Date: Jul 2022

Posts: 8
#7

31 Jul 2022, 09:42

Thanks for your reply Carlo.
I noticed that I did my setup upside down. Usually I first want to do a firm level regression over the years separated by 3 subperiods fy (1950-1963, 1964-1983 and 1984-2002). 1950-2002 is the whole set with 73k+ observation and 3023 firms. Because some firms are missing because of no observation i tidied up my firms (it was around 34000 firms before).

So first I fixed my regression:

Code:

egen newid = group(firmid), sum newid, d return list local max=r(max) forvalues id = 1/`max' { forvalues p = 1/3 { capture noisily regress deltadividend prior_dividend earning if newid ==`id' & fy == `p' } }

This gives me the regression result for each firmid in each of the three subperiod fy.
Although some results are just empty because of no observation for a specific firm in a subperiod, the results seems okay to me.

My next step is to calculate 2 new variables with the results of each regression and in the end sum up all firms in one subperiod as one results with the average, the sd, the median and the percentile.

I tried it with statsby and this works out pretty good for the calculation.

Code:

statsby _b _se SOA = (-(_b[prior_dividend])) TP = (-((_b[earning])/(_b[ prior_dividend]))) r_squared = e(r2_a), by(newid) : regress deltadividend prior_dividend earning

Now my problem is that i cannot combine the extracted values of the regressions and recalculated values with my dataset. I get the message that it does not work because the data has not changed. If I use the command clear, I replace the dataset with the values of statsby. However, this has the disadvantage that I have no more reference to the subperiods to then combine the data into 3 subperiods.
How do I solve the problem? Do I have to perform a kind of merge that brings my dataset together with the statsby values and then I can summarize the values?

Code:

summarize _b _se SOA TP adj_r_squared if fy==1, r(N) r(mean) r(p25) r(p50) r(p75) r(sd) summarize _b _se SOA TP adj_r_squared if fy==2, r(N) r(mean) r(p25) r(p50) r(p75) r(sd) summarize _b _se SOA TP adj_r_squared if fy==3, r(N) r(mean) r(p25) r(p50) r(p75) r(sd)

Also, is this command formulated correctly for my next step or do i need to write down all the coefficients completely so no _b?

I can't test it as long as the previous step is not solved.
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17709
#8

31 Jul 2022, 12:18

Steffen:
1) I would give -merge- a shot;
2) it depends how your regression results were named by -statsby-.

Kind regards,
Carlo
(Stata 19.0)
Comment

Steffen Scheifele

Join Date: Jul 2022
Posts: 8

02 Aug 2022, 09:31

Thanks.
I have managed to summarize my values into 3 subperiods. they now look like this.

Code:

summarize Constant Speed_of_adjustment Target_Payout Adj_R_squared if fy==1, detail
summarize Constant Speed_of_adjustment Target_Payout Adj_R_squared if fy==2, detail
summarize Constant Speed_of_adjustment Target_Payout Adj_R_squared if fy==3, detail


// for fy=3 a part of it:
           -((_b[earning])/(_b[ prior_dividend]))
-------------------------------------------------------------
      Percentiles      Smallest
 1%    -1.221968      -113.1293
 5%    -.0717447      -21.63607
10%    -.0026554      -17.49833       Obs               1,767
25%            0      -13.54007       Sum of wgt.       1,767

50%     .0167484                      Mean           .0404623
                        Largest       Std. dev.      2.930668
75%     .1800658       6.307414
90%     .4503128       11.60898       Variance       8.588813
95%     .6568139       14.80244       Skewness      -32.75353
99%     1.459117       21.33253       Kurtosis       1264.848

                           e(r2_a)
-------------------------------------------------------------
      Percentiles      Smallest
 1%    -.1039925      -.1206766
 5%    -.0559962       -.120493
10%    -.0136272      -.1169248       Obs               1,211
25%     .0991301      -.1161157       Sum of wgt.       1,211

50%     .2846498                      Mean           .3109122
                        Largest       Std. dev.      .2623919
75%     .4787887              1
90%     .6726665              1       Variance       .0688495
95%     .8111694              1       Skewness       .5467246
99%     .9997442              1       Kurtosis       2.708484

My next and last step is to display the summarized values as a table side by side for each subperiod with the values I want to have with the table builder. I have done some research and have read the command collect get several times. However, this only captures the last of the 3000+ regressions as a table for me. But I want to collect the summaries for each variable in the 3 subperiods to be able to display this at the end as a table clearly.

What command can i use here to capture the summarized subperiod for each variable? Or can i just do it without the builder although i gues it will be more complicated right?

Comment

Carlo Lazzaro

Join Date: Apr 2014

Posts: 17709
#10

02 Aug 2022, 10:20

Steffen:
Stata 17 includes a totally revised version of the -table- command.
Take a look at it; I've found it really helpful.

Kind regards,
Carlo
(Stata 19.0)
Comment

Steffen Scheifele

Join Date: Jul 2022
Posts: 8

#11

09 Aug 2022, 09:31

Hello Carlo, hopefully my last question.
I did it with the -collect- command (instead of the table, worked better for me) so far in the last couple days and have this now:

Code:

collect clear

foreach i in Constant Speed_of_adjustment Target_Payout Adj_R_squared {
 
    collect: summarize `i' if fy == 2 , detail
}

collect title Subperiod 1965-1983 (N=1697)
collect label values result mean "Average" sd "Standard Deviation" p25 "25th Percentile" p50 "Median" p75 "75th Percentile", modify
collect style column, extraspace(1)
collect label levels cmdset 1 "Constant" 2 "Speed of adjustment" 3 "Target Payout" 4 "Adjusted R²" , modify
collect style cell result[mean sd p25 p50 p75], nformat(%9.3f)
collect stars mean p25 p50 p75 sd 0.01 "***" 0.05 "**" 0.1 "*", attach(mean median p25 p75 sd) nformat(%9.7g) shownote

collect layout (cmdset) (result[mean sd p25 p50 p75]) (), name(default)


Subperiod 1965-1983 (N=1697)
------------------------------------------------------------------------------------------------
                    |  Average   Standard Deviation   25th Percentile   Median   75th Percentile
--------------------+---------------------------------------------------------------------------
Constant            |  0.107**              0.303**          -0.013**    0.023           0.163**
Speed of adjustment |    0.394                0.311             0.157    0.348             0.596
Target Payout       |   0.216*               3.165*            0.080*    0.220            0.404*
Adjusted R²         |    0.384                0.254             0.189    0.364             0.577
------------------------------------------------------------------------------------------------
*** p<.01, ** p<.05, * p<.1

My last question is: I would like to have the t-value below my coefficients. How can I do that? I could not find any helpful sites that can answer this question for me.
On the one hand I would like to clarify the significance again and on the other hand I want to check whether my whole command is also correct concerning the significance levels. Thanks in advance

Last edited by Steffen Scheifele; 09 Aug 2022, 09:54.

Comment

Carlo Lazzaro

Join Date: Apr 2014

Posts: 17709
#12

09 Aug 2022, 16:30

Steffen:
see Stata blog entry "Customizable tables in Stata 17".

Kind regards,
Carlo
(Stata 19.0)
Comment

Announcement

Need help with replication of research results for further analysis

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment