Reproducing regression results: Unexpected coefficients

Immo Bock

Join Date: Jan 2022

Posts: 3
#1

Reproducing regression results: Unexpected coefficients

22 Mar 2022, 12:50

Hello everybody!

I am currently writing my master thesis in accounting and part of that is replicating the investigations in a research paper by Francis et al. (2005) (download here, in case you are interested). I am not entirely sure if this turns out to be more of a question about Stata or about statistics, but here we go:

I tried recreating their procedure 1:1 (as far as possible). Right now I use data from the same time period as the paper and the variables that I calculated appear to be pretty close to theirs (judging by the means, medians and other quantiles they list). The last step is plugging them into a regression about which they provide the following info:

"Our analyses are based on annual regressions […] for the period t = 1970-2001: […] To control for cross-sectional correlations, we assess the significance of the 32 annual regression results using the time-series standard errors of the estimated coefficients (Fama-MacBeth, 1973)." (p. 308)

So I went ahead and tried:

Code:

xtset firm_j period_t, yearly asreg CostDebt Leverage Size ROA IntCov sigmaNIBE AQ_deciles, fmb

Which gave me the following results:

Code:

Fama-MacBeth (1973) Two-Step procedure Number of obs = 86368 Num. time periods = 32 F( 6, 31) = 314.49 Prob > F = 0.0000 avg. R-squared = 0.0664 Adj. R-squared = 0.0641 ------------------------------------------------------------------------------ | Fama-MacBeth CostDebt | Coefficient std. err. t P>|t| [95% conf. interval] -------------+---------------------------------------------------------------- Leverage | -.0565528 .0018785 -30.10 0.000 -.0603841 -.0527216 Size | -.0013385 .0003525 -3.80 0.001 -.0020574 -.0006195 ROA | -.0386261 .0043771 -8.82 0.000 -.0475532 -.029699 IntCov | .0000469 .000015 3.12 0.004 .0000163 .0000775 sigmaNIBE | .0456804 .007419 6.16 0.000 .0305494 .0608115 AQ_deciles | .0023378 .0001652 14.15 0.000 .0020009 .0026747 _cons | .1135654 .0036623 31.01 0.000 .106096 .1210347 ------------------------------------------------------------------------------

Which are pretty different from the results reported in the paper (p. 309):

Code:

CostDebt | Coefficient t ----------------------------------- Leverage | -2.5 -9.76 Size | -0.01 -0.55 ROA | -1.65 -5.02 IntCov | -0.00 -5.24 sigmaNIBE | 5.44 12.35 AQ_deciles | 0.14 13.36 ------------------------------------

Considering that the period of observation is the same as the paper's, all variables are calculated strictly following the paper and a quick check showed they were distributed similarily (from what I can tell), I am very surprised by how different some of the coefficients turned out. I also tried other regression methods/commands (plain old reg, xtreg and xtfmb) and they all give me pretty much the same results.

So I'm wondering: Am I doing something wrong on the Stata side of things? Do I use the right commands? Do they mean some completely different procedure? Ist there anything else I could look into?

I'm using Stata 17 on Windows 10.

Every hint will be greatly appreciated.

Thanks for your attention, have a good one!

Immo
Tags: None
George Ford

Join Date: Aug 2014

Posts: 3051
#2

23 Mar 2022, 18:54

Did you limit to 20 obs per year for an industry and windsorize at the 1% tails? Is the distribution of the DV similar to theirs?
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17612
#3

24 Mar 2022, 05:09

Immo:
at page 311-12, Authors explain the approach they followed in creating the variables plugged in their regression.
Did you follow their very same steps?

Kind regards,
Carlo
(StataNow 18.5)
Comment

Immo Bock

Join Date: Jan 2022
Posts: 3

24 Mar 2022, 16:32

George Ford The 20 observation minimum applies to the first regression that is used to calculate AQ, which is an idependent variable in the regression I am asking about. They don't report any results of that 1st regression, but the results I get for AQ are close enough for me. I winsorized all variables at the 1% tails after calculating them, except for AQ. I did not winsorize it because they don't use the raw values for AQ in the regression, but their decile ranks. So I figured it wouldn't make a difference for the regression (p. 308). See the table below which shows my results compared to those reported in the paper.

Carlo Lazzaro I followed all the steps carefully and whenever something wasn't 100% clear, I followed related paper's approaches. See the table below to compare my results to those in the paper.

A little bit of background: The sample consits of US firms that have all the data needed to calculate the ominous AQ variable (accruals quality). It ranges from 1970-2001 and has ca. 96000 observations (compared to ca. 91000 reported for the same period in the paper). Following the paper, I drop all observations that don't have all variables required for the regression. After that I'm still left with 86000 observations (compared to 76000 reported in the paper). The table below compares the values I calculated to those reported by Francis et al. (2005), p. 307

Variable:

Mean

10%

25%

Median

75%

90%

Paper

0.0442

.0498852

0.0107

.0110657

0.0179

.0190715

0.0313

.0338772

0.0558

.0619193

0.0943

.1098362

Market value equity

1206.6

1214.252

4.7

4.033469

14.3

13.0305

64.2

59.43625

374.8

356.82

1702.1

1663.572

Assets

1283.5

1320.24

8.5

6.079

25.6

22.155

102.0

97.45

511.3

529.923

2333.6

2436.755

Sales

1240.1

1236.908

8.9

6.054

30.7

26.674

127.6

124.499

575.2

583.826

2297.8

2284.962

ROA

0.003

.0025771

-0.101

-.1221669

0.005

.0024974

0.042

.045052

0.076

.0847903

0.114

.134294

Market to book ratio

2.02

2.128535

0.44

.4407023

0.77

.7760555

1.32

1.336096

2.29

2.358652

4.07

4.31252

Cost Debt

0.099

.1075387

0.059

.0583497

0.074

.0745965

0.092

.0933688

0.114

.1178158

0.144

.1550633

Leverage

0.276

.2791926

0.010

.0082177

0.109

.1087576

0.248

.2528269

0.381

.3844805

0.520

.5292025

sigmaNibe

0.065

.0804085

0.011

.0106844

0.020

.0204683

0.038

.040088

0.077

.0860647

0.151

.1884758

Earnings-price ratio

0.089

.1260663

0.026

.0258647

0.047

.0487951

0.073

.0783496

0.114

.1259355

0.166

.1882353

IndEP

0.008

.010871

-0.045

-.0480226

-0.022

-.0223617

0.001

.0015346

0.027

.0299525

0.062

.0738549

(the variables that are relevant for the regression are bold)

So I think that most of the variables are at least somewhat close to those used by the authors. But maybe I'm underestimating the effects of the differences ...

I found some papers that also repeated the analysis and I'll thoroughly check them for other evidence of what I'm doing wrong tomorrow.

Once again, thank you for looking into this!

Best regards,
Immo

Last edited by Immo Bock; 24 Mar 2022, 16:35.

Comment

Carlo Lazzaro

Join Date: Apr 2014

Posts: 17612
#5

25 Mar 2022, 02:05

Immo:
my gut-feeling is that the difference in observations (paper vs. your dataset) can explain most of the small differences in results.
In addition, papers might be unclear/incomplete about some methodological issues due to (say) word-count constraints.
I would not be concerned about that and simply report on this issue in your dissertation/research report/article/working-paper/whatever else.

Kind regards,
Carlo
(StataNow 18.5)
Comment
George Ford

Join Date: Aug 2014

Posts: 3051
#6

25 Mar 2022, 09:00

It's strange that the means are close (so not a scale problem) but the coefficients are very different. The correlations, obviously, are not the same, which makes me think of a coding error of some sort in creating the data. Maybe something is out of sync across the variables. Are you merging data or does all of it come from the same place?

Are the coefficients from the first stage close?

Might want to run the regression by year and compute the average of the coefficients. Perhaps asreg is doing something you don't want it to.

Size variable is in logs in paper. s(NIBE) is scaled by average assets, which may not have been done for the table of means (and required 5 obs to construct).

So, AQ is a prediction from another regression? That might be the source. Try leaving it out, using its raw form, or some other simple procedure to see if the scale of the coef becomes closer. Also, after you figure this out, you need to bootstrap second stage since you have a prediction as a regressor.

I'd contact the authors of the paper and ask for the data (or a correlation matrix, code, ... whatever you can get). You need to get this figured out. You won't be comfortable until you do and it will make the paper difficult to publish.
Comment
Immo Bock

Join Date: Jan 2022

Posts: 3
#7

25 Mar 2022, 11:54

The data I'm using in this regression is all from the same source (Compustat). I merge the this data with stock market data from CRSP (CAPM betas based on monthly return data) for use in a another regression. However, I just removed the whole merging part from my .do file and the results are still the same.

I also played around with different combinations of independent variables and the coefficients still did not go anywhere near those in the paper. Leaving out or changing AQ also didn't have an noticeable effect. Then I tried doing yearly regressions like this (which also was my original approach):

Code:

forvalues i = 1970/2001 { local z = `y'+1 capture noisily reg CostDebt Leverage Size ROA IntCov sigmaNIBE AQ_old_deciles if fyear == `i' }

A quick look at the coefficients showed that they were within the same range as before. This and the fact that xtfmb (which is a apparently is a specialized function for Fama-Macbeth regressions) gives the exact same results as asreg with the fmb option make me think that the problem is probably not the regression.

As for the variables:

The size variable was calculated like this:

Code:

gen Size = log(at)

(at are Assets - Total)

s(NIBE) was calculated like this:

Code:

gen NIBE_scaled = NIBE /((at + L.at)/2) rangestat (sd) sigmaNIBE = NIBE_scaled, interval(period_t -10 0) by (firm_j) rangestat (count) sigmaNIBE_count = NIBE_scaled, interval (period_t -10 0) by (firm_j) replace sigmaNIBE=. if sigmaNIBE_count<5 drop NIBE_scaled sigmaNIBE_count

I played around with a few versions for s(NIBE), but the results I got from this were the ckosest to the ones in the paper.

I already contacted the authors a while ago and they responded that they do not have any of the data or code anymore since they wrote the paper about 20 years ago.

I am running out of ideas, but I'm sure I'll find the problem at some point.

Once again, thank you for your time!

Last edited by Immo Bock; 25 Mar 2022, 12:17.
Comment
George Ford

Join Date: Aug 2014

Posts: 3051
#8

25 Mar 2022, 12:45

You cannot assume their results are legitimate and you have failed to reproduce them. That is important.
Comment

Announcement