Financial statement comparability (rangerun after rangestat?

River Huang

Join Date: Mar 2016
Posts: 1908

Financial statement comparability (rangerun after rangestat?

10 Oct 2019, 20:19

Dear All, In an earlier post (https://www.statalist.org/forums/for...fter-rangestat), I asked how to compute the measure of financial statement comparability. Thanks to Robert Picard, who offered a helpful code in doing this (using runby & rangerunm, both from SSC). However, due to the large dataset (using all listed A shares in China over the 1991-2018 year), it took more than 10 hours (according to my friend) and never ends! I just wonder if the code from Robert can be speeded up somehow. Any suggestions are highly appreciated. The following is taken from #4 of the above link (by Robert):

Code:

clear all
set seed 3123

* demonstration dataset, 50 firms over 40 quarters in 10 industry
set obs 50
gen firmid = _n
gen industry = runiformint(1,10)
expand 70
bysort firmid: gen qdate = yq(1999,4) + _n
format %tq qdate
gen returns = runiform()
gen earnings = runiform()

* pick a quarter to calculate measure, use quarters in 2 previous years
gen q2use = quarter(dofq(qdate)) == 4
gen qlow  = cond(q2use, qdate - 11, 1)
gen qhigh = cond(q2use, qdate - 4, 0)
format %tq qlow qhigh

program get_CompAcct
    reg earnings returns
    predict pearn, xb
    reg earnings2 returns2
    gen pearn2 =  _b[returns2] * returns + _b[_cons]
    count if !mi(pearn,pearn2)
    
    gen CompAcct_nobs = r(N)
    gen CompAcct = -sum(abs(pearn-pearn2)) / CompAcct_nobs
    drop pearn pearn2
end

program pair_by_quarters
    tempfile hold
    save "`hold'"
    rename (firmid returns earnings) (firmid2 returns2 earnings2)
    joinby qdate using "`hold'"
    keep if firmid != firmid2
    sort firmid firmid2 qdate
    rangerun get_CompAcct, by(firmid firmid2) interval(qdate qlow qhigh)
end
runby pair_by_quarters, by(industry) verbose

save "results.dta", replace

sort industry qdate firmid firmid2

* to install, type: ssc install listsome
listsome industry qdate firmid firmid2 CompAcct_nobs CompAcct ///
    if q2use & !mi(CompAcct), sepby(qdate)

Ho-Chuan (River) Huang
Stata 19.0, MP(4)

Tags: None

Clyde Schechter

Join Date: Apr 2014

Posts: 30097
#2

13 Oct 2019, 11:09

It's a tall order asking people to improve on code written by Robert Picard! Unsurprisingly, I have not found any ways to materially speed this up.

If you change program CompAcct as follows:

Code:

program get_CompAcct reg earnings returns predict pearn, xb reg earnings2 returns2 gen pearn2 = _b[returns2] * returns + _b[_cons] gen CompAcct_nobs = sum(!mi(pearn, pearn2)) gen CompAcct = -sum(abs(pearn-pearn2)) / CompAcct_nobs drop pearn pearn2 end

on my set up it saves about 0.5 seconds on the demonstration data set you show, which is about a 1% improvement. But I couldn't come up with anything better than that. You can also perhaps shave another fraction of a percent off the run time by eliminating the -verbose- option from the -runby- command. That improvement will come at the price of not having any indication of what went wrong in any industry that didn't yield results (as in the example where one industry has no observations.)

There are a few things you can consider that might get you results more quickly using the same code:

1a. Get (or borrow, or rent in the cloud) a computer with a much faster processor and more RAM.

1b. Split the data set into separate industries, and run them in parallel on separate computers. This doesn't reduce total computational effort but you get the results more quickly.

2. Be patient. In my world, a 10 hour run would not be considered exceptionally long. I routinely do things that run for more than 24 hours, and have occasionally had calculations that took longer than a week to conclude. To make it easier to be patient, consider adding the -status- option to the -runby- command. That way you will get periodic progress reports, along with an estimate of the remaining time.

Finally, if this is an analysis that will be run recurrently with different data sets, it might be worth hiring somebody to program this in a compiled language, rather than doing it in Stata.

Added note to any Stata developers who might be following this thread: one part of the code that undoubtedly is a major time sink is the place in program pair_by_quarter where the current data are copied into a tempfile. If it were possible to do something analogous to -joinby- with frames (in the same sense that -frlink- is analogous to -merge-), this could likely be sped up considerably by using a frame instead of a tempfile.
1 like
Comment
River Huang

Join Date: Mar 2016

Posts: 1908
#3

13 Oct 2019, 16:38

Dear Clyde, Thanks a lot, and I will follow your suggestions to see what I can do.

PS: By the way, do you think that, if I use (ssc install) -rangestat- to obtain coefficients and residuals before going to the procedures, will it save a little time? I doubt that -reg- command calculates many unnecessary statistics.

Last edited by River Huang; 13 Oct 2019, 16:46.

Ho-Chuan (River) Huang
Stata 19.0, MP(4)
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30097
#4

13 Oct 2019, 17:26

I doubt using -rangestat- will speed things up much. You can try it on a smaller data set and time it both ways to see. But -rangestat- and -reg- do the same computations for calculating regression coefficients. They have somewhat different overhead for setup time, but that would, I think, be quite small compared to the time required for the regressions themselves.
Comment
River Huang

Join Date: Mar 2016

Posts: 1908
#5

13 Oct 2019, 17:34

Dear Clyde, I see, and thanks again.

Ho-Chuan (River) Huang
Stata 19.0, MP(4)
Comment
River Huang

Join Date: Mar 2016

Posts: 1908
#6

18 Oct 2019, 21:46

Dear Clyde, Suppose that I only want regressions with exactly 16 (quarterly) observations, is it possible (and how) to skip the regressions with fewer observations (so that I can save time)?

Ho-Chuan (River) Huang
Stata 19.0, MP(4)
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30097
#7

18 Oct 2019, 22:05

So you could recode the program as:

Code:

program get_CompAcct reg earnings returns predict pearn, xb gen CompAcct_nobs = sum(!mi(pearn, pearn2)) if CompAcct_nobs[_N] == 16 { drop CompAcct_nobs reg earnings2 returns2 gen pearn2 = _b[returns2] * returns + _b[_cons] gen CompAcct = -sum(abs(pearn-pearn2)) / CompAcct_nobs drop pearn pearn2 } end

and this would skip the regressions when the number of observations in the estimation sample would be different from 16 (more, as well as less). Whether the time savings would be appreciable depends on how many firm pairs will turn out to have other than 16 observations to contribute to the regression. If there are a lot of those, you will save a lot of time. If only a few, it won't make a noticeable difference.
Comment
River Huang

Join Date: Mar 2016

Posts: 1908
#8

19 Oct 2019, 01:27

Dear Clyde, Thanks a lot, and I'll give it a try. On final question is that, for "each firm-year", in order to estimate the following equation using the "previous 16 quarters" of data

Code:

reg earnings returns

Is it correct to modify the above code to

Code:

gen qlow = cond(q2use, qdate - 19, 1) gen qhigh = cond(q2use, qdate - 4, 0)

or

Code:

gen qlow = cond(q2use, qdate - 20, 1) gen qhigh = cond(q2use, qdate - 5, 0)

or others. Thanks again.

Ho-Chuan (River) Huang
Stata 19.0, MP(4)
Comment
River Huang

Join Date: Mar 2016

Posts: 1908
#9

19 Oct 2019, 03:58

Dear Clyde, I am a little confused with the code you offered in #7. Before the `if' condition command, we need to

Code:

gen CompAcct_nobs = sum(!mi(pearn, pearn2))

However,pearn2 is calculated using the `if' command below.

Code:

if CompAcct_nobs[_N] == 16 { drop CompAcct_nobs reg earnings2 returns2 gen pearn2 = _b[returns2] * returns + _b[_cons] gen CompAcct = -sum(abs(pearn-pearn2)) / CompAcct_nobs drop pearn pearn2 }

Am I wrong about this?

Ho-Chuan (River) Huang
Stata 19.0, MP(4)
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30097
#10

19 Oct 2019, 10:49

Re #8: It depends on what you mean by the previous 16 quarters. Your second version excludes the current quarter and counts back from a year before the immediately preceding one. Your first version includes the current quarter and counts back from one year ago. Actually if you really mean the 16 quarters preceding the current one, you would set qlow to qdate-16 and qhigh to qdate-1. If you mean 16 quarters back, including the present one, it's qdate-15 and qdate.

Re #9. Sorry, yes you are right. It should be -gen CompAcct_nobs = sum(!mi(earnings2, returns2))-, as those are the variables used in the regression.
Comment

River Huang

Join Date: Mar 2016
Posts: 1908

#11

19 Oct 2019, 18:48

Dear Clyde, Thanks again. Do you think that the following code is OK?

Code:

program get_CompAcct
    gen CompAcct_nobs = sum(!mi(earnings, returns, earnings2, returns2))
    if CompAcct_nobs[_N] == 16 {
        reg earnings returns
        predict pearn, xb
        reg earnings2 returns2
        gen pearn2 =  _b[returns2] * returns + _b[_cons]
        gen CompAcct = -sum(abs(pearn-pearn2)) / 16
        drop pearn pearn2
    }
end

Last edited by River Huang; 19 Oct 2019, 19:08.

Ho-Chuan (River) Huang
Stata 19.0, MP(4)

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30097
#12

19 Oct 2019, 19:02

Yes, of course. I'm sorry. I guess I wasn't paying close enough attention. But you are absolutely right: it has to be based on observations that have all of the variables needed for both regressions.
Comment
River Huang

Join Date: Mar 2016

Posts: 1908
#13

19 Oct 2019, 19:43

Dear Clyde, Got it and many thanks. I am still confused with what you said

Code:

Your first version includes the current quarter and counts back from one year ago. Actually if you really mean the 16 quarters preceding the current one, you would set qlow to qdate-16 and qhigh to qdate-1. If you mean 16 quarters back, including the present one, it's qdate-15 and qdate.

Let me make it clearer: My purpose is to obtain a measure of FSC (financial statement comparability) for each pair of firms and for "each year", using previous 16 quarters (excluding any quarter in the current year). Professor Robert Picard suggested the following setup to save time (avoiding unnecessary repetitions, I think) since, for each year, we only need to calculate once the measure.

Another question, though.

Robert suggested to calculate the measure at the fourth quarter of each year (and don't replicate the procedures for the other three quarters). This is great to save lots of time!
My problem is: What is the difference between the code

Code:

gen q2use = quarter(dofq(qdate)) == 4 gen qlow = cond(q2use, qdate - 19, 1) gen qhigh = cond(q2use, qdate - 4, 0)

and your suggested code

Code:

gen q2use = quarter(dofq(qdate)) == 4 gen qlow = cond(q2use, qdate - 16, 1) gen qhigh = cond(q2use, qdate - 1, 0)

In my case above, what would be your suggestion?

Ho-Chuan (River) Huang
Stata 19.0, MP(4)
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30097
#14

19 Oct 2019, 20:00

The first block of code skips over a year and begins four quarters in the past and goes back through 19 quarters in the past. My suggested code starts one quarter in the past and goes back through 16 quarters in the past. Since you want to exclude any quarter in the present year, my suggested code would not be appropriate for you.
Comment
River Huang

Join Date: Mar 2016

Posts: 1908
#15

19 Oct 2019, 20:28

Dear Clyde, I see, and thank a lot.

Ho-Chuan (River) Huang
Stata 19.0, MP(4)
Comment

Announcement

Financial statement comparability (rangerun after rangestat?

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment