Financial statement comparability (rangerun after rangestat?

George Vega

Join Date: May 2014
Posts: 13

#16

22 Oct 2019, 11:35

As suggested by Clyde

Originally posted by Clyde Schechter View Post

1b. Split the data set into separate industries, and run them in parallel on separate computers. This doesn't reduce total computational effort but you get the results more quickly.

You can use the parallel module to do such task (which was just published on the Stata Journal, https://journals.sagepub.com/doi/ful...36867X19874242) In general, parallelization is made for data, but you can skip passing data and in essence, run multiple stata sessions simultaneously each one doing something different, like different sets of simulations. To do such, you can make use of the parallel macros, here is an example: https://github.com/gvegayon/parallel...nstance-macros

Code:

clear all
set more off
set trace off

parallel setclusters 4
cap drop

// Generating a variable called code that goes from 1/4
sysuse auto
set seed 112321
gen code = floor(runiform()*4) + 1
tab code
/*
       code |      Freq.     Percent        Cum.
------------+-----------------------------------
          1 |         20       27.03       27.03
          2 |         11       14.86       41.89
          3 |         21       28.38       70.27
          4 |         22       29.73      100.00
------------+-----------------------------------
      Total |         74      100.00

*/

// Storing
save mytempdata, replace
clear

// Program that stores a dataset for
program myprogram
    use if code == $pll_instance using mytempdata.dta, clear
    collapse (mean) price rep78 (max) code
    save dataset_$pll_instance.dta, replace
end

// Processing the data and taking a look at the datasets
parallel, prog(myprogram) nodata: myprogram price
ls dataset_*.dta
/*
-rw-rw-r-- 1 george george 907 Sep 21 09:28 dataset_1.dta
-rw-rw-r-- 1 george george 907 Sep 21 09:28 dataset_2.dta
-rw-rw-r-- 1 george george 907 Sep 21 09:28 dataset_3.dta
-rw-rw-r-- 1 george george 907 Sep 21 09:28 dataset_4.dta
*/

// Now appending (using parallel append)
parallel append, do(di) e("dataset_%g.dta, 1/4")
list
/*
     +------------------------------------------+
     |   price     rep78   code      dta_source |
     |------------------------------------------|
  1. | 6,292.5       3.3      1   dataset_1.dta |
  2. |   4,489       3.5      2   dataset_2.dta |
  3. | 6,532.1      3.35      3   dataset_3.dta |
  4. | 6,537.5   3.52632      4   dataset_4.dta |
     +------------------------------------------+
*/

// Removing files using shell
!rm dataset_*.dta  mytempdata.dta

Here is another example from the manual: https://rawgit.com/gvegayon/parallel.../parallel.html

Code:

program def myprog
                gen x = $pll_instance
                gen y = $PLL_CHILDREN
        
                // For the first child process
                if ($pll_instance == 1) gen z = exp(2)
        
                // For the second child process
                else if ($pll_instance == 2) {
                        summ price
                        gen z = r(mean)
                }
        
                // For the third and fourth child processes
                else gen z = 0
        end

HIH

Comment

River Huang

Join Date: Mar 2016

Posts: 1908
#17

22 Oct 2019, 16:50

Dear George, Thanks a lot, and I will give it a try.

Ho-Chuan (River) Huang
Stata 19.0, MP(4)
Comment
Jahan Ismat

Join Date: Aug 2022

Posts: 17
#18

26 Feb 2024, 11:56

Resolved!! Thanks for all the helpful posts.

Last edited by Jahan Ismat; 26 Feb 2024, 12:14.
Comment
Jahan Ismat

Join Date: Aug 2022

Posts: 17
#19

28 Feb 2024, 08:30

Dear stata experts,

I am trying to measure DeFranco Comparability score with Robert Picard's code (using runby & rangerunm, both from SSC). It has been 3 days since and stata is still working. Can anyone please suggest what may be the problem (my RAM is 16GB, 64-bit operating system, x64-based processor, stata SE-17).

Any advice is highly appreciated.

Best regards,
Jahan

Last edited by Jahan Ismat; 28 Feb 2024, 08:52.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#20

28 Feb 2024, 11:05

It is impossible to say anything specific about your situation without knowing anything about your data or the code that "measure[s] DeFranco Comparability score." (I have no idea what a DeFranco Comparability score is. I suspect I am not alone. This is a multidisciplinary, international forum. It is best to avoid specialized language here: anything that would not be understood by a university graduate, in any field, anywhere in the world, other than basic statistics and introductory level Stata, should be omitted if not necessary, or briefly explained if it is needed to adequately pose the question.)

I have little to add with regard to -rangerun-. Although I use it regularly myself, I do not know its inner workings. As I understand it, however, it is not intended to speed up calculations (although by using Mata for some aspects it does so to some extent) so much as to simplify the programming of tasks like calculations over rolling windows and similar situations.

As a co-author of -runby-, I can tell you that the speed-up that -runby- provides arises primarily from eliminating -if- clauses in the code. In the absence of -runby-, tasks that iterate over values of a variable were programmed along these lines:

Code:

levelsof group_var, local(group_var_values) foreach g of local group_var_values { do stuff if group_var == `g' ... do more stuff if group_var == `g' ... etc. }

That kind of code forces Stata to repeatedly check every observation in the data set to determine whether its value of group_var equals the current value of `g' or not, and it must do that on every iteration of the loop. (And the situation is worse still if instead of a single group variable we are iterating over a grouping defined by multiple variables.) The computational work for this is proportional to N₁*N₂*N, where N₁ is the number of distinct values of the group variable, N₂ is the number of commands inside the loop that include an -if- condition, and N is the number of observations in the data set.

What -runby- does is allow you to encapsulate the commands of the loop into a program which processes just a single group at a time. No -if group_var == `g'- clauses are needed because the program is written to deal with only a subset of the data in which group_var is constant. -runby- then "chunks" the data set into subsets defined by their value of group_var, and feeds one such subset at a time to Stata for processing, and accumulates the results. As a result, the expected computational burden is proportional only to N. Secondary speedups may result from the fact that -runby- does the chunking and feeding, and accumulating in Mata, and, in some cases, such as when the commands in the loop entail sorting the data, the non-linearity of the expected computational burden of sorting leads to it being to faster to sort N₁ subsets of average size N/N₁ once each than it is to sort the full data set of size N.

I expound on this not to bore you with the details of -runby-'s operations, but rather to point out that if your data set is not large, or if the code inside your loop does not involve -if group_var == `g'- clauses, nor benefit from the non-linearity of sorting time as a function of size, -runby- won't speed things up much. And if your dataset is really huge, and the code involves many -if group_var == `g'- clauses and does lots of sorting, then the 3 days execution time may well be a great bargain compared to maybe 3 weeks or longer without -runby-. Some things take a long time, even when optimized.
Comment
Jahan Ismat

Join Date: Aug 2022

Posts: 17
#21

28 Feb 2024, 18:51

Dear Clyde Schechter,

Thank you for your insightful response.

I apologize for the incomplete post. Here is the data example

* Example generated by -dataex-. For more info, type help dataex
clear
input long firmid double(fyearq fyr) float(earnings returns qdate industry)
1004 1992 5 .015158799 -.06796114 130 50
1004 1992 5 .00828494 -.010416667 131 50
1004 1992 5 -.03039101 .04210523 132 50
1004 1992 5 .006658611 .09090906 133 50
1004 1993 5 .011562283 -.018518591 134 50
1004 1993 5 .01128682 .03773593 135 50
1004 1993 5 .010111423 .1545454 136 50
1004 1993 5 .009552783 -.0944882 137 50
1004 1994 5 .008768911 -.06086954 138 50
1004 1994 5 .009625393 -.027777815 139 50
1004 1994 5 .01376752 .04761905 140 50
1004 1994 5 .016011298 .10909092 141 50
1004 1995 5 .013253618 .09016388 142 50
1004 1995 5 .01391159 .1052632 143 50
1004 1995 5 .013947392 .06802722 144 50
1004 1995 5 .015917804 .12738857 145 50
1043 1992 7 -.3217588 .25000054 129 50
1043 1992 7 -1.661019 -.4504004 130 50
1043 1993 7 -.6698821 -.6360989 131 50
1043 1993 7 -.9187924 .8759985 132 50
1043 1993 7 -.7325704 -.4669505 133 50
1043 1993 7 -6.566287 -.936 134 50
1043 1994 7 .0386691 -.06329118 137 50
1043 1994 7 .036962837 .14864875 138 50
1043 1995 7 .004020745 -.04705886 139 50
1043 1995 7 .003612511 -.0987655 140 50
1043 1995 7 .002708804 .013698643 141 50
1043 1995 7 .059923 -.1216217 142 50
1094 1992 6 .03012081 .10169485 129 51
1094 1993 6 .01135727 -.01538448 130 51
1094 1993 6 -.0481853 .07812508 131 51
1094 1993 6 .032862727 -.15942037 132 51
1094 1993 6 .026803134 -.04310341 133 51
1094 1994 6 .014933725 -.08108105 134 51
1094 1994 6 .02294867 .05882345 135 51
1094 1994 6 .03721369 .12962964 136 51
1094 1994 6 .025629094 -1.0705368e-09 137 51
1094 1995 6 .013966362 -.09016396 138 51
1094 1995 6 .026371697 .009009028 139 51
1094 1995 6 .036918778 .08928572 140 51
1094 1995 6 .02982299 -.03278691 141 51
1108 1992 6 -.03660391 -.3773585 129 50
1108 1993 6 -.04893167 -.06060606 130 50
1108 1993 6 -.011041937 -.0967742 131 50
1108 1993 6 -.04260697 -.071428575 132 50
1108 1993 6 -.4997292 -.3076923 133 50
1108 1994 6 .0125173 -.13866666 134 50
1108 1994 6 .04778947 .2254902 135 50
1108 1994 6 .006157296 -.10526316 136 50
1108 1994 6 -.05309941 .1764706 137 50
1108 1995 6 -.008593609 -.1 138 50
1108 1995 6 -.000722152 -.11111111 139 50
1108 1995 6 -.15426973 -.3125 140 50
1108 1995 6 -.1404914 -.136 141 50
1121 1992 12 .04756691 .037037037 129 51
1121 1992 12 .02980769 .7857143 130 51
1121 1992 12 -.018952426 -.18 131 51
1121 1993 12 .018790564 -.04878049 132 51
1121 1993 12 .016560087 .1794872 133 51
1121 1993 12 .013898018 -.26086956 134 51
1121 1993 12 .02157004 .05882353 135 51
1121 1994 12 .031890783 .11111111 136 51
1121 1994 12 .05266699 .1 137 51
1121 1994 12 .030761175 .4772727 138 51
1121 1994 12 .01819654 .2153846 139 51
1121 1995 12 .015961295 -.24050634 140 51
1121 1995 12 .01444285 .04159991 141 51
1121 1995 12 .0007344482 -.1839477 142 51
1121 1995 12 .00332558 .11764706 143 51
1155 1992 2 -.26473716 -.11032015 129 50
1155 1992 2 -.24789006 -.12400008 130 50
1155 1992 2 -.6578999 -.28767136 131 50
1240 1992 1 .006226964 .0375 129 54
1240 1992 1 .012027224 .036144577 130 54
1240 1992 1 .012580484 .06686047 131 54
1240 1992 1 .017420597 .065395094 132 54
1240 1993 1 .011462865 .13043478 133 54
1240 1993 1 .010850204 -.07466064 134 54
1240 1993 1 .009691708 -.48899755 135 54
1240 1993 1 .019185754 .023923445 136 54
1240 1994 1 .012562554 .07009346 137 54
1240 1994 1 .012905888 -.05676856 138 54
1240 1994 1 .013777371 .11111111 139 54
1240 1994 1 .01894681 -.004166667 140 54
1240 1995 1 .01308342 .05857741 141 54
1240 1995 1 .013222873 -.06324111 142 54
1240 1995 1 .014076 .12658228 143 54
1240 1995 1 .018263206 .014981274 144 54
1246 1992 9 .016919486 -.01320132 129 50
1246 1992 9 .01882111 -.04013378 130 50
1246 1993 9 .015019833 .013937282 131 50
1246 1993 9 .017476838 .2508591 132 50
1246 1993 9 .014462348 .071428575 133 50
1246 1993 9 -.033825107 -.0974359 134 50
1246 1994 9 .015415096 .2443182 135 50
1246 1994 9 .013086487 -.04337899 136 50
1246 1994 9 -.017225599 .09069213 137 50
1246 1994 9 .015871108 .08752735 138 50
1246 1995 9 .015785297 .1553785 140 50
1246 1995 9 .007759815 .10172414 141 50
end

I am using the following codes (from #1)

* pick a quarter to calculate measure, use quarters in 2 previous years
gen q2use = quarter(dofq(qdate)) == 4
gen qlow = cond(q2use, qdate - 15, 1)
gen qhigh = cond(q2use, qdate -4, 1)
format %tq qlow qhigh

program get_CompAcct
reg earnings returns
predict pearn, xb
reg earnings2 returns2
gen pearn2 = _b[returns2] * returns + _b[_cons]
count if !mi(pearn,pearn2)

gen CompAcct_nobs = r(N)
gen CompAcct = -sum(abs(pearn-pearn2)) /16

end

program pair_by_quarters
tempfile hold
save "`hold'"
rename (firmid returns earnings) (firmid2 returns2 earnings2)
joinby qdate using "`hold'"
keep if firmid != firmid2
sort firmid firmid2 qdate
rangerun get_CompAcct, by(firmid firmid2) interval(qdate qlow qhigh)
end
runby pair_by_quarters, by(industry) verbose

save "results.dta", replace

sort industry qdate firmid firmid2

PLEASE HELP!!

Best regards,
Jahan
Comment
Jahan Ismat

Join Date: Aug 2022

Posts: 17
#22

28 Feb 2024, 20:34

Dear Clyde Schechter,

I am including some details of how DeFranco et al. (2011) measured comparability here for your kind information.

A firm’s financial statements are a function of the economic events and of the accounting of these events.

Financial Statements_i = f_i (Economic Events_i ),........... (1)

For each firm-year, we first estimate the following equation using the 16 previous quarters of data:

Earnings_it = α_i+ β_iReturn_it + ε_it .............................. (2)

we use the two estimated accounting functions for each firm with the economic events of a single firm. We calculate:

E(Earnings)_iit = αˆ_i + βˆ_i Return_it, ............................(3)

E(Earnings)_ijt = αˆ _j + βˆ_jReturn_it . ........................(4)

By using firm i’s return in both predictions, we explicitly hold the economic events constant.

CompAcct_ijt is the negative value of the average absolute difference between the predicted earnings using firm i’s and j’s functions:

CompAcct_ijt = −1/16 × Sum (from t−15 to t) |E (Earnings_iit) − E (Earnings_ijt)|.............. (5)

We estimate accounting comparability for each firm i − firm j combination for J firms within the same SIC two-digit industry classification.

I hope this helps you to understand what I am trying to do.

Best regards,
Jahan
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#23

28 Feb 2024, 21:00

I'm afraid I don't see a whole lot you can do to speed this up.

One thing that will help a little is, instead of having

Code:

tempfile hold save "`hold'" rename (firmid returns earnings) (firmid2 returns2 earnings2) joinby qdate using "`hold'" keep if firmid != firmid2 sort firmid firmid2 qdate

inside program pair_by_quarters, move it to just after you create variables q2use, qlow and qhigh. -joinby- is a bottleneck in any program that uses it, and some of the bottleneck is overhead waiting for the operating system to give you the huge amount of additional memory required for it and for Stata's memory management to absorb it and then disgorge it. And the -save- to the tempfile is also slow, and may well be better done once on a whole-data set joined to itself, then repeatedly on partial data sets. So when you are doing this inside the -runby- loop, there is a fair amount of time wasted on those things. Also, I would eliminate that -sort firmid firmid2 qdate- command altogether: it serves no purpose that I can discern, and sorts are slow, too. (Keep the single -sort- at the end because it puts your data into a more user-friendly order to look at and work with the results--but that's only done once, so it's not as big a deal as if it's done inside -runby-.)

So the start of your program would look like this:

Code:

* pick a quarter to calculate measure, use quarters in 2 previous years gen q2use = quarter(dofq(qdate)) == 4 gen qlow = cond(q2use, qdate-15, 1) gen qhigh = cond(q2use, qdate-4, 1) format %tq qlow qhigh qdate tempfile copy save `copy' rename (firmid return earnings) =2 joinby industry qdate using `copy' // NOTE THE INCLUSION OF industry HERE keep if firmid != firmid2

And from there you would go on to program get_CompAcct (no changes), and program pair_by_quarters (with everything that precedes the -rnagerun- command stripped out.

On your example data, you get about a 15% reduction in run time by making this change.

Note: In the example data this change results in a slight change in the output. Specifically, industry 54 has only one firmid within it, so its observations get wiped out with the -keep if firmid != firmid2- command. Consequently -runby- never even gets to see industry 54, and the output summary reports 2 groups processed with no errors. By contrast, with the original code, industry 54's observations don't get wiped out until it is already inside -runby-. But, because all of the data for industry 54 is eliminated, -rangerun- returns it as an error because there are no observations for its regression command(s). So this time the output summary reports 3 groups processed with 1 error. The actual data set containing the original data and the newly computed variables, however, is the same either way: industry 54 does not appear.

Other than that, I don't see any way to speed this up. I imagine your complete data set is huge, with many quarters of data on many firms, and probably even a number of industries in the 2-digit range. So it's a lot of data to process and a lot of computing to do on it, so it's going to take time no matter how you go about it.

I don't know what the impact on speed will be when you make this change to the full data set. Just how much time is saved by doing -joinby- only once is not something I can intuit quantitatively. The 15% reduction in the example data is based on actually timing multiple runs both ways. But when the size of the data set and number of industries changes, I can't predict what the impact will be.

Sorry I can't suggest something that will be more dramatic, but I just don't see any other opportunities for speedup here.
Comment
Jahan Ismat

Join Date: Aug 2022

Posts: 17
#24

29 Feb 2024, 09:35

Dear Clyde Schechter,

This is so kind of you!!

Just one last confusion please. I want last 16 quarters' estimate. How do I modify the following code? I see this has been discussed earlier as well. I just did not get the value for gen q2use = quarter(dofq(qdate)) == 4 (is 4 still okay?) and range for qhigh (is -4 and 1 okay if I want to include the current quarter too?)

* pick a quarter to calculate measure, use quarters in 2 previous years gen q2use = quarter(dofq(qdate)) == 4 gen qlow = cond(q2use, qdate-15, 1) gen qhigh = cond(q2use, qdate-4, 1) Best regards,
Jahan
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#25

29 Feb 2024, 11:20

Well, the first command -gen q2use = quarter(dofq(qdate)) == 4- has nothing to do with the number of quarters in the estimate; it just specifies that you are doing the estimates only in relation to the fourth quarters of each year.

Now, when you say you want the "last 16 quarters" it isn't clear to me exactly what that includes. So let's think about the estimates we will do in relation to 2023q4. The "last two years" could mean 2022q1 through 2023q4, or it could be 2021q4 through 2023q3, or it could be 2021q1 through 2022q4. That is, we might include 2023q4 itself and count back 8 quarters from there, or we might exclude 2023q4 (but include the rest of 2023) and count back from there, or we might exclude all of 2023, and count back from the end of 2022. All of these would be possible interpretations of "last 16 quarters" in this context. Which of those is appropriate for your purposes I cannot say, and I leave that to you. The code, evidently would be different for each of these. (And none of them look like what you currently have.)

If you mean 2022q1 through 2023q4 I would do this as:

Code:

gen qhigh = cond(q2use, qdate, 0) gen qlow = cond(q2use, qdate-7, 1)

If you mean 2021q4 through 2023q3 I would do this as:

Code:

gen qhigh = cond(q2use, qdate-1, 0) gen qlow = cond(q2use, qdate-8, 1)

And if you mean 2021q1 through 2022q4, I would do it as:

Code:

gen qhigh = cond(q2use, qdate-4, 0) gen qlow = cond(q2use, qdate-11, 1)

Note: not tested -- beware of typos.
Comment
Jahan Ismat

Join Date: Aug 2022

Posts: 17
#26

29 Feb 2024, 18:12

Dear Clyde Schechter,

I cannot thank you enough!! Highly appreciate your kind help.

Best regards,
Jahan
Comment

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment