Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • #16
    As suggested by Clyde

    Originally posted by Clyde Schechter View Post
    1b. Split the data set into separate industries, and run them in parallel on separate computers. This doesn't reduce total computational effort but you get the results more quickly.
    You can use the parallel module to do such task (which was just published on the Stata Journal, https://journals.sagepub.com/doi/ful...36867X19874242) In general, parallelization is made for data, but you can skip passing data and in essence, run multiple stata sessions simultaneously each one doing something different, like different sets of simulations. To do such, you can make use of the parallel macros, here is an example: https://github.com/gvegayon/parallel...nstance-macros

    Code:
    clear all
    set more off
    set trace off
    
    parallel setclusters 4
    cap drop
    
    // Generating a variable called code that goes from 1/4
    sysuse auto
    set seed 112321
    gen code = floor(runiform()*4) + 1
    tab code
    /*
           code |      Freq.     Percent        Cum.
    ------------+-----------------------------------
              1 |         20       27.03       27.03
              2 |         11       14.86       41.89
              3 |         21       28.38       70.27
              4 |         22       29.73      100.00
    ------------+-----------------------------------
          Total |         74      100.00
    
    */
    
    // Storing
    save mytempdata, replace
    clear
    
    // Program that stores a dataset for
    program myprogram
        use if code == $pll_instance using mytempdata.dta, clear
        collapse (mean) price rep78 (max) code
        save dataset_$pll_instance.dta, replace
    end
    
    // Processing the data and taking a look at the datasets
    parallel, prog(myprogram) nodata: myprogram price
    ls dataset_*.dta
    /*
    -rw-rw-r-- 1 george george 907 Sep 21 09:28 dataset_1.dta
    -rw-rw-r-- 1 george george 907 Sep 21 09:28 dataset_2.dta
    -rw-rw-r-- 1 george george 907 Sep 21 09:28 dataset_3.dta
    -rw-rw-r-- 1 george george 907 Sep 21 09:28 dataset_4.dta
    */
    
    // Now appending (using parallel append)
    parallel append, do(di) e("dataset_%g.dta, 1/4")
    list
    /*
         +------------------------------------------+
         |   price     rep78   code      dta_source |
         |------------------------------------------|
      1. | 6,292.5       3.3      1   dataset_1.dta |
      2. |   4,489       3.5      2   dataset_2.dta |
      3. | 6,532.1      3.35      3   dataset_3.dta |
      4. | 6,537.5   3.52632      4   dataset_4.dta |
         +------------------------------------------+
    */
    
    // Removing files using shell
    !rm dataset_*.dta  mytempdata.dta


    Here is another example from the manual: https://rawgit.com/gvegayon/parallel.../parallel.html

    Code:
    program def myprog
                    gen x = $pll_instance
                    gen y = $PLL_CHILDREN
            
                    // For the first child process
                    if ($pll_instance == 1) gen z = exp(2)
            
                    // For the second child process
                    else if ($pll_instance == 2) {
                            summ price
                            gen z = r(mean)
                    }
            
                    // For the third and fourth child processes
                    else gen z = 0
            end
    HIH

    Comment


    • #17
      Dear George, Thanks a lot, and I will give it a try.
      Ho-Chuan (River) Huang
      Stata 17.0, MP(4)

      Comment


      • #18
        Resolved!! Thanks for all the helpful posts.
        Last edited by Jahan Ismat; 26 Feb 2024, 13:14.

        Comment


        • #19
          Dear stata experts,

          I am trying to measure DeFranco Comparability score with Robert Picard's code (using runby & rangerunm, both from SSC). It has been 3 days since and stata is still working. Can anyone please suggest what may be the problem (my RAM is 16GB, 64-bit operating system, x64-based processor, stata SE-17).

          Any advice is highly appreciated.

          Best regards,
          Jahan
          Last edited by Jahan Ismat; 28 Feb 2024, 09:52.

          Comment


          • #20
            It is impossible to say anything specific about your situation without knowing anything about your data or the code that "measure[s] DeFranco Comparability score." (I have no idea what a DeFranco Comparability score is. I suspect I am not alone. This is a multidisciplinary, international forum. It is best to avoid specialized language here: anything that would not be understood by a university graduate, in any field, anywhere in the world, other than basic statistics and introductory level Stata, should be omitted if not necessary, or briefly explained if it is needed to adequately pose the question.)

            I have little to add with regard to -rangerun-. Although I use it regularly myself, I do not know its inner workings. As I understand it, however, it is not intended to speed up calculations (although by using Mata for some aspects it does so to some extent) so much as to simplify the programming of tasks like calculations over rolling windows and similar situations.

            As a co-author of -runby-, I can tell you that the speed-up that -runby- provides arises primarily from eliminating -if- clauses in the code. In the absence of -runby-, tasks that iterate over values of a variable were programmed along these lines:
            Code:
            levelsof group_var, local(group_var_values)
            foreach g of local group_var_values {
                do stuff if group_var == `g'
                ...
                do more stuff if group_var == `g'
                ... etc.
            }
            That kind of code forces Stata to repeatedly check every observation in the data set to determine whether its value of group_var equals the current value of `g' or not, and it must do that on every iteration of the loop. (And the situation is worse still if instead of a single group variable we are iterating over a grouping defined by multiple variables.) The computational work for this is proportional to N1*N2*N, where N1 is the number of distinct values of the group variable, N2 is the number of commands inside the loop that include an -if- condition, and N is the number of observations in the data set.

            What -runby- does is allow you to encapsulate the commands of the loop into a program which processes just a single group at a time. No -if group_var == `g'- clauses are needed because the program is written to deal with only a subset of the data in which group_var is constant. -runby- then "chunks" the data set into subsets defined by their value of group_var, and feeds one such subset at a time to Stata for processing, and accumulates the results. As a result, the expected computational burden is proportional only to N. Secondary speedups may result from the fact that -runby- does the chunking and feeding, and accumulating in Mata, and, in some cases, such as when the commands in the loop entail sorting the data, the non-linearity of the expected computational burden of sorting leads to it being to faster to sort N1 subsets of average size N/N1 once each than it is to sort the full data set of size N.

            I expound on this not to bore you with the details of -runby-'s operations, but rather to point out that if your data set is not large, or if the code inside your loop does not involve -if group_var == `g'- clauses, nor benefit from the non-linearity of sorting time as a function of size, -runby- won't speed things up much. And if your dataset is really huge, and the code involves many -if group_var == `g'- clauses and does lots of sorting, then the 3 days execution time may well be a great bargain compared to maybe 3 weeks or longer without -runby-. Some things take a long time, even when optimized.


            Comment


            • #21
              Dear Clyde Schechter,

              Thank you for your insightful response.

              I apologize for the incomplete post. Here is the data example

              * Example generated by -dataex-. For more info, type help dataex
              clear
              input long firmid double(fyearq fyr) float(earnings returns qdate industry)
              1004 1992 5 .015158799 -.06796114 130 50
              1004 1992 5 .00828494 -.010416667 131 50
              1004 1992 5 -.03039101 .04210523 132 50
              1004 1992 5 .006658611 .09090906 133 50
              1004 1993 5 .011562283 -.018518591 134 50
              1004 1993 5 .01128682 .03773593 135 50
              1004 1993 5 .010111423 .1545454 136 50
              1004 1993 5 .009552783 -.0944882 137 50
              1004 1994 5 .008768911 -.06086954 138 50
              1004 1994 5 .009625393 -.027777815 139 50
              1004 1994 5 .01376752 .04761905 140 50
              1004 1994 5 .016011298 .10909092 141 50
              1004 1995 5 .013253618 .09016388 142 50
              1004 1995 5 .01391159 .1052632 143 50
              1004 1995 5 .013947392 .06802722 144 50
              1004 1995 5 .015917804 .12738857 145 50
              1043 1992 7 -.3217588 .25000054 129 50
              1043 1992 7 -1.661019 -.4504004 130 50
              1043 1993 7 -.6698821 -.6360989 131 50
              1043 1993 7 -.9187924 .8759985 132 50
              1043 1993 7 -.7325704 -.4669505 133 50
              1043 1993 7 -6.566287 -.936 134 50
              1043 1994 7 .0386691 -.06329118 137 50
              1043 1994 7 .036962837 .14864875 138 50
              1043 1995 7 .004020745 -.04705886 139 50
              1043 1995 7 .003612511 -.0987655 140 50
              1043 1995 7 .002708804 .013698643 141 50
              1043 1995 7 .059923 -.1216217 142 50
              1094 1992 6 .03012081 .10169485 129 51
              1094 1993 6 .01135727 -.01538448 130 51
              1094 1993 6 -.0481853 .07812508 131 51
              1094 1993 6 .032862727 -.15942037 132 51
              1094 1993 6 .026803134 -.04310341 133 51
              1094 1994 6 .014933725 -.08108105 134 51
              1094 1994 6 .02294867 .05882345 135 51
              1094 1994 6 .03721369 .12962964 136 51
              1094 1994 6 .025629094 -1.0705368e-09 137 51
              1094 1995 6 .013966362 -.09016396 138 51
              1094 1995 6 .026371697 .009009028 139 51
              1094 1995 6 .036918778 .08928572 140 51
              1094 1995 6 .02982299 -.03278691 141 51
              1108 1992 6 -.03660391 -.3773585 129 50
              1108 1993 6 -.04893167 -.06060606 130 50
              1108 1993 6 -.011041937 -.0967742 131 50
              1108 1993 6 -.04260697 -.071428575 132 50
              1108 1993 6 -.4997292 -.3076923 133 50
              1108 1994 6 .0125173 -.13866666 134 50
              1108 1994 6 .04778947 .2254902 135 50
              1108 1994 6 .006157296 -.10526316 136 50
              1108 1994 6 -.05309941 .1764706 137 50
              1108 1995 6 -.008593609 -.1 138 50
              1108 1995 6 -.000722152 -.11111111 139 50
              1108 1995 6 -.15426973 -.3125 140 50
              1108 1995 6 -.1404914 -.136 141 50
              1121 1992 12 .04756691 .037037037 129 51
              1121 1992 12 .02980769 .7857143 130 51
              1121 1992 12 -.018952426 -.18 131 51
              1121 1993 12 .018790564 -.04878049 132 51
              1121 1993 12 .016560087 .1794872 133 51
              1121 1993 12 .013898018 -.26086956 134 51
              1121 1993 12 .02157004 .05882353 135 51
              1121 1994 12 .031890783 .11111111 136 51
              1121 1994 12 .05266699 .1 137 51
              1121 1994 12 .030761175 .4772727 138 51
              1121 1994 12 .01819654 .2153846 139 51
              1121 1995 12 .015961295 -.24050634 140 51
              1121 1995 12 .01444285 .04159991 141 51
              1121 1995 12 .0007344482 -.1839477 142 51
              1121 1995 12 .00332558 .11764706 143 51
              1155 1992 2 -.26473716 -.11032015 129 50
              1155 1992 2 -.24789006 -.12400008 130 50
              1155 1992 2 -.6578999 -.28767136 131 50
              1240 1992 1 .006226964 .0375 129 54
              1240 1992 1 .012027224 .036144577 130 54
              1240 1992 1 .012580484 .06686047 131 54
              1240 1992 1 .017420597 .065395094 132 54
              1240 1993 1 .011462865 .13043478 133 54
              1240 1993 1 .010850204 -.07466064 134 54
              1240 1993 1 .009691708 -.48899755 135 54
              1240 1993 1 .019185754 .023923445 136 54
              1240 1994 1 .012562554 .07009346 137 54
              1240 1994 1 .012905888 -.05676856 138 54
              1240 1994 1 .013777371 .11111111 139 54
              1240 1994 1 .01894681 -.004166667 140 54
              1240 1995 1 .01308342 .05857741 141 54
              1240 1995 1 .013222873 -.06324111 142 54
              1240 1995 1 .014076 .12658228 143 54
              1240 1995 1 .018263206 .014981274 144 54
              1246 1992 9 .016919486 -.01320132 129 50
              1246 1992 9 .01882111 -.04013378 130 50
              1246 1993 9 .015019833 .013937282 131 50
              1246 1993 9 .017476838 .2508591 132 50
              1246 1993 9 .014462348 .071428575 133 50
              1246 1993 9 -.033825107 -.0974359 134 50
              1246 1994 9 .015415096 .2443182 135 50
              1246 1994 9 .013086487 -.04337899 136 50
              1246 1994 9 -.017225599 .09069213 137 50
              1246 1994 9 .015871108 .08752735 138 50
              1246 1995 9 .015785297 .1553785 140 50
              1246 1995 9 .007759815 .10172414 141 50
              end


              I am using the following codes (from #1)

              * pick a quarter to calculate measure, use quarters in 2 previous years
              gen q2use = quarter(dofq(qdate)) == 4
              gen qlow = cond(q2use, qdate - 15, 1)
              gen qhigh = cond(q2use, qdate -4, 1)
              format %tq qlow qhigh

              program get_CompAcct
              reg earnings returns
              predict pearn, xb
              reg earnings2 returns2
              gen pearn2 = _b[returns2] * returns + _b[_cons]
              count if !mi(pearn,pearn2)

              gen CompAcct_nobs = r(N)
              gen CompAcct = -sum(abs(pearn-pearn2)) /16

              end

              program pair_by_quarters
              tempfile hold
              save "`hold'"
              rename (firmid returns earnings) (firmid2 returns2 earnings2)
              joinby qdate using "`hold'"
              keep if firmid != firmid2
              sort firmid firmid2 qdate
              rangerun get_CompAcct, by(firmid firmid2) interval(qdate qlow qhigh)
              end
              runby pair_by_quarters, by(industry) verbose

              save "results.dta", replace

              sort industry qdate firmid firmid2



              PLEASE HELP!!

              Best regards,
              Jahan

              Comment


              • #22
                Dear Clyde Schechter,

                I am including some details of how DeFranco et al. (2011) measured comparability here for your kind information.



                A firm’s financial statements are a function of the economic events and of the accounting of these events.

                Financial Statementsi = fi (Economic Eventsi ),........... (1)

                For each firm-year, we first estimate the following equation using the 16 previous quarters of data:

                Earningsit = αi + βi Returnit + εit .............................. (2)

                we use the two estimated accounting functions for each firm with the economic events of a single firm. We calculate:

                E(Earnings)iit = αˆi + βˆi Returnit, ............................(3)

                E(Earnings)ijt = αˆ j + βˆjReturnit . ........................(4)


                By using firm i’s return in both predictions, we explicitly hold the economic events constant.

                CompAcctijt is the negative value of the average absolute difference between the predicted earnings using firm i’s and j’s functions:

                CompAcctijt = −1/16 × Sum (from t−15 to t) |E (Earningsiit) − E (Earningsijt)|.............. (5)

                We estimate accounting comparability for each firm i − firm j combination for J firms within the same SIC two-digit industry classification.


                I hope this helps you to understand what I am trying to do.

                Best regards,

                Jahan

                Comment


                • #23
                  I'm afraid I don't see a whole lot you can do to speed this up.

                  One thing that will help a little is, instead of having
                  Code:
                  tempfile hold
                  save "`hold'"
                  rename (firmid returns earnings) (firmid2 returns2 earnings2)
                  joinby qdate using "`hold'"
                  keep if firmid != firmid2
                  sort firmid firmid2 qdate
                  inside program pair_by_quarters, move it to just after you create variables q2use, qlow and qhigh. -joinby- is a bottleneck in any program that uses it, and some of the bottleneck is overhead waiting for the operating system to give you the huge amount of additional memory required for it and for Stata's memory management to absorb it and then disgorge it. And the -save- to the tempfile is also slow, and may well be better done once on a whole-data set joined to itself, then repeatedly on partial data sets. So when you are doing this inside the -runby- loop, there is a fair amount of time wasted on those things. Also, I would eliminate that -sort firmid firmid2 qdate- command altogether: it serves no purpose that I can discern, and sorts are slow, too. (Keep the single -sort- at the end because it puts your data into a more user-friendly order to look at and work with the results--but that's only done once, so it's not as big a deal as if it's done inside -runby-.)

                  So the start of your program would look like this:
                  Code:
                  * pick a quarter to calculate measure, use quarters in 2 previous years
                  gen q2use = quarter(dofq(qdate)) == 4
                  gen qlow = cond(q2use, qdate-15, 1)
                  gen qhigh = cond(q2use, qdate-4, 1)
                  format %tq qlow qhigh qdate
                  tempfile copy
                  save `copy'
                  rename (firmid return earnings) =2
                  joinby industry qdate using `copy' // NOTE THE INCLUSION OF industry HERE
                  keep if firmid != firmid2
                  And from there you would go on to program get_CompAcct (no changes), and program pair_by_quarters (with everything that precedes the -rnagerun- command stripped out.

                  On your example data, you get about a 15% reduction in run time by making this change.

                  Note: In the example data this change results in a slight change in the output. Specifically, industry 54 has only one firmid within it, so its observations get wiped out with the -keep if firmid != firmid2- command. Consequently -runby- never even gets to see industry 54, and the output summary reports 2 groups processed with no errors. By contrast, with the original code, industry 54's observations don't get wiped out until it is already inside -runby-. But, because all of the data for industry 54 is eliminated, -rangerun- returns it as an error because there are no observations for its regression command(s). So this time the output summary reports 3 groups processed with 1 error. The actual data set containing the original data and the newly computed variables, however, is the same either way: industry 54 does not appear.

                  Other than that, I don't see any way to speed this up. I imagine your complete data set is huge, with many quarters of data on many firms, and probably even a number of industries in the 2-digit range. So it's a lot of data to process and a lot of computing to do on it, so it's going to take time no matter how you go about it.

                  I don't know what the impact on speed will be when you make this change to the full data set. Just how much time is saved by doing -joinby- only once is not something I can intuit quantitatively. The 15% reduction in the example data is based on actually timing multiple runs both ways. But when the size of the data set and number of industries changes, I can't predict what the impact will be.

                  Sorry I can't suggest something that will be more dramatic, but I just don't see any other opportunities for speedup here.

                  Comment


                  • #24
                    Dear Clyde Schechter,

                    This is so kind of you!!

                    Just one last confusion please. I want last 16 quarters' estimate. How do I modify the following code? I see this has been discussed earlier as well. I just did not get the value for gen q2use = quarter(dofq(qdate)) == 4 (is 4 still okay?) and range for qhigh (is -4 and 1 okay if I want to include the current quarter too?)

                    * pick a quarter to calculate measure, use quarters in 2 previous years gen q2use = quarter(dofq(qdate)) == 4 gen qlow = cond(q2use, qdate-15, 1) gen qhigh = cond(q2use, qdate-4, 1) Best regards,
                    Jahan

                    Comment


                    • #25
                      Well, the first command -gen q2use = quarter(dofq(qdate)) == 4- has nothing to do with the number of quarters in the estimate; it just specifies that you are doing the estimates only in relation to the fourth quarters of each year.

                      Now, when you say you want the "last 16 quarters" it isn't clear to me exactly what that includes. So let's think about the estimates we will do in relation to 2023q4. The "last two years" could mean 2022q1 through 2023q4, or it could be 2021q4 through 2023q3, or it could be 2021q1 through 2022q4. That is, we might include 2023q4 itself and count back 8 quarters from there, or we might exclude 2023q4 (but include the rest of 2023) and count back from there, or we might exclude all of 2023, and count back from the end of 2022. All of these would be possible interpretations of "last 16 quarters" in this context. Which of those is appropriate for your purposes I cannot say, and I leave that to you. The code, evidently would be different for each of these. (And none of them look like what you currently have.)

                      If you mean 2022q1 through 2023q4 I would do this as:
                      Code:
                      gen qhigh = cond(q2use, qdate, 0)
                      gen qlow = cond(q2use, qdate-7, 1)
                      If you mean 2021q4 through 2023q3 I would do this as:
                      Code:
                      gen qhigh = cond(q2use, qdate-1, 0)
                      gen qlow = cond(q2use, qdate-8, 1)
                      And if you mean 2021q1 through 2022q4, I would do it as:
                      Code:
                      gen qhigh = cond(q2use, qdate-4, 0)
                      gen qlow = cond(q2use, qdate-11, 1)
                      Note: not tested -- beware of typos.

                      Comment


                      • #26
                        Dear Clyde Schechter,

                        I cannot thank you enough!! Highly appreciate your kind help.

                        Best regards,
                        Jahan

                        Comment

                        Working...
                        X