Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Generating a variable over all observation except the current one

    Hello,

    I am currently analysing a dataset containing firm data such as Total assets and the tax rate of the country the company resides in. In order to analyse the income shifting behaviour of these firms, the literature presents one variable (here called C) which has to be calculated.
    This variable C consists for each observations in its denominator of a sum over all observations. The basic format of what has to summed is: TA[j] * (STR[i] - STR[j]) with i being the current observation and j the other observation. This formula has to be summed up for each observation over all observations without j = i.

    I have tried this with a loop:

    local n = _N

    gen nomi_C = .

    forvalues i = 1/`n' {
    gen double temp_sum = 0

    forvalues j = 1/`n' {
    if `j' != `i' {
    replace temp_sum = temp_sum + TA[`j'] * (STR[`i'] - STR[`j'])
    }
    }
    replace nomi_C = temp_sum in `i'
    }


    As the dataset consists of roughly 90,000 observations, this loop took over 8 h and still was not finished.
    I am now wondering whether that is normal to take so long and whether there is a faster way to complete the described problem.

    Thank you very much in advance!

    Best,
    Katharina

  • #2
    First, I don't think you have accurately reproduced here the code you are running, because the code you show does not run at all. After the first iteration of the -forvalues i = 1/`n'- loop, it aborts at -gen double temp_sum = 0- because temp_sum was created during the first iteration and still exists, so it cannot be -gen-erated again.

    Next, I don't understand the necessity for -if `j' != `i'- guarding the -replace temp_sum = temp_sum + TA[`j'] * (STR[`i'] - STR[`j'])- command. I say that because if `j' = `i', then STR[`i'] - STR[`j'] will be 0, so that nothing is added to temp_sum anyway.

    Putting all that aside, there is no need for any explicit loops to do this calculation, and the direct calculation you are doing is actually enormously repetitious. We can apply a little algebra to the original formula as follows:

    Code:
    nomi_Ci = SUM j=1/_N {TAj* (STRi - STRj) }
                = SUMj=1/_N {TAj * STRi} - SUMj=1/_N {TAj * STRj}
                = STRi * {SUMj=1/_N TAj} - SUMj=1/_N {TAj * STRj}
    Notice now that both summations are independent of i. So each of those can be calculated once and stored, and then the calculation becomes much shorter. You did not post example data, so I created a toy data set to illustrate the approach

    Code:
    clear*
    
    //    CREATE TOY DATA SET TO DEMONSTRATE THE APPROACH
    set obs 90000
    set seed 1234
    gen TA = runiform()
    gen STR = runiform()
    
    //    THE CALCULATION
    egen TA_total = total(TA)
    egen TA_STR_total = total(TA*STR)
    gen nomi_C = STR*TA_total - TA_STR_total
    The code is reduced to three lines, and on my setup runs in 0.02 seconds with 90,000 observations.

    Added: This code does produce the same results (with occasional rounding errors in the 6th decimal place) as your nested loop approach when I run it on a smaller data set so that the nested loops run quickly.
    Last edited by Clyde Schechter; 16 Jun 2024, 17:52.

    Comment


    • #3
      Thank you for the quick response!

      That was really helpful!!

      Comment


      • #4
        See also the FAQ https://www.stata.com/support/faqs/d...ng-properties/ for a riff on identities such as

        sum of others = sum of all MINUS this value

        Comment

        Working...
        X