Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Calculate pairwise differences

    Hi friends,

    I have data:
    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input int(group price)
     1  4499
     2  6229
     3  3667
     3  4172
     3  4934
     4  3984
     4  4589
     4  5079
     4  6486
     4  8129
     5  3299
     5  3748
     5  3829
     5  4389
     6  3895
     6  4187
     6  4195
     6  4453
     6  5799
     7  3798
     7  3995
     7  4099
     7  4425
     7  4647
     7  4749
     7  5719
     7  6295
     8  3799
     8  7140
     8  9735
     9  3955
     9  4082
     9  4424
     9 15906
    10  4181
    10  5899
    10 11995
    10 12990
    11  4516
    11  4697
    11  5397
    11  9690
    11 13466
    12  4060
    12  4296
    12  4733
    12  4816
    12  5104
    12  5172
    12  5189
    12  5222
    12  5379
    12  6303
    12  6850
    12 14500
    13  3291
    13  4010
    13  4482
    13  4504
    13  4723
    13  5886
    13 10371
    13 10372
    14 13594
    16  4890
    16  5705
    16  5798
    16  7827
    16  8814
    16 11385
    17  5788
    17  6342
    18 11497
    19  6165
    end

    What I want to do is to compute the sum of pairwise differences (in absolute value) in price across groups. Here I have 19 groups. Each group has different numbers of observations. My goal is:
    1. calculate the differences (in abs. value) between observation 1 in group 1 and all other observations that are not in group 1.

    2. repeat the calculation for the remaining observations in group 1.

    3. repeat the calculation for each member in group 2, 3, ... 19. For instance, calculate the pairwise differences (in abs. value) between observation 1 in group 2 and all other observations that are not in group 2. Then calculate the pairwise differences (in abs. value) between observation 2 in group 2 and all other observations that are not in group 2...

    4. sum up all of the absolute differences.
    I have already written up the codes (as below). However, my real dataset is relatively large, and my codes run too slowly. Can anybody kindly tell me how to improve the speed? Thank you so much!

    Code:
    Code:
    egen price=mean(price)
    gen sumdiff=. //the variable that I finally want to produce
    egen N=count(group) //sample size
    
    forvalues p=1/19 { //19 groups in total
    
    qui gen temp=1 if p==`p'
    sort temp //move all observations who are in group `p' to the top
    qui sum temp if p==`p'
    local inend=r(N)
    local outstart=r(N)+1 //observations not in group `p' start from position `outstart'
    local outend=N
    local diff=0
        forvalues n=`loopstart'/`loopend' {
            forvalues m=`loopstart'/`loopend' {
                local diff=abs(price[`n']-price[`m'])+`diff'
            }
        }
    replace sumdiff=`diff' if p==`p' 
    drop temp
    }

  • #2
    First of all: As you seem to be interested in some kind of Gini-related calculation, you should try -search Gini-. That search reveals many hits, so the procedure of interest may already have been developed by someone else, and exist as a downloadable community-contributed package (See -help ssc-).

    Anyway, I worked from your text description, and did not try to understand your code. Your text describes something involving "all possible pairs," so the built-in command -cross- is relevant and will do all the hard work. The following gives what I would describe as "The sum of the absolute differences in price between all pairs of individuals, with distinct pairs defined without regard to order, and in which members of pairs are in different groups." The following takes time proportional to N^2, and required about 4 sec on my machine for an example with N = about 1500.

    Code:
    clear
    input int(group price)
    1  4499
     2  6229
     3  3667
     3  4172
     3  4934
     4  3984
     4  4589
     4  5079
     4  6486
     4  8129
     5  3299
     5  3748
     5  3829
     5  4389
     6  3895
     6  4187
     6  4195
     6  4453
     6  5799
     7  3798
     7  3995
     7  4099
     7  4425
     7  4647
     7  4749
     7  5719
     7  6295
     8  3799
     8  7140
     8  9735
     9  3955
     9  4082
     9  4424
     9 15906
    10  4181
    10  5899
    10 11995
    10 12990
    11  4516
    11  4697
    11  5397
    11  9690
    11 13466
    12  4060
    12  4296
    12  4733
    12  4816
    12  5104
    12  5172
    12  5189
    12  5222
    12  5379
    12  6303
    12  6850
    12 14500
    13  3291
    13  4010
    13  4482
    13  4504
    13  4723
    13  5886
    13 10371
    13 10372
    14 13594
    16  4890
    16  5705
    16  5798
    16  7827
    16  8814
    16 11385
    17  5788
    17  6342
    18 11497
    19  6165
    end
    // An individual id is necessary, I think.
    gen int id = _n
    preserve
    rename * *2  // change names to distinguish individeuals
    tempfile temp
    save `temp'
    restore
    compress // Save space as we are about to make a big file
    // If you have any unnecessary variables, drop them here to save space
    cross using `temp'  // the crucial command
    // Drop self pairs and duplicates
    drop if (id >= id2)
    //
    gen diff = abs(price-price2) if (group != group2)
    summ diff
    di "Sum abs. diffs. = " r(sum)







    Comment


    • #3
      Originally posted by Mike Lacy View Post
      First of all: As you seem to be interested in some kind of Gini-related calculation, you should try -search Gini-. That search reveals many hits, so the procedure of interest may already have been developed by someone else, and exist as a downloadable community-contributed package (See -help ssc-).

      Anyway, I worked from your text description, and did not try to understand your code. Your text describes something involving "all possible pairs," so the built-in command -cross- is relevant and will do all the hard work. The following gives what I would describe as "The sum of the absolute differences in price between all pairs of individuals, with distinct pairs defined without regard to order, and in which members of pairs are in different groups." The following takes time proportional to N^2, and required about 4 sec on my machine for an example with N = about 1500.

      Code:
      clear
      input int(group price)
      1 4499
      2 6229
      3 3667
      3 4172
      3 4934
      4 3984
      4 4589
      4 5079
      4 6486
      4 8129
      5 3299
      5 3748
      5 3829
      5 4389
      6 3895
      6 4187
      6 4195
      6 4453
      6 5799
      7 3798
      7 3995
      7 4099
      7 4425
      7 4647
      7 4749
      7 5719
      7 6295
      8 3799
      8 7140
      8 9735
      9 3955
      9 4082
      9 4424
      9 15906
      10 4181
      10 5899
      10 11995
      10 12990
      11 4516
      11 4697
      11 5397
      11 9690
      11 13466
      12 4060
      12 4296
      12 4733
      12 4816
      12 5104
      12 5172
      12 5189
      12 5222
      12 5379
      12 6303
      12 6850
      12 14500
      13 3291
      13 4010
      13 4482
      13 4504
      13 4723
      13 5886
      13 10371
      13 10372
      14 13594
      16 4890
      16 5705
      16 5798
      16 7827
      16 8814
      16 11385
      17 5788
      17 6342
      18 11497
      19 6165
      end
      // An individual id is necessary, I think.
      gen int id = _n
      preserve
      rename * *2 // change names to distinguish individeuals
      tempfile temp
      save `temp'
      restore
      compress // Save space as we are about to make a big file
      // If you have any unnecessary variables, drop them here to save space
      cross using `temp' // the crucial command
      // Drop self pairs and duplicates
      drop if (id >= id2)
      //
      gen diff = abs(price-price2) if (group != group2)
      summ diff
      di "Sum abs. diffs. = " r(sum)






      Thank you so much Mike! This is incredible! Much much faster than my codes.

      Comment

      Working...
      X