Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • calculate the sum of absolute paired differences

    Dear All, I found this question somewhere, and wish to see your suggestion on how to calculate the value of interest. The data set is
    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input str12 region int year float(nlmy ck zhz dl)
    "A" 2003 2.91  2.46 102.21  3.78
    "A" 2004 2.98  2.16 105.08  6.28
    "B" 2003  .97  6.97  77.17  3.55
    "B" 2004  .87  7.69  76.81  3.07
    "C"   2003 9.09  28.1 126.72 16.75
    "C"   2004 8.64 27.87 121.54 17.14
    end
    There are three regions (region=A,B,C), and two year (year=2003, 2004). Four industries (nlmy, ck, zhz, dl) are included. For each year, I'd like to calculate the value DIVI as displayed below:
    Click image for larger version

Name:	abs-pair.png
Views:	1
Size:	1.4 KB
ID:	1431074

    where i, j denotes region (A,B,C), t denotes year (2003,2004), k denotes industry (nlmy, ck, zhz, dl), and `abs' means absolute value. Thank you in advance.
    Ho-Chuan (River) Huang
    Stata 19.0, MP(4)

  • #2
    I assume xitk refers to the value in the corresponding industry(k) variable for year t in region i. But what is xit supposed to be? It seems to depend only on region and year, but there is nothing in the data you show that has that property.

    Comment


    • #3
      Hi, Clyde: Thanks. I will ask the person who raised this question, and get back to you later.

      Ho-Chuan (River) Huang
      Stata 19.0, MP(4)

      Comment


      • #4
        Code:
        search dissimilarity index
        reveals various commands in this territory.

        Setting aside the rather specific and rococo notation, consider just the sum of absolute differences between vectors of probabilities with typical elements p_i, q_i, i.e.. the sum of |p_i - q_i|.

        It is immediate that applied to such probabilities the measure varies between 0 and 2. For that reason a prefactor of (1/2) is often applied. I recall that this index has many names and was used early by Gini (as were various other different measures).

        Then one way to think about this is just to use Mata as a calculator.

        Code:
        * Example generated by -dataex-. To install: ssc install dataex
        clear
        input str12 region int year float(nlmy ck zhz dl)
        "A" 2003 2.91  2.46 102.21  3.78
        "A" 2004 2.98  2.16 105.08  6.28
        "B" 2003  .97  6.97  77.17  3.55
        "B" 2004  .87  7.69  76.81  3.07
        "C"   2003 9.09  28.1 126.72 16.75
        "C"   2004 8.64 27.87 121.54 17.14
        end
        
        forval i = 1(2)5 {
            matrix P2003 = (nlmy[`i'], ck[`i'], zhz[`i'], dl[`i'])'
            matrix P2004 = (nlmy[`i'+1], ck[`i'+1], zhz[`i'+1], dl[`i'+1])'
            mata : M1 = st_matrix("P2003"); M2 = st_matrix("P2004")
            mata : M1 = M1 / sum(M1); M2 = M2 / sum(M2)
            mata: ustrword("A A B B C C", `i')
            mata : sum(abs(M1 - M2))
        }
        
          A
          .0399232294
          B
          .0166733515
          C
          .0173293267
        Here I am comparing data for a region in two years. For comparing regions in a given year, which is what the notation seems to imply, the implication I imagine is that you should be interested in a (symmetric) matrix of pairwise dissimilarities. .
        Last edited by Nick Cox; 21 Feb 2018, 18:18.

        Comment


        • #5
          Dear Nick, Thanks for the suggestion. I will have a look at the (search) dissimilarity index.
          Ho-Chuan (River) Huang
          Stata 19.0, MP(4)

          Comment


          • #6
            For regional comparisons, I'd prefer a different data layout. This contains steps towards a more general program.

            A Mata function. Put in a separate file, say dissim.mata and do that before you use the second code block. Note that the prefactor 0.5 would be conventional, but isn't included here. No reason except that I left it out and I have to do something else now rather than re-run results.

            Code:
            mata :
            
            void dissim(string vector varlist, string scalar select) {
                real matrix data, result
                    real vector vj, vk
                    real scalar J, j
                    
                data = st_data(., varlist, select)
                    J = cols(data)
                    result = J(J, J, 0)
                    
                    for(j = 1; j < J; j++) {
                        for(k = j + 1; k <= J; k++) {
                             vj = data[., j]; vj = vj / sum(vj); vk = data[., k]; vk = vk/sum(vk)  
                             result[k, j] = result[j, k] = sum(abs(vj - vk))
                        }
                    }
                    st_matrix("result", result)
            }
            
            end
            Sample do-file. This isn't truly general. For example, there is an assumption that region names are single words.

            Code:
            * Example generated by -dataex-. To install: ssc install dataex
            clear
            input str12 region int year float(nlmy ck zhz dl)
            "A" 2003 2.91  2.46 102.21  3.78
            "A" 2004 2.98  2.16 105.08  6.28
            "B" 2003  .97  6.97  77.17  3.55
            "B" 2004  .87  7.69  76.81  3.07
            "C"   2003 9.09  28.1 126.72 16.75
            "C"   2004 8.64 27.87 121.54 17.14
            end
            
            rename (nlmy-dl) ind=
            reshape long ind, i(region year) j(which) string
            reshape wide ind, i(year which) j(region) string
            
            list
            
            unab varlist : ind*
            local regions : subinstr local varlist "ind" "", all
            
            gen select = .
            
            forval y = 2003/2004 {
                quietly replace select = year == `y'
                mata: dissim("`varlist'", "select")
                matrix rownames result = `regions'
                matrix colnames result = `regions'
                di _n "{title:`y'}"  
                matrix list result, noheader  
                di
            }
            What happens:

            Code:
            . * Example generated by -dataex-. To install: ssc install dataex
            . clear
            
            . input str12 region int year float(nlmy ck zhz dl)
            
                       region      year       nlmy         ck        zhz         dl
              1. "A" 2003 2.91  2.46 102.21  3.78
              2. "A" 2004 2.98  2.16 105.08  6.28
              3. "B" 2003  .97  6.97  77.17  3.55
              4. "B" 2004  .87  7.69  76.81  3.07
              5. "C"   2003 9.09  28.1 126.72 16.75
              6. "C"   2004 8.64 27.87 121.54 17.14
              7. end
            
            .
            . rename (nlmy-dl) ind=
            
            . reshape long ind, i(region year) j(which) string
            (note: j = ck dl nlmy zhz)
            
            Data                               wide   ->   long
            -----------------------------------------------------------------------------
            Number of obs.                        6   ->      24
            Number of variables                   6   ->       4
            j variable (4 values)                     ->   which
            xij variables:
                             indck inddl ... indzhz   ->   ind
            -----------------------------------------------------------------------------
            
            . reshape wide ind, i(year which) j(region) string
            (note: j = A B C)
            
            Data                               long   ->   wide
            -----------------------------------------------------------------------------
            Number of obs.                       24   ->       8
            Number of variables                   4   ->       5
            j variable (3 values)            region   ->   (dropped)
            xij variables:
                                                ind   ->   indA indB indC
            -----------------------------------------------------------------------------
            
            .
            . list
            
                 +----------------------------------------+
                 | year   which     indA    indB     indC |
                 |----------------------------------------|
              1. | 2003      ck     2.46    6.97     28.1 |
              2. | 2003      dl     3.78    3.55    16.75 |
              3. | 2003    nlmy     2.91     .97     9.09 |
              4. | 2003     zhz   102.21   77.17   126.72 |
              5. | 2004      ck     2.16    7.69    27.87 |
                 |----------------------------------------|
              6. | 2004      dl     6.28    3.07    17.14 |
              7. | 2004    nlmy     2.98     .87     8.64 |
              8. | 2004     zhz   105.08   76.81   121.54 |
                 +----------------------------------------+
            
            .
            . unab varlist : ind*
            
            . local regions : subinstr local varlist "ind" "", all
            
            .
            . gen select = .
            (8 missing values generated)
            
            .
            . forval y = 2003/2004 {
              2.     quietly replace select = year == `y'
              3.     mata: dissim("`varlist'", "select")
              4.     matrix rownames result = `regions'
              5.     matrix colnames result = `regions'
              6.         di _n "{title:`y'}"  
              7.     matrix list result, noheader  
              8.         di
              9. }
            
            2003
            
                       A          B          C
            A          0
            B  .12524211          0
            C  .43281191  .33795138          0
            
            
            2004
            
                       A          B          C
            A          0
            B  .13682167          0
            C  .41642638  .34947471          0
            Last edited by Nick Cox; 22 Feb 2018, 01:25.

            Comment


            • #7
              Hi Nick, Many thanks for the suggestion. I guess I need more time do digest your code.

              Ho-Chuan (River) Huang
              Stata 19.0, MP(4)

              Comment


              • #8
                I pushed this further. Now a program. No help file, but look at the examples carefully. The prefactor is fixed in this version.

                On version: see https://www.stata.com/support/faqs/p...stata-version/ especially #4d.

                Code:
                *! 1.0.0 NJC 22 February 2018 
                program dissimmat, rclass  
                    version 15 
                    syntax varlist(numeric min=2) [if] [in] [, by(varname) * ] 
                
                    marksample touse 
                    if "`by'" != "" markout `touse' `by', strok 
                
                    qui count if `touse' 
                    if r(N) == 0 error 2000 
                
                    * check for negative values
                    foreach v of local varlist { 
                        su `v' if `touse', meanonly 
                        if r(min) < 0 { 
                            di as err "`v' contains negative values; no go" 
                            exit 411 
                        }
                    }
                
                    tempvar select which 
                    gen byte `select' = 0 
                   
                    if "`by'" == "" {
                        gen byte `which' = `touse' 
                        local onegroup = 1 
                    }
                    else {
                        egen `which' = group(`by') if `touse', label
                        local onegroup = 0 
                    } 
                
                    su `which', meanonly 
                    tempname result 
                
                    forval i = 1/`r(max)' { 
                       quietly replace `select' = `which' == `i' 
                       mata: dissim("`varlist'", "`select'", "`result'")
                       matrix rownames `result' = `varlist'
                       matrix colnames `result' = `varlist'
                
                       if !`onegroup' { 
                          local title : label (`which') `i' 
                          di _n `"{title:`title'}"'   
                       }
                
                       matrix list `result', noheader `options'  
                       return matrix dissim`i' = `result' 
                       di
                    }
                
                end 
                
                mata :
                
                void dissim(string vector varlist, string scalar select, string scalar matname) {
                real matrix data, result
                real vector vj, vk
                real scalar J, j
                        
                data = st_data(., varlist, select)
                J = cols(data)
                result = J(J, J, 0)
                        
                for(j = 1; j < J; j++) {
                    for(k = j + 1; k <= J; k++) {
                        vj = data[., j]; vj = vj / sum(vj); vk = data[., k]; vk = vk/sum(vk)  
                        result[k, j] = result[j, k] = sum(abs(vj - vk)) / 2 
                    }
                }
                
                st_matrix(matname, result)
                
                }
                
                end

                Code:
                * note the data structure expected! 
                clear
                input float(id year) str4 which float(indA indB indC)
                1 2003 "ck"     2.46  6.97   28.1
                2 2003 "dl"     3.78  3.55  16.75
                3 2003 "nlmy"   2.91   .97   9.09
                4 2003 "zhz"  102.21 77.17 126.72
                5 2004 "ck"     2.16  7.69  27.87
                6 2004 "dl"     6.28  3.07  17.14
                7 2004 "nlmy"   2.98   .87   8.64
                8 2004 "zhz"  105.08 76.81 121.54
                end
                
                * lumping years together makes little substantive sense; it's just to show syntax 
                . dissimmat ind*
                
                           indA       indB       indC
                indA          0
                indB  .06595852          0
                indC  .21207176  .17179878          0
                
                
                . dissimmat ind*, by(year)
                
                2003
                
                           indA       indB       indC
                indA          0
                indB  .06262105          0
                indC  .21640595  .16897569          0
                
                
                2004
                
                           indA       indB       indC
                indA          0
                indB  .06841084          0
                indC  .20821319  .17473735          0
                
                
                . ret li
                
                matrices:
                            r(dissim2) :  3 x 3
                            r(dissim1) :  3 x 3
                
                . mat li r(dissim1)
                
                symmetric r(dissim1)[3,3]
                           indA       indB       indC
                indA          0
                indB  .06262105          0
                indC  .21640595  .16897569          0

                Comment


                • #9
                  Hi Nick, That's great. Thank you so much for your help.
                  Ho-Chuan (River) Huang
                  Stata 19.0, MP(4)

                  Comment


                  • #10
                    On #8: The program works in Stata 11.2 and fails in 10.1 (because st_data() in Mata was then more demanding in its syntax).

                    Comment


                    • #11
                      Originally posted by Nick Cox View Post
                      On #8: The program works in Stata 11.2 and fails in 10.1 (because st_data() in Mata was then more demanding in its syntax).
                      Got it.
                      Ho-Chuan (River) Huang
                      Stata 19.0, MP(4)

                      Comment


                      • #12
                        In fact, a program dissim is already available from SSC, as from 1999. It requires Stata 5.

                        If you installed it, help dissim is already hijacked by StataCorp for [MV] purposes, so you need other ways to look at its help.

                        It did not support a by() option.

                        Comment


                        • #13
                          Dear Nick, Thanks again for the information. I was wondering if we can use something like -joinby- command to form pairs (to change the original structure) , e.g.,from A, B, C to AB, AC, BC in one column (rather than a 3 by 3 matrix) for year?

                          Ho-Chuan (River) Huang
                          Stata 19.0, MP(4)

                          Comment


                          • #14
                            I imagine that you can do that. Why would you want to do it?

                            Comment


                            • #15
                              I didn't really have anything in mind. Just for curiosity!

                              Ho-Chuan (River) Huang
                              Stata 19.0, MP(4)

                              Comment

                              Working...
                              X