Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to calculate Jaccard Similarity

    Dear Statlisters,

    I am trying to calculate a pairwise Jaccard similarity measure and have trouble figuring out how to do so. My data is in the following format: the first variable, assignee_id represents the firm, and the other variables (law_1-5) represent their legal partners (dummy variables, a 1 indicating that they have worked with that firm). Now I am trying to calculate the pairwise similarity measure for firms depending on how similar they are in the use of their legal partners. I have been playing around with a few different things but haven't gotten anywhere, so your help with the syntax would be much appreciated. I've attached a data example below

    Thanks for your help

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input str32 assignee_id byte(law_1 law_2 law_3 law_4 law_5)
    "00d92f99f43508d37de79da7051b43c7" 1 0 0 0 0
    "00e5262f320cbda9f15490debbe80858" 0 1 0 1 0
    "031b354668d5ceefc7b4bb3ba57664d4" 0 0 0 1 1
    "03810188291c60318b5b0da566c266fb" 0 1 0 1 0
    "054d563b447b317f56d940f5e3dd7b39" 1 0 0 0 0
    "05695a60b69eb9a0f6e781debe23e9cc" 1 0 0 0 0
    "062af6b4d9f7708cfd5e659cd13a3726" 1 0 1 0 0
    "081507e638fca84980f88a3c3f5cd1fa" 0 0 0 0 0
    "099c2e138f83bf0366539bddfda6b2e2" 0 0 1 0 0
    "09fc005ad2872886a676a2f4197ce018" 0 0 0 0 0
    "0a00649f54947198768fa954f8756563" 0 0 0 0 0
    "0a21a0cbd50fe6558b13d773effc9eb1" 0 0 1 0 0
    "0a302a7b505844998614e26c7c26d4a0" 0 1 0 1 1
    "0a4642a77d52197c97f5d592966b68d7" 0 1 1 1 1
    "0a74e8eea755f3ab33162a52dc87bb5d" 1 0 0 0 0
    "0bb9626cc72bbfaf9ae174a022ceb086" 0 1 0 0 0
    "0c65f80fcfe79b0c4732a7ebc645da8c" 0 1 0 0 0
    "0ceb8b624ea012dea6d0c3705d4f547e" 0 1 1 0 0
    "0d5c37ddbc9800bfc84774afe4b36faa" 1 0 0 0 1
    "0d5fb33b90b1825b0003a1573d7477fe" 0 0 0 0 0
    "0d6c6c25cf34819e50fd97318db9b699" 0 1 0 0 0
    "0ee26da954c6572b783432f619a301e3" 1 0 0 0 0
    "0f4a6ddb6c4a854440e1123924820706" 0 0 0 0 1
    "0fa5a08e051f6bb467854f4bbb913a46" 0 0 1 1 0
    "1005528d1a3c548b2403fba94f0927f5" 1 0 0 1 0
    "107da3bb737c53c0d39645f72ede8b86" 0 0 0 0 1
    "10b108b4ee97bab2304d092590c0bf7c" 0 0 0 0 1
    "11127e943b93352979514b124179eb94" 1 0 0 1 0
    "11f00a94b4fe1138e00af83137db2fac" 0 0 1 0 0
    "134d75dd2f4984f02db90d441336fd2e" 0 1 1 0 0
    end

  • #2
    I am using the following definition of Jaccard Similarity:

    How to Calculate the Jaccard Index
    1. Count the number of members which are shared between both sets.
    2. Count the total number of members in both sets (shared and un-shared).
    3. Divide the number of shared members (1) by the total number of members (2).
    4. Multiply the number you found in (3) by 100.
    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input str32 assignee_id byte(law_1 law_2 law_3 law_4 law_5)
    "00d92f99f43508d37de79da7051b43c7" 1 0 0 0 0
    "00e5262f320cbda9f15490debbe80858" 0 1 0 1 0
    "031b354668d5ceefc7b4bb3ba57664d4" 0 0 0 1 1
    "03810188291c60318b5b0da566c266fb" 0 1 0 1 0
    "054d563b447b317f56d940f5e3dd7b39" 1 0 0 0 0
    "05695a60b69eb9a0f6e781debe23e9cc" 1 0 0 0 0
    "062af6b4d9f7708cfd5e659cd13a3726" 1 0 1 0 0
    "081507e638fca84980f88a3c3f5cd1fa" 0 0 0 0 0
    "099c2e138f83bf0366539bddfda6b2e2" 0 0 1 0 0
    "09fc005ad2872886a676a2f4197ce018" 0 0 0 0 0
    "0a00649f54947198768fa954f8756563" 0 0 0 0 0
    "0a21a0cbd50fe6558b13d773effc9eb1" 0 0 1 0 0
    "0a302a7b505844998614e26c7c26d4a0" 0 1 0 1 1
    "0a4642a77d52197c97f5d592966b68d7" 0 1 1 1 1
    "0a74e8eea755f3ab33162a52dc87bb5d" 1 0 0 0 0
    "0bb9626cc72bbfaf9ae174a022ceb086" 0 1 0 0 0
    "0c65f80fcfe79b0c4732a7ebc645da8c" 0 1 0 0 0
    "0ceb8b624ea012dea6d0c3705d4f547e" 0 1 1 0 0
    "0d5c37ddbc9800bfc84774afe4b36faa" 1 0 0 0 1
    "0d5fb33b90b1825b0003a1573d7477fe" 0 0 0 0 0
    "0d6c6c25cf34819e50fd97318db9b699" 0 1 0 0 0
    "0ee26da954c6572b783432f619a301e3" 1 0 0 0 0
    "0f4a6ddb6c4a854440e1123924820706" 0 0 0 0 1
    "0fa5a08e051f6bb467854f4bbb913a46" 0 0 1 1 0
    "1005528d1a3c548b2403fba94f0927f5" 1 0 0 1 0
    "107da3bb737c53c0d39645f72ede8b86" 0 0 0 0 1
    "10b108b4ee97bab2304d092590c0bf7c" 0 0 0 0 1
    "11127e943b93352979514b124179eb94" 1 0 0 1 0
    "11f00a94b4fe1138e00af83137db2fac" 0 0 1 0 0
    "134d75dd2f4984f02db90d441336fd2e" 0 1 1 0 0
    end
    
    preserve
    rename * *2
    tempfile id2
    save `id2'
    restore
    cross using `id2'
    drop if assignee_id== assignee_id2
    drop if assignee_id< assignee_id2
    forval i=1/5{
    gen tlaw_`i'= cond((law_`i'+law_`i'2) > 1, 1, (law_`i'+law_`i'2))
    }
    egen total= rowtotal(tlaw_*)
    gen similarity= (((law_1*law_12)+ (law_2*law_22)+ (law_3*law_32)+ (law_4*law_42)+ (law_5*law_52))/ total)*100


    Result (Part):

    Code:
    
    
    . l assignee_id assignee_id2 similarity in 1/20, noobs clean
    
                             assignee_id                       assignee_id2   simila~y  
        00e5262f320cbda9f15490debbe80858   00d92f99f43508d37de79da7051b43c7          0  
        031b354668d5ceefc7b4bb3ba57664d4   00d92f99f43508d37de79da7051b43c7          0  
        03810188291c60318b5b0da566c266fb   00d92f99f43508d37de79da7051b43c7          0  
        054d563b447b317f56d940f5e3dd7b39   00d92f99f43508d37de79da7051b43c7        100  
        05695a60b69eb9a0f6e781debe23e9cc   00d92f99f43508d37de79da7051b43c7        100  
        062af6b4d9f7708cfd5e659cd13a3726   00d92f99f43508d37de79da7051b43c7         50  
        081507e638fca84980f88a3c3f5cd1fa   00d92f99f43508d37de79da7051b43c7          0  
        099c2e138f83bf0366539bddfda6b2e2   00d92f99f43508d37de79da7051b43c7          0  
        09fc005ad2872886a676a2f4197ce018   00d92f99f43508d37de79da7051b43c7          0  
        0a00649f54947198768fa954f8756563   00d92f99f43508d37de79da7051b43c7          0  
        0a21a0cbd50fe6558b13d773effc9eb1   00d92f99f43508d37de79da7051b43c7          0  
        0a302a7b505844998614e26c7c26d4a0   00d92f99f43508d37de79da7051b43c7          0  
        0a4642a77d52197c97f5d592966b68d7   00d92f99f43508d37de79da7051b43c7          0  
        0a74e8eea755f3ab33162a52dc87bb5d   00d92f99f43508d37de79da7051b43c7        100  
        0bb9626cc72bbfaf9ae174a022ceb086   00d92f99f43508d37de79da7051b43c7          0  
        0c65f80fcfe79b0c4732a7ebc645da8c   00d92f99f43508d37de79da7051b43c7          0  
        0ceb8b624ea012dea6d0c3705d4f547e   00d92f99f43508d37de79da7051b43c7          0  
        0d5c37ddbc9800bfc84774afe4b36faa   00d92f99f43508d37de79da7051b43c7         50  
        0d5fb33b90b1825b0003a1573d7477fe   00d92f99f43508d37de79da7051b43c7          0  
        0d6c6c25cf34819e50fd97318db9b699   00d92f99f43508d37de79da7051b43c7          0
    Last edited by Andrew Musau; 16 Jan 2019, 14:16.

    Comment


    • #3
      Thanks so much for your help Andrew Musau. This makes a lot sense. However, outside of the example I have given, I am working with a very large data set, so I am wondering how I would go about coding the last line ( gen similarity = ...) as I would not be able to write out the code for thousands of variables.

      Comment


      • #4
        The same indicators that we used to create the total variable can be re-used.

        Code:
        preserve
        rename * *2
        tempfile id2
        save `id2'
        restore
        cross using `id2'
        drop if assignee_id== assignee_id2
        drop if assignee_id< assignee_id2
        forval i=1/5{
        gen tlaw_`i'= cond((law_`i'+law_`i'2) > 1, 1, (law_`i'+law_`i'2))
        }
        egen total= rowtotal(tlaw_*)
        forval i=1/5{
        replace tlaw_`i'= cond((law_`i'+law_`i'2) == 2, 1, 0)
        }
        egen count= rowtotal(tlaw_*)
        gen similarity= (count/total)*100

        Comment


        • #5
          Oh right, that's smart. Thanks Andrew!

          Comment


          • #6
            So I'm not sure whether this deserves a new topic or not but let me start by trying to pose my question here:

            I am still trying to measure the Jaccard similarity (thanks on that front to Andrew Musau) but my data set is a bit too large to use the initial approach that I was planning on using. I am starting with the following data structure:

            Code:
            * Example generated by -dataex-. To install: ssc install dataex
            clear
            input str32(assignee_id lawyer_id)
            "914b749bd9f0e7a256216f4de7f2b5a6" "9d09de38cd429c759bb62d889befa5c6"
            "58177709a5aa0d0850859da5ccc0a468" "09f6e19b2bd8da5cc40ca48edd56ea94"
            "d960d2575fbdb96d5740a47a1c2360df" "7e4642c6bd279509097d906dfe4db5fd"
            "0d5c37ddbc9800bfc84774afe4b36faa" "47fde3d66cd164adcb2a02927f714271"
            "1d6bdd879a4950b07cad461e3bd31a05" "3065ee7a0f5db92441a29f3026aee12a"
            "ab8fca5cbb381916bf75767f9d4a8d40" "9d09de38cd429c759bb62d889befa5c6"
            "03810188291c60318b5b0da566c266fb" "c7fba32a9af42df344510a9dc12180ad"
            "a18ca3c22b82fef7af1efd9b0712712f" "34c1475b3954bba9b4e911067e8b1193"
            "d8bef40b1fe4378272a6dda74244b1f4" "13694a440357f873e7d310efc6af3ce8"
            "fd5c3510323696410e42f08bbbc9ae73" "b0ddc5c538cbac817a71b3bc57d359a5"
            "80936ad23fa87f9d8a07e42a0171ed71" "8729dfa7dc300b6311c361b59dadf9fe"
            "b2b9cab4ede9a4760ada3c125cef95f8" "1f4f411279658c6b0ed0d452a0a8830a"
            "cd43f1d18d3947d2ccd2a25e3a7eba2a" "30a12e049680fdf710be863654a77b49"
            "9be1e7000d02e9568cb8315543f1cf99" "cba1ad56a859ce2f9b93360098d9192d"
            "b9bed5c26d82998aae60c288b0ce44c8" "c2aac2a808e169e6553874705dc129cd"
            "f0c20278be910af51be78ebe77604cd7" "5210d577a2e578b9665923f9794340c9"
            "2c4293e2a32ced00126c738309f31d3c" "65dba96232f2dbd70d4be1cfa5172b61"
            "1d3402344f70c0a10e277a3bb3e8c862" "478acb28c34aea8efc75c606485fb2ad"
            "6fde9a279d234bdd67598cf9d3b39581" "f71b2d688d79df83a0670e4dcad04545"
            "b66653d47d8e39f3a1d5c392ca5e3a1e" "5210d577a2e578b9665923f9794340c9"
            "1ee8fae2374654bcd82a1d668f50acde" "8ffec053d94dfabc1a343f88ae72e6a5"
            "0a21a0cbd50fe6558b13d773effc9eb1" "4a6633dae5cde7d2b8fe2c624d45781f"
            "031b354668d5ceefc7b4bb3ba57664d4" "3e50a746174ad603ab4e5f28a2a24b90"
            "2cbd222707fc3fc258301155967fe016" "0339fd5da00b6be59b8db084daab77fa"
            "a73ce9b7f1aee36695dcf0f6638ee1d8" "968c9325b421773e8a2803f550776a4a"
            "cd43f1d18d3947d2ccd2a25e3a7eba2a" "30a12e049680fdf710be863654a77b49"
            "f2b7bf2808a4a1e2e4162cc5016b918b" "0fa9ae8e22f6f35d91b08f5c68c4efb0"
            "3f71a4facd890af6522d475b26815d96" "d3d77f528852261e04ff90d588a4dd6e"
            "b8e28428d90c4b0048da119925b1d6cc" "364d514180b7e743654ca585d3db1f8f"
            "7efb83b4195b23919eabe7c95a9e617e" "37cb4484fde0e6c45d0eed06568cb827"
            end
            This is essentially the same data as before without having used
            Code:
             tab lawyer_id, gen(law_)
            to create dummies for the lawyer_id variable. The reason I cannot do this is that I have more than 120k categories for that variable, so these would be too many new variables for Stata to construct. Is there any way I could get around this restriction? Is there a way to calculate the Jaccard similarity without creating dummies for all categories?

            P.S.: I know I could take a sample of the data, but I would really like to avoid doing so. (Also there seems to be a 12k restriction on tab tables, which means I would have to figure out another way to generate the dummies).

            Comment


            • #7
              What flavor of Stata do you have? For such large data sets, Stata MP is preferable because it allows a much higher variable limit. Stata SE is capped at 32,767.

              # of observations 1,099,511,627,775 (MP) (1)
              2,147,483,647 (SE) (1)

              # of variables 120,000 (MP)
              32,767 (SE)
              You need to have two firms in the same observation (the reference firm and the comparison firm) because you are calculating pairwise similarity. It may be the case that you have 120,000+ law firms in your data set, but are these distinct?

              (Also there seems to be a 12k restriction on tab tables, which means I would have to figure out another way to generate the dummies).
              Check out this link for a work-around. If you can successfully generate a data set with one set of dummies for the law firms, the rest can be done in a matrix in Mata and the result exported back to Stata since you just need a single variable at the end, i.e., the similarity index.
              Last edited by Andrew Musau; 18 Jan 2019, 09:08.

              Comment


              • #8
                Originally posted by Andrew Musau View Post
                What flavor of Stata do you have? For such large data sets, Stata MP is preferable because it allows a much higher variable limit. Stata SE is capped at 32,767.
                I do have access to the MP version so I could potentially work with 120k variables.


                Originally posted by Andrew Musau View Post
                You need to have two firms in the same observation (the reference firm and the comparison firm) because you are calculating pairwise similarity. It may be the case that you have 120,000+ law firms in your data set, but are these distinct?
                Originally posted by Andrew Musau View Post
                Check out this link for a work-around. If you can successfully generate a data set with one set of dummies for the law firms, the rest can be done in a matrix in Mata and the result exported back to Stata since you just need a single variable at the end, i.e., the similarity index.
                Thanks, this seems like it should work. After restricting my sample a little bit, I did manage to get the number of law firms to barely be below 120k. I have, however, never worked with mata before. Do you know how I would run (essentially) the same code that you previously suggested in Mata?

                Again, thank you so much for your help Andrew, I really appreciate it!



                Comment


                • #9
                  Apologies John Kirk , I have my vacation this month so I am slow to reply. The calculations require simple matrix operations. Here, in addition, I use moremata from SSC.


                  Code:
                  * Example generated by -dataex-. To install: ssc install dataex
                  clear
                  input str32 assignee_id byte(law_1 law_2 law_3 law_4 law_5)
                  "00d92f99f43508d37de79da7051b43c7" 1 0 0 0 0
                  "00e5262f320cbda9f15490debbe80858" 0 1 0 1 0
                  "031b354668d5ceefc7b4bb3ba57664d4" 0 0 0 1 1
                  "03810188291c60318b5b0da566c266fb" 0 1 0 1 0
                  "054d563b447b317f56d940f5e3dd7b39" 1 0 0 0 0
                  "05695a60b69eb9a0f6e781debe23e9cc" 1 0 0 0 0
                  "062af6b4d9f7708cfd5e659cd13a3726" 1 0 1 0 0
                  "081507e638fca84980f88a3c3f5cd1fa" 0 0 0 0 0
                  "099c2e138f83bf0366539bddfda6b2e2" 0 0 1 0 0
                  "09fc005ad2872886a676a2f4197ce018" 0 0 0 0 0
                  "0a00649f54947198768fa954f8756563" 0 0 0 0 0
                  "0a21a0cbd50fe6558b13d773effc9eb1" 0 0 1 0 0
                  "0a302a7b505844998614e26c7c26d4a0" 0 1 0 1 1
                  "0a4642a77d52197c97f5d592966b68d7" 0 1 1 1 1
                  "0a74e8eea755f3ab33162a52dc87bb5d" 1 0 0 0 0
                  "0bb9626cc72bbfaf9ae174a022ceb086" 0 1 0 0 0
                  "0c65f80fcfe79b0c4732a7ebc645da8c" 0 1 0 0 0
                  "0ceb8b624ea012dea6d0c3705d4f547e" 0 1 1 0 0
                  "0d5c37ddbc9800bfc84774afe4b36faa" 1 0 0 0 1
                  "0d5fb33b90b1825b0003a1573d7477fe" 0 0 0 0 0
                  "0d6c6c25cf34819e50fd97318db9b699" 0 1 0 0 0
                  "0ee26da954c6572b783432f619a301e3" 1 0 0 0 0
                  "0f4a6ddb6c4a854440e1123924820706" 0 0 0 0 1
                  "0fa5a08e051f6bb467854f4bbb913a46" 0 0 1 1 0
                  "1005528d1a3c548b2403fba94f0927f5" 1 0 0 1 0
                  "107da3bb737c53c0d39645f72ede8b86" 0 0 0 0 1
                  "10b108b4ee97bab2304d092590c0bf7c" 0 0 0 0 1
                  "11127e943b93352979514b124179eb94" 1 0 0 1 0
                  "11f00a94b4fe1138e00af83137db2fac" 0 0 1 0 0
                  "134d75dd2f4984f02db90d441336fd2e" 0 1 1 0 0
                  end
                  
                  preserve
                  rename * *2
                  tempfile id2
                  save `id2'
                  restore
                  cross using `id2'
                  drop if assignee_id<= assignee_id2
                  mata:  st_view(D=., ., "law_1 - law_5")
                  mata:  st_view(D2=., ., "law_12 - law_52")
                  gen count=.
                  gen total=.
                  *ssc install moremata
                  mata: D5 = mm_cond((D+D2) :> 1, 1, (D+D2))
                  mata: D6 = mm_cond((D+D2) :> 1, 1, 0)
                  mata: J=J(5, 1, 1)
                  mata: st_store(., "count", D6*J)
                  mata: st_store(., "total", D5*J)
                  gen similarity= (count/total)*100
                  You should also be able to recreate in Mata what the cross command does. If you fail to find code in the archives, let me know and I can work something out.

                  Comment


                  • #10
                    Sorry to jump into this late--it sounds like you have worked out what John needs to do with his data. I just thought I would point out a couple of string similarity measures that Stata has:

                    See Statalist entry, "New package for phonetic string encoding and string distance/similarity metrics", Link

                    Code:
                     net inst strutil, from("http://wbuchanan.github.io/StataStringUtilities/")
                    project repository (on GitHub) is https://github.com/wbuchanan/StataStringUtilities/

                    help cluster programming, see cluster measures
                    help measure_option has Jaccard, Russell & Rao, Hamann, etc

                    binary_measure Description
                    matching simple matching similarity coefficient
                    Jaccard Jaccard binary similarity coefficient
                    Russell Russell and Rao similarity coefficient
                    Hamann Hamann similarity coefficient
                    Dice Dice similarity coefficient
                    antiDice anti-Dice similarity coefficient
                    Sneath Sneath and Sokal similarity coefficient
                    Rogers Rogers and Tanimoto similarity coefficient
                    Ochiai Ochiai similarity coefficient
                    Yule Yule similarity coefficient
                    Anderberg Anderberg similarity coefficient
                    Kulczynski KulczyƄski similarity coefficient
                    Pearson Pearson's phi similarity coefficient
                    Gower2 similarity coefficient with same denominator as Pearson

                    Comment

                    Working...
                    X