How to calculate Jaccard Similarity

John Kirk

Join Date: Apr 2017
Posts: 57

How to calculate Jaccard Similarity

16 Jan 2019, 11:29

Dear Statlisters,

I am trying to calculate a pairwise Jaccard similarity measure and have trouble figuring out how to do so. My data is in the following format: the first variable, assignee_id represents the firm, and the other variables (law_1-5) represent their legal partners (dummy variables, a 1 indicating that they have worked with that firm). Now I am trying to calculate the pairwise similarity measure for firms depending on how similar they are in the use of their legal partners. I have been playing around with a few different things but haven't gotten anywhere, so your help with the syntax would be much appreciated. I've attached a data example below

Thanks for your help

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str32 assignee_id byte(law_1 law_2 law_3 law_4 law_5)
"00d92f99f43508d37de79da7051b43c7" 1 0 0 0 0
"00e5262f320cbda9f15490debbe80858" 0 1 0 1 0
"031b354668d5ceefc7b4bb3ba57664d4" 0 0 0 1 1
"03810188291c60318b5b0da566c266fb" 0 1 0 1 0
"054d563b447b317f56d940f5e3dd7b39" 1 0 0 0 0
"05695a60b69eb9a0f6e781debe23e9cc" 1 0 0 0 0
"062af6b4d9f7708cfd5e659cd13a3726" 1 0 1 0 0
"081507e638fca84980f88a3c3f5cd1fa" 0 0 0 0 0
"099c2e138f83bf0366539bddfda6b2e2" 0 0 1 0 0
"09fc005ad2872886a676a2f4197ce018" 0 0 0 0 0
"0a00649f54947198768fa954f8756563" 0 0 0 0 0
"0a21a0cbd50fe6558b13d773effc9eb1" 0 0 1 0 0
"0a302a7b505844998614e26c7c26d4a0" 0 1 0 1 1
"0a4642a77d52197c97f5d592966b68d7" 0 1 1 1 1
"0a74e8eea755f3ab33162a52dc87bb5d" 1 0 0 0 0
"0bb9626cc72bbfaf9ae174a022ceb086" 0 1 0 0 0
"0c65f80fcfe79b0c4732a7ebc645da8c" 0 1 0 0 0
"0ceb8b624ea012dea6d0c3705d4f547e" 0 1 1 0 0
"0d5c37ddbc9800bfc84774afe4b36faa" 1 0 0 0 1
"0d5fb33b90b1825b0003a1573d7477fe" 0 0 0 0 0
"0d6c6c25cf34819e50fd97318db9b699" 0 1 0 0 0
"0ee26da954c6572b783432f619a301e3" 1 0 0 0 0
"0f4a6ddb6c4a854440e1123924820706" 0 0 0 0 1
"0fa5a08e051f6bb467854f4bbb913a46" 0 0 1 1 0
"1005528d1a3c548b2403fba94f0927f5" 1 0 0 1 0
"107da3bb737c53c0d39645f72ede8b86" 0 0 0 0 1
"10b108b4ee97bab2304d092590c0bf7c" 0 0 0 0 1
"11127e943b93352979514b124179eb94" 1 0 0 1 0
"11f00a94b4fe1138e00af83137db2fac" 0 0 1 0 0
"134d75dd2f4984f02db90d441336fd2e" 0 1 1 0 0
end

Tags: None

Andrew Musau

Join Date: Oct 2014
Posts: 10078

16 Jan 2019, 12:29

I am using the following definition of Jaccard Similarity:

How to Calculate the Jaccard Index
1. Count the number of members which are shared between both sets.
2. Count the total number of members in both sets (shared and un-shared).
3. Divide the number of shared members (1) by the total number of members (2).
4. Multiply the number you found in (3) by 100.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str32 assignee_id byte(law_1 law_2 law_3 law_4 law_5)
"00d92f99f43508d37de79da7051b43c7" 1 0 0 0 0
"00e5262f320cbda9f15490debbe80858" 0 1 0 1 0
"031b354668d5ceefc7b4bb3ba57664d4" 0 0 0 1 1
"03810188291c60318b5b0da566c266fb" 0 1 0 1 0
"054d563b447b317f56d940f5e3dd7b39" 1 0 0 0 0
"05695a60b69eb9a0f6e781debe23e9cc" 1 0 0 0 0
"062af6b4d9f7708cfd5e659cd13a3726" 1 0 1 0 0
"081507e638fca84980f88a3c3f5cd1fa" 0 0 0 0 0
"099c2e138f83bf0366539bddfda6b2e2" 0 0 1 0 0
"09fc005ad2872886a676a2f4197ce018" 0 0 0 0 0
"0a00649f54947198768fa954f8756563" 0 0 0 0 0
"0a21a0cbd50fe6558b13d773effc9eb1" 0 0 1 0 0
"0a302a7b505844998614e26c7c26d4a0" 0 1 0 1 1
"0a4642a77d52197c97f5d592966b68d7" 0 1 1 1 1
"0a74e8eea755f3ab33162a52dc87bb5d" 1 0 0 0 0
"0bb9626cc72bbfaf9ae174a022ceb086" 0 1 0 0 0
"0c65f80fcfe79b0c4732a7ebc645da8c" 0 1 0 0 0
"0ceb8b624ea012dea6d0c3705d4f547e" 0 1 1 0 0
"0d5c37ddbc9800bfc84774afe4b36faa" 1 0 0 0 1
"0d5fb33b90b1825b0003a1573d7477fe" 0 0 0 0 0
"0d6c6c25cf34819e50fd97318db9b699" 0 1 0 0 0
"0ee26da954c6572b783432f619a301e3" 1 0 0 0 0
"0f4a6ddb6c4a854440e1123924820706" 0 0 0 0 1
"0fa5a08e051f6bb467854f4bbb913a46" 0 0 1 1 0
"1005528d1a3c548b2403fba94f0927f5" 1 0 0 1 0
"107da3bb737c53c0d39645f72ede8b86" 0 0 0 0 1
"10b108b4ee97bab2304d092590c0bf7c" 0 0 0 0 1
"11127e943b93352979514b124179eb94" 1 0 0 1 0
"11f00a94b4fe1138e00af83137db2fac" 0 0 1 0 0
"134d75dd2f4984f02db90d441336fd2e" 0 1 1 0 0
end

preserve
rename * *2
tempfile id2
save `id2'
restore
cross using `id2'
drop if assignee_id== assignee_id2
drop if assignee_id< assignee_id2
forval i=1/5{
gen tlaw_`i'= cond((law_`i'+law_`i'2) > 1, 1, (law_`i'+law_`i'2))
}
egen total= rowtotal(tlaw_*)
gen similarity= (((law_1*law_12)+ (law_2*law_22)+ (law_3*law_32)+ (law_4*law_42)+ (law_5*law_52))/ total)*100

Result (Part):

Code:



. l assignee_id assignee_id2 similarity in 1/20, noobs clean

                         assignee_id                       assignee_id2   simila~y  
    00e5262f320cbda9f15490debbe80858   00d92f99f43508d37de79da7051b43c7          0  
    031b354668d5ceefc7b4bb3ba57664d4   00d92f99f43508d37de79da7051b43c7          0  
    03810188291c60318b5b0da566c266fb   00d92f99f43508d37de79da7051b43c7          0  
    054d563b447b317f56d940f5e3dd7b39   00d92f99f43508d37de79da7051b43c7        100  
    05695a60b69eb9a0f6e781debe23e9cc   00d92f99f43508d37de79da7051b43c7        100  
    062af6b4d9f7708cfd5e659cd13a3726   00d92f99f43508d37de79da7051b43c7         50  
    081507e638fca84980f88a3c3f5cd1fa   00d92f99f43508d37de79da7051b43c7          0  
    099c2e138f83bf0366539bddfda6b2e2   00d92f99f43508d37de79da7051b43c7          0  
    09fc005ad2872886a676a2f4197ce018   00d92f99f43508d37de79da7051b43c7          0  
    0a00649f54947198768fa954f8756563   00d92f99f43508d37de79da7051b43c7          0  
    0a21a0cbd50fe6558b13d773effc9eb1   00d92f99f43508d37de79da7051b43c7          0  
    0a302a7b505844998614e26c7c26d4a0   00d92f99f43508d37de79da7051b43c7          0  
    0a4642a77d52197c97f5d592966b68d7   00d92f99f43508d37de79da7051b43c7          0  
    0a74e8eea755f3ab33162a52dc87bb5d   00d92f99f43508d37de79da7051b43c7        100  
    0bb9626cc72bbfaf9ae174a022ceb086   00d92f99f43508d37de79da7051b43c7          0  
    0c65f80fcfe79b0c4732a7ebc645da8c   00d92f99f43508d37de79da7051b43c7          0  
    0ceb8b624ea012dea6d0c3705d4f547e   00d92f99f43508d37de79da7051b43c7          0  
    0d5c37ddbc9800bfc84774afe4b36faa   00d92f99f43508d37de79da7051b43c7         50  
    0d5fb33b90b1825b0003a1573d7477fe   00d92f99f43508d37de79da7051b43c7          0  
    0d6c6c25cf34819e50fd97318db9b699   00d92f99f43508d37de79da7051b43c7          0

Last edited by Andrew Musau; 16 Jan 2019, 13:16.

Comment

John Kirk

Join Date: Apr 2017

Posts: 57
#3

16 Jan 2019, 15:30

Thanks so much for your help Andrew Musau. This makes a lot sense. However, outside of the example I have given, I am working with a very large data set, so I am wondering how I would go about coding the last line ( gen similarity = ...) as I would not be able to write out the code for thousands of variables.
Comment

Andrew Musau

Join Date: Oct 2014
Posts: 10078

16 Jan 2019, 16:47

The same indicators that we used to create the total variable can be re-used.

Code:

preserve
rename * *2
tempfile id2
save `id2'
restore
cross using `id2'
drop if assignee_id== assignee_id2
drop if assignee_id< assignee_id2
forval i=1/5{
gen tlaw_`i'= cond((law_`i'+law_`i'2) > 1, 1, (law_`i'+law_`i'2))
}
egen total= rowtotal(tlaw_*)
forval i=1/5{
replace tlaw_`i'= cond((law_`i'+law_`i'2) == 2, 1, 0)
}
egen count= rowtotal(tlaw_*)
gen similarity= (count/total)*100

Comment

John Kirk

Join Date: Apr 2017

Posts: 57
#5

17 Jan 2019, 08:34

Oh right, that's smart. Thanks Andrew!
Comment

John Kirk

Join Date: Apr 2017
Posts: 57

17 Jan 2019, 12:05

So I'm not sure whether this deserves a new topic or not but let me start by trying to pose my question here:

I am still trying to measure the Jaccard similarity (thanks on that front to Andrew Musau) but my data set is a bit too large to use the initial approach that I was planning on using. I am starting with the following data structure:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str32(assignee_id lawyer_id)
"914b749bd9f0e7a256216f4de7f2b5a6" "9d09de38cd429c759bb62d889befa5c6"
"58177709a5aa0d0850859da5ccc0a468" "09f6e19b2bd8da5cc40ca48edd56ea94"
"d960d2575fbdb96d5740a47a1c2360df" "7e4642c6bd279509097d906dfe4db5fd"
"0d5c37ddbc9800bfc84774afe4b36faa" "47fde3d66cd164adcb2a02927f714271"
"1d6bdd879a4950b07cad461e3bd31a05" "3065ee7a0f5db92441a29f3026aee12a"
"ab8fca5cbb381916bf75767f9d4a8d40" "9d09de38cd429c759bb62d889befa5c6"
"03810188291c60318b5b0da566c266fb" "c7fba32a9af42df344510a9dc12180ad"
"a18ca3c22b82fef7af1efd9b0712712f" "34c1475b3954bba9b4e911067e8b1193"
"d8bef40b1fe4378272a6dda74244b1f4" "13694a440357f873e7d310efc6af3ce8"
"fd5c3510323696410e42f08bbbc9ae73" "b0ddc5c538cbac817a71b3bc57d359a5"
"80936ad23fa87f9d8a07e42a0171ed71" "8729dfa7dc300b6311c361b59dadf9fe"
"b2b9cab4ede9a4760ada3c125cef95f8" "1f4f411279658c6b0ed0d452a0a8830a"
"cd43f1d18d3947d2ccd2a25e3a7eba2a" "30a12e049680fdf710be863654a77b49"
"9be1e7000d02e9568cb8315543f1cf99" "cba1ad56a859ce2f9b93360098d9192d"
"b9bed5c26d82998aae60c288b0ce44c8" "c2aac2a808e169e6553874705dc129cd"
"f0c20278be910af51be78ebe77604cd7" "5210d577a2e578b9665923f9794340c9"
"2c4293e2a32ced00126c738309f31d3c" "65dba96232f2dbd70d4be1cfa5172b61"
"1d3402344f70c0a10e277a3bb3e8c862" "478acb28c34aea8efc75c606485fb2ad"
"6fde9a279d234bdd67598cf9d3b39581" "f71b2d688d79df83a0670e4dcad04545"
"b66653d47d8e39f3a1d5c392ca5e3a1e" "5210d577a2e578b9665923f9794340c9"
"1ee8fae2374654bcd82a1d668f50acde" "8ffec053d94dfabc1a343f88ae72e6a5"
"0a21a0cbd50fe6558b13d773effc9eb1" "4a6633dae5cde7d2b8fe2c624d45781f"
"031b354668d5ceefc7b4bb3ba57664d4" "3e50a746174ad603ab4e5f28a2a24b90"
"2cbd222707fc3fc258301155967fe016" "0339fd5da00b6be59b8db084daab77fa"
"a73ce9b7f1aee36695dcf0f6638ee1d8" "968c9325b421773e8a2803f550776a4a"
"cd43f1d18d3947d2ccd2a25e3a7eba2a" "30a12e049680fdf710be863654a77b49"
"f2b7bf2808a4a1e2e4162cc5016b918b" "0fa9ae8e22f6f35d91b08f5c68c4efb0"
"3f71a4facd890af6522d475b26815d96" "d3d77f528852261e04ff90d588a4dd6e"
"b8e28428d90c4b0048da119925b1d6cc" "364d514180b7e743654ca585d3db1f8f"
"7efb83b4195b23919eabe7c95a9e617e" "37cb4484fde0e6c45d0eed06568cb827"
end

This is essentially the same data as before without having used

Code:

 tab lawyer_id, gen(law_)

to create dummies for the lawyer_id variable. The reason I cannot do this is that I have more than 120k categories for that variable, so these would be too many new variables for Stata to construct. Is there any way I could get around this restriction? Is there a way to calculate the Jaccard similarity without creating dummies for all categories?

P.S.: I know I could take a sample of the data, but I would really like to avoid doing so. (Also there seems to be a 12k restriction on tab tables, which means I would have to figure out another way to generate the dummies).

Comment

Andrew Musau

Join Date: Oct 2014

Posts: 10078
#7

18 Jan 2019, 08:04

What flavor of Stata do you have? For such large data sets, Stata MP is preferable because it allows a much higher variable limit. Stata SE is capped at 32,767.

# of observations 1,099,511,627,775 (MP) (1)
2,147,483,647 (SE) (1)

# of variables 120,000 (MP)
32,767 (SE)

You need to have two firms in the same observation (the reference firm and the comparison firm) because you are calculating pairwise similarity. It may be the case that you have 120,000+ law firms in your data set, but are these distinct?

(Also there seems to be a 12k restriction on tab tables, which means I would have to figure out another way to generate the dummies).

Check out this link for a work-around. If you can successfully generate a data set with one set of dummies for the law firms, the rest can be done in a matrix in Mata and the result exported back to Stata since you just need a single variable at the end, i.e., the similarity index.

Last edited by Andrew Musau; 18 Jan 2019, 08:08.
1 like
Comment
John Kirk

Join Date: Apr 2017

Posts: 57
#8

18 Jan 2019, 09:41

Originally posted by Andrew Musau View Post

What flavor of Stata do you have? For such large data sets, Stata MP is preferable because it allows a much higher variable limit. Stata SE is capped at 32,767.

I do have access to the MP version so I could potentially work with 120k variables.

Originally posted by Andrew Musau View Post

You need to have two firms in the same observation (the reference firm and the comparison firm) because you are calculating pairwise similarity. It may be the case that you have 120,000+ law firms in your data set, but are these distinct?

Originally posted by Andrew Musau View Post

Check out this link for a work-around. If you can successfully generate a data set with one set of dummies for the law firms, the rest can be done in a matrix in Mata and the result exported back to Stata since you just need a single variable at the end, i.e., the similarity index.

Thanks, this seems like it should work. After restricting my sample a little bit, I did manage to get the number of law firms to barely be below 120k. I have, however, never worked with mata before. Do you know how I would run (essentially) the same code that you previously suggested in Mata?

Again, thank you so much for your help Andrew, I really appreciate it!
Comment

Andrew Musau

Join Date: Oct 2014
Posts: 10078

21 Jan 2019, 14:00

Apologies John Kirk , I have my vacation this month so I am slow to reply. The calculations require simple matrix operations. Here, in addition, I use moremata from SSC.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str32 assignee_id byte(law_1 law_2 law_3 law_4 law_5)
"00d92f99f43508d37de79da7051b43c7" 1 0 0 0 0
"00e5262f320cbda9f15490debbe80858" 0 1 0 1 0
"031b354668d5ceefc7b4bb3ba57664d4" 0 0 0 1 1
"03810188291c60318b5b0da566c266fb" 0 1 0 1 0
"054d563b447b317f56d940f5e3dd7b39" 1 0 0 0 0
"05695a60b69eb9a0f6e781debe23e9cc" 1 0 0 0 0
"062af6b4d9f7708cfd5e659cd13a3726" 1 0 1 0 0
"081507e638fca84980f88a3c3f5cd1fa" 0 0 0 0 0
"099c2e138f83bf0366539bddfda6b2e2" 0 0 1 0 0
"09fc005ad2872886a676a2f4197ce018" 0 0 0 0 0
"0a00649f54947198768fa954f8756563" 0 0 0 0 0
"0a21a0cbd50fe6558b13d773effc9eb1" 0 0 1 0 0
"0a302a7b505844998614e26c7c26d4a0" 0 1 0 1 1
"0a4642a77d52197c97f5d592966b68d7" 0 1 1 1 1
"0a74e8eea755f3ab33162a52dc87bb5d" 1 0 0 0 0
"0bb9626cc72bbfaf9ae174a022ceb086" 0 1 0 0 0
"0c65f80fcfe79b0c4732a7ebc645da8c" 0 1 0 0 0
"0ceb8b624ea012dea6d0c3705d4f547e" 0 1 1 0 0
"0d5c37ddbc9800bfc84774afe4b36faa" 1 0 0 0 1
"0d5fb33b90b1825b0003a1573d7477fe" 0 0 0 0 0
"0d6c6c25cf34819e50fd97318db9b699" 0 1 0 0 0
"0ee26da954c6572b783432f619a301e3" 1 0 0 0 0
"0f4a6ddb6c4a854440e1123924820706" 0 0 0 0 1
"0fa5a08e051f6bb467854f4bbb913a46" 0 0 1 1 0
"1005528d1a3c548b2403fba94f0927f5" 1 0 0 1 0
"107da3bb737c53c0d39645f72ede8b86" 0 0 0 0 1
"10b108b4ee97bab2304d092590c0bf7c" 0 0 0 0 1
"11127e943b93352979514b124179eb94" 1 0 0 1 0
"11f00a94b4fe1138e00af83137db2fac" 0 0 1 0 0
"134d75dd2f4984f02db90d441336fd2e" 0 1 1 0 0
end

preserve
rename * *2
tempfile id2
save `id2'
restore
cross using `id2'
drop if assignee_id<= assignee_id2
mata:  st_view(D=., ., "law_1 - law_5")
mata:  st_view(D2=., ., "law_12 - law_52")
gen count=.
gen total=.
*ssc install moremata
mata: D5 = mm_cond((D+D2) :> 1, 1, (D+D2))
mata: D6 = mm_cond((D+D2) :> 1, 1, 0)
mata: J=J(5, 1, 1)
mata: st_store(., "count", D6*J)
mata: st_store(., "total", D5*J)
gen similarity= (count/total)*100

You should also be able to recreate in Mata what the cross command does. If you fail to find code in the archives, let me know and I can work something out.

Comment

David Benson

Join Date: Oct 2018
Posts: 489

#10

21 Jan 2019, 14:41

Sorry to jump into this late--it sounds like you have worked out what John needs to do with his data. I just thought I would point out a couple of string similarity measures that Stata has:

See Statalist entry, "New package for phonetic string encoding and string distance/similarity metrics", Link

Code:

 net inst strutil, from("http://wbuchanan.github.io/StataStringUtilities/")

project repository (on GitHub) is https://github.com/wbuchanan/StataStringUtilities/

help cluster programming, see cluster measures
help measure_option has Jaccard, Russell & Rao, Hamann, etc

binary_measure	Description
matching	simple matching similarity coefficient
Jaccard	Jaccard binary similarity coefficient
Russell	Russell and Rao similarity coefficient
Hamann	Hamann similarity coefficient
Dice	Dice similarity coefficient
antiDice	anti-Dice similarity coefficient
Sneath	Sneath and Sokal similarity coefficient
Rogers	Rogers and Tanimoto similarity coefficient
Ochiai	Ochiai similarity coefficient
Yule	Yule similarity coefficient
Anderberg	Anderberg similarity coefficient
Kulczynski	Kulczyński similarity coefficient
Pearson	Pearson's phi similarity coefficient
Gower2	similarity coefficient with same denominator as Pearson

Announcement

How to calculate Jaccard Similarity

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment