I have two sets of binary variables (one with ~80 variables and one with ~180) and about 15000 observations, each with certain characteristics like year and genre.
I would like to measure the pairwise cosine similarity of vectors for each set of binary variables between observations and then from that create averages based on certain characteristics (e.g. to compare one observation against all those who came before based on year in a genre). As best as I can tell, the best way to do this is to calculate a matrix of all possible pairwise combinations (or several matrices because there are too many observations for one matrix in my version of Stata) by doing something along the following:
But then because I need to run the averages based on several characteristics (i.e. year, genre), I think I need one giant list of cosine similarity for all possible pairwise combinations (then I can add the characteristics specified for each observation). How do I generate this after the matrix?
Using MDS, I was able to do something similar to measure Euclidean distance using the code below, but it took about 2 full days to calculate and I had to reduce the number of binary variables for each set:
Hopefully that makes sense and I provided everything correctly, my apologies if not. I don't think this is a Mata specific question but it is possible the post should go there, let me know if that is the case.
id | year | genre | var1 | var2 |
obs1 | 2007 | Scifi | 0 | 1 |
obs2 | 2010 | Fantasy | 1 | 0 |
obs3 | 2015 | Scifi | 1 | 1 |
I would like to measure the pairwise cosine similarity of vectors for each set of binary variables between observations and then from that create averages based on certain characteristics (e.g. to compare one observation against all those who came before based on year in a genre). As best as I can tell, the best way to do this is to calculate a matrix of all possible pairwise combinations (or several matrices because there are too many observations for one matrix in my version of Stata) by doing something along the following:
Code:
matrix dissimilarity matrix1 = var1 var2 var3 var4, angular
Using MDS, I was able to do something similar to measure Euclidean distance using the code below, but it took about 2 full days to calculate and I had to reduce the number of binary variables for each set:
Code:
mds var1 var2 var3, id(id) method(classical) measure(L2) predict d1, pairwise(distances) full saving(mds2)