From symmetric to asymmetric matrix

dilandilan

Join Date: Sep 2014

Posts: 5
#16

22 Sep 2014, 01:11

I agree with the last comment of Oded Mcdossi....... that is good

Graduated from Soran University with First Class Degree with Honours in Computer Science.
Comment

Aspen Chen

Join Date: Apr 2014
Posts: 114

#17

23 Sep 2014, 02:23

Originally posted by Oded Mcdossi View Post

Calculating the binary measures on large co-occurrence matrix becomes tedious.

Oded, not sure if you still need it, but here's a program that should make process less onerous. You basically input the name of the co-occurrence matrix (assuming the diagonal line fits Modesto's description), the total number of cases, and the similarity measure. It uses the build-in similarity index parsing file, so all binary options in -help measure_option- would work. The program returns a proximity matrix named PRX, which you could further run through -mdsmat-.

Code:

capture program drop coo_prx
program coo_prx
syntax anything, sum(integer) Measure(string)    // coo_prx matrixname, sum(number_of_cases) m(measure_type)
loc COO `anything'            // co-occurance matrix
parse_dissim `measure'
loc measure `s(dist)'
if "`s(binary)'"==""    {
    di as error "only binary measures accepted"
    exit
}
qui mat list `COO'            // assert matrix exists
if issymmetric(`COO')!=1    {
    di as error "matrix `C' not symmetric"
    exit
}
mata: dissmat("`COO'", `sum', "`measure'")
mat coln PRX=`:colnames `COO''
mat rown PRX=`:rownames `COO''
mat list PRX, f(%8.4f)        // proximity matrix stored in PRX
end

mata:
function dissmat(string scalar COO, real scalar sum, string scalar measure)
{
    measure=st_local("measure")
    COO=st_matrix(COO)
    PRX=J(cols(COO),rows(COO),.)
    for (i=1;i<=rows(COO);i++)    {
        for (j=1;j<=rows(COO);j++)    {
            if (i!=j)                     {
                a=COO[i,j]
                b=COO[j,j]; b=b-a
                c=COO[i,i]; c=c-a
                d=sum-(a+b+c)
                if (measure=="matching") PRX[i,j]=(a+d)/(a+b+c+d)
                else if (measure=="Jaccard") PRX[i,j]=a/(a+b+c)
                else if (measure=="Russell") PRX[i,j]=a/(a+b+c+d)
                else if (measure=="Hamann") PRX[i,j]=((a+d)-(b+c))/(a+b+c+d)
                else if (measure=="Dice") PRX[i,j]=(2*a)/(2*a+b+c)
                else if (measure=="antiDice") PRX[i,j]=a/(a+2*(b+c))
                else if (measure=="Sneath")    PRX[i,j]=(2*(a+d))/(2*(a+d)+(b+c))
                else if (measure=="Rogers")    PRX[i,j]=(a+d)/((a+d)+2*(b+c))
                else if (measure=="Ochiai")    PRX[i,j]=a/sqrt((a+b)*(a+c))
                else if (measure=="Yule") PRX[i,j]=(a*d-b*c)/(a*d+b*c)
                else if (measure=="Anderberg") PRX[i,j]=(a/(a+b) + a/(a+c) + d/(c+d) + d/(b+d))/4
                else if (measure=="Kulczynski")    PRX[i,j]=(a/(a+b) + a/(a+c))/2
                else if (measure=="Pearson") PRX[i,j]=(a*d-b*c)/sqrt((a+b)*(a+c)*(d+b)*(d+c))
                else if (measure=="Gower2")    PRX[i,j]=a*d/sqrt((a+b)*(a+c)*(d+b)*(d+c))
                else _error(3888)
                
            }
            else {
                PRX[i,j]=0
            }
        }
    }
    st_matrix("PRX",PRX)
}
end

Save this code as coo_prx.ado in your personal ado folder.
Just for demo purpose, this is a test with a random data consisting of 70 variables and 10,000 cases.

Code:

clear
mat drop _all
// Some random data
set obs 10000
loc n=70
forval i=1/`n'    {
    gen v`i'=floor(2*runiform())
}
// produce co-occurrence matrix
mat C=J(`n',`n',.)
loc nlist ""
forval i=1/`n'    {
    loc nlist="`nlist' v`i'"
}
mat coln C=`nlist'
mat rown C=`nlist'
forval i=1/`n'{
    forval j=1/`n'    {
        if (`i'!=`j')    {
            qui count if (v`i'==v`j' & v`i'==1)
            mat C[`i',`j']=`r(N)'
        }
        else {
            qui count if v`i'==1
            mat C[`i',`i']=`r(N)'
        }
    }
}
// run the program
coo_prx C, sum(1000) m(match) // sum() takes any positive integers, and m() takes any of the binary similarity measures

Comment

Modesto Escobar

Join Date: Apr 2014
Posts: 17

#18

23 Sep 2014, 09:55

Good program, Aspe!. I like coo_prx a lot, because you can select lots of distances measures. But perhaps that can be a little bit confusing for Odec.
I will try to explain a program that can get a solution for him. I don't know if I have understood him rightly. Let's see.
First, I suppose that he has a coocurrence matrix that he want to convert into a MDS graph via binary distances.
The first step is to fill the matrix with occurrences: number of citations of every paper in the diagonal of the coocurrence matrix. After, it coud be also be convenient to have in mind the total of citations in all the papers.
The second step is to convert this matrix into a distance matrix. That is what Aspen's coo_prx do or my todistance command.
Once we have a distance matrix, The Stata command mdsmat to obtain a multidimensional can be applied.
Here it is my code:

Code:

capture program drop todistance
program define todistance, rclass // to convert a coocurrence matrix into distances
if "A`2'"=="A" {
local 2 trace(`1')
}
matrix J=diag(J(1,`=rowsof(`1')',1))
matrix K=diag(J(1,`=rowsof(`1')',1))
forvalues X=2/`=rowsof(`1')' {
 forvalues Y=1/`=`X'-1' {
   matrix O`X'_`Y'=J(2,2,.)
matrix O`X'_`Y'=J(2,2,.)
matrix O`X'_`Y'[1,1]=`1'[`X',`Y']
matrix O`X'_`Y'[1,2]=`1'[`X',`X']-`1'[`X',`Y']
matrix O`X'_`Y'[2,2]=`2'-`1'[`X',`X']-`1'[`Y',`Y']+`1'[`X',`Y']
matrix O`X'_`Y'[2,1]=`1'[`Y',`Y']-`1'[`X',`Y']
matrix rownames O`X'_`Y'=Paper`X'(Si) Paper`X'(No)
matrix colnames O`X'_`Y'=Paper`Y'(Si) Paper`Y'(No)
// matlist O`X'_`Y'
matrix J[`X',`Y']=O`X'_`Y'[1,1]/(O`X'_`Y'[1,1]+O`X'_`Y'[1,2]+O`X'_`Y'[2,1])
matrix J[`Y', `X']=J[`X',`Y']
matrix K[`X',`Y']=((O`X'_`Y'[1,1]/(O`X'_`Y'[1,1]+O`X'_`Y'[1,2]))+(O`X'_`Y'[1,1]/(O`X'_`Y'[1,1]+O`X'_`Y'[2,1])))/2
matrix K[`Y', `X']=K[`X',`Y']
return matrix O`X'_`Y'=O`X'_`Y'
}
}
matrix colnames J=`:colnames(`1')'
matrix rownames J=`:rownames(`1')'
matrix colnames K=`:colnames(`1')'
matrix rownames K=`:rownames(`1')'

matlist J, title("Jaccard distances") format(%3.2f)
matlist K, title("Kulczynski distances")format(%3.2f)

end

// Input the coocurrence matrix with input (or with the data editor).
clear
input p1 p2 p3 p4
60 10 20 25 
10 50 30 15 
20 30 40 12 
25 15 12 30 
end
// Convert your data file into a matrix (A)
mkmat p1-p4, mat(A)
matrix rownames A=`:colnames(A)'

//Convert your coocurrence matrix (similarity matrix, but without diagonal of 1's
//into a distance (dissimilarity) matrix
//(Note that in this case you obtain Jaccard -J- and Kulczynski -K-)
todistance A //you may add a number if the total is diferent to the sum of the diagonal

// Once you have J (Jaccard) and K (Kulczynski), you can apply MDS to these distances matrices.
mdsmat J, s2d(standard) // you can add other options as convenient (see help mdsmat)
mdsconfig, name(Jacard, replace)
mdsmat K, s2d(standard)
mdsconfig, name(Kulczynski, replace)

I hope this helps!

Comment

Aspen Chen

Join Date: Apr 2014

Posts: 114
#19

23 Sep 2014, 13:30

Great, Modesto. That's some neat and clean code. Mine was a bit messy as I was concerned about the speed as well as the potential for expansion. But even with a 70x70 matrix, the difference is probably negligible for most of us.
Comment
Oded Mcdossi

Join Date: Jun 2014

Posts: 577
#20

23 Sep 2014, 15:10

Dear Aspen, thank you very much for the program (Chapeau!). It really contributes to me and I am sure it will help others. Like Modesto, what I liked in this code is the automatization of a variety of binary indicators. The code deals perfectly with zero co-occurrence as I expected. I guess you just need at the end to set the diagonal in the similarity matrix to be 1 (as Modesto's -todistance- program does).
Many thanks also to you Modesto. A step-by-step comparison between both programs, as well as your explanations, was exactly what I was looking for.

I'm a bit confused about the summation method of the diagonal, Modesto recommends to add the total number of occurrences per row, Is it possible to extract the number from a given symmetric co-occurrence data (is it the total row sum)?
Modesto also advises to add the total number of occurrences. I'm trying to figure out what is the purpose of the addition of the diagonal and the grand total? Is it a method to normalize the size of the objects?

Warm regards.
Comment

Aspen Chen

Join Date: Apr 2014
Posts: 114

#21

23 Sep 2014, 16:57

Oded,
The questions are somewhat unclear to me. But I think we now assume in your data co-occurrence both must and only takes place in pairs (for example, paper 1 is always cited with one and only one other paper). Under this restriction, two things occur:

a) The row sums (or column sum) of the co-occurrence matrix (minus the diagonal) would equal the total number of occurrence of the corresponding variable.

b) Assuming the total number of occurrence of each variable is now placed along the diagonal, the sum of all these values would be twice of the grand total.

These two types of values are essential to the calculation of b, c, and d used for the binary measures.

Here's a code that can be run and played with to better understand the process.

Code:

mat A=(.,20,20,10\20,.,15,10\20,15,.,25\10,10,25,.) // fake co-occurrence matrix with missing along the diagonal
mat list A // display matrix A
mata: st_matrix("B",rowsum(st_matrix("A"))) // returns the row sums in 4x1 matrix named B; see *1* for Stata equivalent

// this loop fills the diagonal with row sums
forval i=1/`=colsof(A)' {
    mat A[`i',`i']=B[`i',1]
}

// calculate the grand total from the diagonal (could just use matrix B, but let's pretend we don't have B); see *2* for Stata equivalent
mat C=vecdiag(A) // extracts the diagonal in a 1x4 matrix named C
mata: st_local("ttl",strofreal(rowsum(st_matrix("C"))/2))    // calculate grand total and store in Stata local `ttl'

mat list A

/*------------
symmetric A[4,4]
    c1  c2  c3  c4
r1  50
r2  20  45
r3  20  15  60
r4  10  10  25  45
----------------*/

/*----------------
        v1
       1 0
v2   1 a b
     0 c d
----------------*/

// calcuate a-d based on A[2,1]
loc a=A[2,1]                // 20
loc b=A[2,2]-`a'            // 25 (A[2,2] from the diagonal)
loc c=A[1,1]-`a'            // 30 (A[1,1] from the diagonal)
loc d=`ttl'-(`a'+`b'+`c')   // 25 (`ttl' from the grand total)

di "| a=`a' | b=`b' | c=`c' | d=`d' |"

//----Stata equivalent of mata lines----//
*1*
/*---
mat B=J(`=colsof(A)',1,0)
forval i=1/`=colsof(A)'    {    
    forval j=1/`=colsof(A)'    {
        if `i'!=`j'    {
            mat B[`i',1]=B[`i',1]+A[`i',`j']
        }
    }
}
----*/

*2*
/*----
loc ttl=0
forval i=1/`=colsof(C)'    {
    loc ttl=`ttl'+A[`i',`i']
}
loc ttl=`ttl'/2
----*/

Last edited by Aspen Chen; 23 Sep 2014, 17:02.

Comment

Oded Mcdossi

Join Date: Jun 2014

Posts: 577
#22

23 Sep 2014, 18:19

I thought so too. In one of the previous examples Modesto put on the diagonal sums that haven't extracted from the table and I guess it confused me a bit.
Excellent, I got the point. Thank you again.
Last question, I hope, is it possible to use binary similarity measures in classical MDS or it is better to use non-metric MDS and treat the distance between pairs of objects as ordinal.
This question deserves a separate post, however, is also a continuation of the discussion above.
Comment
Modesto Escobar

Join Date: Apr 2014

Posts: 17
#23

25 Sep 2014, 02:41

Let me explain, with your initial example, why I put information on the diagonal of the symmetric matrix and this frequency is (can be) different from the sum of co-occurrence.
Let us suppose that we are analyzing 5 papers ( Paper1-Paper5) citing 4 other papers. The cited papers are PaperA, PaperB, PaperC and PaperD.
The asymmetrical matrix could be as following:

PaperA PaperB PaperC PaperD

Paper1 0 0 0 0

Paper2 1 0 0 0

Paper3 1 1 0 0

Paper4 1 1 1 0

Paper5 1 1 1 1

This means that we are studying citations made by 5 papers. The first (Paper1) doesn’t have any citation and the last (Paper5) has 4. It cited PaperA, PaperB, PaperC and PaperD.
Let this matrix be Y. Then the co-citation matrix would be Y*Y’:
Paper1 Paper2 Paper3 Paper4 Paper5

Paper1 0 0 0 0 0

Paper2 0 1 1 1 1

Paper3 0 1 2 2 2

Paper4 0 1 2 3 3

Paper5 0 1 2 3 4

In this co-citation matrix, the diagonal represents the number of citations in every paper, and outside the diagonal (co-citations) numbers express the number of citations that are common for every pair of papers, i.e. Paper 4 cited A, B, and C, while Paper5 also cited A, B and C. So, they share 3 citations.
It is easy to see that numbers in the diagonal represent the number of citations of every paper, from 0 to 4. Besides, it should be noted that these numbers in the diagonal are not the sum of the frequencies outside the diagonal, i.e in the second column (Paper2) the diagonal 1 is not equal to 0+1+1+1. Why? Because categories, in this case the citations of papers, are not mutually exclusive.
I am writing in Stata a program to solve these kinds of problems. If you can have a look at it, it can be download as follow:

Code:

net install coin, from(http://sociocav.usal.es/stata/)

It has a help file (help coin) and several data examples.
Comment
Oded Mcdossi

Join Date: Jun 2014

Posts: 577
#24

25 Sep 2014, 18:23

Dear Modesto,
Thank you for the elaboration on the technique and the rational behind it.
I installed your program -coin- and this is exactly a kind of program I was looking for, so thanks for sharing that with me (I am familiar with -netplot- and -netsis-). However, if I understand it right, the program requires a kind of data that I can't submit. It requires an asymmetric data, where each row is a case (called setting in your terminology in the help file) and each column is an attribute (incidences in your terminology) but It just takes me back to the beginning of this post. Is there a possibility that the program will accept as input a matrix of combinations of events. I found the -coin- program very useful and versatile to analyze the data and it fit my needs very well. So, it would be great if the program will accept as well a matrix of coincidence (co-occurrence, co-membership, etc.) as an input data.
Comment
Modesto Escobar

Join Date: Apr 2014

Posts: 17
#25

29 Sep 2014, 13:14

Dear Oded,
Thank you for your opinion about my program coin.
You are right. This program requires an asymmetric matrix for the time being. It is still a beta version and I think to incorporate new improvements. One of them is to divide the program into a central program: coin and two or three postcommands: at least pcoin for bar plots of coincidences, and gcoin for different actual graphs of nodes.
However, if you want to work with a symmetrical matrix right now, I recommend you to download the program again and to give your data as if every pair of co-occurrences of papers were a scenario. This is not exactly true, for the reason I gave you in previous posts; but if you do not have the exact behavior of citations in every paper, then that could be an approximation.
As it is hard to explain the details in a post, I am giving you the exact code if you want to apply coin to your symmetrical data:
Please, be careful because I have not tested enough the process of weighting in coin. I would appreciate to receive any error you could detect.

Code:

. clear . net install coin, from(http://sociocav.usal.es/stata) replace . input p1 p2 p3 p4 co 1. 1 1 0 0 10 2. 1 0 1 0 20 3. 1 0 0 1 25 4. 0 1 1 0 30 5. 0 1 0 1 15 6. 0 0 1 1 12 7. end . coin p1-p4 [fweight=co], f plot(pca) p(.65) 112 scenarios. 2 p<=.65 coincidences amongst 4 events. Density: 0.33 4 events(n>=5): p1 p2 p3 p4 Frequencies | p1 p2 p3 p4 ---------------------+---------------------------- p1 | 55 p2 | 10 55 p3 | 20 30 62 p4 | 25 15 12 52

Take care because if you want to introduce the analysis of 70 cited paper, you need to write 1715 lines of co-citations (70*49/2), and even so the introduction of data through a symmetrical matrix is an imperfect way to analyze them properly. I also changed the value of p, because there are few co-citations and there were no probable coincidences.
Comment
Oded Mcdossi

Join Date: Jun 2014

Posts: 577
#26

30 Sep 2014, 18:20

Thanks a lot for the detailed explanation. I'm learning to use coin program and will contact in a case of questions or problems. The idea to divide the program into three related postcommands seems right to me since when using a large data file the program is slow. If you just want bar charts there is no need to run all again, but calls the last estimation. In addition, since the program uses existing procedures (such as MDS and PCA) it would be informative to report some model fit measures (percentage of explained common variance in PCA or stress in MDS). I guess it already exists and only needs to be printed. These are just wishful thinking.
Comment

	PaperA	PaperB	PaperC	PaperD
Paper1	0	0	0	0
Paper2	1	0	0	0
Paper3	1	1	0	0
Paper4	1	1	1	0
Paper5	1	1	1	1

	Paper1	Paper2	Paper3	Paper4	Paper5
Paper1	0	0	0	0	0
Paper2	0	1	1	1	1
Paper3	0	1	2	2	2
Paper4	0	1	2	3	3
Paper5	0	1	2	3	4

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment