Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • From symmetric to asymmetric matrix

    Dear Stata listers,
    I'm interested in reshaping the symmetric matrix from Figure 1 (copied from Leydesdorff and Vaughan, 2006) to the asymmetrical matrix like Figure 2. I guess this involves reshaping the data to long form, but didn't find the accurate way.
    All suggestions are welcome.

    mat input A=(.,10,20,25\10,.,30,15\20,30,.,12\25,15,12,.) //symmetrical matrix​-Figure 1
    mat colnames A=Paper1 Paper2 Paper3 Paper4
    mat rownames A=Paper1 Paper2 Paper3 Paper4
    matlist A
    svmat A
    Figure 1: Co-citation matrix (symmetrical matrix)
    P1 Paper1 Paper2 Paper3 Paper4
    Paper1 . 10 20 25
    Paper2 10 . 30 15
    Paper3 20 30 . 12
    Paper4 25 15 12 .
    Figure 2: Citation Matrix (asymmetrical matrix)
    Cited Paper A Cited Paper B Cited Paper C Cited Paper D
    Citing Paper 1 1 1 0 0
    Citing Paper 2 0 0 1 1
    Citing Paper 3 0 0 1 1
    Citing Paper 4 1 1 0 0
    Citing Paper 5 0 1 0 1
    *L. Leydesdorff, L. Vaughan. Co-occurrence matrices and their applications in information science: Extending ACA to the Web environment. Journal of the American Society for Information Science and Technology, 57 (12) (2006), pp. 1616–1628.
    Last edited by Oded Mcdossi; 19 Sep 2014, 01:37.

  • #2
    I don't see how figure 1 provides sufficient information to create figure 2. Figure 1 reports nothing on papers A through D, which is essential to figure 2. Perhaps you could clarify on how values are converted between these two figures?
    Last edited by Aspen Chen; 19 Sep 2014, 02:15.

    Comment


    • #3
      Thank you Aspen!

      In fact, I want to change Figure-1
      Figure 1: Co-citation matrix (symmetrical matrix)
      P1 Paper1 Paper2 Paper3 Paper4
      Paper1 . 10 20 25
      Paper2 10 . 30 15
      Paper3 20 30 . 12
      Paper4 25 15 12 .












      to series of matrices in size 2 x 2, including all the following cells (as in "help 'measure option' "):

      Paper2
      1 0
      Paper1 1 a b
      0 c d









      Is it possible to do it by one asymmetric matrix (or file), or only series of 2x2 matrices?


      I hope I've explained myself well.
      Last edited by Oded Mcdossi; 19 Sep 2014, 07:18.

      Comment


      • #4
        Oded,

        I'm afraid your question still isn't clear (to me anyway). I see that the A, B, C, and D in the columns of Figure 2 refer to the four cells of the 2x2 table, but where do the cells of the 2x2 come from? Figure 1 has values like 15, 20, 25, and 30, but Figure 2 (and your 2x2 table) has values 0 and 1. Further, which of the dimensions (rows or columns) in Figure 1 is the Citing paper and which is the Cited Paper?

        Regards,
        Joe

        Comment


        • #5
          Thanks for trying to help.
          Let's take, for example, the cell with the number 10 in figure 1. This cell corresponds to letter "a" in the 2x2 matrix, since this is the total co-occurrence of Paper1 & Paper2. The letter "b" equal 45 and represents the number of citations of paper1 which not involve paper2 (and vice versa for letter "c" which in this case also equal 45). The letter "d" equal 12 and denote all other papers not involve Paper1 & Paper2.
          The next step is to reshape the 2x2 matrix to long file which is a simple task, but with a large matrix this task becomes very problematic.
          Regards,
          Oded

          Comment


          • #6
            Still makes no sense, to me at least. I'd like to help, but can't figure out what you want.

            Comment


            • #7
              Mechanically it is not difficult to produce the 2x2 matrices. I will attach the code at the very end.
              But I agree why you are trying to do this is confusing. The numbers you labeled a,b, c, d in the 2x2 has nothing to do with papers A, B, C and D. There is simply no information on the cited papers in figure 1.

              Perhaps you are thinking about something else? I studied the paper and found no indication that figure 1 can be converted into figure 2. Rather the paper suggests that information from figure 2 can be used to produce something similar to (but not the same as) figure 1. It generated figure 3, which contains a matrix with dissimilarity measures based on a (1+r)/2 transformation of the Pearson correlation. It looks like this:
              Paper A Paper B Paper C Paper D
              Paper A 1 1 0 .295
              Paper B 1 1 0 .295
              Paper C 0 0 1 .705
              Paper D .295 .295 .705 1
              For this, Stata has the functionality to do this transformation. This is an example. You could tweak it a bit to make the matrix for Paper 1-Paper 5.

              Code:
              clear
              input pa pb pc pd
              1 1 0 1
              0 0 1 1
              0 0 1 1
              1 1 0 0
              1 1 0 1
              end
              xpose,clear
              mat dissimilarity D=v1-v5,pearson
              mat list D
              forval i=1/4    {
                  forval j=1/4    {
                  mat D[`i',`j']=(1+D[`i',`j'])/2
                  }
              }
              mat coln D=PaperA PaperB PaperC PaperD
              mat rown D=PaperA PaperB PaperC PaperD
              mat list D, nohalf f(%8.3f)
              Now for completeness, here's the code to make the 2x2 matrices. I'd be interested in knowing how you plan to make use of these matrices.

              Code:
              mat input A=(.,10,20,25\10,.,30,15\20,30,.,12\25,15,12,.) //symmetrical matrix-Figure 1
              mat colnames A=Paper1 Paper2 Paper3 Paper4
              mat rownames A=Paper1 Paper2 Paper3 Paper4
              
              // calcuate marginal sums
              mata: st_matrix("B",colsum(st_matrix("A")))
              mat rownames B=sum
              mat A=A\B
              mat list A
              
              // sum below diagonal line
              loca diagsum=A[2,1]+A[3,1]+A[4,1]+A[3,2]+A[4,2]+A[3,1]
              
              // create 2x2 matrices
              forval i=1/4    {
                  forval j=1/4    {
                      if `i'!=`j'    {
                          mat A`i'`j'=J(2,2,.)
                          mat coleq A`i'`j'=p`j' p`j'
                          mat coln A`i'`j'=0 1
                          mat roweq A`i'`j'=p`i' p`i'
                          mat rown A`i'`j'=0 1
                          mat A`i'`j'[1,1]=A[`i',`j']
                          mat A`i'`j'[1,2]=A[5,`i']-A`i'`j'[1,1]
                          mat A`i'`j'[2,1]=A[5,`j']-A`i'`j'[1,1]
                          mat A`i'`j'[2,2]=`diagsum'-(A`i'`j'[1,1]+A`i'`j'[1,2]+A`i'`j'[2,1])
                          mat list A`i'`j'
                      }
                  }
              }

              Comment


              • #8
                Thank you!
                Finally... I think you understand exactly my goal. I'll read your code carefully.

                I attached the article in order to give an example of what I want. Specifically, I would like to take these matrices to calculate binary similarity indices. As far as I know the command matrix dissimilarity does not perform binary similarity measures based on co-occurrence matrix, therefore I need to change the matrix to a file that will allow the calculation of the indices. Any other suggestion that will shorten the process is welcome.
                I thank you very much.

                Comment


                • #9
                  Dear Aspen,
                  I ran the code, but unfortunately it does not lead to the desired results. First, negative values were corrected by replacing missing values with zeros in the input matrix, but still, the 2x2 matrices are not parallel to the matrix in Figure-1.

                  Comment


                  • #10
                    Originally posted by Oded Mcdossi View Post
                    I ran the code, but unfortunately it does not lead to the desired results.
                    Not quite clear which block of code is under discussion--I am assuming the second?

                    Originally posted by Oded Mcdossi View Post
                    First, negative values were corrected by replacing missing values with zeros in the input matrix,
                    Please elaborate on which negative values you are referring to. Neither figure 1 or figure 2 contains negative values. Also, the code did not replace the missing values along the diagonal line of matrix A with zeros. Type -mat list A- would confirm that.

                    Originally posted by Oded Mcdossi View Post
                    The 2x2 matrices are not parallel to the matrix in Figure-1.
                    There's one typo in the code. Please change

                    Code:
                    loca diagsum=A[2,1]+A[3,1]+A[4,1]+A[3,2]+A[4,2]+A[3,1]
                    to

                    Code:
                    loca diagsum=A[2,1]+A[3,1]+A[4,1]+A[3,2]+A[4,2]+A[3,4]
                    This should produce the 2x2 matrices you were asking for.

                    Comment


                    • #11
                      Now I finally think I understand why you liked the 2x2 matrices. Unfortunately, it would not work.

                      Originally posted by Oded Mcdossi View Post
                      ...I would like to take these matrices to calculate binary similarity indices. As far as I know the command matrix dissimilarity does not perform binary similarity measures based on co-occurrence matrix, therefore I need to change the matrix to a file that will allow the calculation of the indices.
                      You cannot do this, at least not for the binary similarity indices under -help measure option-. As the manual shows, these indices require you to have the count values for a, b, c, and d based on the crosstable below (I suspect that's why you used the same notations for the 2x2 table).
                      obs j
                      1 0
                      obs 1 a b
                      i 0 c d
                      By definition, a co-occurance matrix contains only the count values for the condition (xj==xi==1), which is the a value in this case. There is no reliable information for b, c, and d.

                      Let's use the case for Paper 1 and Paper 2 in figure 1 to explain. You suggested that a=10, b=30+15=45, c=20+25=45, and d=12. This appears intuitive, but there are two problems.
                      Figure 1: Co-citation matrix (symmetrical matrix)
                      P1 Paper1 Paper2 Paper3 Paper4
                      Paper1 . 10 20 25
                      Paper2 10 . 30 15
                      Paper3 20 30 . 12
                      Paper4 25 15 12 .
                      First, the elements for both b and c likely overcount the co-citation. In the case of b, 30 suggests that paper 2 and paper 3 are co-cited in 30 sources. But in some of these 30 sources, paper 1 and paper 4 may also be cited. Similarly, in the 15 sources paper 2 and paper 4 are co-cited, paper 1 and 4 may also be on the reference list. Your method would overcount the citations unless every single source only cites exactly two papers.

                      Second,assuming d=12 involves another problem. This matrix only counts sources that have cite two or more papers, so sources that cited only paper 3, only paper 4, or none of the four papers would not be included in this matrix. And yet these sources should be part of the value d. Your method of calculating d is likely to undercount.

                      Now while values in a co-occurance matrix are insufficient for the calculation of binary similarity indices, I wonder why you would like to do that in the first place. A co-occurrence matrix still provides information for proximity, and can still be used to conduct multidimensional scaling. If ever helpful, Stata has the native routine -mdsmat- for proximity matrices.

                      Comment


                      • #12
                        Thank you Aspen for giving it a second thought.
                        Indeed, my case is a bit different. My data file refers to a combination of only two objects at a time and I have reliable information about "c","b" and "d" (in the 2x2 matrix). I guess it relieves some of the serious concerns you raised in the case of citations. Based on preliminary assumptions I took off all the entries on the diagonal and therefore this problem, at least in my case, is less relevant.
                        Further to your question why to use binary measures in the first place, this is because MDS require proximity matrix and therefore I have to normalize this co-occurrence matrix somehow. My intention was to try to do so with the simple and intuitive binary similarity measures, but I'll read more about the basic requirements for proximity matrix with co-occurrence matrix.

                        Comment


                        • #13
                          That makes sense. If your real case doesn't have over- or under-counting issues, then values in the co-occurance matrix can reasonably be converted into binary indices. The code above for 2x2 tables still works. You can calculate each index value within the inner loop, and also put the value into their corresponding cell in a separately created matrix (in the example, declare a 4x4 matrix before all loops).

                          As to whether and how to normalize, I think the consideration has to be mostly theoretical or empirical (in cases of small N, for example). Technically, -mdsmat- can take a co-occurance matrix like figure 1 without any problems.

                          Comment


                          • #14
                            Oded, you are trying to obtain an asymmetric matrix from a symmetric matrix, and this is impossible because the former has more information than the latter.
                            The other way around is different, because if you had the asymmetric information, it could be converted into the symmetric information multiplying it by its transpose.
                            What would be necessary to have converted your initial co-occurrence matrix into pieces of 2x2 tables? Easy answer: just the diagonal data, which are missing in your example, as well as the number of total citations (In this case, it could be assumed that it is the sum of the diagonal)
                            I have written this short program to apply my ideas. Are these results that you wanted?
                            Code:
                            capture program drop atable
                            program define atable, rclass
                            if "A`2'"=="A" {
                            local 2 trace(`1')
                            }
                            forvalues X=2/`=rowsof(`1')' {
                             forvalues Y=1/`=`X'-1' {
                              matrix O`X'_`Y'=J(2,2,.)
                              matrix O`X'_`Y'[1,1]=`1'[`X',`Y']
                              matrix O`X'_`Y'[1,2]=`1'[`X',`X']-`1'[`X',`Y']
                              matrix O`X'_`Y'[2,2]=`2'-`1'[`X',`X']-`1'[`Y',`Y']+`1'[`X',`Y']
                              matrix O`X'_`Y'[2,1]=`1'[`Y',`Y']-`1'[`X',`Y']
                              matrix rownames O`X'_`Y'=Paper`X'(Yes) Paper`X'(No)
                              matrix colnames O`X'_`Y'=Paper`Y'(Yes) Paper`Y'(No)
                              matlist O`X'_`Y'
                              return matrix O`X'_`Y'=O`X'_`Y'
                              }
                            }
                            end
                             
                            mat input A=(60,10,20,25\10,50,30,15\20,30,40,12\25,15,12,30) //symmetrical matrix?-Figure 1
                            atable A
                            atable A 200
                            Paper1(Yes) Paper1(No)
                            Paper2(Yes) 10 40
                            Paper2(No) 50 80
                            Note that I added in the diagonal of matrix A a supposed number of citations in every paper (60, 50, 40, 30).
                            There is also two examples of this program. Without a number, it calculates the total of citations summing the diagonal, while if you insert a number (second parameter of the program), this is assumed as the total.
                            In the first example of the output, you can see that the first column (Paper1) sums 60, and the first row (Paper2), 50. The sum of all the frequencies of the table is 180 (60+50+40+30).
                            In any case, my advice is: work with the assymetrical matrix, as the Leydesdorff and Vaughan paper suggests.

                            Comment


                            • #15
                              Dear Modesto,
                              Thanks for sharing the elegant code to produce the tables.
                              Based on Aspens's code (see above), I've computed the Jaccard and Kulczynski binary similarity indexes to test whether I get a better representation of the objects on the MDS map with these measures. If I understand it right, the built-in dissimilarity measures in Stata -matrix dissimilarity- are calculated on row data (or asymmetric matrix) and not directly on co-occurrence matrix. Calculating the binary measures on large co-occurrence matrix becomes tedious. My matrix is large (about 70 objects) and includes many zeros, that is, no co-occurrence, so I try to explore different similarity measures.

                              I am not an expert in the subject and thus far the comments in the forum helped me.

                              Comment

                              Working...
                              X