Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Use stored correlation matrix from PWCORR in CORR2DATA

    Hey Stata-listers,

    I will preface this by saying that I am not super familiar with matrices in Stata. That said, I am trying to use CORR2DATA to create a dataset that matches a correlation structure from a different dataset. I know I can manually enter the correlation structure into CORR2DATA, but it is rather large, so I would like to avoid that, if possible. I have run PWCORR on the existing dataset and saved the stored correlation matrix as a matrix, but I am getting an error when I try to use that matrix in CORR2DATA. I believe the problem here is that PWCORR gives the lower half of the correlation matrix, while CORR2DATA requires the upper half (I tried the 'cstorage(lower)' option and that didn't work). Here is the code I am using (variables changed for ease):

    Code:
    pwcorr var1-var3
    matrix C_var1_3 = r(C)
    corr2data var1 var2 var3, n(100) means(10197.48 11269.10 490.71 80.84) sds(12380.57 15761.20 448.63 65.49) corr(C_var1_3) clear
    The error I get is:
    corr() incorrectly specified
    diagonal elements should be 1

    When I add the cstorage(lower) option to the corr2data code, the error I get is:
    invalid cstorage; matrix is found to be square

    I'm not sure what is going on here. I tried transposing the correlation matrix, but when I list each one, I get the same exact looking matrix. The code I used looked like this:
    Code:
    matrix C_var1_3_Trans = C_var1_3'
    Any help on this would be greatly appreciated!!

  • #2
    Whenever you list a symmetric matrix in Stata, Stata shows you the lower half.

    I did some experimenting here. There are two problems, one of which I understand, and the other I do not.

    The one I understand is that you are specifying four means and standard deviations, but trying to use a 3x3 correlation matrix. That won't fly, for the obvious reason.

    However, I also discovered that even after eliminating one mean and one standard deviation, the code still produces an error message. This can be fixed by using -corr- instead of -pwcorr- to generate the correlation matrix. I'm perplexed, because I do not understand how -corr2data- can tell the difference between the correlation matrix generated by -corr- and the one generated by -pwcorr- (because in my test example, they are the same!) But somehow it can.

    Now, that might be a feature, not a bug. It is not legitimate to use the matrix output of -pwcorr- for this purpose. The pairwise correlations do not necessarily form a positive definite matrix--so they cannot be the variance matrix of a distribution. (They may turn out to form a positive definite matrix, but that is just coincidental.)

    Anyway, if you use -corr-, not -pwcorr- to calculate your correlation matrix, and if you have the number of means and standard deviations correct, you'll get results.

    Comment


    • #3
      Perfect! That worked!! Thank you so much. Also, the reason for the 4 means/SDs was because when I was playing around with it, I was trying it with 4 variables. When I changed the code to include it in this post, I just put 3 in, so, that was my mistake. Thanks again for your quick, helpful response!

      Comment


      • #4
        The problem seems to be that corr2data does not enforce 1's on the diagonal of the correlation matrix, as corr apparently does. Consider the following.
        Code:
        . sysuse auto, clear
        (1978 Automobile Data)
        
        . corr weight length
        (obs=74)
        
                     |   weight   length
        -------------+------------------
              weight |   1.0000
              length |   0.9460   1.0000
        
        
        . matrix list r(C), nohalf format(%20.17f)
        
        symmetric r(C)[2,2]
                             weight               length
        weight  1.00000000000000000  0.94600864341781110
        length  0.94600864341781110  1.00000000000000000
        
        . pwcorr weight length
        
                     |   weight   length
        -------------+------------------
              weight |   1.0000 
              length |   0.9460   1.0000 
        
        . matrix list r(C), nohalf format(%20.17f)
        
        symmetric r(C)[2,2]
                             weight               length
        weight  1.00000000000000000  0.94600864341781110
        length  0.94600864341781110  1.00000000000000022
        
        . matrix C = r(C)
        
        . corr2data var1 var2, n(100) corr(C) clear
        corr() incorrectly specified
        diagonal elements should be 1
        r(198);
        
        . matrix C[2,2] = 1
        
        . corr2data var1 var2, n(100) corr(C) clear
        (obs 100)
        
        .

        Comment


        • #5
          Aha! Thanks, William Lisowski. I didn't think to examine the matrices at higher precision: I just took the default formatted -matrix list- output at face value. I should have known better.

          Comment


          • #6
            Originally posted by Clyde Schechter View Post
            I didn't think to examine the matrices at higher precision: I just took the default formatted -matrix list- output at face value. I should have known better.
            Well, I didn't think to use a higher precision originally. I was grasping at straws so I subtracted the identify matrix from the correlation matrix output by pwcorr and then matrix list showed me a non-zero value on the order of E-16.
            Code:
            . sysuse auto, clear
            (1978 Automobile Data)
            
            . quietly corr weight length
            
            . matrix C1 = r(C)
            
            . quietly pwcorr weight length
            
            . matrix C2 = r(C)
            
            . matrix D = C2 - C1
            
            . matrix list D
            
            symmetric D[2,2]
                       weight     length
            weight          0
            length          0  2.220e-16
            
            . matrix list C2, format(%21x)
            
            symmetric C2[2,2]
                                   weight                 length
            weight  +1.0000000000000X+000
            length  +1.e45b3eb26cf75X-001  +1.0000000000001X+000
            
            .
            Is it just me, or is anyone else disturbed that pwcorr does not produce results identical to those from corr when there are in fact no missing values?

            Comment


            • #7
              Originally posted by William Lisowski View Post
              Is it just me, or is anyone else disturbed that pwcorr does not produce results identical to those from corr when there are in fact no missing values?
              You are not alone: That is disturbing.
              --
              Bruce Weaver
              Email: [email protected]
              Version: Stata/MP 18.5 (Windows)

              Comment


              • #8
                Yes, it is disturbing to me, as well.

                Comment


                • #9
                  The following example suggests to me that
                  • precision issues can lead to a calculated correlation of a variable with itself not having the value 1
                  • the correlate command and mata correlate() function take some care to ensure that correlations on the main diagonal are 1
                  • the correlate command and mata correlate() function do not ensure off-diagonal correlations are in the interval [-1,+1]
                  • the pwcorr command does not ensure either of these conditions
                  • the mata quadcorrelate() function does not experience the precision problem and likely takes no special care
                  • it appears that the correlate command is not built on the Mata correlate() function
                  I remain uncomfortable seeing reported values that are not 1 (for the correlation of variable with itself) or are outside the range [-1,+1] (for any correlation). I don't see a general solution. Since it appears the correlate command takes special care (perhaps plugging in a constant 1 on the main diagonal), I'd think the pwcorr command should do so as well (I've looked at pwcorr.ado, it basically calculates the main diagonal correlations by using correlate(var,var).

                  Code:
                  about
                  clear all
                  sysuse auto
                  
                  // original problem - pwcorr weight length
                  // correlation of length with itself is not precisely 1
                  pwcorr weight length
                  matrix list r(C), format(%21x)
                  
                  // simplify - pwcorr length length
                  // correlations are not precisely one
                  pwcorr length length
                  matrix list r(C), format(%21x)
                  
                  // demonstrate problem using corr - corr length length
                  // correlations on diagnoal are precisely 1 - fudging behind the scenes?
                  // off-diagonal correlation is not precisely 1
                  correlate length length
                  matrix list r(C), format(%21x)
                  
                  // reproduce problem in mata - correlation((length,length))
                  // correlations on diagnoal are precisely 1 - fudging behind the scenes?
                  // off-diagonal correlation is not precisely 1
                  mata length = st_data(.,st_varindex("length"))
                  mata C = correlation((length,length))
                  mata printf("%21x\n%21x  %21x",C[1,1],C[2,1],C[2,2])
                  
                  // quadcorrelation resolves this problem - quadcorrelation((length,length)) 
                  // all elements are precisely 1
                  mata length = st_data(.,st_varindex("length"))
                  mata C = quadcorrelation((length,length))
                  mata printf("%21x\n%21x  %21x",C[1,1],C[2,1],C[2,2])
                  
                  // shift gears: what if the correlation is just below 1?
                  // remove a carefully selected observation
                  
                  // demonstrate problem using corr - corr length length
                  // correlations on diagnoal are precisely 1 - fudging behind the scenes?
                  // off-diagonal correlation is not precisely 1
                  correlate length length if _n!=4
                  matrix list r(C), format(%21x)
                  
                  // reproduce problem in mata - correlation((length,length))
                  // correlations on diagnoal are precisely 1 - fudging behind the scenes?
                  // off-diagonal correlation is not precisely 1
                  mata length = st_data((1,2,3,5...)',st_varindex("length"))
                  mata C = correlation((length,length))
                  mata printf("%21x\n%21x  %21x",C[1,1],C[2,1],C[2,2])
                  
                  // quadcorrelation resolves this problem - quadcorrelation((length,length)) 
                  // all elements are precisely 1
                  mata length = st_data((1,2,3,5...)',st_varindex("length"))
                  mata C = quadcorrelation((length,length))
                  mata printf("%21x\n%21x  %21x",C[1,1],C[2,1],C[2,2])
                  Code:
                  . about
                  
                  Stata/SE 17.0 for Mac (Intel 64-bit)
                  Revision 16 Dec 2021
                  ...
                  
                  . clear all
                  
                  . sysuse auto
                  (1978 automobile data)
                  
                  . 
                  . // original problem - pwcorr weight length
                  . // correlation of length with itself is not precisely 1
                  . pwcorr weight length
                  
                               |   weight   length
                  -------------+------------------
                        weight |   1.0000 
                        length |   0.9460   1.0000 
                  
                  . matrix list r(C), format(%21x)
                  
                  symmetric r(C)[2,2]
                                         weight                 length
                  weight  +1.0000000000000X+000
                  length  +1.e45b3eb26cf75X-001  +1.0000000000001X+000
                  
                  . 
                  . // simplify - pwcorr length length
                  . // correlations are not precisely one
                  . pwcorr length length
                  
                               |   length   length
                  -------------+------------------
                        length |   1.0000 
                        length |   1.0000   1.0000 
                  
                  . matrix list r(C), format(%21x)
                  
                  symmetric r(C)[2,2]
                                         length                 length
                  length  +1.0000000000001X+000
                  length  +1.0000000000001X+000  +1.0000000000001X+000
                  
                  . 
                  . // demonstrate problem using corr - corr length length
                  . // correlations on diagnoal are precisely 1 - fudging behind the scenes?
                  . // off-diagonal correlation is not precisely 1
                  . correlate length length
                  (obs=74)
                  
                               |   length   length
                  -------------+------------------
                        length |   1.0000
                        length |   1.0000   1.0000
                  
                  
                  . matrix list r(C), format(%21x)
                  
                  symmetric r(C)[2,2]
                                         length                 length
                  length  +1.0000000000000X+000
                  length  +1.0000000000001X+000  +1.0000000000000X+000
                  
                  . 
                  . // reproduce problem in mata - correlation((length,length))
                  . // correlations on diagnoal are precisely 1 - fudging behind the scenes?
                  . // off-diagonal correlation is not precisely 1
                  . mata length = st_data(.,st_varindex("length"))
                  
                  . mata C = correlation((length,length))
                  
                  . mata printf("%21x\n%21x  %21x",C[1,1],C[2,1],C[2,2])
                  +1.0000000000000X+000
                  +1.0000000000001X+000  +1.0000000000000X+000
                  . 
                  . // quadcorrelation resolves this problem - quadcorrelation((length,length)) 
                  . // all elements are precisely 1
                  . mata length = st_data(.,st_varindex("length"))
                  
                  . mata C = quadcorrelation((length,length))
                  
                  . mata printf("%21x\n%21x  %21x",C[1,1],C[2,1],C[2,2])
                  +1.0000000000000X+000
                  +1.0000000000000X+000  +1.0000000000000X+000
                  . 
                  . // shift gears: what if the correlation is just below 1?
                  . // remove a carefully selected observation
                  . 
                  . // demonstrate problem using corr - corr length length
                  . // correlations on diagnoal are precisely 1 - fudging behind the scenes?
                  . // off-diagonal correlation is not precisely 1
                  . correlate length length if _n!=4
                  (obs=73)
                  
                               |   length   length
                  -------------+------------------
                        length |   1.0000
                        length |   1.0000   1.0000
                  
                  
                  . matrix list r(C), format(%21x)
                  
                  symmetric r(C)[2,2]
                                         length                 length
                  length  +1.0000000000000X+000
                  length  +1.fffffffffffffX-001  +1.0000000000000X+000
                  
                  . 
                  . // reproduce problem in mata - correlation((length,length))
                  . // correlations on diagnoal are precisely 1 - fudging behind the scenes?
                  . // off-diagonal correlation is not precisely 1
                  . mata length = st_data((1,2,3,5...)',st_varindex("length"))
                  
                  . mata C = correlation((length,length))
                  
                  . mata printf("%21x\n%21x  %21x",C[1,1],C[2,1],C[2,2])
                  +1.0000000000000X+000
                  +1.ffffffffffffeX-001  +1.0000000000000X+000
                  . 
                  . // quadcorrelation resolves this problem - quadcorrelation((length,length)) 
                  . // all elements are precisely 1
                  . mata length = st_data((1,2,3,5...)',st_varindex("length"))
                  
                  . mata C = quadcorrelation((length,length))
                  
                  . mata printf("%21x\n%21x  %21x",C[1,1],C[2,1],C[2,2])
                  +1.0000000000000X+000
                  +1.0000000000000X+000  +1.0000000000000X+000
                  . 
                  end of do-file

                  Comment

                  Working...
                  X