Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Calculating summary statistics for correlation coefficients

    Hello,
    I have calculated some pairwise correlation coefficients between observations in a panel data set within a grouping variable, and I have successfully transferred the correlation coefficient matrix into a data set. I am now looking to calculate some summary statistics for these correlation coefficients (mean, percentiles, std), however, I do not want to double count coefficients in these calculations. For example, since the correlation coefficient at column C1 row 2 is the same at column C2 row one, if I collapse and sum by column, I will be double counting these coefficients. Similarly, I do not want to include correlation coefficients between a variable and itself. How can I avoid double counting correlation coefficients?

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input float(C1 C2 C3 C4 C5 C6 C7 C8)
           1 .9937328 .9937328  .  . . 1 .9032351
    .9937328        1        1  .  . . 1  .946541
    .9937328        1        1  .  . . 1  .946541
           .        .        .  1  1 . .       -1
           .        .        .  1  1 . .       -1
           .        .        .  .  . . .        .
           1        1        1  .  . . 1        .
    .9032351  .946541  .946541 -1 -1 . .        1
    
    end
    Thanks for any help!

  • #2
    Set to missing what you do not need, stack them into one variable, and then calculate the statistics you need. Like this:

    Code:
    . forvalues i=1/8 {
      2. replace C`i' = . in 1/`i'
      3. }
    (1 real change made, 1 to missing)
    (2 real changes made, 2 to missing)
    (3 real changes made, 3 to missing)
    (1 real change made, 1 to missing)
    (2 real changes made, 2 to missing)
    (0 real changes made)
    (4 real changes made, 4 to missing)
    (6 real changes made, 6 to missing)
    
    . stack C1-C8, into(C) clear
    
    . summ C, detail
    
                                  C
    -------------------------------------------------------------
          Percentiles      Smallest
     1%           -1             -1
     5%           -1             -1
    10%           -1       .9032351       Obs                  12
    25%      .924888        .946541       Sum of wgt.          12
    
    50%     .9937328                      Mean           .6486486
                            Largest       Std. dev.      .7707012
    75%            1              1
    90%            1              1       Variance       .5939803
    95%            1              1       Skewness       -1.78249
    99%            1              1       Kurtosis       4.188306
    
    .

    Comment


    • #3
      corrci from the Stata Journal includes an option to save correlations to a new dataset. Each correlation is included just once and correlations betweeen a variable and itself are not included, as you wish.

      Code:
      . sysuse auto, clear
      (1978 automobile data)
      
      . corrci headroom-gear_ratio, saving(corr_results)
      
      (obs=74)
      
                                 correlations and 95% limits
      headroom     trunk             0.662    0.511    0.774
      headroom     weight            0.483    0.287    0.641
      headroom     length            0.516    0.326    0.666
      headroom     turn              0.424    0.217    0.595
      headroom     displacement      0.474    0.276    0.634
      headroom     gear_ratio       -0.378   -0.558   -0.163
      trunk        weight            0.672    0.524    0.781
      trunk        length            0.727    0.597    0.819
      trunk        turn              0.601    0.432    0.729
      trunk        displacement      0.609    0.442    0.735
      trunk        gear_ratio       -0.509   -0.660   -0.317
      weight       length            0.946    0.915    0.966
      weight       turn              0.857    0.782    0.908
      weight       displacement      0.895    0.838    0.933
      weight       gear_ratio       -0.759   -0.842   -0.642
      length       turn              0.864    0.792    0.913
      length       displacement      0.835    0.750    0.893
      length       gear_ratio       -0.696   -0.798   -0.556
      turn         displacement      0.777    0.667    0.854
      turn         gear_ratio       -0.676   -0.784   -0.530
      displacement gear_ratio       -0.829   -0.889   -0.741
      
      . u corr_results, clear
      
      . su
      
          Variable |        Obs        Mean    Std. dev.       Min        Max
      -------------+---------------------------------------------------------
              var1 |          0
              var2 |          0
                 r |         21     .309352    .6376442  -.8288772   .9460086
             lower |         21    .1821128     .639221  -.8890014   .9153801
             upper |         21    .4233829    .6108544  -.7406569   .9657493
      
      . l in 1
      
           +---------------------------------------------------+
           |     var1    var2          r      lower      upper |
           |---------------------------------------------------|
        1. | headroom   trunk   .6620111   .5107769   .7735031 |
           +---------------------------------------------------+

      Install from pr0041_4. Read pr0041 if interested.



      Code:
      SJ-21-3 pr0041_4  . . . . . . . . . . . . . . . . . Software update for corrci
              (help corrci, corrcii if installed) . . . . . . . . . . . .  N. J. Cox
              Q3/21   SJ 21(3):847
              improves explanation of the format() option and fixes a bug
              concerning saving results to a new dataset
      
      SJ-20-4 pr0041_3  . . . . . . . . . . . . . . . . . Software update for corrci
              (help corrci, corrcii if installed) . . . . . . . . . . . .  N. J. Cox
              Q4/20   SJ 20(4):1028--1030
              corrects code for a bias correction used if (and only if) the
              fisher option is specified
      
      SJ-17-3 pr0041_2  . . . . . . . . . . . . . . . . . Software update for corrci
              (help corrci, corrcii if installed) . . . . . . . . . . . .  N. J. Cox
              Q3/17   SJ 17(3):779
              new options added
      
      SJ-10-4 pr0041_1  . . . . . . . . . . . . . . . . . Software update for corrci
              (help corrci, corrcii if installed) . . . . . . . . . . . .  N. J. Cox
              Q4/10   SJ 10(4):691
              update to fix corrci so that it always saves r-class results
      
      SJ-8-3  pr0041  .  Speaking Stata: Corr. with confidence, Fisher's z revisited
              (help corrci, corrcii if installed) . . . . . . . . . . . .  N. J. Cox
              Q3/08   SJ 8(3):413--439
              reviews Fisher's z transformation and its inverse, the
              hyperbolic tangent, and reviews their use in inference
              with correlations

      Comment


      • #4
        Here is another line of attack exemplified:

        Code:
        . sysuse auto, clear
        (1978 automobile data)
        
        . corr headroom-gear_ratio
        (obs=74)
        
                     | headroom    trunk   weight   length     turn displa~t gear_r~o
        -------------+---------------------------------------------------------------
            headroom |   1.0000
               trunk |   0.6620   1.0000
              weight |   0.4835   0.6722   1.0000
              length |   0.5163   0.7266   0.9460   1.0000
                turn |   0.4245   0.6011   0.8574   0.8643   1.0000
        displacement |   0.4745   0.6086   0.8949   0.8351   0.7768   1.0000
          gear_ratio |  -0.3779  -0.5087  -0.7593  -0.6964  -0.6763  -0.8289   1.0000
        
        
        . matrix rho = vech(r(C))
        
        . svmat rho 
        
        . su rho if rho < 1 
        
            Variable |        Obs        Mean    Std. dev.       Min        Max
        -------------+---------------------------------------------------------
                rho1 |         21     .309352    .6376442  -.8288772   .9460086

        Comment


        • #5
          Originally posted by Nick Cox View Post
          Here is another line of attack exemplified:

          Code:
          . sysuse auto, clear
          (1978 automobile data)
          
          . corr headroom-gear_ratio
          (obs=74)
          
          | headroom trunk weight length turn displa~t gear_r~o
          -------------+---------------------------------------------------------------
          headroom | 1.0000
          trunk | 0.6620 1.0000
          weight | 0.4835 0.6722 1.0000
          length | 0.5163 0.7266 0.9460 1.0000
          turn | 0.4245 0.6011 0.8574 0.8643 1.0000
          displacement | 0.4745 0.6086 0.8949 0.8351 0.7768 1.0000
          gear_ratio | -0.3779 -0.5087 -0.7593 -0.6964 -0.6763 -0.8289 1.0000
          
          
          . matrix rho = vech(r(C))
          
          . svmat rho
          
          . su rho if rho < 1
          
          Variable | Obs Mean Std. dev. Min Max
          -------------+---------------------------------------------------------
          rho1 | 21 .309352 .6376442 -.8288772 .9460086
          This was the first thing that came to my mind, but the thing is that the matrix -vech(.)- function does not exist in Stata. The code above does not work.

          -vech()- probably does exist in Mata, but then one need to switch back and forth.

          Comment


          • #6
            This is great! Thanks for the help Joro and Nick!

            Comment


            • #7
              #5 is wrong, fortunately,. How did my code work at all as shown if vech() does not exist?

              This link is evidence for existence:

              Stata 17 help for vech()
              Last edited by Nick Cox; 30 Jan 2023, 10:36.

              Comment


              • #8
                The versions of Stata that Nick is getting must be some special versions, custom built. Because this is what is happening in Stata 17:

                Code:
                . sysuse auto, clear
                (1978 automobile data)
                
                . corr headroom-gear_ratio
                (obs=74)
                
                             | headroom    trunk   weight   length     turn displa~t gear_r~o
                -------------+---------------------------------------------------------------
                    headroom |   1.0000
                       trunk |   0.6620   1.0000
                      weight |   0.4835   0.6722   1.0000
                      length |   0.5163   0.7266   0.9460   1.0000
                        turn |   0.4245   0.6011   0.8574   0.8643   1.0000
                displacement |   0.4745   0.6086   0.8949   0.8351   0.7768   1.0000
                  gear_ratio |  -0.3779  -0.5087  -0.7593  -0.6964  -0.6763  -0.8289   1.0000
                
                
                . matrix rho = vech(r(C))
                unknown function vech()
                r(133);

                Comment


                • #9
                  Which is not surprising because the function vech() does not appear in the list of Stata functions. Here is the list that
                  Code:
                  [FN] Matrix functions    
                  (View complete PDF manual entry)
                  
                  
                  Matrix functions returning a matrix
                  
                      cholesky(M)
                         Description:  the Cholesky decomposition of the matrix:
                                       if R = cholesky(S), then RR^T = S
                  
                                       R^T indicates the transpose of R.
                                       Row and column names are obtained from M.
                         Domain:       n x n, positive-definite, symmetric matrices
                         Range:        n x n lower-triangular matrices
                  
                      corr(M)
                         Description:  the correlation matrix of the variance matrix
                  
                                       Row and column names are obtained from M.
                         Domain:       n x n symmetric variance matrices
                         Range:        n x n symmetric correlation matrices
                  
                      diag(M)
                         Description:  the square, diagonal matrix created from the row or column vector
                  
                                       Row and column names are obtained from the column names of M if M is a row vector or from the row names of M if M is a column vector.
                         Domain:       1 x n and n x 1 vectors
                         Range:        n x n diagonal matrices
                  
                      get(systemname)
                         Description:  a copy of Stata internal system matrix systemname
                  
                                       This function is included for backward compatibility with previous versions of Stata.
                         Domain:       existing names of system matrices
                         Range:        matrices
                  
                      hadamard(M,N)
                         Description:  a matrix whose i, j element is M[i,j]*N[i,j] (if M and N are not the same size, this function reports a conformability error)
                         Domain M:     m x n matrices
                         Domain N:     m x n matrices
                         Range:        m x n matrices
                  
                      I(n)
                         Description:  an n x n identity matrix if n is an integer; otherwise, a round(n) x round(n) identity matrix
                         Domain:       real scalars 1 to c(max_matdim)
                         Range:        identity matrices
                  
                      inv(M)
                         Description:  the inverse of the matrix M
                  
                                       If M is singular, this will result in an error.
                  
                                       The function invsym() should be used in preference to inv() because invsym() is more accurate.  The row names of the result are obtained from the
                                       column names of M, and the column names of the result are obtained from the row names of M.
                         Domain:       n x n nonsingular matrices
                         Range:        n x n matrices
                  
                      invsym(M)
                         Description:  the inverse of M if M is positive definite
                  
                                       If M is not positive definite, rows will be inverted until the diagonal terms are zero or negative; the rows and columns corresponding to these
                                       terms will be set to 0, producing a g2 inverse.  The row names of the result are obtained from the column names of M, and the column names of the
                                       result are obtained from the row names of M.
                         Domain:       n x n symmetric matrices
                         Range:        n x n symmetric matrices
                  
                      J(r,c,z)
                         Description:  the r x c matrix containing elements z
                         Domain r:     integer scalars 1 to c(max_matdim)
                         Domain c:     integer scalars 1 to c(max_matdim)
                         Domain z:     scalars -8e+307 to 8e+307
                         Range:        r x c matrices
                  
                      matuniform(r,c)
                         Description:  the r x c matrices containing uniformly distributed pseudorandom numbers on the interval (0,1)
                         Domain r:     integer scalars 1 to c(max_matdim)
                         Domain c:     integer scalars 1 to c(max_matdim)
                         Range:        r x c matrices
                  
                      nullmat(matname)
                         Description:  use with the row-join (,) and column-join (\) operators
                          
                                       Consider the following code fragment, which is an attempt to create the vector (1,2,3,4):
                  
                                             forvalues i = 1/4 {
                                                     mat v = (v, `i')
                                             }
                  
                                       The above program will not work because, the first time through the loop, v will not yet exist, and thus forming (v, `i') makes no sense.
                                       nullmat() relaxes that restriction:
                  
                                             forvalues i = 1/4 {
                                                     mat v = (nullmat(v), `i')
                                             }
                  
                                       The nullmat() function informs Stata that if v does not exist, the function row-join is to be generalized.  Joining nothing with `i' results in
                                       (`i').  Thus the first time through the loop, v = (1) is formed.  The second time through, v does exist, so v = (1,2) is formed, and so on.
                  
                                       nullmat() can be used only with the , and \ operators.
                         Domain:       matrix names, existing and nonexisting
                         Range:        matrices including null if matname does not exist
                  
                      sweep(M,i)
                         Description:  matrix M with ith row/column swept
                  
                                       The row and column names of the resultant matrix are obtained from M, except that the nth row and column names are interchanged.
                         Domain M:     n x n matrices
                         Domain i:     integer scalars 1 to n
                         Range:        n x n matrices
                  
                      vec(M)
                         Description:  a column vector formed by listing the elements of M, starting with the first column and proceeding column by column
                         Domain:       matrices
                         Range:        column vectors (n x 1 matrices)
                  
                      vecdiag(M)
                         Description:  the row vector containing the diagonal of matrix M
                  
                                       vecdiag() is the opposite of diag().  The row name is set to r1; the column names are obtained from the column names of M.
                         Domain:       n x n matrices
                         Range:        1 x n vectors
                  there is no vech() here as far as I can see. The rest of the functions return a scalar.

                  Comment


                  • #10
                    EDIT vech() was added on 6 April 2022. See help whatsnew.

                    The age-old support for vech I was remembering in #7 [now corrected] was my own add-on matvech from 2000: package dm79 from http://www.stata.com/stb/stb56


                    I don't have any special version, just Stata 17 updated. I tried again on a different machine.

                    Code:
                    . sysuse auto, clear
                    (1978 automobile data)
                    
                    .
                    . corr headroom-gear_ratio
                    (obs=74)
                    
                                 | headroom    trunk   weight   length     turn displa~t gear_r~o
                    -------------+---------------------------------------------------------------
                        headroom |   1.0000
                           trunk |   0.6620   1.0000
                          weight |   0.4835   0.6722   1.0000
                          length |   0.5163   0.7266   0.9460   1.0000
                            turn |   0.4245   0.6011   0.8574   0.8643   1.0000
                    displacement |   0.4745   0.6086   0.8949   0.8351   0.7768   1.0000
                      gear_ratio |  -0.3779  -0.5087  -0.7593  -0.6964  -0.6763  -0.8289   1.0000
                    
                    
                    . matrix rho = vech(r(C))
                    
                    .. mat li rho
                    
                    rho[28,1]
                                                       c1
                            headroom:headroom           1
                               headroom:trunk   .66201113
                              headroom:weight   .48345581
                              headroom:length   .51629547
                                headroom:turn   .42446462
                        headroom:displacement   .47449149
                          headroom:gear_ratio  -.37785196
                                  trunk:trunk           1
                                 trunk:weight   .67220573
                                 trunk:length   .72659561
                                   trunk:turn   .60105949
                           trunk:displacement   .60863505
                             trunk:gear_ratio  -.50866456
                                weight:weight           1
                                weight:length   .94600864
                                  weight:turn   .85744287
                          weight:displacement   .89489577
                            weight:gear_ratio  -.75925828
                                length:length           1
                                  length:turn   .86426115
                          length:displacement   .83514003
                            length:gear_ratio  -.69638335
                                    turn:turn           1
                            turn:displacement   .77676473
                              turn:gear_ratio  -.67629957
                    displacement:displacement           1
                      displacement:gear_ratio  -.82887716
                        gear_ratio:gear_ratio           1
                    Last edited by Nick Cox; 30 Jan 2023, 10:38.

                    Comment


                    • #11
                      The resolution of the mystery is that these functions like vech() have been added in the update of Stata 17. That is Stata 17 does not recognise them, but after you write -update all- and it updates, it starts to recognise them.

                      When I updated my Stata 17, it started to recognise these functions.

                      Comment


                      • #12
                        That's it. Otherwise put, Stata on your machine can't know about newer stuff if you haven't updated.

                        Comment


                        • #13
                          This is a footnote to a tangled thread. Suppose you were interested in the vech method of #4 but

                          * have Stata 17, but are unable to update (awkward administrators or ultra-rigid site policies?)

                          * or have Stata 16 or earlier

                          and are unable to, or do not care to, download matvech (#10)or write your own vech code.

                          There is still an alternative, which is to push the correlation matrix into Mata and pull out the vech.

                          Here is some code

                          Code:
                          sysuse auto, clear
                          corr headroom-gear_ratio
                          mata : rho = vech(st_matrix("r(C)"))
                          getmata rho, force 
                          su rho if rho < 1
                          The use of force there does not itself connote brutality to data or results, but is indirectly a sign that this method needs care if the number of correlations exceeds the number of observations in the dataset.

                          The methods of #2 and #3 remain.

                          Comment

                          Working...
                          X