Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Pairwise correlations

    Hi,

    I am a very new user in Stata and need help with obtaining only relevant correlation coefficients when running pwcorr or spearman commands.

    Example:
    I have the following variables in my data set: Weight-a, Weight-b, BMI-a, BMI-b, Hight-a, Hight-b
    I want to get the following table directly:
    Variable r-value
    Weight* 1.0
    BMI 1.0
    Hight 0.9
    *for the correlation between Weight-a and -b, etc

    Is there any way to get the table above instead of getting the traditional one where every single variable is compared to all other ones?

  • #2
    Maybe this text may interest you. That said, there are at least two aspects here, one is having specific paired correlations, the other is presenting them in a table. Personally, I just copy the results and paste them in a customized table.
    Best regards,

    Marcos

    Comment


    • #3
      Yes. cpcorr (SSC) does this.

      Code:
      . sysuse auto, clear
      (1978 Automobile Data)
      
      . cpcorr price headroom-gear \ mpg
      (obs=74)
      
                        mpg
             price  -0.4686
          headroom  -0.4138
             trunk  -0.5816
            weight  -0.8072
            length  -0.7958
              turn  -0.7192
      displacement  -0.7056
        gear_ratio   0.6162
      
      . cpcorr price headroom-gear \ mpg, format(%04.3f)
      (obs=74)
      
                       mpg
             price  -0.469
          headroom  -0.414
             trunk  -0.582
            weight  -0.807
            length  -0.796
              turn  -0.719
      displacement  -0.706
        gear_ratio   0.616
      
      . ssc desc cpcorr
      
      --------------------------------------------------------------------------------------------------------------------------------------------------
      package cpcorr from http://fmwww.bc.edu/repec/bocode/c
      --------------------------------------------------------------------------------------------------------------------------------------------------
      
      TITLE
            'CPCORR': module for correlations for each row vs each column variable
      
      DESCRIPTION/AUTHOR(S)
            
            cpcorr produces a matrix of correlations for rowvarlist versus
            colvarlist. cpspear does the same for Spearman correlations. This
            matrix may thus be oblong, and need not be square. Both also
            allow a single varlist.
            
            KW: correlate
            KW: matrix
            KW: oblong
            
            Requires: Stata version 10.0 (6.0 for cpcorr6, cpspear6)
            
            
            Author: Nicholas J. Cox, Durham University
            Support: email [email protected]
            
            Distribution-Date: 20150916
            
      
      INSTALLATION FILES                               (type net install cpcorr)
            cpcorr.ado
            cpcorr.sthlp
            cpspear.ado
            cpspear.sthlp
            cpcorr6.ado
            cpcorr6.hlp
            cpspear6.ado
            cpspear6.hlp
      --------------------------------------------------------------------------------------------------------------------------------------------------
      (type ssc install cpcorr to install)
      .

      Comment


      • #4
        Thank you Nick! It worked perfectly. As I compare the results for analytes measured in analyzer X and analyzer Y each, I tried a pair of codes to skip writing the whole varlist. First, I tried the following loop but did not work for variables without suffix (_X or _Y) or string variables:
        Code:
        foreach var of varlist `var'_* {
            cpcorr `var'_X \ `var'_Y
        }
        Then I tried the following loop and it worked:
        Code:
        foreach stub in AL_ BS_ CD_ FR_ MAC_ RAW_ RD_ PT_ PV_ CT_ PW_ LP_ JB_ WBC_ LM_ MY_ GN_ LY_ MID_ GO_  {
            cpcorr `stub'X \ `stub'Y 
        }
        Still, I have to write the prefix (=name of the analyte) for each variable. Is there is a way to get the results without typing the whole varlist? And even better if I can get the results in one table (instead of repeating the same command using the loop which creates many tables). I'm a new user in Stata and appreciate every single tip.

        Comment


        • #5
          As I understand it, AL_X and AL_Y are just two variables so there is no gain in using cpcorr rather than correlate. Something like this may be more to your taste:


          Code:
          * sandbox dataset
          clear
          set obs 7
          set seed 2803
          foreach s in AL BS CD {
              gen `s'_X = rnormal()
              gen `s'_Y = rnormal()
          }
          
          * you start here
          unab stubs : *_X
          local stubs : subinstr local stubs "_X" "", all
          local nstubs : word count `stubs'
          
          matrix results = J(`nstubs', 1, .)
          
          local i = 1
          foreach s of local stubs {
              quietly corr `s'_X `s'_Y
              mat results[`i', 1] = r(rho)
              local ++i
          }
          
          mat rownames results = `stubs'
          mat li results, format(%05.3f)
          
          * results I got follow here
          
          results[3,1]
                  c1
          AL   0.366
          BS  -0.490
          CD  -0.454
          The trick in getting the stubs in one place is described at https://www.stata.com/support/faqs/d...-with-reshape/

          Comment


          • #6

            Thank you Nick for the code. Almost everything in the code is new for me so I have a long list of commands to learn about which I find fun and challenging Right now I get the error message r(198) for the "matrix results = J(`nstubs', 1, .)" when I run the code for the whole dataset. I'm still investigating the reason...

            Comment


            • #7
              I guess you are trying to execute the commands line by line from a do-file editor window. If so, don't do that. The local macro definitions of the form `foo' are not intervisible that way.

              Here's another approach using the reshape suggested in your other thread. https://www.statalist.org/forums/for...-using-tabstat

              You need to install rangestat (SSC) to do this, but the command to do that is included here.


              Code:
              * sandbox dataset
              clear
              set obs 12
              set seed 2803
              foreach s in AL BS CD {
                  gen `s'_X = rnormal()
                  gen `s'_Y = rnormal()
                  replace `s'_Y = . if runiform() < 0.4
              }
              
              * you start here
              
              gen id = _n
              reshape long @_X @_Y, i(id) j(which) string
              encode which, gen(WHICH)
              ssc install rangestat
              rangestat (corr) _X _Y if !missing(_X, _Y), int(WHICH 0 0)
              sort WHICH corr_x
              tabdisp WHICH, c(corr*)
              
              ------------------------------------------------------
                  WHICH |            corr_nobs  correlation of _X _Y
              ----------+-------------------------------------------
                     AL |                    7             .11365776
                     BS |                    7            -.05947031
                     CD |                    5             .33442341
              ------------------------------------------------------
              Plenty of scope for nicer variable names, etc. In practice rangestat would ignore missings any way, but the principle is worth making explicit.

              Comment


              • #8
                Thanks Nick. You were right. I executed the command from a Do-file. When I run the first code now, Stata complains about the number of variables: too few variables specified r(102) at the end of the loop. Am I doing something wrong? Or maybe I should do something with my data before?

                Comment


                • #9
                  Clearly you are doing something wrong. But I can't see what it is because you don't give a data example or show the exact code you're using. https://www.statalist.org/forums/help#stata applies, always.

                  I recommend the approach in #7 over that in #6. The code in #7 can be simplified to


                  Code:
                  * sandbox dataset
                  clear
                  set obs 12
                  set seed 2803
                  foreach s in AL BS CD {    
                      gen `s'_X = rnormal()    
                      gen `s'_Y = rnormal()    
                      replace `s'_Y = . if runiform() < 0.4
                  }  
                  
                  * you start here  
                  gen id = _n
                  reshape long @_X @_Y, i(id) j(which) string
                  encode which, gen(WHICH)
                  
                  ssc install rangestat
                  rangestat (corr) _X _Y, int(WHICH 0 0)
                  tabdisp WHICH, c(corr*)

                  Last edited by Nick Cox; 25 Mar 2019, 03:36.

                  Comment


                  • #10
                    Dear Nick,

                    As I mentioned in my other post, we analyzed blood samples (with unique Sampleid, then recoded to sampleno) in duplicates (Run 1 and 2) in a reference analyzer (X) and a test analyzer (Y). T3, MoK, LyS etc. are analytes measured in the blood samples by the analyzers. We need to compare the results of the analytes between the analyzers. I hereby attach a sample of my data:

                    Code:
                    Sampleid  Sampleno    Run     Date       T3_X     T3_Y   MoK_X  MoK_Y   LyS_X   LyS_Y    etc.
                    -------------------------------------------------------------------------------------------                                         
                    321           1       1       23/1       89       90.4   31.2   32.1    12      12.3    
                    321           1       2       23/1       92.8     91.5   31.9   31.3    13.1    12.5    
                    345           2       1       23/1       86.4     83.1   30.4   30.2    12.3    12.5    
                    345           2       2       23/1       86.7     84.9   31     30.5    12.2    12.7    
                    600           3       1       25/1       84.7     85.4   31.1   31.5    12.1    12.9    
                    600           3       2       25/1       85.8     86.1   31.6   31.1    12.8    12.7
                    When I run the last code you sent, I get the following:

                    Code:
                    gen id = _n
                    reshape long @_X @_Y, i(id) j(which) string
                    encode which, gen(WHICH)
                    
                    ssc install rangestat
                    rangestat (corr) _X _Y, int(WHICH 0 0)
                    tabdisp WHICH, c(corr*)
                    
                    // RESULTS //
                    
                    . gen id = _n
                    
                    . reshape long @_X @_Y, i(id) j(which) string
                    (note: j = APNA Flags T3 MoK LyS GB PR LM MH CH CV MU PW CT DW PT RY RW etc.)
                    variable T3_X type mismatch with other @_X variables
                    r(198);
                    
                    . encode which, gen(WHICH)
                    variable which not found
                    r(111);
                    
                    . ssc install rangestat
                    checking rangestat consistency and verifying not already installed...
                    all files already exist and are up to date.
                    
                    . rangestat (corr) _X _Y, int(WHICH 0 0)
                    variable WHICH not found
                    r(111);
                    
                    . tabdisp WHICH, c(corr*)
                    variable WHICH not found
                    r(111);
                    It seems like the analysis stops when it comes to the numeric variables.

                    And when I try your first code, I get the following:

                    Code:
                    unab stubs : *_BM800
                    local stubs : subinstr local stubs "_BM800" "", all
                    local nstubs : word count `stubs'
                    
                    matrix results = J(`nstubs', 1, .) 
                    
                    local i = 1
                    foreach s of local stubs {
                        quietly corr `s'_BM800 `s'_BM850
                        mat results[`i', 1] = r(rho)
                        local ++i
                    }
                    
                    mat rownames results = `stubs'
                    mat li results, format(%05.3f)
                    
                    // RESULTS //
                    
                    unab stubs : *_X
                    local stubs : subinstr local stubs "_X" "", all
                    local nstubs : word count `stubs'
                     matrix results = J(`nstubs', 1, .) 
                    
                    local i = 1
                    
                    foreach s of local stubs {
                      2. 
                    .     quietly corr `s'_X `s'_Y
                      3. 
                    .     mat results[`i', 1] = r(rho)
                      4. 
                    .     local ++i
                      5. 
                    . }
                    too few variables specified
                    r(102);
                    
                    mat rownames results = `stubs'
                    mat li results, format(%05.3f)
                    
                    results[23,1]
                           c1
                     APNA   .
                    Flags   .
                      T3    .
                      MoK   .
                      LyS   .
                    Am I doing something wrong using both codes? Should I consider something when performing the analysis on the first replicates (Run 1) only?

                    Comment


                    • #11
                      You have yet to give a data example in our sense, namely as produced by dataex.

                      What I can guess -- nothing in your posts makes it completely explicit -- is that APNA_* Flags_* are completely different kinds of variables, string variables containing other information, and so are not suitable therefore for correlation analysis. So, the reshape fails and as the reshape fails, so nothing else of consequence can possibly work and keeping going with the later commands is over-optimistic.

                      This code works with the sandbox dataset and is more likely to work with yours, but without a copy of your dataset I cannot be certain

                      Code:
                      * sandbox dataset
                      clear
                      set obs 12
                      set seed 2803
                      foreach s in AL BS CD {    
                          gen `s'_X = rnormal()    
                          gen `s'_Y = rnormal()    
                          replace `s'_Y = . if runiform() < 0.4
                      }  
                      
                      * you start here  
                      keep *_X *_Y 
                      capture drop APNA* Flags* 
                      
                      gen id = _n
                      reshape long @_X @_Y, i(id) j(which) string
                      encode which, gen(WHICH)
                      
                      ssc install rangestat
                      rangestat (corr) _X _Y, int(WHICH 0 0)
                      tabdisp WHICH, c(corr*)
                      That code does not distinguish between samples and runs and pools everything that comes from each prefix, such as AL BS CD.

                      I haven't looked closely at your second block of code. It starts out with variables *_BM800 which you haven't told us about.

                      If you're reliant on Statalist for support you have to give us enough information for us to understand your questions. It's as simple as that.

                      Comment


                      • #12
                        Thank you Nick for your help. I appreciate it. The code is working now. I will read about how to include a copy of my data for future posts. And also how to reshape wide again.

                        Comment


                        • #13
                          Excellent. Thanks for the closure!

                          Comment

                          Working...
                          X