Pairwise correlations

Amanda Ode

Join Date: Mar 2019

Posts: 62
#1

Pairwise correlations

20 Mar 2019, 07:34

Hi,

I am a very new user in Stata and need help with obtaining only relevant correlation coefficients when running pwcorr or spearman commands.

Example:
I have the following variables in my data set: Weight-a, Weight-b, BMI-a, BMI-b, Hight-a, Hight-b
I want to get the following table directly:

Variable r-value

Weight* 1.0

BMI 1.0

Hight 0.9

*for the correlation between Weight-a and -b, etc

Is there any way to get the table above instead of getting the traditional one where every single variable is compared to all other ones?
Tags: None
Marcos Almeida

Join Date: Apr 2014

Posts: 4047
#2

20 Mar 2019, 07:48

Maybe this text may interest you. That said, there are at least two aspects here, one is having specific paired correlations, the other is presenting them in a table. Personally, I just copy the results and paste them in a customized table.

Best regards,

Marcos
Comment

Nick Cox

Join Date: Mar 2014
Posts: 35418

20 Mar 2019, 08:05

Yes. cpcorr (SSC) does this.

Code:

. sysuse auto, clear
(1978 Automobile Data)

. cpcorr price headroom-gear \ mpg
(obs=74)

                  mpg
       price  -0.4686
    headroom  -0.4138
       trunk  -0.5816
      weight  -0.8072
      length  -0.7958
        turn  -0.7192
displacement  -0.7056
  gear_ratio   0.6162

. cpcorr price headroom-gear \ mpg, format(%04.3f)
(obs=74)

                 mpg
       price  -0.469
    headroom  -0.414
       trunk  -0.582
      weight  -0.807
      length  -0.796
        turn  -0.719
displacement  -0.706
  gear_ratio   0.616

. ssc desc cpcorr

--------------------------------------------------------------------------------------------------------------------------------------------------
package cpcorr from http://fmwww.bc.edu/repec/bocode/c
--------------------------------------------------------------------------------------------------------------------------------------------------

TITLE
      'CPCORR': module for correlations for each row vs each column variable

DESCRIPTION/AUTHOR(S)
      
      cpcorr produces a matrix of correlations for rowvarlist versus
      colvarlist. cpspear does the same for Spearman correlations. This
      matrix may thus be oblong, and need not be square. Both also
      allow a single varlist.
      
      KW: correlate
      KW: matrix
      KW: oblong
      
      Requires: Stata version 10.0 (6.0 for cpcorr6, cpspear6)
      
      
      Author: Nicholas J. Cox, Durham University
      Support: email [email protected]
      
      Distribution-Date: 20150916
      

INSTALLATION FILES                               (type net install cpcorr)
      cpcorr.ado
      cpcorr.sthlp
      cpspear.ado
      cpspear.sthlp
      cpcorr6.ado
      cpcorr6.hlp
      cpspear6.ado
      cpspear6.hlp
--------------------------------------------------------------------------------------------------------------------------------------------------
(type ssc install cpcorr to install)
.

Comment

Amanda Ode

Join Date: Mar 2019

Posts: 62
#4

21 Mar 2019, 11:56

Thank you Nick! It worked perfectly. As I compare the results for analytes measured in analyzer X and analyzer Y each, I tried a pair of codes to skip writing the whole varlist. First, I tried the following loop but did not work for variables without suffix (_X or _Y) or string variables:

Code:

foreach var of varlist `var'_* { cpcorr `var'_X \ `var'_Y }

Then I tried the following loop and it worked:

Code:

foreach stub in AL_ BS_ CD_ FR_ MAC_ RAW_ RD_ PT_ PV_ CT_ PW_ LP_ JB_ WBC_ LM_ MY_ GN_ LY_ MID_ GO_ { cpcorr `stub'X \ `stub'Y }

Still, I have to write the prefix (=name of the analyte) for each variable. Is there is a way to get the results without typing the whole varlist? And even better if I can get the results in one table (instead of repeating the same command using the loop which creates many tables). I'm a new user in Stata and appreciate every single tip.
Comment

Nick Cox

Join Date: Mar 2014
Posts: 35418

21 Mar 2019, 15:32

As I understand it, AL_X and AL_Y are just two variables so there is no gain in using cpcorr rather than correlate. Something like this may be more to your taste:

Code:

* sandbox dataset
clear
set obs 7
set seed 2803
foreach s in AL BS CD {
    gen `s'_X = rnormal()
    gen `s'_Y = rnormal()
}

* you start here
unab stubs : *_X
local stubs : subinstr local stubs "_X" "", all
local nstubs : word count `stubs'

matrix results = J(`nstubs', 1, .)

local i = 1
foreach s of local stubs {
    quietly corr `s'_X `s'_Y
    mat results[`i', 1] = r(rho)
    local ++i
}

mat rownames results = `stubs'
mat li results, format(%05.3f)

* results I got follow here

results[3,1]
        c1
AL   0.366
BS  -0.490
CD  -0.454

The trick in getting the stubs in one place is described at https://www.stata.com/support/faqs/d...-with-reshape/

Comment

Amanda Ode

Join Date: Mar 2019

Posts: 62
#6

22 Mar 2019, 04:31

Thank you Nick for the code. Almost everything in the code is new for me so I have a long list of commands to learn about which I find fun and challenging Right now I get the error message r(198) for the "matrix results = J(`nstubs', 1, .)" when I run the code for the whole dataset. I'm still investigating the reason...
Comment

Nick Cox

Join Date: Mar 2014
Posts: 35418

22 Mar 2019, 04:40

I guess you are trying to execute the commands line by line from a do-file editor window. If so, don't do that. The local macro definitions of the form `foo' are not intervisible that way.

Here's another approach using the reshape suggested in your other thread. https://www.statalist.org/forums/for...-using-tabstat

You need to install rangestat (SSC) to do this, but the command to do that is included here.

Code:

* sandbox dataset
clear
set obs 12
set seed 2803
foreach s in AL BS CD {
    gen `s'_X = rnormal()
    gen `s'_Y = rnormal()
    replace `s'_Y = . if runiform() < 0.4
}

* you start here

gen id = _n
reshape long @_X @_Y, i(id) j(which) string
encode which, gen(WHICH)
ssc install rangestat
rangestat (corr) _X _Y if !missing(_X, _Y), int(WHICH 0 0)
sort WHICH corr_x
tabdisp WHICH, c(corr*)

------------------------------------------------------
    WHICH |            corr_nobs  correlation of _X _Y
----------+-------------------------------------------
       AL |                    7             .11365776
       BS |                    7            -.05947031
       CD |                    5             .33442341
------------------------------------------------------

Plenty of scope for nicer variable names, etc. In practice rangestat would ignore missings any way, but the principle is worth making explicit.

Comment

Amanda Ode

Join Date: Mar 2019

Posts: 62
#8

25 Mar 2019, 02:16

Thanks Nick. You were right. I executed the command from a Do-file. When I run the first code now, Stata complains about the number of variables: too few variables specified r(102) at the end of the loop. Am I doing something wrong? Or maybe I should do something with my data before?
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35418
#9

25 Mar 2019, 03:17

Clearly you are doing something wrong. But I can't see what it is because you don't give a data example or show the exact code you're using. https://www.statalist.org/forums/help#stata applies, always.

I recommend the approach in #7 over that in #6. The code in #7 can be simplified to

Code:

* sandbox dataset clear set obs 12 set seed 2803 foreach s in AL BS CD { gen `s'_X = rnormal() gen `s'_Y = rnormal() replace `s'_Y = . if runiform() < 0.4 } * you start here gen id = _n reshape long @_X @_Y, i(id) j(which) string encode which, gen(WHICH) ssc install rangestat rangestat (corr) _X _Y, int(WHICH 0 0) tabdisp WHICH, c(corr*)

Last edited by Nick Cox; 25 Mar 2019, 03:36.
Comment

Amanda Ode

Join Date: Mar 2019
Posts: 62

#10

25 Mar 2019, 07:21

Dear Nick,

As I mentioned in my other post, we analyzed blood samples (with unique Sampleid, then recoded to sampleno) in duplicates (Run 1 and 2) in a reference analyzer (X) and a test analyzer (Y). T3, MoK, LyS etc. are analytes measured in the blood samples by the analyzers. We need to compare the results of the analytes between the analyzers. I hereby attach a sample of my data:

Code:

Sampleid  Sampleno    Run     Date       T3_X     T3_Y   MoK_X  MoK_Y   LyS_X   LyS_Y    etc.
-------------------------------------------------------------------------------------------                                         
321           1       1       23/1       89       90.4   31.2   32.1    12      12.3    
321           1       2       23/1       92.8     91.5   31.9   31.3    13.1    12.5    
345           2       1       23/1       86.4     83.1   30.4   30.2    12.3    12.5    
345           2       2       23/1       86.7     84.9   31     30.5    12.2    12.7    
600           3       1       25/1       84.7     85.4   31.1   31.5    12.1    12.9    
600           3       2       25/1       85.8     86.1   31.6   31.1    12.8    12.7

When I run the last code you sent, I get the following:

Code:

gen id = _n
reshape long @_X @_Y, i(id) j(which) string
encode which, gen(WHICH)

ssc install rangestat
rangestat (corr) _X _Y, int(WHICH 0 0)
tabdisp WHICH, c(corr*)

// RESULTS //

. gen id = _n

. reshape long @_X @_Y, i(id) j(which) string
(note: j = APNA Flags T3 MoK LyS GB PR LM MH CH CV MU PW CT DW PT RY RW etc.)
variable T3_X type mismatch with other @_X variables
r(198);

. encode which, gen(WHICH)
variable which not found
r(111);

. ssc install rangestat
checking rangestat consistency and verifying not already installed...
all files already exist and are up to date.

. rangestat (corr) _X _Y, int(WHICH 0 0)
variable WHICH not found
r(111);

. tabdisp WHICH, c(corr*)
variable WHICH not found
r(111);

It seems like the analysis stops when it comes to the numeric variables.

And when I try your first code, I get the following:

Code:

unab stubs : *_BM800
local stubs : subinstr local stubs "_BM800" "", all
local nstubs : word count `stubs'

matrix results = J(`nstubs', 1, .) 

local i = 1
foreach s of local stubs {
    quietly corr `s'_BM800 `s'_BM850
    mat results[`i', 1] = r(rho)
    local ++i
}

mat rownames results = `stubs'
mat li results, format(%05.3f)

// RESULTS //

unab stubs : *_X
local stubs : subinstr local stubs "_X" "", all
local nstubs : word count `stubs'
 matrix results = J(`nstubs', 1, .) 

local i = 1

foreach s of local stubs {
  2. 
.     quietly corr `s'_X `s'_Y
  3. 
.     mat results[`i', 1] = r(rho)
  4. 
.     local ++i
  5. 
. }
too few variables specified
r(102);

mat rownames results = `stubs'
mat li results, format(%05.3f)

results[23,1]
       c1
 APNA   .
Flags   .
  T3    .
  MoK   .
  LyS   .

Am I doing something wrong using both codes? Should I consider something when performing the analysis on the first replicates (Run 1) only?

Comment

Nick Cox

Join Date: Mar 2014

Posts: 35418
#11

25 Mar 2019, 08:11

You have yet to give a data example in our sense, namely as produced by dataex.

What I can guess -- nothing in your posts makes it completely explicit -- is that APNA_* Flags_* are completely different kinds of variables, string variables containing other information, and so are not suitable therefore for correlation analysis. So, the reshape fails and as the reshape fails, so nothing else of consequence can possibly work and keeping going with the later commands is over-optimistic.

This code works with the sandbox dataset and is more likely to work with yours, but without a copy of your dataset I cannot be certain

Code:

* sandbox dataset clear set obs 12 set seed 2803 foreach s in AL BS CD { gen `s'_X = rnormal() gen `s'_Y = rnormal() replace `s'_Y = . if runiform() < 0.4 } * you start here keep *_X *_Y capture drop APNA* Flags* gen id = _n reshape long @_X @_Y, i(id) j(which) string encode which, gen(WHICH) ssc install rangestat rangestat (corr) _X _Y, int(WHICH 0 0) tabdisp WHICH, c(corr*)

That code does not distinguish between samples and runs and pools everything that comes from each prefix, such as AL BS CD.

I haven't looked closely at your second block of code. It starts out with variables *_BM800 which you haven't told us about.

If you're reliant on Statalist for support you have to give us enough information for us to understand your questions. It's as simple as that.
Comment
Amanda Ode

Join Date: Mar 2019

Posts: 62
#12

26 Mar 2019, 02:00

Thank you Nick for your help. I appreciate it. The code is working now. I will read about how to include a copy of my data for future posts. And also how to reshape wide again.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35418
#13

26 Mar 2019, 02:32

Excellent. Thanks for the closure!
Comment

Variable	r-value
Weight*	1.0
BMI	1.0
Hight	0.9

Announcement