Correlation analysis

Kelvin Wan

Join Date: Apr 2017

Posts: 20
#1

Correlation analysis

02 Jul 2018, 10:50

I want to evaluate the correlation between variable X and variable Y measured in each eye. Since each subject contributed two eyes, I was told that the Pearson and Spearman correlation cannot be used. Apart from using data from 1 eye per subject, is there any other way I can study the correlation? Thanks!
Tags: None
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17675
#2

02 Jul 2018, 11:09

Kelvin:
you can consider a regression model with standard errors clustered on -patientid-.

Kind regards,
Carlo
(Stata 19.0)
Comment

Kelvin Wan

Join Date: Apr 2017
Posts: 20

02 Jul 2018, 23:45

Thanks for your reply Carlo. May I know how the syntax would be constructed? (Sorry, my knowledge is very limited). This is my data set, thanks again!

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input byte PatientID int VarX str7 VarY byte Side
 1 295 "1.28" 1
 1 284 "1.2"  0
 2 345 "1.1"  1
 2 356 "1.13" 0
 3 371 "1.12" 1
 3 377 "0.95" 0
 4 283 "1.02" 1
 4 316 "1.18" 0
 5 279 "1.43" 1
 5 285 "1.16" 0
 6 313 "1.24" 1
 6 313 "1.36" 0
 7 336 "1.22" 1
 7 294 "1.14" 0
 8 309 "1.07" 1
 8 292 "1.08" 0
 9 374 "1.49" 1
 9 359 "1.53" 0
10 330 "1.01" 1
10 310 "0.95" 0
11 337 "1.21" 1
11 325 "1.1"  0
12 348 "1.03" 1
12 327 "1.06" 0
13 352 "1.08" 1
13 351 "1.12" 0
end

Comment

Carlo Lazzaro

Join Date: Apr 2014
Posts: 17675

03 Jul 2018, 00:20

Kelvin:
thanks for providing an example of your data via -dataex-.
What you're probably after is:

Code:

. input byte PatientID int VarX str7 VarY byte Side

     Patien~D      VarX       VarY      Side
  1.
.  1 295 "1.28" 1
  2.
.  1 284 "1.2"  0
  3.
.  2 345 "1.1"  1
  4.
.  2 356 "1.13" 0
  5.
.  3 371 "1.12" 1
  6.
.  3 377 "0.95" 0
  7.
.  4 283 "1.02" 1
  8.
.  4 316 "1.18" 0
  9.
.  5 279 "1.43" 1
 10.
.  5 285 "1.16" 0
 11.
.  6 313 "1.24" 1
 12.
.  6 313 "1.36" 0
 13.
.  7 336 "1.22" 1
 14.
.  7 294 "1.14" 0
 15.
.  8 309 "1.07" 1
 16.
.  8 292 "1.08" 0
 17.
.  9 374 "1.49" 1
 18.
.  9 359 "1.53" 0
 19.
. 10 330 "1.01" 1
 20.
. 10 310 "0.95" 0
 21.
. 11 337 "1.21" 1
 22.
. 11 325 "1.1"  0
 23.
. 12 348 "1.03" 1
 24.
. 12 327 "1.06" 0
 25.
. 13 352 "1.08" 1
 26.
. 13 351 "1.12" 0
 27.
. end

. destring VarY,g(new_VarY) *you should -destring- -VarY- and convert it into -numeric-, because -string- variable are not suitable as regressand or regressors*
VarY: all characters numeric; new_VarY generated as double

. regress new_VarY VarX, vce(cluster PatientID)

Linear regression                               Number of obs     =         26
                                                F(1, 12)          =       0.00
                                                Prob > F          =     0.9902
                                                R-squared         =     0.0000
                                                Root MSE          =     .15505

                             (Std. Err. adjusted for 13 clusters in PatientID)
------------------------------------------------------------------------------
             |               Robust
    new_VarY |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        VarX |   .0000204   .0016302     0.01   0.990    -.0035315    .0035723
       _cons |   1.157206   .5121653     2.26   0.043      .041294    2.273119
------------------------------------------------------------------------------

.

Kind regards,
Carlo
(Stata 19.0)

Comment

daniel klein

Join Date: Mar 2014

Posts: 3824
#5

03 Jul 2018, 02:12

Kelvin did not really say why a simple correlation coefficient would not suffice, given this is really what he is after. Carlo's suggestion implies that the standard error for a simple correlation coefficient might be off since the two eyes of a subject cannot be regarded as being independent observations. Carlo attempts to correct for non-independence by means of clustered standard errors in a regression framework.

However, some details are rather implicit here. First, regressing VarY on VarX will yield a different point estimate than regressing VarX on VarY. That is fine, since Kelvin is not interested in regression coefficients, anyway. But where is the correlation coefficient that Kelvin is after? Picking up where Carlo's example ends, Kelvin may type

Code:

display sqrt(e(r2))

to find the desired correlation coefficient. The p-value reported for the F-Test can be used to test against Zero.

Alternatively, Kelvin might want to standardize his variables before running the regression. That will yield the standardized regression coefficient which equals the correlation. To test against Zero, Kelvin should probably run both the model regressing VarY on VarX and the one regressing VarX on VarY, then pick whichever standard error is larger to get a conservative test result.

Taking one step back and starting with a simple scatter plot, there does not seem to be an obvious relationship between the variables in the dataset. Type

Code:

graph twoway scatter VarX VarY || lfit VarX VarY , by(Side)

to check.

Last, I am not completely convinced that a correlation coefficient (e.g., Pearson, Spearman, Intraclass, ...) is the best choice here, but that depends on the exact research question.

Best
Daniel

Last edited by daniel klein; 03 Jul 2018, 02:16.
1 like
Comment
Kelvin Wan

Join Date: Apr 2017

Posts: 20
#6

03 Jul 2018, 07:35

Carlo and Daniel: Thanks so much for your help and detailed explanation!

Daniel:
I am just testing whether there is a relationship between these 2 independent variables (Var X and Y), so it appears that a correlation coefficient (Pearson or Spearman) would be the most appropriate. A "simple" correlation coefficient wouldn't work (as I was told) is because I have used two eyes from the same subject, so this violates the assumption.

And yes, I don't expect there is any relationship between Var X and Y, hence the scatterplot supports my hypothesis.

by running

Code:

display sqrt(e(r2))

after the regression suggested by Carlo, how should I call this correlation coefficient? Pearson? Spearman? Or another term?

Also may I ask how I can "standardize the variables before running the regression" to get the correlation like you suggest?

Thanks again!
Comment
daniel klein

Join Date: Mar 2014

Posts: 3824
#7

03 Jul 2018, 07:55

Originally posted by Kelvin Wan View Post

A "simple" correlation coefficient wouldn't work (as I was told) is because I have used two eyes from the same subject, so this violates the assumption.

Which assumption do you (or the person who told you) mean?

by running

Code:

display sqrt(e(r2))

after the regression suggested by Carlo, how should I call this correlation coefficient? Pearson? Spearman? Or another term?

Do not be afraid to play around with these things. Type:

Code:

correlate VarY VarX

and compare the resulting coefficient with the square root of R-squared (or: r-squared) before. Hint: correlate estimates Pearson's correlation coefficient; it is usually called r.

By standardizing variables, I mean to subtract the mean and divide by the standard deviation. The result will be exactly the same that you get from the square root of R-squared.

Best
Daniel
1 like
Comment
Kelvin Wan

Join Date: Apr 2017

Posts: 20
#8

03 Jul 2018, 08:23

To the best of my (limited) understanding, because the measurements from variable X or Y were measured twice from each subject, so these measurements within X and Y are correlated, therefore I cannot use the Pearson's correlation coefficient because this test assumes each measurement is independent.

I just tried:

Code:

correlate VarY VarX

I got the same correlation coefficient as taking the square root of R-squared after running the syntax suggested by Carlo.

So does this mean I can just use the "simple" Pearson correlation instead of running a regression model clustered on each patient (previously suggested by Carlo)? I am so confused.
Comment
daniel klein

Join Date: Mar 2014

Posts: 3824
#9

03 Jul 2018, 09:07

To the best of my (limited) understanding, because the measurements from variable X or Y were measured twice from each subject, so these measurements within X and Y are correlated,

They probably are.

therefore I cannot use the Pearson's correlation coefficient because this test assumes each measurement is independent.

A correlation coefficient is not a 'test' in a statistical sense; it can be regarded as a point estimate. We are often interested in testing a point estimate (against Zero or other values of interest). This is (parts of) what is called statistical inference. For statistical inference, we usually need a standard error and many estimators of this standard error indeed assume independent observations. That is where Carlo suggested one possible way to deal with the clustered observations.

I just tried:

Code:

correlate VarY VarX

I got the same correlation coefficient as taking the square root of R-squared after running the syntax suggested by Carlo.

So does this mean I can just use the "simple" Pearson correlation instead of running a regression model clustered on each patient (previously suggested by Carlo)? I am so confused.

That depends on what you want. If you are only interested in the correlation coefficient as a point estimate then correlate is the most direct way to get what you want. If you want to test this coefficient against Zero, then you need to think about the clustered observations. If you want something entirely different you need something different.

Best
Daniel

Last edited by daniel klein; 03 Jul 2018, 09:09.
1 like
Comment
Kelvin Wan

Join Date: Apr 2017

Posts: 20
#10

04 Jul 2018, 16:37

Thanks Daniel for your very elaborate explanation, I have a much better understanding on the issue now.
Comment

Announcement

Correlation analysis

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment