Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Correlation analysis

    I want to evaluate the correlation between variable X and variable Y measured in each eye. Since each subject contributed two eyes, I was told that the Pearson and Spearman correlation cannot be used. Apart from using data from 1 eye per subject, is there any other way I can study the correlation? Thanks!

  • #2
    Kelvin:
    you can consider a regression model with standard errors clustered on -patientid-.
    Kind regards,
    Carlo
    (Stata 19.0)

    Comment


    • #3
      Thanks for your reply Carlo. May I know how the syntax would be constructed? (Sorry, my knowledge is very limited). This is my data set, thanks again!


      Code:
      * Example generated by -dataex-. To install: ssc install dataex
      clear
      input byte PatientID int VarX str7 VarY byte Side
       1 295 "1.28" 1
       1 284 "1.2"  0
       2 345 "1.1"  1
       2 356 "1.13" 0
       3 371 "1.12" 1
       3 377 "0.95" 0
       4 283 "1.02" 1
       4 316 "1.18" 0
       5 279 "1.43" 1
       5 285 "1.16" 0
       6 313 "1.24" 1
       6 313 "1.36" 0
       7 336 "1.22" 1
       7 294 "1.14" 0
       8 309 "1.07" 1
       8 292 "1.08" 0
       9 374 "1.49" 1
       9 359 "1.53" 0
      10 330 "1.01" 1
      10 310 "0.95" 0
      11 337 "1.21" 1
      11 325 "1.1"  0
      12 348 "1.03" 1
      12 327 "1.06" 0
      13 352 "1.08" 1
      13 351 "1.12" 0
      end

      Comment


      • #4
        Kelvin:
        thanks for providing an example of your data via -dataex-.
        What you're probably after is:
        Code:
        . input byte PatientID int VarX str7 VarY byte Side
        
             Patien~D      VarX       VarY      Side
          1.
        .  1 295 "1.28" 1
          2.
        .  1 284 "1.2"  0
          3.
        .  2 345 "1.1"  1
          4.
        .  2 356 "1.13" 0
          5.
        .  3 371 "1.12" 1
          6.
        .  3 377 "0.95" 0
          7.
        .  4 283 "1.02" 1
          8.
        .  4 316 "1.18" 0
          9.
        .  5 279 "1.43" 1
         10.
        .  5 285 "1.16" 0
         11.
        .  6 313 "1.24" 1
         12.
        .  6 313 "1.36" 0
         13.
        .  7 336 "1.22" 1
         14.
        .  7 294 "1.14" 0
         15.
        .  8 309 "1.07" 1
         16.
        .  8 292 "1.08" 0
         17.
        .  9 374 "1.49" 1
         18.
        .  9 359 "1.53" 0
         19.
        . 10 330 "1.01" 1
         20.
        . 10 310 "0.95" 0
         21.
        . 11 337 "1.21" 1
         22.
        . 11 325 "1.1"  0
         23.
        . 12 348 "1.03" 1
         24.
        . 12 327 "1.06" 0
         25.
        . 13 352 "1.08" 1
         26.
        . 13 351 "1.12" 0
         27.
        . end
        
        . destring VarY,g(new_VarY) *you should -destring- -VarY- and convert it into -numeric-, because -string- variable are not suitable as regressand or regressors*
        VarY: all characters numeric; new_VarY generated as double
        
        . regress new_VarY VarX, vce(cluster PatientID)
        
        Linear regression                               Number of obs     =         26
                                                        F(1, 12)          =       0.00
                                                        Prob > F          =     0.9902
                                                        R-squared         =     0.0000
                                                        Root MSE          =     .15505
        
                                     (Std. Err. adjusted for 13 clusters in PatientID)
        ------------------------------------------------------------------------------
                     |               Robust
            new_VarY |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
        -------------+----------------------------------------------------------------
                VarX |   .0000204   .0016302     0.01   0.990    -.0035315    .0035723
               _cons |   1.157206   .5121653     2.26   0.043      .041294    2.273119
        ------------------------------------------------------------------------------
        
        .
        Kind regards,
        Carlo
        (Stata 19.0)

        Comment


        • #5
          Kelvin did not really say why a simple correlation coefficient would not suffice, given this is really what he is after. Carlo's suggestion implies that the standard error for a simple correlation coefficient might be off since the two eyes of a subject cannot be regarded as being independent observations. Carlo attempts to correct for non-independence by means of clustered standard errors in a regression framework.

          However, some details are rather implicit here. First, regressing VarY on VarX will yield a different point estimate than regressing VarX on VarY. That is fine, since Kelvin is not interested in regression coefficients, anyway. But where is the correlation coefficient that Kelvin is after? Picking up where Carlo's example ends, Kelvin may type

          Code:
          display sqrt(e(r2))
          to find the desired correlation coefficient. The p-value reported for the F-Test can be used to test against Zero.

          Alternatively, Kelvin might want to standardize his variables before running the regression. That will yield the standardized regression coefficient which equals the correlation. To test against Zero, Kelvin should probably run both the model regressing VarY on VarX and the one regressing VarX on VarY, then pick whichever standard error is larger to get a conservative test result.

          Taking one step back and starting with a simple scatter plot, there does not seem to be an obvious relationship between the variables in the dataset. Type

          Code:
          graph twoway scatter VarX VarY || lfit VarX VarY , by(Side)
          to check.

          Last, I am not completely convinced that a correlation coefficient (e.g., Pearson, Spearman, Intraclass, ...) is the best choice here, but that depends on the exact research question.

          Best
          Daniel
          Last edited by daniel klein; 03 Jul 2018, 02:16.

          Comment


          • #6
            Carlo and Daniel: Thanks so much for your help and detailed explanation!

            Daniel:
            I am just testing whether there is a relationship between these 2 independent variables (Var X and Y), so it appears that a correlation coefficient (Pearson or Spearman) would be the most appropriate. A "simple" correlation coefficient wouldn't work (as I was told) is because I have used two eyes from the same subject, so this violates the assumption.

            And yes, I don't expect there is any relationship between Var X and Y, hence the scatterplot supports my hypothesis.

            by running
            Code:
             
             display sqrt(e(r2))
            after the regression suggested by Carlo, how should I call this correlation coefficient? Pearson? Spearman? Or another term?

            Also may I ask how I can "standardize the variables before running the regression" to get the correlation like you suggest?

            Thanks again!

            Comment


            • #7
              Originally posted by Kelvin Wan View Post
              A "simple" correlation coefficient wouldn't work (as I was told) is because I have used two eyes from the same subject, so this violates the assumption.
              Which assumption do you (or the person who told you) mean?

              by running
              Code:
              display sqrt(e(r2))
              after the regression suggested by Carlo, how should I call this correlation coefficient? Pearson? Spearman? Or another term?
              Do not be afraid to play around with these things. Type:

              Code:
              correlate VarY VarX
              and compare the resulting coefficient with the square root of R-squared (or: r-squared) before. Hint: correlate estimates Pearson's correlation coefficient; it is usually called r.

              By standardizing variables, I mean to subtract the mean and divide by the standard deviation. The result will be exactly the same that you get from the square root of R-squared.

              Best
              Daniel

              Comment


              • #8
                To the best of my (limited) understanding, because the measurements from variable X or Y were measured twice from each subject, so these measurements within X and Y are correlated, therefore I cannot use the Pearson's correlation coefficient because this test assumes each measurement is independent.

                I just tried:
                Code:
                 
                 correlate VarY VarX
                I got the same correlation coefficient as taking the square root of R-squared after running the syntax suggested by Carlo.

                So does this mean I can just use the "simple" Pearson correlation instead of running a regression model clustered on each patient (previously suggested by Carlo)? I am so confused.

                Comment


                • #9
                  To the best of my (limited) understanding, because the measurements from variable X or Y were measured twice from each subject, so these measurements within X and Y are correlated,
                  They probably are.

                  therefore I cannot use the Pearson's correlation coefficient because this test assumes each measurement is independent.
                  A correlation coefficient is not a 'test' in a statistical sense; it can be regarded as a point estimate. We are often interested in testing a point estimate (against Zero or other values of interest). This is (parts of) what is called statistical inference. For statistical inference, we usually need a standard error and many estimators of this standard error indeed assume independent observations. That is where Carlo suggested one possible way to deal with the clustered observations.

                  I just tried:
                  Code:
                  correlate VarY VarX
                  I got the same correlation coefficient as taking the square root of R-squared after running the syntax suggested by Carlo.

                  So does this mean I can just use the "simple" Pearson correlation instead of running a regression model clustered on each patient (previously suggested by Carlo)? I am so confused.
                  That depends on what you want. If you are only interested in the correlation coefficient as a point estimate then correlate is the most direct way to get what you want. If you want to test this coefficient against Zero, then you need to think about the clustered observations. If you want something entirely different you need something different.

                  Best
                  Daniel
                  Last edited by daniel klein; 03 Jul 2018, 09:09.

                  Comment


                  • #10
                    Thanks Daniel for your very elaborate explanation, I have a much better understanding on the issue now.

                    Comment

                    Working...
                    X