Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Average correlation coefficient matrix

    Hello statalisters,

    I am trying to compute a matrix of correlation coefficients. I have observations spread across 84 quarters, and would like to compute the correlation coefficients of my variables within each quarter first, and then take the average across those 84 quarters.

    My data looks as follows:

    * Example generated by -dataex-. For more info, type help dataex
    clear
    input double permno float(num_QUARTER EXRET ln_MV ln_MB) double(INCVOL RETVOL) float(DISP ln_AGE) double TURN float INSOWN
    10011 43 7.620689 4.475688 1.6873487 . 2.474256321751049 . -1.9906852 .09708822701941244 5.335223
    10011 44 3.627314 4.579211 .9194773 . 2.6414308282639967 . -2.0281482 .3400871268120305 12.897806
    10011 46 6.99464 4.776778 1.0684394 . 2.289853477549846 . -2.0927093 .28441275235626007 20.15268
    10011 47 -7.960851 4.6077433 .8482187 2.717672115669003 1.9411344246290014 . -2.123655 .2173368682463964 17.707294
    10011 48 .23876584 4.4052424 .6102028 2.67519623302301 3.0389313309939645 . -2.1514435 .4065265440382063 25.705116
    10011 49 -3.2104526 4.1643367 .3479661 2.532721266515545 3.061159067153737 . -2.1944811 .18068486331932007 34.033783
    10011 50 -5.325699 4.463703 .6218236 2.4463803968178697 1.588931415467767 . -2.2090268 .31440199141701064 19.195
    10011 51 -3.56842 4.446916 .5785106 2.334583952508747 1.2160774760870083 . -2.236621 .17153829795618852 14.916426
    10011 52 -.11500657 4.491021 .5861638 2.25453858594281 1.12889420875828 . -2.262904 .12345527016025569 39.19349
    10015 1 .2713797 2.975351 .7527837 . 4.805263268270954 . -.4768296 .14994475125680648 8.153242
    10015 2 -.029075697 3.2543395 .836909 . 2.9251311048068147 . -.6028927 .06539496875384115 3.864492
    10015 3 -13.90464 3.461979 .9384125 . 3.949698512442778 . -.7107987 .09006608091294765 15.327478
    10015 4 -2.116692 3.6982975 1.1322718 . 1.3982119956314099 . -.9212101 .26447932474735764 7.276555
    10016 5 -.8200542 5.824858 .4808456 . 2.480334731412854 . 1.4810567 .11132207642852639 0
    10016 6 3.152819 5.80124 .1965566 . 1.2837863481800118 . .7014003 .2595312267852326 0
    10016 7 6.870953 5.84867 .2159648 . 1.5579642600848447 . .29777858 .1645664483619233 0
    10016 8 -.6537237 5.945834 .26646048 . 1.4112492930481495 . -.016304731 .41616357962290446 0
    10016 10 3.6085796 6.238881 .496259 . 1.0891969206006966 . -.38981825 .09270969603676349 .
    10016 11 -2.0152378 6.24295 .4743074 . 1.4152395212290263 . -.55687237 .07975471673172808 .
    10016 12 -2.138123 5.894248 .11781038 . 2.854929422700498 . -.6999731 .13787257610820233 0
    10016 13 4.791698 6.025914 .28185353 . 2.0678123179526513 . -.8069649 .1536628291127272 0
    10016 14 -.3678725 6.168983 .29632685 . 1.3229930957585092 . -.9135472 .24335319901195665 .
    10016 15 5.258823 6.15718 .25509027 . 1.4379051690596525 . -1.0078579 .2500570541868607 .
    10016 16 -3.352444 6.30959 .4075858 . 1.8850191956611202 . -1.1067979 .05414249434071625 .
    10016 17 .4465417 6.355889 .4272098 . 1.4218536782028925 . -1.1784443 .09322355929907644 1.1767303
    10016 18 -6.536911 6.370505 .4192401 . 1.7918183611182745 . -1.2523714 .06250662739330437 0
    10016 19 1.9284724 6.20778 .50872856 . 1.519524575002625 . -1.3219385 .3172460706799381 0
    10016 20 .9492266 6.015412 .25807664 . 2.2227151377096934 . -1.402599 .12681499511624375 0
    10016 21 .5641308 6.105686 .31303495 . 1.6846580782507354 . -1.4441755 .19486639008391649 0
    10016 22 2.715161 6.139966 .3224233 . 2.713157257543933 . -1.5031638 .18034091155277565 0
    10016 23 7.286383 5.66947 -.1667544 . 5.316748029667575 . -1.5571347 .14007795741781592 0
    10016 24 -16.9453 5.549669 -.3145498 . 4.542604698957943 . -1.6143574 .179431816128393 0
    10016 25 8.111321 5.563562 -.3036111 . 3.7807510550224555 . -1.660704 .18063096818514168 70.212
    10016 26 2.3040948 5.727489 -.1513888 . 3.5056606391570195 . -1.703502 .21603378257714212 0
    10016 27 .8245112 5.633414 -.2612757 . 3.166245923705227 . -1.7464564 .11039680681161342 0
    10016 28 10.352745 5.508663 -.4272647 .8673182866282676 3.2682498175588663 . -1.7944955 .2028837788464694 0
    10016 29 -.07700622 5.94789 -.0005551329 .8129080140783217 2.16127910612237 . -1.8311558 .24367767465159748 0
    10016 30 -13.59457 5.823988 -.13622601 .7665897111991958 2.6940783891492117 . -1.8690586 .06557833659850681 74.952095
    10016 31 .3153377 5.626577 -.3344666 .7665897111991958 2.999377545392024 . -1.909644 .1504327189552808 0
    10016 32 2.0904431 5.737465 -.24028544 .836974770270605 2.014673960258895 . -1.948646 .13948722334268193 0
    10016 33 .6409574 5.826255 -.16725224 .7992138311994667 2.3724611581318595 . -1.98204 .135667040332919 70.53607
    10016 34 -3.5840454 5.752853 -.24011984 .7651898483983927 2.4007010173477474 . -2.015816 .08782547323062317 .4283751
    10016 35 -9.149167 5.771093 -.17797285 .7651898483983927 1.8627983959656782 . -2.0498998 .06947739201507741 0
    10016 36 6.431603 5.542513 .2281552 3.5851541593250156 4.82198026765289 . -2.0977657 .5772816515217225 0
    10016 37 -.322807 5.19246 -.014202964 3.4605721486564005 3.906781726763971 . -2.1203732 .48718820872406166 0
    10016 38 -9.567686 5.242568 .0891724 3.343341188878521 2.1095490468226696 . -2.147612 .13064499737154092 0
    10016 39 -12.797285 5.080534 .026705606 3.343341188878521 2.1512528659023964 . -2.1763072 .20890262837565388 0
    10016 40 12.408442 4.7252727 .547476 4.249474619486368 4.278965217368482 . -2.204504 .2218316688105978 0
    10016 41 -.8127576 5.108107 1.0574877 4.140813296672247 1.770487771512002 . -2.231045 .17409410091931932 0
    10016 42 5.496794 5.063531 1.1142317 4.150593166610387 2.814240918097085 . -2.2554648 .18607185781002045 0
    10016 43 -5.719497 5.494717 1.516213 4.1449059607530545 1.518472658709454 . -2.2759316 .305235676591595 0
    end


    I have tried to run the following code, but encounter an error message stating "no observations":

    Code:
    matrix R = J(9,9,0)
    levelsof num_QUARTER, local(levels)
    foreach qtr of local levels {
        qui correl EXRET ln_MV ln_MB INCVOL RETVOL DISP ln_AGE TURN INSOWN if num_QUARTER==`qtr'
        matrix R = R + r(C)
    }
    matrix R = R/84
    matrix list R
    Can anyone explain what I am doing wrong? Thanks

  • #2

    Code:
    foreach qtr of local levels {    
        di "`qtr'"      
        correl EXRET ln_MV ln_MB INCVOL RETVOL DISP ln_AGE TURN INSOWN if num_QUARTER==`qtr'    
        di  
    }
    to see where the problem occurs.
    Last edited by Nick Cox; 24 Jan 2024, 10:27.

    Comment


    • #3
      I'll save you the trouble. Just visually scanning your example data, the variable DISP has all missing values. The -corr- command, similar to regression commands, excludes from the estimation sample any observation where there is a missing value in any of the variables mentioned in the command. So, in every quarter, all the observations are omitted due to the missingness of DISP, and you get the error message informing you of that.

      Potential solutions: just omit DISP from your analysis--there is no way you can extract any information from any empty variable anyway. Another alternative, is to use -pwcorr- instead of -corr-. -pwcorr- calculates the correlations of each pair of variables from the observations that have non-missing values for both of those variables; it doesn't care if some other variable has a missing value. Be careful in doing this: because you are calculating the correlations of different pairs of variables from different sets of observations with -pwcorr-, the resulting "correlation matrix" does not have all of the nice properties of a true correlation matrix (the kind produced by -corr-). For example, it can turn out to be singular. And you certainly shouldn't use it as an input for, say, a structural equations model. But it looks like you are not going in those directions with this (since you are averaging a bunch of correlation matrices), so perhaps -pwcorr- is what you really want. That's your call.

      Added: Note that while DISP is the only variable in your example that is always missing, there are other variables, INCVOL and INSOWN, that sometimes have missing values. It follows that sometimes observations will be excluded from estimation due to the missing values in those variables, and perhaps in some quarters in your full data set, that, too, will lead to a no-observations problem. If you do run into that difficulty, post back--there are ways of handling this problem that just skip over the problematic quarters.
      Last edited by Clyde Schechter; 24 Jan 2024, 10:38.

      Comment


      • #4
        Thank you, that makes sense!

        Comment


        • #5
          Is there a way for me to obtain the significance levels of the correlation coefficients in the same matrix? When I use the pwcorr command in the code from above, the "sig" option doesn"t seem to work anymore. Any thoughts?

          Comment


          • #6
            The short answer is No. You can't hold two numbers as a single value in a Stata matrix, or indeed in a matrix as usually defined, or at least according to any definition I've ever encountered. Otherwise I may be misunderstanding the question. (Being able to hold complex numbers in Mata seems unlikely to be helpful here.)

            On a different level, I worry about averaging correlations as I don't think they often lend themselves to averaging. It is easy enough to imagine correlations that have the same sign in subsets but a different sign for the combined data, and other not very weird configurations can frustrate averaging as a useful procedure.

            In some contexts, it is recommended to combine correlations by averaging correlations on the atanh() scale before back-transforming that result.

            I'd worry even more about combining significance levels.

            This is all very gloomy, but a simple direct check is to compare the average correlations so obtained with the correlations for the entire dataset. If the results are consistent, you don't need to do what you are doing. If they are sometimes or always different, the question is why that happens.

            Naturally the significance levels are likely to be completely different, if only because the total sample size is enormously different from that for any quarter. (Set aside the very awkward fact that the usual machinery for correlation significance calculations assumes independent observations.)

            Comment


            • #7
              Very valid points and certainly appreciated. I am currently trying to replicate a published paper, and this is the approach that they took. The coefficients don't differ hugely when compared to the coefficients for the entire dataset. Still, I would like to present both, and can critique the approach of the benchmark paper in the process. Is there an efficient way to obtain the t-statistics of the average correlation coefficients, separately from the matrix. Then i can combine the two in my own table? Much appreciated.

              Comment


              • #8
                There is usually little point in trying to replicate a dubious analysis, unless there is a goal of calling out researchers who are incompetent or worse. However, I appreciate that this may be a project assigned to you.

                I don't know quite what efficient means here, but more importantly I don't have a different answer because I can't see that you can even in principle easily set up significance calculations for average correlations that make any sense.

                The implication of looking at correlations quarter by quarter is presumably that changes in correlation from quarter to quarter are of interest, should show some interpretable features, and emphatically are not just noise. That flat out contradicts standard procedures which would treat different correlations as independent estimates of an underlying constant. There could be a much more elaborate approach with a stochastic model for correlation changes with explicit dependence structure. That would be way beyond my expertise but I fear that it would be far more complicated than you want to do. This may already be discussed in some literature, but I am not an econometrician or even an economist to know about that.

                I hope you get a more interesting and more encouraging answer. If you are say a graduate student or research assistant, you may need to talk to your supervisor.

                Comment

                Working...
                X