Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Correlation matrix with dummy variables

    Dear statalisters,

    I aim to test my data for multicollinearity, first using a correlation matrix. However, some of my independents are dummy variables (FORCED and OUTSIDE) and I expect a strong correlation between them (as I am including an interaction variable between the two in the regression model). I performed a Pearson correlation test with the following code and result:

    Code:
    . pwcorr ΔOROAt1_adjusted ΔOROAt2_adjusted ΔOROAt3_adjusted ΔOROAt4_adjusted ΔOROAt5_adjusted FORCED OUTSIDE SIZE 
    > FISCALYEAR, sig star(.05) obs
    
                 | ΔOROAt.. ΔOROAt.. ΔOROAt.. ΔOROAt.. ΔOROAt..   FORCED  OUTSIDE
    -------------+---------------------------------------------------------------
    ΔOROAt1_ad~d |   1.0000 
                 |
                 |      422
                 |
    ΔOROAt2_ad~d |   0.4980*  1.0000 
                 |   0.0000
                 |      321      331
                 |
    ΔOROAt3_ad~d |   0.3534*  0.6523*  1.0000 
                 |   0.0000   0.0000
                 |      232      229      242
                 |
    ΔOROAt4_ad~d |   0.2417*  0.4223*  0.6253*  1.0000 
                 |   0.0011   0.0000   0.0000
                 |      180      181      185      189
                 |
    ΔOROAt5_ad~d |   0.2070*  0.3981*  0.5098*  0.5875*  1.0000 
                 |   0.0196   0.0000   0.0000   0.0000
                 |      127      127      131      132      133
                 |
          FORCED |   0.0598   0.1187*  0.2041*  0.1432*  0.1906*  1.0000 
                 |   0.2204   0.0308   0.0014   0.0493   0.0280
                 |      422      331      242      189      133      510
                 |
         OUTSIDE |  -0.0008   0.0497   0.0310   0.0164   0.0068   0.0364   1.0000 
                 |   0.9868   0.3673   0.6313   0.8225   0.9384   0.4126
                 |      422      331      242      189      133      510      510
                 |
            SIZE |   0.0342   0.0177   0.0820   0.0128   0.1408  -0.1443* -0.0028 
                 |   0.4886   0.7502   0.2044   0.8614   0.1074   0.0012   0.9503
                 |      413      327      241      188      132      500      500
                 |
      FISCALYEAR |  -0.0472  -0.0879  -0.0237  -0.0935  -0.1436   0.1040*  0.0127 
                 |   0.3334   0.1104   0.7137   0.2007   0.0992   0.0188   0.7755
                 |      422      331      242      189      133      510      510
                 |
    
                 |     SIZE FISCAL~R
    -------------+------------------
            SIZE |   1.0000 
                 |
                 |      500
                 |
      FISCALYEAR |   0.0309   1.0000 
                 |   0.4905
                 |      500      510
                 |
    Looking at the results, they seem to match my expectations in terms of correlations - but I want to make sure that (1) I am performing the correct correlation test on this type of data (I have read online that the variables have to be continuous for a Pearson correlation to make sense) and that (2) I am interpreting the results in the appropriate way.

    Please advise.

    Thanks!

  • #2
    Carl-Johan:
    why not relying on -estat vce, corr- after -regression-?
    Kind regards,
    Carlo
    (Stata 19.0)

    Comment


    • #3
      pwcorr is not focused on your problem. For multicollinearity you need to focus just on the observations used in a model fit. The major variation here in # of observations used points to problems with missing values, but only observations not missing on all variables included will be relevant to thinking about a model fit.

      Fit a model first then use
      correlate ... if e(sample)

      Comment


      • #4
        If you're using regress to fit your model, and Stata doesn't omit predictors with the collinearity message, then I think that pairwise correlation coefficients wouldn't be all that additionally informative. You could always use regress and then follow it with estat vif to explore suspected problematic predictors in anticipation of their use with other estimation commands that might be more sensitive to near collinearity, but I think that the algorithms used in most modern least-squares linear regression software, including Stata's, are fairly resilient.

        Comment


        • #5
          Appreciate the input.

          1. I am planning to include a correlation table also as a part of the descriptive statistics. For that purpose, do the results in my first post make more sense?
          2. For the purpose of investigating multicollinearity, -estat vif- gives the following output:
          Code:
          . estat vif
          
              Variable |       VIF       1/VIF  
          -------------+----------------------
              1.FORCED |      2.31    0.432907
             1.OUTSIDE |      1.72    0.581479
                FORCED#|
               OUTSIDE |
                  1 1  |      2.67    0.374190
                  SIZE |      1.14    0.880448
            FISCALYEAR |
                 2001  |     10.44    0.095786
                 2002  |     13.72    0.072907
                 2003  |      8.58    0.116492
                 2004  |      9.58    0.104345
                 2005  |     12.10    0.082674
                 2006  |     13.01    0.076862
                 2007  |     12.16    0.082213
                 2008  |     13.65    0.073257
                 2009  |     12.06    0.082953
                 2010  |      8.75    0.114301
                 2011  |     11.93    0.083824
                 2012  |      7.81    0.128054
          -------------+----------------------
              Mean VIF |      8.85
          
          .
          Which I interpret as indicating no multicollinearity problems.

          -estat vce, corr- gives the following output:

          Code:
          . estat vce, corr
          
          Correlation matrix of coefficients of regress model
          
                       |        1.        1. 1.FORCED#               2001.     2002.     2003.     2004.     2005.     2006.
                  e(V) |   FORCED   OUTSIDE  1.OUTS~E      SIZE  FISCAL~R  FISCAL~R  FISCAL~R  FISCAL~R  FISCAL~R  FISCAL~R 
          -------------+----------------------------------------------------------------------------------------------------
              1.FORCED |   1.0000                                                                                           
             1.OUTSIDE |   0.4528    1.0000                                                                                 
              1.FORCED#|                                                                                                    
             1.OUTSIDE |  -0.6889   -0.6007    1.0000                                                                       
                  SIZE |   0.2617    0.1398   -0.1888    1.0000                                                             
          2001.FISCA~R |  -0.0884   -0.1194    0.0589    0.0620    1.0000                                                   
          2002.FISCA~R |  -0.0980   -0.0938    0.0520    0.0539    0.9224    1.0000                                         
          2003.FISCA~R |  -0.0862   -0.0882    0.0581    0.0354    0.8998    0.9119    1.0000                               
          2004.FISCA~R |  -0.0674   -0.1298    0.0655    0.0764    0.9064    0.9161    0.8942    1.0000                     
          2005.FISCA~R |  -0.0719   -0.0812    0.0153    0.0684    0.9169    0.9290    0.9058    0.9106    1.0000           
          2006.FISCA~R |  -0.0473   -0.1164    0.0551    0.1070    0.9190    0.9287    0.9063    0.9161    0.9236    1.0000 
          2007.FISCA~R |  -0.1301   -0.1199    0.0988    0.0414    0.9168    0.9289    0.9066    0.9108    0.9214    0.9227 
          2008.FISCA~R |  -0.0689   -0.1011    0.0504    0.0524    0.9224    0.9338    0.9118    0.9175    0.9282    0.9305 
          2009.FISCA~R |  -0.0717   -0.1081    0.0639    0.0594    0.9174    0.9284    0.9066    0.9130    0.9224    0.9262 
          2010.FISCA~R |  -0.1128   -0.1028    0.0207    0.0291    0.8984    0.9108    0.8874    0.8904    0.9065    0.9012 
          2011.FISCA~R |  -0.0141   -0.0333    0.0194    0.0832    0.9119    0.9241    0.9032    0.9080    0.9191    0.9230 
          2012.FISCA~R |  -0.0370   -0.1188    0.0410    0.1031    0.8927    0.9018    0.8799    0.8900    0.8975    0.9041 
                 _cons |  -0.0899   -0.0480    0.0648   -0.3434   -0.9050   -0.9156   -0.8909   -0.9048   -0.9154   -0.9281 
          
                       |     2007.     2008.     2009.     2010.     2011.     2012.          
                  e(V) | FISCAL~R  FISCAL~R  FISCAL~R  FISCAL~R  FISCAL~R  FISCAL~R     _cons 
          -------------+----------------------------------------------------------------------
          2007.FISCA~R |   1.0000                                                             
          2008.FISCA~R |   0.9278    1.0000                                                   
          2009.FISCA~R |   0.9231    0.9294    1.0000                                         
          2010.FISCA~R |   0.9037    0.9085    0.9022    1.0000                               
          2011.FISCA~R |   0.9168    0.9258    0.9208    0.8953    1.0000                     
          2012.FISCA~R |   0.8954    0.9038    0.8994    0.8760    0.8959    1.0000           
                 _cons |  -0.9036   -0.9173   -0.9138   -0.8805   -0.9267   -0.9007    1.0000 
          
          .
          Not really sure how to test -correlate ... if e(sample)-. What does e(sample) refer to?

          Thanks!

          Comment

          Working...
          X