Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Choosing optimal number of principal components

    I'm writing code for this method. It's a form of synthetic control analysis which predicts the counterfactual of a treated unit based on principal components analysis. The theory at least is that PCA de-noises the outcomes matrix in the pre-intervention period, after which we can perform linear regression to predict the post-intervention counterfactual. My question here is about the principal components analysis step, however. Let's looks at my dataset. What we see here is the real, non-normalized value of the gdp per capita for the Basque Country, and the normalized GDP per-capita values of the other 16 regions in Spain.

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input double(year gdpcap15) float(normgdpcap1 normgdpcap2 normgdpcap3 normgdpcap4 normgdpcap5 normgdpcap6 normgdpcap7 normgdpcap8 normgdpcap9 normgdpcap10 normgdpcap11 normgdpcap12 normgdpcap13 normgdpcap14 normgdpcap16 normgdpcap17)
    1955  3.853184630005267  -1.596707  -1.3327312  -.9565113 -1.4974374 -1.2136703 -1.5789266 -1.7555073   -.7793649  -1.206382  -1.792608  -1.620488  -.3183888 -1.6007596  -1.215555  -1.238519  -1.287997
    1956 3.9456582961508766 -1.5660152  -1.2639334  -.8668543 -1.4281683 -1.1545168 -1.5308938 -1.7170875   -.7160962 -1.1348827 -1.7534026 -1.5804973  -.2338523 -1.5634706 -1.1526319 -1.1889784 -1.2243198
    1957  4.033561734872626  -1.535606   -1.194319  -.7780455  -1.360313 -1.0988817 -1.4827982  -1.678165   -.6560946 -1.0638859 -1.7138518 -1.5405066 -.15606996  -1.525616 -1.0903056 -1.1404744 -1.1606112
    1958  4.023421896896646  -1.524548  -1.1786431  -.7371125 -1.3626063 -1.0730591 -1.4723686 -1.6659132    -.634607 -1.0358956 -1.7060295  -1.529763 -.18126445  -1.509312 -1.0718025 -1.1290082 -1.1401918
    1959  4.013781968405232 -1.5134274  -1.1618992  -.6965562 -1.3658735 -1.0445975 -1.4619076 -1.6536303   -.6143446 -1.0083137  -1.698176 -1.5189877 -.20755836  -1.493165 -1.0502521 -1.1177617 -1.1197723
    1960  4.285918396222732 -1.4553105   -1.071991  -.5540287 -1.3024162  -.9595584 -1.3987017 -1.6060373   -.4735448   -.923369 -1.6639656 -1.4671224 -.06911468  -1.407592  -.9479037 -1.0342307 -1.0330998
    1961  4.574336095797406 -1.4029425    -.976051  -.4214283  -1.263682  -.8757132 -1.3544072 -1.5686854   -.3268078  -.8603829  -1.637389 -1.4572268   .1383151 -1.3253802  -.8720691  -.9565427  -.9524587
    1962  4.898957353563045  -1.336438   -.8618279 -.29564452 -1.1745907  -.7755323  -1.259944  -1.494767   -.2112338  -.7690924 -1.5892934 -1.3780937  .22991987 -1.2303828  -.7455943  -.8555137  -.8419426
    1963  5.197014981629133 -1.2701535   -.7496468 -.17890836  -1.088138   -.680472 -1.1640353  -1.419341   -.1042046  -.6832051 -1.5399725 -1.2985836  .31071785 -1.1346314  -.6261879   -.757092  -.7254261
    1964 5.3389029787527225 -1.2359117   -.7206513 -.11768155 -1.0413303  -.6505656 -1.1176047 -1.3794446 -.070434034  -.6574768 -1.5113225 -1.2579334   .3436404 -1.0852792  -.5864485  -.7239498  -.6661471
    1965  5.465153005251848 -1.2025496   -.6919699 -.05874794  -.9959993  -.6212243 -1.0716767   -1.33892  -.04159556  -.6352667  -1.482264 -1.2179426   .3668243 -1.0348276  -.5478087  -.6928181  -.6143131
    1966  5.545915627064143 -1.1542655   -.6312456 .065370746  -.9178715  -.5657779  -1.025089  -1.293306  .006280098 -.58952725  -1.445415 -1.1651664    .348541  -.9762081  -.4875243  -.6197794 -.56552655
    1967  5.614895726639487 -1.1067982   -.5728147   .1845259  -.8423824  -.5113051  -.9782501 -1.2475665   .04894103 -.54762024 -1.4080317 -1.1131753   .3251686  -.9198505  -.4248838  -.5477458 -.50938886
    1968 5.8521849330715785 -1.0254031    -.472728   .3709708   -.729353  -.4145485  -.8868653 -1.1478255    .1566927  -.4568323 -1.3517684 -1.0293615   .4113071   -.816968  -.3348814  -.4441409 -.41580495
    1969 6.0814054173695915  -.9410553   -.3704112   .5558452  -.6182399 -.31119475  -.7906428 -1.0429639    .2627793  -.3659816 -1.2917353  -.9418093   .4915711  -.7091849  -.2430567  -.3380855  -.3143991
    1970   6.17009424134957  -.8639642   -.3176034   .6901736  -.4809588 -.23174755  -.7248607  -.9596213    .3293466 -.29187495 -1.2316707  -.8606028  .53897566  -.6210988 -.14903314  -.3020532 -.23086786
    1971  6.283633404546246  -.7897945  -.26152843   .8140724  -.3627773 -.15333693  -.6582934  -.8792316    .4000921 -.22043823 -1.1720461  -.7845797   .5967782  -.5418713 -.05215097  -.2721151  -.1473054
    1972 6.5555553986528405  -.6871634  -.13942033   .9909357 -.23592573  -.0716278  -.5439448  -.7430497   .53709066 -.07888454 -1.0874155  -.6752887   .7781027 -.41297776 .065370746 -.13005887 -.03876837
    1973  6.810768561103078  -.5853176 -.015270197   1.160574 -.11582801 .009044589  -.4279623  -.6036323    .6685919   .0577056   -.999769  -.5667203   .9488406   -.286503  .18138444 .012060257 .068826266
    1974  7.105184302810804 -.55710745   .05462705  1.1677995 -.15415373    .083654 -.37546885 -.56753707    .7343109  .10548712  -.9745118 -.52167183   1.028319  -.2729319   .2932514    .081078  .11334074
    1975  7.377891682175629 -.53056216    .1243985  1.1684276  -.1921025  .15706965  -.3242633 -.53065634    .7948152   .1485877   -.950888  -.4782884  1.0990016 -.26196828   .3980188   .1468285   .1574466
    1976  7.232933621922754   -.482341    .1840546    1.19513 -.16266713   .2102229 -.25757018  -.4652201    .7994331  .19495553  -.9043317  -.4219623  1.0509377 -.23485756   .4309412  .12056603  .21923886
    1977  7.089831372119127 -.43399405    .2411663  1.2152352  -.1364988  .25725052 -.19072025  -.3985899     .802763   .2388102  -.8590949  -.3662644   1.003816 -.20837517   .4690784  .09700516   .2804971
    1978  6.786703607144611  -.4456487    .2669575  1.1982714 -.12198532   .2632506  -.1633896  -.3835739    .7483845   .2167572  -.8055646 -.35040015   .9114573  -.1995477  .44256455  .13039878    .329221
    1979 6.6398173868571035 -.48007905   .27917776  1.1819359 -.03726034  .25878978   -.175013  -.4000977    .7447091  .20117553  -.7929046 -.33466145   .8812994 -.21195647   .4053697  .16985537   .3443943
    1980  6.562839171369564  -.4868645   .28882203  1.2940855  .05230221    .255554 -.18673053   -.435282    .7745529   .2079924   -.782475  -.3240119   .9233946  -.2441563   .3916416  .21063133   .3477556
    1981   6.50078545499277 -.48884365    .3048433  1.4134606    .149593  .26249668  -.1923224  -.4649687     .820041  .25580537  -.7678673  -.3077078   .9739722  -.2684082   .3781963   .2716382   .3615152
    1982  6.545058606999563  -.4703406    .3587193  1.4976516   .1623787  .26337618 -.17381923  -.4289364    .8662204   .3079221  -.7510605 -.26721445   .9786844 -.22307713   .4592142   .2338467   .4362188
    1983  6.595329801139407  -.4516175    .4146999   1.586869   .1767037  .27427712  -.1528658  -.3930923    .9149128   .3614523  -.7343479 -.22518185   .9855954  -.1809504  .53957254  .19834834   .5208494
    1984  6.761496750091492  -.4221193    .4639893   1.741114  .21254756   .3126343  -.0900681  -.3379285    .9529243   .4041445  -.6955512  -.1842803  1.0474818 -.12591213   .6506541  .25037074   .5757306
    1985  6.937160671727721  -.3922441    .5179279   1.906668  .24961662  .36079255 -.02472599 -.28179082    .9906216   .4478107  -.6553719  -.1409596  1.1090544  -.0719735   .7608874  .30490625   .6411355
    1986  7.332191151300521  -.2881366    .7075457   2.173063   .4129093   .4849112  .14478648 -.14849916   1.2139786   .6252084  -.5087605 -.01772062  1.3327255  .06744402   .9375311   .4032336   .8125016
    1987  7.742788123594152 -.18368343    .8976348   2.421866    .569479   .6137106   .3195767 -.00996114   1.4367074   .8033913  -.3605784  .10891119  1.5535696  .20692444  1.1266465   .5046396   .9953338
    1988   8.12053664075889 -.06977446   1.0930328    2.55255   .7297558   .7501752   .4558528  .15823197   1.6449857    .986538 -.24139184   .2601406  1.7316896   .3548866   1.333668   .6026841  1.1809933
    1989  8.509711162324157  .04378901   1.2865463  2.6791506   .8885245   .8866398   .5964641    .328687   1.8510646  1.1662288  -.1173673    .411904  1.9091815  .49810535  1.5334642   .7018282   1.379219
    1990  8.776777889074104  .11748728    1.381418  2.7250156    .891666   .9381596   .6828225   .4228676   1.9650993   1.240681 -.03700916   .4993306  1.9745237   .5543059   1.706558   .7324574  1.5576532
    1991   9.02527866619582  .18961503   1.4737767  2.7985256   .8926086  1.0022452   .7676731    .508692   2.0819612   1.314505 .035244167    .582202  2.0436356    .612454  1.8799658    .757746  1.7206944
    1992  8.873892824706335  .14211632   1.3851875  2.6401966   .8366907   .9431858   .7598823   .4454232   1.9883457  1.2152352  .02079358   .5394782  1.9892883  .59156346   1.771586   .7059121  1.6993325
    1993  8.718223539089278   .0953401   1.2928293  2.4862654   .7804273   .8841264   .7520288   .3829085   1.8947308  1.1169081 .006342988   .4968173  1.9358838  .57108116   1.671374   .6511567   1.677971
    1994  9.018137849286365  .14104815   1.4323093   2.684177   .9205676   .9821399   .7725423   .4015372   2.0624843   1.209895  .05352742   .5447558   2.007823   .6452194   1.780696   .7197031   1.838813
    1995  9.440873861653367  .17708066    1.552313   2.839993   1.011041  1.0817237   .8643355   .4445751     2.20919   1.306966  .08057525    .602904  2.1030087   .6896394   1.953476   .7911084  1.9506484
    1996   9.68651813767495   .2981521   1.6622636    2.90722  1.0993156  1.1392124  1.0088422   .5702331   2.3131716   1.368852  .25850698    .679367  2.2016501   .7598509  2.0863593   .8354342  2.0844743
    1997 10.170665872808662   .4323548    1.847923   3.093508  1.2064394   1.279635  1.1307302   .6806549    2.519565  1.4989083   .3987413   .8008153  2.3819695   .8696759  2.2896109   .9491547  2.2691915
    end
    format %ty year
    To generate the principal components, I do
    Code:
    pca norm* if year < 1975
    screeplot
    qui{
    cap drop pred
    predict double pred*
    }
    drop norm*
    Okay so this leaves us with our main outcome, out principal component scores, the time variable, and it also generates a screeplot. How do I choose the optimal number of principal components to include in my regression? That is, how do I choose the optimal number of pred scores such that adding more of them doesn't improve my model by very much?

    Currently, this code
    Code:
    reg gdp pred1-pred2 if year < 1975 // intervention begins here
    predict cf, xb
    line gdp cf year, lcol(black red) xli(1975, lcol(blue) lpat(dash) lwid(thick))
    doesn't looks so bad. Here, I use two principal components, and the pre-intervention fit is good. But suppose I had more donors? Suppose I'd used covariates? What if I should use the top 4 principal components, or the top 3? How could I objectively (or defensibly, rather) choose the optimal number of scores to include here? I know I can eyeball the screeplot, but I was hoping there'd be a more objective way of determining this. Any ideas?
Working...
X