The goal of my code here is to select the ideal number of principal components. According to this website which details the method for Stata, we essentially want to select the number of components where the average is at its lowest and begins to increase after that. Note my use of the user written crossfold prefix. To be concrete, here's my code:
The result we get is
So in this case (and after inspecting the screeplot from PCA), the ideal number of components is 2.
Conceptually, the way I think that makes the most sense to think about this is by starting with the second mean, .569. Then, move to the next one, and ask if .5784063 is less than .5696208. In this case it obviously isn't, so we'd select 2 as the number of principal components. But if it weren't (if mean 3 were less than mean 2), we'd move on to mean 3, and ask if mean 4 was less than mean 3. And if true, we'd use 3 components in our regression.
How would I implement this programmatically, though? My intuition tells me that looping through the columns of the r(table) matrix would be a good first start with a few assert statements, but how would I actually go about this?
Code:
* Example generated by -dataex-. For more info, type help dataex clear input double(year gdpcap15 pred1 pred2 pred3 pred4 pred5 pred6 pred7 pred8 pred9) 1955 3.853184630005267 -5.6359681692686685 -.2796493308499519 .14076150844533958 .031033665292261155 -.15112918386469865 .01822882385195665 -.008124304083252737 -.0008164576572658733 .0002865166184990442 1956 3.9456582961508766 -4.80336876033442 -.20493869698996953 .10893913008301967 .03810150847092765 .0002628159616413356 -.013868786188177551 -.006429933515285535 .004279014881456822 -.00028320987191921 1957 4.033561734872626 -3.984097559990941 -.15062350735637864 .07413980200961344 .03615715202065638 .14698408848701347 -.044033646814635424 -.0032634351303035825 .004594817232454029 -.0007239532350134525 1958 4.023421896896646 -3.7909617209543813 -.21861309305703375 -.04366336241116735 -.025601833127530615 .0751777915675255 -.001905947589295215 -.0008236563661104535 -.008400687934551607 .0017323740685059597 1959 4.013781968405232 -3.5960191421690597 -.2875851019749088 -.16428162980925726 -.0946026919255327 -.0006760918780774616 .03819602222038847 -.0020393501911901285 -.02575956769585215 -.0015446361508075906 1960 4.285918396222732 -2.40520552130941 -.022971459410171417 -.1621336397189589 -.04263588859617323 -.01744175533074968 .022601202686801186 .00613835442027974 .04103215499285389 .0008280633007628213 1961 4.574336095797406 -1.3303790071789818 .5309343857741721 -.09294057379692988 .1413480762231831 -.017890179062513074 .0010103711436177006 .05044703538868318 -.012004946934204473 .0024779905452485124 1962 4.898957353563045 .04228213763346514 .5412352518357144 -.06031624412421429 .061584272382696835 -.017632543983173845 -.016308012125437872 -.015943505697339336 .00196428212353659 -.005299340712424142 1963 5.197014981629133 1.3818361379201034 .5112547238876536 -.010022155165356008 -.038436535244485114 -.01853674225273269 -.02088969728520911 -.0708273493439845 -.005548450291739709 -.001131061857750193 1964 5.3389029787527225 1.9611439921091653 .4226917985221951 .07081727130789066 -.055061427587660666 .015383650681875 .0179673876598935 -.013718314773666251 .0005227960766076845 .008660399468473308 1965 5.465153005251848 2.512032836587504 .30387269313558785 .1466876568856701 -.07180058997304965 .039132084392686554 .057970807460515444 .037219005780505854 -.00016834621438262948 -.0018753329397272062 1966 5.545915627064143 3.2785311878240764 .027584452000544053 .05577526983594565 -.05756744855911607 -.00868821871424516 -.010843189126278005 .0274838487765844 .003844001145529616 -.006352518350698069 1967 5.614895726639487 4.02807594553564 -.2627014003748662 -.0261268019214803 -.06720710738888518 -.059637885581341164 -.0805153379207898 .02669785039648495 -.007198561000218567 .0019571102831647536 1968 5.8521849330715785 5.445083506715171 -.38111146454733813 -.022780059389372984 .034014966033170654 -.01183480815152145 -.02182652234407402 -.0006919939231301575 .006452130851168247 .0018004998243573156 1969 6.0814054173695915 6.89701413688074 -.5293792505952488 -.014856172230743148 .11067388197953773 .026526977728309448 .054216524370719235 -.026124251738275844 -.0027921795753868273 -.000532900990673435 1970 6.17009424134957 8.038199924990405 -.779090200321954 -.0018207213227130759 .12695146249770906 .1955640050140302 .09036882537098212 .05438415156659038 .08471243381069524 -.06810464939518154 1971 6.283633404546246 9.145831123757889 -.9581022929408727 .017284947847694132 .11449644613898544 .36138691243717824 .14376089776334022 .1331524970023707 .15367048597972616 -.14412247783215515 1972 6.5555553986528405 10.983920152860067 -.9224148612405898 .13741117390715954 .245384955266249 .5872056509280951 .26958023328948866 .07744013991540999 .21669144214265626 -.15132619386978885 1973 6.810768561103078 12.797898389916256 -.9188239314052387 .2597731099547717 .3572339027620418 .8054123560884386 .40500328959475046 .016267002212098647 .2504152654263079 -.16684961304785007 1974 7.105184302810804 13.483436916330389 -.6912008846040067 .28920628417672295 .14529159677906842 .7418375140135989 .45178882518772734 -.14565015366024225 .22189576108402365 -.23540893913360122 1975 7.377891682175629 14.129962334361577 -.48910478275826064 .32139331002556326 -.068432647396291 .6771019665423366 .502216641352912 -.3144457664927772 .17822165303778037 -.2978813916241576 1976 7.232933621922754 14.655724797410553 -.8195517558088924 .26024964665872374 -.3163213335059729 .7952233480881981 .6995804758293602 -.3200185575710963 .07668854994396535 -.3872792441288567 1977 7.089831372119127 15.167983198809889 -1.1452589771500479 .2188386890861237 -.5715493203851785 .9057103652666365 .8945935592653703 -.3328679210115676 -.0197146767761191 -.47157613190480197 1978 6.786703607144611 15.246498160982224 -1.5832636509535927 .22551363384770295 -.7152678695111958 .957237911150079 .8379966416176281 -.2432014360926641 -.2847825631831978 -.45359152166116035 1979 6.6398173868571035 15.282575725478978 -1.8033324476838084 .24964799981841534 -.6306966220394404 1.0691825799505508 .598332032776653 -.25245212207980755 -.28343486092719855 -.33770168385929183 1980 6.562839171369564 15.481585723352795 -1.7898147574012568 .16304890403427746 -.3846219297832588 1.2248834214840256 .3111588530473055 -.16289279236345866 -.15255742392170735 -.21394982926744094 1981 6.50078545499277 15.83232897218898 -1.7530443413968633 .0029852721653992886 -.1523853234088376 1.409805175418291 .04795807738992236 -.10564646169219682 .0009063780809936972 -.056123793728779736 1982 6.545058606999563 16.309593606525535 -1.784468040316007 -.0726225266832779 -.3377780502558716 1.4013275464652095 .15133987275970073 -.000053022571553817954 .07272124169259164 -.051658536760092244 1983 6.595329801139407 16.818135788104822 -1.8086859669856088 -.1571816084016392 -.5320848509254261 1.3939288004168728 .2550847467631834 .10542367660701446 .13244122904770378 -.027459190827567814 1984 6.761496750091492 17.631311010318974 -1.8338991911875449 -.14492183674859432 -.5292312090822835 1.395928637503537 .2710251943297569 .17052067085894462 .1387845069763436 -.08029841987964442 1985 6.937160671727721 18.482714837431047 -1.8663538287215398 -.14652742073317787 -.5327547376820521 1.3966905041997464 .28842298383402054 .24143251557977286 .1244133748715433 -.1162693874289108 1986 7.332191151300521 20.892869429520093 -1.7475666762165913 -.13909364881807473 -.5506734053689885 1.7850752708803073 .3647781877540621 .43439427377534434 .14344758997858148 -.26621243848354903 1987 7.742788123594152 23.329675833037232 -1.6339961910462066 -.11048057343516143 -.6291803374256126 2.1625926284644375 .45589124349554333 .6112319347274746 .14721949164321602 -.41005372527654516 1988 8.12053664075889 25.666380884615926 -1.623124252105773 -.024982883199390127 -.8412939241332308 2.4714871099866884 .6037780941976278 .5484206501680482 .19720849028273124 -.505166091967399 1989 8.509711162324157 28.00935858726377 -1.6270558649915419 .07714389394393995 -1.0689817172597456 2.7992441225289584 .7641988311988135 .5069535199084416 .2186007035341937 -.5746335620073871 1990 8.776777889074104 29.21914765014005 -1.615727337035631 .1764241804855109 -1.5291831551065178 2.8537456687094784 .9311299698192872 .6194858367631093 .16545387960640268 -.5461770419550871 1991 9.02527866619582 30.409858803872673 -1.566357132663052 .2323741917230142 -1.9438485588235395 2.8705228767967435 1.093448814239629 .7305398867925607 .1346147572802343 -.5508702936327751 1992 8.873892824706335 29.566071506714888 -1.6418149885222633 .48775876767527426 -2.0171636139907068 2.8624876012920994 1.099338293986793 .8854525641695832 .01689859152810802 -.5518044941246998 1993 8.718223539089278 28.72833388460929 -1.7139242650504576 .7402916614293393 -2.0968489496131446 2.852399408805832 1.1080889010104502 1.0489654461608868 -.08842898572482327 -.5600206940772299 1994 9.018137849286365 30.023375376148266 -1.6216341006721546 .520095804113017 -2.1118696608999152 2.8595663671585525 .8238863268288168 1.2045879912620205 -.07684458007656847 -.4100289494428013 1995 9.440873861653367 31.312531772654815 -1.4532279796489098 .4309155176565326 -2.260235052254674 2.8446523366154746 .6876126485134306 1.129642678084732 .027223013634157245 -.3848566318605062 1996 9.68651813767495 32.85221669474676 -1.6749058326253756 .625399108002632 -2.497716504314482 3.2912265949405093 .7858230737927119 1.4317549486919998 -.2274248843919251 -.6434678836236071 1997 10.170665872808662 35.00403089746422 -1.549512124557728 .6428125484515941 -2.6900621400867153 3.5104278648230483 .7612511825986432 1.5787132481189765 -.2957021623046656 -.7591290137769755 end format %ty year cls qui ds pred* loc v : di word("`r(varlist)'", `c(k)') loc vars `r(varlist)' local nwords : word count `r(varlist)' cls set seed 1000 tokenize `vars' local predictors qui forval i = 1/`nwords' { loc a: word `i' of `c(alpha)' local predictors `predictors' ``i'' crossfold: reg gdpcap15 `predictors' if year < 1975, k(10) mat `a' = r(est) svmat double `a', name(model`a') } mean model*
Code:
Mean estimation Number of obs = 10 -------------------------------------------------------------- | Mean Std. err. [95% conf. interval] -------------+------------------------------------------------ modela1 | .657484 .037877 .5718001 .7431678 modelb1 | .5696208 .0614974 .430504 .7087376 modelc1 | .5784063 .0485904 .4684871 .6883254 modeld1 | 1.342622 .1057873 1.103315 1.58193 modele1 | .3425936 .0374513 .2578728 .4273145 modelf1 | .3795547 .0326934 .3055972 .4535122 modelg1 | .3340048 .0401717 .2431302 .4248794 modelh1 | .3870031 .0443458 .2866858 .4873203 modeli1 | .3789193 .0671177 .2270885 .53075 --------------------------------------------------------------
Conceptually, the way I think that makes the most sense to think about this is by starting with the second mean, .569. Then, move to the next one, and ask if .5784063 is less than .5696208. In this case it obviously isn't, so we'd select 2 as the number of principal components. But if it weren't (if mean 3 were less than mean 2), we'd move on to mean 3, and ask if mean 4 was less than mean 3. And if true, we'd use 3 components in our regression.
How would I implement this programmatically, though? My intuition tells me that looping through the columns of the r(table) matrix would be a good first start with a few assert statements, but how would I actually go about this?
Comment