Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Pooled OLS, fixed & random effects: Panel Data

    Hey everyone, (Data description is posted in the buttom) I'm currently writing by bachelor at Copenhagen Business School, and ran into an issue with Stata that i haven't been able to find the solution to on my own.

    Since it is a university assignment the normal approach (as i have been taught, and is the recommendations in https://www.iuj.ac.jp/faculty/kucc62...blq5Qmk7KvdJLg) would be to start of with a simple model like a Pooled OLS, and then if that isn't sufficient, or the assumptions of the model don't seem to hold up, then you move on to fixed or random effects models. Gladly correct me if this approach isn't optimal.

    My first issue when doing the Pooled OLS, is figuring out if it is actually done correctly (As i have seen different approaches from different sources). From what i can tell you do this by running clustered standard errors.

    Code:
    reg Covid19_cases x1 x2 x3 Country, vce(cluster Country)
    Question 1. is this approach to Pooled OLS correct, and how should i include my time variable in the -reg?

    Question 2. How do i test the assumptions of heteroskedasticity and autocorrelation when using clustered standard errors, as this seems to make it impossible to run a Breusch-Pagan test.

    Code:
    . hettest
    hettest not appropriate after robust cluster()
    r(498);



    Furthermore, i know that -xtreg usually outperforms -reg (with clustered standard errors) when it comes to panel data regression.

    So my Question 3 (See output from Pooled OLS and Random effects below) is how do i based on the stata output determine whether i should use Pooled OLS, fixed or random effects model. (As almost all my variables are static, i know that i'll probably end up with a -re effects model. I just simply haven't been able to statistically argue for this point of view, as i can't even test for things like heteroskedasticity and autocorrelation)

    output from Pooled OLS:

    Code:
    Linear regression                               Number of obs     =      4,592
                                                    F(19, 41)         =     303.69
                                                    Prob > F          =     0.0000
                                                    R-squared         =     0.7294
                                                    Root MSE          =       1242
    
                                                    (Std. Err. adjusted for 42 clusters in Country)
    -----------------------------------------------------------------------------------------------
                                  |               Robust
                    Covid19_cases |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
    ------------------------------+----------------------------------------------------------------
                      Ages0_14Pct |   8123.495   11156.99     0.73   0.471     -14408.5    30655.49
                     Ages65_99Pct |   3038.334   11320.49     0.27   0.790    -19823.86    25900.53
                     Ages15_64Pct |   7851.408   11239.37     0.70   0.489    -14846.96    30549.78
                   Covid19_deaths |   9.630924   1.715408     5.61   0.000     6.166587    13.09526
                       CrimeIndex |  -8.226521   5.282897    -1.56   0.127    -18.89555    2.442507
                      DAI_B_index |   1999.881   1276.736     1.57   0.125    -578.5393    4578.301
                      DAI_G_index |   290.8351   454.5301     0.64   0.526     -627.107    1208.777
                      DAI_P_index |   3.885746    690.839     0.01   0.996    -1391.292    1399.064
                          Gdp2018 |   .1692471   .0260085     6.51   0.000     .1167219    .2217724
               GdpAgriculturalPct |    1772.46   2558.133     0.69   0.492    -3393.794    6938.715
                 GdpIndustrialPct |  -35.28193   2158.263    -0.02   0.987    -4393.983    4323.419
                    GdpServicePct |   150.5583   2216.818     0.07   0.946    -4326.396    4627.512
             InternetUsage2014Pct |  -283.9534   701.2951    -0.40   0.688    -1700.248    1132.341
                      popData2018 |  -9.46e-07   4.60e-07    -2.06   0.046    -1.87e-06   -1.64e-08
    pop_AnnualGrowthPct_2010_2018 |  -15230.27   11371.27    -1.34   0.188    -38195.02    7734.475
                    pop_density18 |  -.3554776   .4322918    -0.82   0.416    -1.228509    .5175533
              SocialMobilityIndex |  -9.681271    21.7794    -0.44   0.659    -53.66567    34.30313
                  StringencyIndex |   3.398476   1.255605     2.71   0.010     .8627302    5.934222
                          Country |   .3061796     1.2523     0.24   0.808    -2.222892    2.835251
                            _cons |  -7908.159   12120.52    -0.65   0.518    -32386.04    16569.72
    -----------------------------------------------------------------------------------------------


    output from Random effects:
    Code:
    xtset Country Date
    Code:
    Random-effects GLS regression                   Number of obs     =      4,592
    Group variable: Country                         Number of groups  =         42
    
    R-sq:                                           Obs per group:
         within  = 0.6600                                         min =         51
         between = 0.9450                                         avg =      109.3
         overall = 0.7293                                         max =        113
    
                                                    Wald chi2(18)     =    9283.67
    corr(u_i, X)   = 0 (assumed)                    Prob > chi2       =     0.0000
    
    -----------------------------------------------------------------------------------------------
                    Covid19_cases |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
    ------------------------------+----------------------------------------------------------------
                      Ages0_14Pct |   7719.581   17073.57     0.45   0.651       -25744    41183.17
                     Ages65_99Pct |   2600.686   16826.48     0.15   0.877    -30378.61    35579.98
                     Ages15_64Pct |   7477.483   17549.24     0.43   0.670    -26918.39    41873.36
                   Covid19_deaths |   9.728414   .1100511    88.40   0.000     9.512718     9.94411
                       CrimeIndex |   -8.30544   6.737789    -1.23   0.218    -21.51126    4.900383
                      DAI_B_index |   1956.444   1257.037     1.56   0.120    -507.3033    4420.191
                      DAI_G_index |   308.9836   463.2871     0.67   0.505    -599.0424     1217.01
                      DAI_P_index |   64.28808   1023.292     0.06   0.950    -1941.327    2069.903
                          Gdp2018 |   .1685496   .0209757     8.04   0.000      .127438    .2096611
               GdpAgriculturalPct |   1816.435   4797.581     0.38   0.705    -7586.651    11219.52
                 GdpIndustrialPct |  -124.5615   4119.166    -0.03   0.976    -8197.979    7948.856
                    GdpServicePct |   135.0255   4126.775     0.03   0.974    -7953.304    8223.356
             InternetUsage2014Pct |  -392.6615   1069.255    -0.37   0.713    -2488.362    1703.039
                      popData2018 |  -9.59e-07   3.03e-07    -3.16   0.002    -1.55e-06   -3.64e-07
    pop_AnnualGrowthPct_2010_2018 |  -15717.27   16056.61    -0.98   0.328    -47187.65    15753.11
                    pop_density18 |  -.3674106   .5401536    -0.68   0.496    -1.426092     .691271
              SocialMobilityIndex |  -8.118353   21.31227    -0.38   0.703    -49.88963    33.65293
                  StringencyIndex |   3.613653   .5180906     6.97   0.000     2.598214    4.629092
                            _cons |  -7511.384    18162.4    -0.41   0.679    -43109.03    28086.26
    ------------------------------+----------------------------------------------------------------
                          sigma_u |   319.2514
                          sigma_e |  1214.6213
                              rho |  .06462069   (fraction of variance due to u_i)
    -----------------------------------------------------------------------------------------------


    Data description:
    21 variables, and 4592 observations. (unbalanced dataset)
    Variable Description
    Date Time indicator (In days)
    StringencyIndex Index measuring the goverment response to Covid19.

    100 being the most severe response, and 0 being the loosest response.
    Covid19_cases Dependent variable

    Measuring the number of recorded covid19 cases
    Covid19_deaths Measuring the number of recording deaths caused by covid19
    popData2018 2018 country population data
    DAI_index Digital adoption index

    Measuring a countries digital adoption across three dimensions of the economy: people, government, and business
    DAI_B_index Measuring a countries digital adoption across business
    DAI_P_index Measuring a countries digital adoption across people
    DAI_G_index Measuring a countries digital adoption across government
    pop_AnnualGrowthPct_2010_2018 Measuring a countries annual growth in population from 2010 to 2018 in pct.
    Ages0_14Pct Measuring the pct. of a countries population who are between 0 and 14 years of age.
    Ages15_64Pct Measuring the pct. of a countries population who are between 15 and 64 years of age.
    Ages65_99Pct Measuring the pct. of a countries population who are between 65 and 99 years of age.
    Ages0_99Pct Measuring the pct. of a countries population who are between 0 and 99 years of age.
    CrimeIndex Index measuring crime rates by country.

    100 being the highest crimes rates and 0 being the lowest
    SocialMobilityIndex Index measuring social mobility by country

    100 being the highest social mobility and 0 being the lowest
    Gdp2018 Country GDP by 2018 numbers
    GdpAgriculturalPct Pct. of a countries GDP that comes from the agriculture sector
    GdpIndustrialPct Pct. of a countries GDP that comes from the industrial sector
    GdpServicePct Pct. of a countries GDP that comes from the service sector
    InternetUsage2014Pct % of a countries population that uses the internet, by 2014 numbers
    Country Entity indicator
    Continent Continent
    pop_density18 Population density by country by 2018 numbers
    I hope i have been as precise and informative as possible.

    Best regards, Walther Larsen
    Last edited by Walther Larsen; 28 Apr 2020, 02:15.

  • #2
    Walther:
    welcome to this forum.
    Some comments about your queries:
    1) usually, -xtreg- outperforms pooled OLS, regardless non-default standard errors. That said, I would have started off with -xtreg- and switch to pooled OLS only in absence of a panel-wise effect.
    2) If you actually have 109 dates and 42 panels, you should consider estimators developed for long panels (see -xtgls- and -xtregar, fe-);
    3) you have sky-rocketing R-sq (-regress-) and R-sq between (-xtreg,re-) but most of your coefficients does not reach statistical significance: you may have quasi-extreme multicollinearity issue to deal with;
    4) I'm not clear with the reason underlying non-default standard errors in -xtreg,re-. Did you detect heteroskedasticity and/or autocorrelation?
    5) Re-checking for heteroskedasticity after imposing non-default standard errors in -reg- is not allowed in order to save your time, as these options change the way standard errors are calculated to take heteroskedastcity into account.
    Kind regards,
    Carlo
    (Stata 19.0)

    Comment


    • #3
      Hey Carlo, thanks for the quick response.

      1) Alright, and how would one go about checking whether or not their is a panel wise effect? (This might sound dumb as i'm nowhere near a statistics expert).

      2) I have 137 countries, and a varying amount of dates on each country. As when it comes to Covid19, some countries startet tracking later or earlier than others. Does this change anything?

      3) I didn't want to drop insignificant variables, before knowing if the approach was actually correct. (In university the usual approach has been to drop the most insignificant variable until your model only consists of significant variables.

      I did check for multicollinearity using the -correlate command, and even though there were some high ones, there shouldn't be any perfect linear correlation.
      Code:
      (obs=4,592)
      
                   | Covid~es Ag~14Pct Ages65~t Ages15~t Covid~hs CrimeI~x DAI_B_~x DAI_G_~x DAI_P_~x  Gdp2018 GdpAgr~t
      -------------+---------------------------------------------------------------------------------------------------
      Covid19_ca~s |   1.0000
       Ages0_14Pct |  -0.0650   1.0000
      Ages65_99Pct |   0.0729  -0.8413   1.0000
      Ages15_64Pct |  -0.0192  -0.2351  -0.3243   1.0000
      Covid19_de~s |   0.8284  -0.0804   0.1027  -0.0471   1.0000
        CrimeIndex |   0.0735   0.5069  -0.5786   0.1635   0.0773   1.0000
       DAI_B_index |   0.0462  -0.6566   0.7053  -0.1302   0.0556  -0.4926   1.0000
       DAI_G_index |   0.0803  -0.3036   0.2144   0.1369   0.0817  -0.1012   0.1430   1.0000
       DAI_P_index |   0.0529  -0.7608   0.7580  -0.0227   0.0562  -0.4781   0.8669   0.2378   1.0000
           Gdp2018 |   0.4099  -0.1009   0.0619   0.0760   0.2912   0.1024  -0.0623   0.1350   0.0212   1.0000
      GdpAgricul~t |  -0.1049   0.7472  -0.7381   0.0206  -0.1109   0.3785  -0.7905  -0.3257  -0.8314  -0.1010   1.0000
      GdpIndustr~t |  -0.1235   0.2016  -0.2698   0.1609  -0.1442   0.1165  -0.3897  -0.0471  -0.2602  -0.0313   0.1847
      GdpService~t |   0.1549  -0.5576   0.5992  -0.1236   0.1727  -0.2731   0.7056   0.1864   0.6379   0.0874  -0.6770
      InternetUs~t |   0.0828  -0.6886   0.7333  -0.1119   0.0803  -0.4299   0.9325   0.1832   0.9251   0.0672  -0.8192
       popData2018 |   0.0705   0.2050  -0.3260   0.2435   0.0339   0.1451  -0.4350   0.0730  -0.4413   0.5231   0.3975
      pop_Ann~2018 |  -0.0481   0.6871  -0.7264   0.0985  -0.0540   0.4207  -0.2593  -0.1416  -0.3387  -0.1141   0.4084
      pop_densi~18 |  -0.0435   0.2362  -0.1174  -0.2076  -0.0102  -0.0853  -0.2068   0.1238  -0.2611  -0.0327   0.2311
      SocialMobi~x |   0.0406  -0.7323   0.7647  -0.0932   0.0517  -0.5304   0.9203   0.1618   0.9260   0.0043  -0.8167
      Stringency~x |   0.2341   0.0336  -0.0450   0.0214   0.2466   0.0006  -0.1070   0.0318  -0.0860   0.0192   0.0762
           Country |   0.1387   0.0778  -0.1600   0.1357   0.1142   0.1075  -0.1323  -0.0079  -0.0756   0.1096   0.1351
              Date |   0.2247  -0.0155   0.0158  -0.0019   0.2402  -0.0086   0.0046   0.0155   0.0070  -0.0020  -0.0085
      
                   | GdpInd~t GdpSer~t Intern~t popDat~8 pop_An~8 pop_d~18 Social~x String~x  Country     Date
      -------------+------------------------------------------------------------------------------------------
      GdpIndustr~t |   1.0000
      GdpService~t |  -0.8280   1.0000
      InternetUs~t |  -0.3244   0.6751   1.0000
       popData2018 |   0.1418  -0.3188  -0.4196   1.0000
      pop_Ann~2018 |  -0.0456  -0.1789  -0.2812   0.0664   1.0000
      pop_densi~18 |  -0.1784  -0.0728  -0.2572   0.2385   0.0846   1.0000
      SocialMobi~x |  -0.2180   0.5891   0.9548  -0.3984  -0.3351  -0.1964   1.0000
      Stringency~x |   0.0249  -0.0585  -0.0945   0.0747   0.0100   0.0837  -0.0713   1.0000
           Country |  -0.0081  -0.0843  -0.1189  -0.1015   0.0637   0.1292  -0.1378   0.0133   1.0000
              Date |  -0.0092   0.0120   0.0019  -0.0032  -0.0146  -0.0031   0.0068   0.8703   0.0075   1.0000
      4) Before using the clustered standard errors i found heteroskedasticity. The reason why i wanted to use -re, was because almost all of my variables are static over time, and from what i can understand in those cases its appropriate to use -re. I might be wrong though.

      Comment


      • #4
        Walther:
        1) see -xttest0- to test whether there's evidence of a panel-wise effect after -xtreg,re-,
        2) the outcome provided by -xtreg,re- tells that you actually have 42 countries, and this detail is confirmed by (Std. Err. adjusted for 42 clusters in Country). I cannot follow you on 137 countries as cross-sectional dimension of your panel dataset, then.
        3) If perfet multicollinearity was detected, Stata would have omitted one of the variables included in the perfect multicollinearity issue. I wouod have checked the -vce- matrix after -regress- and -xtreg,re- via -estat vce-.
        4) if with static (which means something different in panel data regerssion setting) you mean time-invariant, I follow you. However, it may well be that -fe- specification fits your data better. You can check it via the user-written command -xtoverid- (see -search xtoverid-). Please note that, unlike -hausman-, -xtoverid- needs the -re- regression only to work properly, as you can see from the following toy-example:
        Code:
        use "https://www.stata-press.com/data/r16/nlswork.dta"
        . xtreg ln_wage age, vce(cluster idcode)
        
        Random-effects GLS regression                   Number of obs     =     28,510
        Group variable: idcode                          Number of groups  =      4,710
        
        R-sq:                                           Obs per group:
             within  = 0.1026                                         min =          1
             between = 0.0877                                         avg =        6.1
             overall = 0.0774                                         max =         15
        
                                                        Wald chi2(1)      =    1064.91
        corr(u_i, X)   = 0 (assumed)                    Prob > chi2       =     0.0000
        
                                     (Std. Err. adjusted for 4,710 clusters in idcode)
        ------------------------------------------------------------------------------
                     |               Robust
             ln_wage |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
        -------------+----------------------------------------------------------------
                 age |   .0185667    .000569    32.63   0.000     .0174516    .0196819
               _cons |   1.120439   .0159154    70.40   0.000     1.089245    1.151632
        -------------+----------------------------------------------------------------
             sigma_u |  .36972456
             sigma_e |  .30349389
                 rho |  .59743613   (fraction of variance due to u_i)
        ------------------------------------------------------------------------------
        
        . xtoverid
        
        Test of overidentifying restrictions: fixed vs random effects
        Cross-section time-series model: xtreg re  robust cluster(idcode)
        Sargan-Hansen statistic  14.529  Chi-sq(1)    P-value = 0.0001
        
        .
        *The -xtoverid- output points out to -fe- specification, the null being, loosely speaking, that the -re- specification id OK for your data*
        As an aside, I cannot help from wondering to myself what's the support that you get from your supervisor, as you seem a bit lost with this actually demanding statistical methods.
        Kind regards,
        Carlo
        (Stata 19.0)

        Comment

        Working...
        X