Estimation with too many missing values.....(unbalanced panel data)

Ngozi ADELEYE

Join Date: Apr 2014

Posts: 80
#1

Estimation with too many missing values.....(unbalanced panel data)

12 Jul 2014, 20:07

Dear all.....I need help with treating too many missing values.

I'm using Stata13 and have gone through most of the online tutorials on missing values which often handles just a couple missing obs. My dataset is quite small with only 503 observations but with lots of variables having missing values.

It is an unbalanced panel sorted as 'c_id year'. I have attached the Stata output for the 'missing data' summaries and a regression output showing loss of observations as more variables are included.

My dependent variable is the 'gini' and all the explanatory variables are important in this equation, given the hypothesis being tested.

I will appreciate your guidance on what to do.

Thank you.
Attached Files

Missing_Values.smcl (7.4 KB, 1 view)
Tags: None
daniel klein

Join Date: Mar 2014

Posts: 3848
#2

13 Jul 2014, 03:49

Concerning your missing values have you tried multiple imputation? Since the fraction of missing values is quite high in some variables I would suggest to create at least a 100 imputed datasets (maybe more). Start with reading the manual entry mi.

You do not tell us a lot about your research question, but are you sure you are making the best of your panel data? A simple OLS model with (cluster) robust standard errors is neither efficient (as would a 'random effects' model be) nor is it fully controlling for (time invariant) unobserved heterogeneity (as would be possible with a 'fixed effects' model or better: 'within estimator'.

Best
Daniel
Comment
Ngozi ADELEYE

Join Date: Apr 2014

Posts: 80
#3

13 Jul 2014, 08:34

...my research question is 'the determinants of income inequality and its link to crime'. I have initially had 214 countries but since only 134 countries had gini data I had to limit it to that. The countries are grouped into 7 regions, thus I'll create regional dummies to control for 'fixed effects'. Due to many 'holes' in the data, I lose lots of data points even with pooled OLS.

I've gone through lots of online tutorials one of which is Nick Cox' at: http://www.stata.com/support/faqs/da...issing-values/ from where I got this:

5. Complications: several variables and panel structure Two common complications are
You want to do this with several variables: use foreach. sort or gsort once, replace all variables using foreach, and, if necessary, sort back again.

You have panel data, so the appropriate replacement is a neighboring nonmissing value for each individual in the panel.

Suppose that individuals are identified by id. There are just a few extra details to review, such as

. by id (time), sort: replace myvar = myvar[_n-1] if myvar >= . or

. gsort id -time . quietly by id: replace myvar = myvar[_n-1] if myvar >= . . sort id time The key to many data management problems with panel data lies in following sort by some computations under by:. For more information, see the sections of the manual indexed under by:.
I tried to construct this for the multiple missing values but didn't get it right. I used:

foreach var of varlist homi v_rob gdppc gdpgr gdpcgr cons_exp rents educ_exp subs trade dcredit corrupt une_m lit_ym police pol_inst {
by c_id (year): replace x 'var' = var[_n-1] if var>=.
}

I got error msgs. Kindly assist with the correct command.

Thanks.
Comment
Richard Williams

Join Date: Apr 2014

Posts: 4987
#4

13 Jul 2014, 08:58

Wisconsin has a nice introduction to multiple imputation: http://www.ssc.wisc.edu/sscc/pubs/stata_mi_intro.htm

The last line with by looks wrong. You aren't using ` and ' correctly. I think it should be more like

by c_id (year): replace x`var' = `var'[_n-1] if `var'>=.

Assuming x should be there at all -- I don't know why it is there unless there is a parallel set of vars that start with x. It probably would be a good idea to have such a set of variables so you aren't wiping out the originals.

Anyway, assuming the data are xtset, my best guess would be something like

Code:

foreach var of varlist homi v_rob gdppc gdpgr gdpcgr cons_exp rents educ_exp subs trade dcredit corrupt une_m lit_ym police pol_inst { clonevar x`var' = `var' replace x`var' = L.`var' if `var'>=. }

If this doesn't work then we need to see the output and error messages.

You might also consider the use of -tssmooth, which we discussed in some other thread of yours.

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://www3.nd.edu/~rwilliam
Comment
Ngozi ADELEYE

Join Date: Apr 2014

Posts: 80
#5

13 Jul 2014, 14:19

...Hi Richard, I modified the code a bit:

foreach var of varlist homi v_rob gdppc gdpgr gdpcgr cons_exp rents educ_exp subs trade dcredit corrupt une_m lit_ym police pol_inst {
by c_id (year): replace `var' = L.`var' if `var'>=.
}

but only a very few changes were made:
(2 real changes made)
(78 real changes made)
(0 real changes made)
(1 real change made)
(5 real changes made)
(0 real changes made)
(0 real changes made)
(28 real changes made)
(20 real changes made)
(0 real changes made)
(5 real changes made)
(15 real changes made)
(0 real changes made)
(88 real changes made)
(0 real changes made)
(0 real changes made)

My guess is because 'the previous cell value is also missing'. The 'tssmooth' as advised in the previous thread (on moving averages) worked quite well until I realised that it didn't capture any missing value for that 'var', so the data came out really 'short' on total obs.

I'm currently going through Stata's Imputation Reference Manual as given by Daniel (though it's 380 pages!)....and funny enough I've always gained tremendous knowledge from the Winsconsin's Online Tutorials (SSCC) just that I never stumbled on this particular link.

....a quicker idea on how to solve this will be greatly appreciated, though.

Thanks as always.....you've always been helpful.
Ngozi
Comment
Richard Williams

Join Date: Apr 2014

Posts: 4987
#6

13 Jul 2014, 15:07

I hate it when Political Science students take my classes. They always seem to have 200 years of data on 30 countries with 80% of the values missing,. They have inspired me to learn more about panel data though.

It would be better to get feedback from a panel data expert. But It seems to me that other options would be lagging further back in time, e.g. 2 periods, 3 periods. You could also go ahead in time, e.g. fill in the value from the next time period or 2 time periods ahead. (tssmooth can go forward in time too.) Maybe plug in the mean. See what multiple imputation can do. Think about how far back (or forward) in time you can go and still have reasonable results.

Why is so much data missing in the first place? My guess is it is because of the quality of data collection varies across the world. But if it was just sloppy data collection by whoever gathered the data, maybe you could go back and clean things up a bit.

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://www3.nd.edu/~rwilliam
Comment
Ngozi ADELEYE

Join Date: Apr 2014

Posts: 80
#7

13 Jul 2014, 15:48

...yeah you're right. A badly 'holed' World Bank (WDI) secondary data is definitely the result of 'sloppy data collection'. I guess that's one of the reasons for mi. I'll give you a feedback if I run into a hitch....

Thanks again.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30097
#8

13 Jul 2014, 16:11

Looking at your originally posted .smcl file, it seems like one variable, educ_exp, accounts for most of your problem. Nearly half of its values are missing. Perhaps you can find an alternative data source for that variable and merge it in to your data.

If not, I would think long and hard about whether that variable is really so crucial to your modeling. Because including it means that your ultimate results are going to be largely a reflection of how you impute the missing educ_exp values. And I've rarely met an imputation model that struck me as credible enough to rely on for the core of my argument.
Comment
Ngozi ADELEYE

Join Date: Apr 2014

Posts: 80
#9

13 Jul 2014, 17:46

....yeah Clyde......not sure how to handle this really. It's an important variable indicating 'share of education in total expenditure'. It's sister variable educ_gdp is even worse....I had to drop it.

90% of my data is from World Bank includes educ_exp, subs (subsidies) and lit_ym(literacy rate, male youth) - covers 2000 - 2012
I got from World Governance Indicator (WGI) - police and pol_inst (political instability) - which covers only 2008 - 2012

I won't mind opinions from fellow Statalisters on other data sources covering 2000 - 2012.

Thanks a lot for contributing!
Ngozi
Comment

Announcement

Estimation with too many missing values.....(unbalanced panel data)

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment