Imputing variables

Maarten Loomans

Join Date: Jun 2022

Posts: 46
#1

Imputing variables

27 Jun 2024, 04:49

Hi all,

I am busy with imputing variables through multiple imputation and I wanted to know your general feeling about when to impute, and when not to impute. I cannot find any paper on the topic which deals with questions like: what percentage missing variables is too much? Or: should we impute observations that are missing in developing economies with information gained from advanced economies?

To give a bit of context:
I am doing a research on the effect of sovereign ESG scores on country productivity from 1990 till 2021. I have data from the world bank and my sample consists of 196 countries. I have a whole array of variables which span the social, governmental and environmental dimension of ESG.

With my data, I am going to build a sovereign ESG index set per country per year following a study done by Jiang et al. (2021): new measurement of ESG index. However, I am going to try to improve upon this paper as they have, for example, used mean imputation and in general have not really backed up their statistical methods.

One of the problems I am facing is that, due to the difference in statistical prowess and variability in documentation per country, some countries are missing a lot of observations for certain variables. I need to have a full database in order to create a correct ESG score and want to do this via multiple imputation.

The question I am asking is the following:
What percentage of missing observations is too much according to you, and if possible, backed up by literature?
Is it okay to use multiple imputation methods within one dataset? I was thinking: certain variables don't have a lot of variability, so I could perform mean imputation or simply do a carry-forward or carry backwards. Also: can I do imputation by country grouping? (developing, emerging, advanced economy) and if so, how would I put that into Stata code?

Looking forward to your answers and discussion

Last edited by Maarten Loomans; 27 Jun 2024, 04:58.
Tags: multiple imputation

Maarten Loomans

Join Date: Jun 2022
Posts: 46

27 Jun 2024, 04:49

For clarity:

an example of my data:

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input str3 iso3 double(AG_SRF_TOTL_K2 EG_USE_PCAP_KG_OE)
"ABW"     180           .
"ABW"     180           .
"ABW"     180           .
"ABW"     180           .
"ABW"     180           .
"ABW"     180           .
"ABW"     180           .
"ABW"     180           .
"ABW"     180           .
"ABW"     180           .
"ABW"     180           .
"ABW"     180           .
"ABW"     180           .
"ABW"     180           .
"ABW"     180           .
"ABW"     180           .
"ABW"     180           .
"ABW"     180           .
"ABW"     180           .
"ABW"     180           .
"ABW"     180           .
"ABW"     180           .
"ABW"     180           .
"ABW"     180           .
"ABW"     180           .
"ABW"     180           .
"ABW"     180           .
"ABW"     180           .
"ABW"     180           .
"ABW"       .           .
"ABW"       .           .
"ABW"       .           .
"AFG"  652860           .
"AFG"  652860           .
"AFG"  652860           .
"AFG"  652860           .
"AFG"  652860           .
"AFG"  652860           .
"AFG"  652860           .
"AFG"  652860           .
"AFG"  652860           .
"AFG"  652860           .
"AFG"  652860           .
"AFG"  652860           .
"AFG"  652860           .
"AFG"  652860           .
"AFG"  652860           .
"AFG"  652860           .
"AFG"  652860           .
"AFG"  652860           .
"AFG"  652860           .
"AFG"  652860           .
"AFG"  652860           .
"AFG"  652860           .
"AFG"  652860           .
"AFG"  652860           .
"AFG"  652860           .
"AFG"  652860           .
"AFG"  652860           .
"AFG"  652860           .
"AFG"  652860           .
"AFG"       .           .
"AFG"       .           .
"AFG"       .           .
"AGO" 1246700 496.5365322
"AGO" 1246700 491.8025707
"AGO" 1246700  478.585781
"AGO" 1246700 479.8200297
"AGO" 1246700 470.9384125
"AGO" 1246700 455.6660874
"AGO" 1246700 454.0550814
"AGO" 1246700 449.2543895
"AGO" 1246700  433.882523
"AGO" 1246700 441.1631632
"AGO" 1246700 438.5499123
"AGO" 1246700 442.9788396
"AGO" 1246700 448.2616945
"AGO" 1246700 467.6441109
"AGO" 1246700 464.8080209
"AGO" 1246700 433.5724861
"AGO" 1246700 458.7940737
"AGO" 1246700 471.6942449
"AGO" 1246700 491.9785712
"AGO" 1246700 515.2172566
"AGO" 1246700 520.9623361
"AGO" 1246700 521.7807029
"AGO" 1246700 552.3637656
"AGO" 1246700 533.7608866
"AGO" 1246700 544.6094435
"AGO" 1246700           .
"AGO" 1246700           .
"AGO" 1246700           .
"AGO" 1246700           .
"AGO"       .           .
"AGO"       .           .
"AGO"       .           .
"ALB"   28750 813.2556955
"ALB"   28750  572.781844
"ALB"   28750 418.2866298
"ALB"   28750 412.3788805
end

Comment

Rich Goldstein

Join Date: Mar 2014

Posts: 4408
#3

27 Jun 2024, 06:39

your text is a bit unclear so the issue is not how much missing data there is for any particular variable but how many observations will Stata drop because of at least one missing value? there is literature on this, but not much. if I recall correctly, Harrell's book on regression modeling strategies, second edition, suggests losing 3% of the total N means you should do something about it and there is an article (can't immediately find it) suggesting 5%; also, however, the theory points to amount of missing information rather than amount of missing data and this is somewhat harder; however, you can look at the user-written how_many_imputations and the article cited in its help file; use -search- to find and install
1 like
Comment
Maarten Loomans

Join Date: Jun 2022

Posts: 46
#4

02 Jul 2024, 08:39

Originally posted by Rich Goldstein View Post

your text is a bit unclear so the issue is not how much missing data there is for any particular variable but how many observations will Stata drop because of at least one missing value? there is literature on this, but not much. if I recall correctly, Harrell's book on regression modeling strategies, second edition, suggests losing 3% of the total N means you should do something about it and there is an article (can't immediately find it) suggesting 5%; also, however, the theory points to amount of missing information rather than amount of missing data and this is somewhat harder; however, you can look at the user-written how_many_imputations and the article cited in its help file; use -search- to find and install

Hi Rich,

Thanks for the answer and sorry for the delay in response. My question is not: how many observations will Stata drop, but more so: when imputing missing observations for different variables, when do you think the %missing obs is too high to be able to do a correct imputation. I found one article where, even with 40% missing observations, multiple imputation still generating unbiased parameters.

Kind regards,
Maarten
Comment

Announcement

Imputing variables

Comment

Comment

Comment