Hi all,
I am busy with imputing variables through multiple imputation and I wanted to know your general feeling about when to impute, and when not to impute. I cannot find any paper on the topic which deals with questions like: what percentage missing variables is too much? Or: should we impute observations that are missing in developing economies with information gained from advanced economies?
To give a bit of context:
I am doing a research on the effect of sovereign ESG scores on country productivity from 1990 till 2021. I have data from the world bank and my sample consists of 196 countries. I have a whole array of variables which span the social, governmental and environmental dimension of ESG.
With my data, I am going to build a sovereign ESG index set per country per year following a study done by Jiang et al. (2021): new measurement of ESG index. However, I am going to try to improve upon this paper as they have, for example, used mean imputation and in general have not really backed up their statistical methods.
One of the problems I am facing is that, due to the difference in statistical prowess and variability in documentation per country, some countries are missing a lot of observations for certain variables. I need to have a full database in order to create a correct ESG score and want to do this via multiple imputation.
The question I am asking is the following:
What percentage of missing observations is too much according to you, and if possible, backed up by literature?
Is it okay to use multiple imputation methods within one dataset? I was thinking: certain variables don't have a lot of variability, so I could perform mean imputation or simply do a carry-forward or carry backwards. Also: can I do imputation by country grouping? (developing, emerging, advanced economy) and if so, how would I put that into Stata code?
Looking forward to your answers and discussion
I am busy with imputing variables through multiple imputation and I wanted to know your general feeling about when to impute, and when not to impute. I cannot find any paper on the topic which deals with questions like: what percentage missing variables is too much? Or: should we impute observations that are missing in developing economies with information gained from advanced economies?
To give a bit of context:
I am doing a research on the effect of sovereign ESG scores on country productivity from 1990 till 2021. I have data from the world bank and my sample consists of 196 countries. I have a whole array of variables which span the social, governmental and environmental dimension of ESG.
With my data, I am going to build a sovereign ESG index set per country per year following a study done by Jiang et al. (2021): new measurement of ESG index. However, I am going to try to improve upon this paper as they have, for example, used mean imputation and in general have not really backed up their statistical methods.
One of the problems I am facing is that, due to the difference in statistical prowess and variability in documentation per country, some countries are missing a lot of observations for certain variables. I need to have a full database in order to create a correct ESG score and want to do this via multiple imputation.
The question I am asking is the following:
What percentage of missing observations is too much according to you, and if possible, backed up by literature?
Is it okay to use multiple imputation methods within one dataset? I was thinking: certain variables don't have a lot of variability, so I could perform mean imputation or simply do a carry-forward or carry backwards. Also: can I do imputation by country grouping? (developing, emerging, advanced economy) and if so, how would I put that into Stata code?
Looking forward to your answers and discussion
Comment