I would like to request assistance with my datasets, which consist of traded companies from the US, UK, and Australia over the last 10 years. My raw dataset includes approximately 2900 companies and 29000 observations across my panel dataset. Generally, I am interested in studying the impact of ESG (Environmental, Social, and Governance) performance on financial performance.
Firstly, my challenge involves handling missing data. I understand there are several methods to impute data in STATA, such as linear interpolation using the command `ipolate — Linearly interpolate (extrapolate) values`. Alternatively, one can delete companies with missing data. However, I've read that interpolating dependent variables, such as return on equity, return on assets, and Tobin's Q, could be particularly problematic. Thus, I am faced with a dilemma: should I delete missing values, potentially reducing my dataset from 29000 observations to around 10000 and risk introducing bias, or should I interpolate missing values, which could also introduce various problems?
I was considering removing all companies with missing values for my dependent variables, as these are crucial, and then eliminating all companies that have reported less than 9 out of 10 years' values for the remaining independent and control variables. Afterward, I would interpolate values for those with 9 out of 10 years of data. This approach would mean I am interpolating only 10% of the years, in a sense.
My second issue concerns the return on equity and its negative values. The distribution does not match the standard normal distribution, so I was considering performing a log transformation; however, this is not feasible due to the presence of negative values.
Is there a way to address this? I am looking for any suggestions or helpful comments. Thank you.
Firstly, my challenge involves handling missing data. I understand there are several methods to impute data in STATA, such as linear interpolation using the command `ipolate — Linearly interpolate (extrapolate) values`. Alternatively, one can delete companies with missing data. However, I've read that interpolating dependent variables, such as return on equity, return on assets, and Tobin's Q, could be particularly problematic. Thus, I am faced with a dilemma: should I delete missing values, potentially reducing my dataset from 29000 observations to around 10000 and risk introducing bias, or should I interpolate missing values, which could also introduce various problems?
I was considering removing all companies with missing values for my dependent variables, as these are crucial, and then eliminating all companies that have reported less than 9 out of 10 years' values for the remaining independent and control variables. Afterward, I would interpolate values for those with 9 out of 10 years of data. This approach would mean I am interpolating only 10% of the years, in a sense.
My second issue concerns the return on equity and its negative values. The distribution does not match the standard normal distribution, so I was considering performing a log transformation; however, this is not feasible due to the presence of negative values.
Is there a way to address this? I am looking for any suggestions or helpful comments. Thank you.
Comment