Dealing with missing data and negative data

Adam Zurek

Join Date: Mar 2024

Posts: 1
#1

Dealing with missing data and negative data

30 Mar 2024, 08:33

I would like to request assistance with my datasets, which consist of traded companies from the US, UK, and Australia over the last 10 years. My raw dataset includes approximately 2900 companies and 29000 observations across my panel dataset. Generally, I am interested in studying the impact of ESG (Environmental, Social, and Governance) performance on financial performance.

Firstly, my challenge involves handling missing data. I understand there are several methods to impute data in STATA, such as linear interpolation using the command `ipolate — Linearly interpolate (extrapolate) values`. Alternatively, one can delete companies with missing data. However, I've read that interpolating dependent variables, such as return on equity, return on assets, and Tobin's Q, could be particularly problematic. Thus, I am faced with a dilemma: should I delete missing values, potentially reducing my dataset from 29000 observations to around 10000 and risk introducing bias, or should I interpolate missing values, which could also introduce various problems?

I was considering removing all companies with missing values for my dependent variables, as these are crucial, and then eliminating all companies that have reported less than 9 out of 10 years' values for the remaining independent and control variables. Afterward, I would interpolate values for those with 9 out of 10 years of data. This approach would mean I am interpolating only 10% of the years, in a sense.

My second issue concerns the return on equity and its negative values. The distribution does not match the standard normal distribution, so I was considering performing a log transformation; however, this is not feasible due to the presence of negative values.

Is there a way to address this? I am looking for any suggestions or helpful comments. Thank you.
Tags: None
Rich Goldstein

Join Date: Mar 2014

Posts: 4462
#2

30 Mar 2024, 12:09

those are not your only alternatives for missing data; multiple imputation (MI) is another and there are various weighting schemes also but I am more familiar with MI; see

Code:

h mi

and if you click on the blue link near the top you will go to an entire manual dealing with this

I don't understand your substantive situation enough to offer guidance on what to do with the negative values but I don't see why you think this is important; first, the normal distribution includes negative values; second, assumptions about normality are usually of little importance anyway (but you don't say anything about your planned method of analysis so I can't be more specific)
3 likes
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17707
#3

30 Mar 2024, 12:16

Adam:
welcome to this forum.
As an aside to Rich's helpful advice, please note that removing all the observations with missing data (something that Stata does anyhow by applying listwise deletion) means making up your data, as you do not make any diagnosis about the mechanisms and the patterns of missingness. Therefore, you will probably end up with a sub-sample that has, at best, a tenuous relationship with the original sample.

Kind regards,
Carlo
(Stata 19.0)
2 likes
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35696
#4

30 Mar 2024, 12:20

Interpolation was developed largely to go beyond printed tables to "read between the lines" and get estimates of mathematically defined functions known to be smooth for values not tabulated. Its use for filling in gaps in data series is much more problematic. But on the one hand, more is available than just the official command ipolate. Some other possibilities are implemented in mipolate from SSC. Although I've contributed code in this area, I warn that using interpolation for anything but filling in small gaps where it seems clear what the right answer should be is likely to cause as many problems as it solves, especially for financial time series, which are, to coin a phrase, highly volatile.
2 likes
Comment

Announcement

Dealing with missing data and negative data

Comment

Comment

Comment