Stata: Handling Missing Data & Winsorizing Large Variables

Tina Ayo

Join Date: Mar 2025

Posts: 1
#1

Stata: Handling Missing Data & Winsorizing Large Variables

21 Mar 2025, 06:01

I have data on companies that report various employment and financial metrics. The dataset includes:
Staffing Information:
Total staff

Part-time staff

Full-time staff

Female staff

Financial Metrics:
Monthly revenue

Annual revenue

Annual profit

Revenue aspirations for the next year

Profit aspirations for the next year

Customer Information:
Current number of customers

Expected number of customers in the next year

Missing Data Challenge

As expected, some firms have missing values across these variables. For example, a firm might report total staff but not specify female staff. Similarly, financial and customer-related variables have missing entries. Imputation & Winsorization Approach

To handle missing values:
Imputation Using Sector-Specific Medians:
Instead of imputing using percentiles across all firms, I now replace missing values with the median within the same sector to ensure more realistic figures.

This approach significantly reduces unrealistic observations.

Winsorization for Large Values:
For large numerical variables (e.g., revenue, profit, customers), I winsorize at the 2.5% and 97.5% percentiles to cap extreme values while preserving overall trends.

This is done after imputation to avoid distorting the sector-based replacements.

Remaining Issues

Initially, I was imputing missing values using a random number between the 25th and 75th percentiles, ignoring sector differences. This resulted in many illogical observations (e.g., firms with unrealistically high or low staff/customer numbers).

After switching to sector-specific median imputation, the number of unrealistic observations dropped significantly, but I still have about 3 problematic cases per variable. These include:
Firms reporting more female staff than total staff

Unrealistic revenue-to-customer ratios

Firms with profit aspirations far below current profits

Question

Given my current approach (sector-specific median imputation before winsorizing), what additional steps can I take to resolve the remaining inconsistencies? Would a different method (e.g., regression imputation) help?

Would appreciate any insights from the community!
Tags: data, panel data
Clyde Schechter

Join Date: Apr 2014

Posts: 29948
#2

21 Mar 2025, 09:10

Well, you don't say how you plan to analyze this data once you have addressed these issues. But if you are going to be doing some kind of regression analyses, I think that the use of multiple imputation would be preferable to the single-imputation approaches you have mentioned. Moreover, given what these variables are, the missingness at random assumption underlying multiple imputation seems plausible to me.

I am not a fan of Winsorization. Large values are not necessarily problematic. Have you explored the data graphically to see if these large values really distort relationships among the variables? If they do, might log-transformation or some other such transform resolve the problem?
2 likes
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17672
#3

25 Mar 2025, 01:51

Tina:
welcome to this forum.
You should definitely go -mi-, provided that your data are MCAR or MAR (as Clyde suggests).
Establishing the mechanism underlying the missingness of your data is mandatory before challenging yourself with any procedure aimed at dealing with missing values: if data are MNAR, things get trickier.
I concurr with Clyde about winsorizing: "extreme" values may be totally justified by the data generating process (let's think about the annual profit of a successful microchips manufacturer that towers among its competitors).

Kind regards,
Carlo
(StataNow 18.5)
1 like
Comment

Announcement

Stata: Handling Missing Data & Winsorizing Large Variables

Comment

Comment