I have data on companies that report various employment and financial metrics. The dataset includes:
As expected, some firms have missing values across these variables. For example, a firm might report total staff but not specify female staff. Similarly, financial and customer-related variables have missing entries. Imputation & Winsorization Approach
To handle missing values:
Initially, I was imputing missing values using a random number between the 25th and 75th percentiles, ignoring sector differences. This resulted in many illogical observations (e.g., firms with unrealistically high or low staff/customer numbers).
After switching to sector-specific median imputation, the number of unrealistic observations dropped significantly, but I still have about 3 problematic cases per variable. These include:
Given my current approach (sector-specific median imputation before winsorizing), what additional steps can I take to resolve the remaining inconsistencies? Would a different method (e.g., regression imputation) help?
Would appreciate any insights from the community!
- Staffing Information:
- Total staff
- Part-time staff
- Full-time staff
- Female staff
- Financial Metrics:
- Monthly revenue
- Annual revenue
- Annual profit
- Revenue aspirations for the next year
- Profit aspirations for the next year
- Customer Information:
- Current number of customers
- Expected number of customers in the next year
As expected, some firms have missing values across these variables. For example, a firm might report total staff but not specify female staff. Similarly, financial and customer-related variables have missing entries. Imputation & Winsorization Approach
To handle missing values:
- Imputation Using Sector-Specific Medians:
- Instead of imputing using percentiles across all firms, I now replace missing values with the median within the same sector to ensure more realistic figures.
- This approach significantly reduces unrealistic observations.
- Winsorization for Large Values:
- For large numerical variables (e.g., revenue, profit, customers), I winsorize at the 2.5% and 97.5% percentiles to cap extreme values while preserving overall trends.
- This is done after imputation to avoid distorting the sector-based replacements.
Initially, I was imputing missing values using a random number between the 25th and 75th percentiles, ignoring sector differences. This resulted in many illogical observations (e.g., firms with unrealistically high or low staff/customer numbers).
After switching to sector-specific median imputation, the number of unrealistic observations dropped significantly, but I still have about 3 problematic cases per variable. These include:
- Firms reporting more female staff than total staff
- Unrealistic revenue-to-customer ratios
- Firms with profit aspirations far below current profits
Given my current approach (sector-specific median imputation before winsorizing), what additional steps can I take to resolve the remaining inconsistencies? Would a different method (e.g., regression imputation) help?
Would appreciate any insights from the community!
Comment