Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Stata: Handling Missing Data & Winsorizing Large Variables

    I have data on companies that report various employment and financial metrics. The dataset includes:
    • Staffing Information:
      • Total staff
      • Part-time staff
      • Full-time staff
      • Female staff
    • Financial Metrics:
      • Monthly revenue
      • Annual revenue
      • Annual profit
      • Revenue aspirations for the next year
      • Profit aspirations for the next year
    • Customer Information:
      • Current number of customers
      • Expected number of customers in the next year
    Missing Data Challenge

    As expected, some firms have missing values across these variables. For example, a firm might report total staff but not specify female staff. Similarly, financial and customer-related variables have missing entries. Imputation & Winsorization Approach

    To handle missing values:
    1. Imputation Using Sector-Specific Medians:
      • Instead of imputing using percentiles across all firms, I now replace missing values with the median within the same sector to ensure more realistic figures.
      • This approach significantly reduces unrealistic observations.
    2. Winsorization for Large Values:
      • For large numerical variables (e.g., revenue, profit, customers), I winsorize at the 2.5% and 97.5% percentiles to cap extreme values while preserving overall trends.
      • This is done after imputation to avoid distorting the sector-based replacements.
    Remaining Issues

    Initially, I was imputing missing values using a random number between the 25th and 75th percentiles, ignoring sector differences. This resulted in many illogical observations (e.g., firms with unrealistically high or low staff/customer numbers).

    After switching to sector-specific median imputation, the number of unrealistic observations dropped significantly, but I still have about 3 problematic cases per variable. These include:
    • Firms reporting more female staff than total staff
    • Unrealistic revenue-to-customer ratios
    • Firms with profit aspirations far below current profits
    Question

    Given my current approach (sector-specific median imputation before winsorizing), what additional steps can I take to resolve the remaining inconsistencies? Would a different method (e.g., regression imputation) help?

    Would appreciate any insights from the community!

  • #2
    Well, you don't say how you plan to analyze this data once you have addressed these issues. But if you are going to be doing some kind of regression analyses, I think that the use of multiple imputation would be preferable to the single-imputation approaches you have mentioned. Moreover, given what these variables are, the missingness at random assumption underlying multiple imputation seems plausible to me.

    I am not a fan of Winsorization. Large values are not necessarily problematic. Have you explored the data graphically to see if these large values really distort relationships among the variables? If they do, might log-transformation or some other such transform resolve the problem?

    Comment


    • #3
      Tina:
      welcome to this forum.
      You should definitely go -mi-, provided that your data are MCAR or MAR (as Clyde suggests).
      Establishing the mechanism underlying the missingness of your data is mandatory before challenging yourself with any procedure aimed at dealing with missing values: if data are MNAR, things get trickier.
      I concurr with Clyde about winsorizing: "extreme" values may be totally justified by the data generating process (let's think about the annual profit of a successful microchips manufacturer that towers among its competitors).
      Kind regards,
      Carlo
      (StataNow 18.5)

      Comment

      Working...
      X