Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to interpolate a data with large amounts of severely and systematically missing values?

    Hey colleagues,

    I have been bothered by the following question:

    My unit of analysis is country year. My dataset covers 180 countries from 1946 to 2010. I am going to predict Y using X, controlling a bunch of regular covariates (C1+C2+…+Cn). The trouble is that the data of X is very severely missing with a missing rate of above 90%. For some countries, X’s missing is systematical, without a single value.

    My question is whether there is any method that allows for interpolating/filling in at least some of the missing values. For instance, research shows that X is theoretically and empirically determined by a number of other variables (D1+D2+...+Dn). Can I use some predicted value (E) of D1+D2+...+Dn and then use the value E as a proxy?

    Statistical analysis in the social sciences is frequently hindered by the absence or high rates of missing values. If there is no statistical solution to the challenge, we would have to wait until the data is worked out or instead do qualitative work, e.g.case studies, instead.

    Thanks a lot and wish all of you a happy Teacher’s Day of China!

    Sincerely
    Raymon Lucas

  • #2
    There are some issues to consider:

    1. It is important to understand the mechanism(s) by which the missing data came to be missing. The causes of the missingness are important for understanding the validity of different approaches to handling it.
    2. With 90% of the values of X missing, I think that any analysis, no matter how sophisticated, is going to be pretty much skating on thin ice.
    3. Given that you have variables D* that you believe are useful for predicting the values of E, if the mechanisms of missingness are not irretrievably biased (see 1. above), you might use multiple imputation. It is complicated to implement, and you will need a lot of patience to get it to work, and to wait through what may be a very prolonged runtime.
    4. Avoid the temptation to just use the D* variables to create a single predicted value of E for each observation and use that. Even if the imputation method itself produces completely unbiased predictions, the resulting data set has reduced variance, which leads to underestimating standard errors, which, in turn, means that the confidence intervals are too narrow and the p-values too low.

    See https://statisticalhorizons.com/wp-c...aterials-1.pdf for a nice overview of methods of dealing with missing data.

    Comment


    • #3
      Raymon:
      as an aside to Clyde's excellent reply, the main question concerns the aim of your research.
      You seem to be dealing with a very large (panel?) dataset. How can it be that the main independent variable is quite totally missing? What results can be obtained from such a dataset? How could you disseminate (not to say publishing) them?
      Kind regards,
      Carlo
      (Stata 19.0)

      Comment


      • #4

        Thanks a lot ! Clyde and Carlo!

        I would try to use multiple imputation so that I could generate some data to finish the research project in the hand. I would take great precaution as advised by Clyde.

        The X is the size/number of civil service/servants across 202 countries from 1946 to 2010. The time range could be much shorter. The more recent it is, the more data points there are. The raw data was downloadable from International Labor Organization. Few countries in the world provide the data and most of the few countries are established OECD countries. After 1980 or 1990, there is one or a few data points available for many underdeveloped countries.

        I used to take two approaches in order to resolve the missing value problem. First, I used public expenditures as % GDP as a proxy for civil service. This was why my paper was rejected by peer reviewers. Second, I attempted to find out the difference between public wages and military wages and divide the difference by GDP. This indicator makes a lot of sense but there seems no data for the latter.

        Your advice is really helpful! I would first impute my X and run the models, and then substitute the X for a proxy as a robustness check, say, government consumption-military expenditures as % GDP, which is less satisfactory but makes some sense.

        Comment


        • #5
          Raymon:
          from a strategic viewpoint, please note that journals often require the cover letter to be explicit about the reason(s) why the submitted paper was previously rejected. Therefore, it is of paramount importance to highlight the difference(s) between the previuos vs. current research approach.
          Kind regards,
          Carlo
          (Stata 19.0)

          Comment


          • #6
            Thank you for further advise, Carlo !

            Comment

            Working...
            X