Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Compute mean and mode for missing data

    Hello,

    I'm somewhat new to STATA and was looking for days to find an appropriate solution. I hope you can help me to solve my question...

    My dataset contains some missing data. For illustrative purposes, I provide three (example!) variables and their type:

    1. Age: Continuous variable (integer)
    2. Level of education: Ordinal variable (1: low, 2: intermediary, 3: high)
    3. Gender: Nominal variable / dummy (1: male, 2: female)

    For Age, I want to compute the sample mean (but exclude missing values in the computation) and assign the computed sample mean only to the missing values.
    For Level of education, I want to compute the mode (value with highest frequency) of the sample and assign that value of that mode to the missing values.
    For Gender, I want to compute the mode of the sample and assign that value of that mode to the missing values.

    Furthermore, what is the best way to deal with multiple modes? Given that the ordinal and nominal variables have categorical values, taking the average of two modes is not going to work.

    Thank you!

    Best, Dave

  • #2
    The command egen has functions for mean and mode which you can apply groupwise.

    There is no easy answer when there are ties for mode. But the function just mentioned has various options that might appeal.

    This is a marker for all those discussions that might point out that these methods are widely considered long past their sell-by date for imputation. For example, assume two categories. Then assigning the mode to missing values will just bias estimation of the probability of the more frequent category unless you're certain that the missings all belong to the modal category. Concretely, imagine 5 females, 3 males, 2 missing. After imputation we are estimating pr(female) as 0.7 rather than 0.625. Naturally, this is a fairly cheap criticism and anything more elaborate is also much harder work and not white magic in any case.

    Comment


    • #3
      You are better of not replacing the missing values than replacing them by the mean/median/mode. The latter will tend to make things worse. Imagine a scatterplot for a bivariate regression, and where your "imputed" values end up in that scatterplot. If you really want to deal with the missing values you can look at help mi, but the default of ignoring cases with missing values tend to be fairly robust (compared to the alternatives). So my recommendation would generally be that, unless you are an expert, you are better of leaving the missing values alone and just focus on the data you do have.
      ---------------------------------
      Maarten L. Buis
      University of Konstanz
      Department of history and sociology
      box 40
      78457 Konstanz
      Germany
      http://www.maartenbuis.nl
      ---------------------------------

      Comment


      • #4
        Indeed. There is software that gives up if missing values are met anywhere and not explicitly excluded, but Stata's general convention is to ignore missing data unless you specify otherwise (and sometimes even then).

        Comment


        • #5
          For continuous variable age
          Code:
          egen age_new = mean(age)
          replace age_new = age if age!=.
          and similar logic for the ordinal variable.

          The syntax you used computes the mean of age limited to just those observations where age is missing, so for those observations age_new will be missing, and for observations where age is not missing, nothing is computed, so for those observations, age_new will also be missing.

          Comment


          • #6
            William is right, but to make it concrete, imagine ages 24, 42, 124 and two missings. You're asking what's the mean of the missings and Stata can only return missing as a result.

            Comment


            • #7
              I am having trouble setting up a panel data set. I am using the European Social Survey Cumulative File, for those countries that had surveys in all 7 rounds, and the data are categoried by country (numerical) and essround (numerical) and when i tabulate these variables they look right (the table looks like a balanced panel).
              but if I do xtset country it says it is unbalanced. why
              and if I try xtset country essround I get error 451, repeated time values within panel.

              i followed the posted fix by Nick Cox but when i get to the remedy (duplicates tag....) it basically classifies the whole dataset as duplicates.
              I am stuck.

              Comment


              • #8
                John:
                welcome to this forum.
                Please, repost your query following FAQ advice (mainly: do not queue up your query to an existing one with a totally different subject; provide an exxcerpt/example of your data via -dataex-; post what you typed and what Stata gave you back within CODE delimiters). Thanks.
                With a bit of guess-work I would say that your panel dataset has missing values.
                Please note that Stata can easily handles both balanced and unbalanced panel datasets.
                Kind regards,
                Carlo
                (Stata 19.0)

                Comment


                • #9
                  Dave:
                  an addition to the excellent advice already provided, you may want to take a look at https://missingdata.lshtm.ac.uk/, that debunks the black magic behind the tragically oversold naive methods to deal with missing values (such as replacing missing data with the mean of the observed ones).
                  Kind regards,
                  Carlo
                  (Stata 19.0)

                  Comment


                  • #10
                    carlo

                    thanks. i am not sure where to post the query. But this is not really about missing data. I cannot actually get the xtset command to work without errors and none of nick's fixes work. I just happened to mention that when I just did a one way panel (no time series) stata reported that the panel was unbalanced. but I know that Stata can handle situations once I get in the door. \I cannot get in the door! is that clearer?

                    john

                    Comment


                    • #11
                      john ferejohn

                      While you are reading this answer, scroll to the top of the page above the first post, locate the word General shown in the screenshot below, and click on it. In the page that opens, you will see a button labelled "+ New Topic". That is how you create a new topic as Carlo advised.

                      Before posting, you should also follow Carlo's advice and review the Statalist FAQ linked to from the top of the page, as well as from the Advice on Posting link on the page you used to create your post. Note especially sections 9-12 on how to best pose your question.

                      The more you help others understand your problem, the more likely others are to be able to help you solve your problem.

                      Click image for larger version

Name:	interface.png
Views:	1
Size:	164.7 KB
ID:	1431972

                      Comment

                      Working...
                      X