Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • replacing missing data with mean or median

    Hi guys,
    i have a set of data (n=586) with 29 variables. in some of the variables i have 2 to 3% missing values. what is the code to replace the missing data by mean or median for all variables at once.
    thanks in advance

    abdelilah
    Last edited by abdelilah arredouani; 04 Dec 2018, 02:14.

  • #2
    I am not supposed to show you stuff life this, because it is old syntax and we are not supposed to use it...

    But try

    . for var * : summ X \\ replace X = r(mean) if missing(X)

    it will probably do the job.

    Comment


    • #3
      i know that replacing missing data with mean is not the right thing to do but in some variables i have only 3 or missing points. i guess that using the mean would not affect my analysis.
      what is the code if i want to use the foreach command please

      Comment


      • #4
        I thought for a second there that the -foreach- does not accept the wild card *. But it does accept it when you specify that the loop is through a varlist.


        Code:
        . sysuse auto, clear
        (1978 Automobile Data)
        
        . drop make
        
        . foreach var of varlist * {
          2. summ `var'
          3. replace `var' = r(mean) if missing(`var')
          4. }
        
            Variable |        Obs        Mean    Std. Dev.       Min        Max
        -------------+---------------------------------------------------------
               price |         74    6165.257    2949.496       3291      15906
        (0 real changes made)
        
            Variable |        Obs        Mean    Std. Dev.       Min        Max
        -------------+---------------------------------------------------------
                 mpg |         74     21.2973    5.785503         12         41
        (0 real changes made)
        
            Variable |        Obs        Mean    Std. Dev.       Min        Max
        -------------+---------------------------------------------------------
               rep78 |         69    3.405797    .9899323          1          5
        variable rep78 was int now float
        (5 real changes made)
        
            Variable |        Obs        Mean    Std. Dev.       Min        Max
        -------------+---------------------------------------------------------
            headroom |         74    2.993243    .8459948        1.5          5
        (0 real changes made)
        
            Variable |        Obs        Mean    Std. Dev.       Min        Max
        -------------+---------------------------------------------------------
               trunk |         74    13.75676    4.277404          5         23
        (0 real changes made)
        
            Variable |        Obs        Mean    Std. Dev.       Min        Max
        -------------+---------------------------------------------------------
              weight |         74    3019.459    777.1936       1760       4840
        (0 real changes made)
        
            Variable |        Obs        Mean    Std. Dev.       Min        Max
        -------------+---------------------------------------------------------
              length |         74    187.9324    22.26634        142        233
        (0 real changes made)
        
            Variable |        Obs        Mean    Std. Dev.       Min        Max
        -------------+---------------------------------------------------------
                turn |         74    39.64865    4.399354         31         51
        (0 real changes made)
        
            Variable |        Obs        Mean    Std. Dev.       Min        Max
        -------------+---------------------------------------------------------
        displacement |         74    197.2973    91.83722         79        425
        (0 real changes made)
        
            Variable |        Obs        Mean    Std. Dev.       Min        Max
        -------------+---------------------------------------------------------
          gear_ratio |         74    3.014865    .4562871       2.19       3.89
        (0 real changes made)
        
            Variable |        Obs        Mean    Std. Dev.       Min        Max
        -------------+---------------------------------------------------------
             foreign |         74    .2972973    .4601885          0          1
        (0 real changes made)

        Comment


        • #5
          thanks joro

          Comment


          • #6
            Joro showed some useful technique -- in terms of giving you what you asked for -- but accidentally showed some of the problems in this territory. Note first that you really don't need to drop string variables for this purpose.

            Code:
            ds, has(type numeric) 
            
            foreach v in `r(varlist)' {
            gets you going.

            Then, as it turns out, the only variable in the auto dataset with missing values is rep78 for which the mean is about 3.406, which is not even a possible value! So, watch out.

            In general, replacement by the mean or median is an old technique with possibly the only advantages that it's easy to explain and implement. At best, replacing a few values with mean or median gives you similar results to ignoring observations with missing values, but that is something you can and should check out. At worst, it's an indefensible method without independent grounds for thinking that the real values behind missing values should be close to typical. If anything, the complete opposite is often more plausible.

            Comment


            • #7
              Abdelilah:
              as an aside to previous positive advice, I do discourage replacing missing values with the mean or (even worse) the median of the observed data.
              Nick has already highlighted all the downsides of that procedure, which is, an addition, old-fashioned, untenable in era of powerful desk/laptops,and proved to give back biased results.
              As reminded by Nick's reply, if your scant handful of missing data is MCAR, a passive attitude can be enough: https://statisticalhorizons.com/list...n-its-not-evil
              Kind regards,
              Carlo
              (StataNow 18.5)

              Comment


              • #8
                Thank you Nick for the pointer to the -ds- command. I have encountered it in your writings, but my wrong impression was that it does not do anything more than -describe-, and apparently it does a lot more than -describe-. To contribute more to my confusion, in Stata 11 they say "ds continues to work but, as of Stata 9, is no longer an official part of Stata." But then in Stata 12 -ds- becomes official again.

                I did not know whether the original poster had string variable or variables which are to be excluded from the imputation procedure. Hence I just dropped make.

                Here is another solution to excluding variables from the procedure without dropping them. For the sake of arguments lets say that we want to exclude make and rep78, The backbone of this solution is to order the variables conveniently:

                Code:
                . sysuse auto, clear
                (1978 Automobile Data)
                
                . order make rep78, first
                options not allowed
                r(101);
                
                . order make rep78
                
                . foreach var of varlist price-foreign {
                  2. summ `var'
                  3. replace `var' = r(mean) if missing(`var')
                  4. }
                Last edited by Joro Kolev; 04 Dec 2018, 07:22.

                Comment


                • #9
                  And we encountered a bug in Stata 15.

                  The command -order- is supposed to take options, in particular the option -first- sends the listed variables to become first, and the option -last- sends the selected variables to go to the end.

                  This is what the help file of Stata 15 says, and this is how -order- worked since at least Stata 11.

                  Suddenly in Stata 15 -order- refuses to accept options

                  Code:
                  . order make rep78, first
                  options not allowed
                  r(101);

                  Comment


                  • #10
                    Works for me. (And I tried rep78 too.)

                    Code:
                    . sysuse auto, clear
                    (1978 Automobile Data)
                    
                    . order make mpg, first
                    
                    . ds
                    make          rep78         weight        displacement
                    mpg           headroom      length        gear_ratio
                    price         trunk         turn          foreign
                    
                    . update query
                    (contacting http://www.stata.com)
                    
                    Update status
                        Last check for updates:  04 Dec 2018
                        New update available:    none         (as of 04 Dec 2018)
                        Current update level:    15 Oct 2018  (what's new)
                    
                    Possible actions
                    
                        Do nothing; all files are up to date.

                    Comment

                    Working...
                    X