Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • replace missing data with the mean of a subscale

    Hi there,

    I was using the syntax foreach var of varlist * {summ `var' replace `var' = r(mean) if missing(`var')}
    but this seems to have replaced missing data with the complete mean rather than the subscale mean.

    If I have items q6, q7, q8, q12, q18, q19 and item q8 is missing. How do I replace q8 with the subscale mean i.e. can I do q6+q7+q12+q18+q19 divided by 6 to replace q8?

    So, for example, if the item scores are:
    q6: 4
    q7: 3
    q8: . (missing)
    q12: 3
    q18: 1
    q19: 4

    Can I get Stata to sum the other subscale scores and divide by the total number of scores to leave the mean of 2.5 to replace q8 without me having to do this manually?

    Hope I explained that ok?
    Many thanks for your help in advance,
    Mary-Elaine.

  • #2
    How is your sub-scale defined? Please post a sample of your data using dataex program
    Code:
    ssc install dataex
    dataex
    Regards
    --------------------------------------------------
    Attaullah Shah, PhD.
    Professor of Finance, Institute of Management Sciences Peshawar, Pakistan
    FinTechProfessor.com
    https://asdocx.com
    Check out my asdoc program, which sends outputs to MS Word.
    For more flexibility, consider using asdocx which can send Stata outputs to MS Word, Excel, LaTeX, or HTML.

    Comment


    • #3
      I hope I understood this correctly, maybe try something like this:

      Code:
      webuse automiss, clear
      egen m = rowmean(turn displacement gear_ratio)
      list turn displacement gear_ratio m in 1/10
      foreach VAR of varlist turn displacement gear_ratio {
          replace `VAR' = m if missing(`VAR')
      }
      list turn displacement gear_ratio m in 1/10
      However, if you just want to generate a scale, alpha will automatically ignore missing values and you can set how many values will be ignored to produce a scale output, see help alpha and the option min.
      Best wishes

      (Stata 16.1 MP)

      Comment


      • #4
        Mary:
        notwithstanding its technical feasibility, replacing missing values with the mean of their observed counterparts is, in general, a very bad idea: see https://www.lshtm.ac.uk/media/37311 (Methods to avoid paragraph).
        Kind regards,
        Carlo
        (Stata 19.0)

        Comment


        • #5
          Originally posted by Attaullah Shah View Post
          How is your sub-scale defined? Please post a sample of your data using dataex program
          Code:
          ssc install dataex
          dataex
          I tried to use dataex but the following error code came up: input statement exceeds linesize limit. Try specifying fewer variables
          r(1000);

          I have 112 participants, 11 participants have 1 item of missing data.

          For one participant in the the example above the subscale (SWTD) is made up of 6 questions: q6, q7, q8, q12, q18, q19 but q8 is for one participant missing. So the values, I have for that participant's questions are:
          q6: 4
          q7: 3
          q8: . (missing)
          q12: 3
          q18: 1
          q19: 4

          Can I get Stata to sum the other subscale scores and divide by the total number of scores to leave the mean of 2.5 to replace q8 for one participant without me having to do this manually? i.e. q6+q7+q12+q18+q19/6 = 2.5 and then have Stata automatically replace that one participant missing response for q8 to 2.5?

          Many thanks for your time and guidance.

          Comment


          • #6
            Originally posted by Carlo Lazzaro View Post
            Mary:
            notwithstanding its technical feasibility, replacing missing values with the mean of their observed counterparts is, in general, a very bad idea: see https://www.lshtm.ac.uk/media/37311 (Methods to avoid paragraph).
            Thank you, I know I am very conscious of this. I was advised that as I only have 42 missing data out of a possible 2912 scores, it wouldn't make much of a difference. Perhaps I should rethink.

            Comment


            • #7
              Mary:
              even though the share of your missing values is actually negligible, the issue is the mechanism underlying the missingness (ignorable or not).
              Kind regards,
              Carlo
              (Stata 19.0)

              Comment


              • #8
                On generating the mean (#8), your denominator suggests that you are treating missing values as zeros. The calculated mean will therefore not reflect the mean of the available responses. If this is what you want, see wanted1. Otherwise, wanted2.

                Code:
                * Example generated by -dataex-. To install: ssc install dataex
                clear
                input float id str5 category float SWTD
                1 "q6"  4
                1 "q7"  3
                1 "q8"  .
                1 "q12" 3
                1 "q18" 1
                1 "q19" 4
                2 "q6"  2
                2 "q7"  4
                2 "q8"  .
                2 "q12" .
                2 "q18" 2
                2 "q19" 3
                end
                bys id: egen wanted1= mean(cond(missing(SWTD), 0, SWTD))
                bys id: egen wanted2= mean(SWTD)
                Followed by

                Code:
                replace SWTD= wanted if missing(SWTD)
                Res.:

                Code:
                . l, sepby(id)
                
                     +-------------------------------------------+
                     | id   category   SWTD    wanted1   wanted2 |
                     |-------------------------------------------|
                  1. |  1         q6      4        2.5         3 |
                  2. |  1         q7      3        2.5         3 |
                  3. |  1         q8      .        2.5         3 |
                  4. |  1        q12      3        2.5         3 |
                  5. |  1        q18      1        2.5         3 |
                  6. |  1        q19      4        2.5         3 |
                     |-------------------------------------------|
                  7. |  2         q6      2   1.833333      2.75 |
                  8. |  2         q7      4   1.833333      2.75 |
                  9. |  2         q8      .   1.833333      2.75 |
                 10. |  2        q12      .   1.833333      2.75 |
                 11. |  2        q18      2   1.833333      2.75 |
                 12. |  2        q19      3   1.833333      2.75 |
                     +-------------------------------------------

                Comment


                • #9
                  Originally posted by Mary McCavert View Post
                  I am very conscious of this. I was advised that as I only have 42 missing data out of a possible 2912 scores, it wouldn't make much of a difference. Perhaps I should rethink.
                  It's difficult to judge whether missingness is ignorable in any particular case, and from what I've seen with questionnaires, it's generally handled with either listwise deletion if the number of respondents with one or more missing item responses doesn't result in too much of a hit to power, or some method of imputing the missing values if there aren't too many that are missing in a given subscale for a respondent.

                  If you have relatively few missing data that are more or less unsystematically distributed throughout your dataset, then I've seen claims just as you have been advised, namely, that it doesn't actually matter how you impute them, and replacing with the subscalewise mean is easy and convenient. I think that this holds when your subscale score is a simple sumscore. Is that right in your case? I think that mean imputation will be a bit problematic if you have to do dipsy-doodles and to jump through hoops in order to compute the subscale’s score—for example, reverse the sense (high-to-low order) of responses on some items, recode responses on other items to quirky values, form subscale scores through weighted sums and so on, such as one has to do with SF-12 and SF-36. In that case, if you’re going to use mean imputation, then maybe use the average of that item across other respondents (say, of a similar demographic category or whatever representative criteria you have to match respondents to the one with the missing response) instead of average across the other items of the subscale within that respondent.

                  At least for the SF-12, whose subscales have a convoluted method of scoring, I’ve seen others advocate slightly more involved but still accessible methods of imputing, such as predictions from multiple regression (Liu et al., 2005) and expectation maximization (Wirz et al., 2020). If your subscales' scores are anything more involved than sumscores, then you might want to consider one of these other approaches.

                  Honghu Liu, Ron D Hays, John L Adams, Wen-Pin Chen, Diana Tisnado, Carol M Mangione, Cheryl L Damberg, and Katherine L Kahn, Imputation of SF-12 Health Scores for Respondents with Partially Missing Data _Health Serv Res_ 40:905–22, 2005.

                  Markus A Wirtz, Nicole Röttele, Matthias Morfeld, Elmar Brähler, Heide Glaesmer, Handling Missing Data in the Short Form-12 Health Survey (SF-12): Concordance of Real Patient Data and Data Estimated by Missing Data Imputation Procedures. _Assessment_ 00:1-14, 2020.
                  Last edited by Joseph Coveney; 30 Jan 2021, 22:34.

                  Comment

                  Working...
                  X