Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Testing whether attrition is informative

    Hi,

    I have a panel dataset with 13 waves and my dataset involves questionnaire responses from individuals in the Netherlands.
    The dependent variable is binary, measuring an individual's ability to save (saving=1 if individual indicated an ability to save; 0 otherwise).
    Code:
    . xtdes
    
        hhid:  6, 21, ..., 89972                                 n =       2976
        year:  2004, 2005, ..., 2016                             T =         13
               Delta(year) = 1 unit
               Span(year)  = 13 periods
               (hhid*year uniquely identifies each observation)
    My regression is as follows (please note that incomescaled is income divided by 1000 because this makes AME interpretations at a later stage more meaningful):
    Code:
    . gen incomescaled = income/1000
    
    . xtprobit saving $xlist employed retired health incomescaled risk
    > selfcontrol child savingexp partner uni owner male c.age##c.age
    >  i.year, re vce(cluster hhid) nolog
    I would like to investigate missingness in my dataset. In particular my aim is to see whether the attrition in my dataset is random and informative - I would like to see if there are differences between the attriting and non-attriting samples.

    I think this can be done by conducting significance tests of missingness, so I have done the following for incomescaled, to see if there is a significant difference in income between the attrited and non-attrited sample (because theoretically maybe more poorer households left the sample, which may then lead to sample bias due to under-representation of poor households)
    Code:
    . mdesc saving incomescaled
    
        Variable    |     Missing          Total     Percent Missing
    ----------------+-----------------------------------------------
             saving |         266         13,217           2.01
       incomescaled |       5,759         13,217          43.57
    ----------------+-----------------------------------------------
    
    .
    . gen incomescaled_m=1 if incomescaled==.
    (7,458 missing values generated)
    
    .
    . replace incomescaled_m=0 if incomescaled!=.
    (7,458 real changes made)
    
    .
    . tab incomescaled_m
    
    incomescale |
            d_m |      Freq.     Percent        Cum.
    ------------+-----------------------------------
              0 |      7,458       56.43       56.43
              1 |      5,759       43.57      100.00
    ------------+-----------------------------------
          Total |     13,217      100.00
    
    .
    . sort incomescaled_m
    
    .
    . by incomescaled_m: su saving
    
    --------------------------------------------------------------------------------------------
    -> incomescaled_m = 0
    
        Variable |        Obs        Mean    Std. Dev.       Min        Max
    -------------+---------------------------------------------------------
          saving |      7,330    .4278308     .494798          0          1
    
    --------------------------------------------------------------------------------------------
    -> incomescaled_m = 1
    
        Variable |        Obs        Mean    Std. Dev.       Min        Max
    -------------+---------------------------------------------------------
          saving |      5,621    .3239637     .468028          0          1
    
    
    .
    . ttest saving, by(incomescaled_m)
    
    Two-sample t test with equal variances
    ------------------------------------------------------------------------------
       Group |     Obs        Mean    Std. Err.   Std. Dev.   [95% Conf. Interval]
    ---------+--------------------------------------------------------------------
           0 |   7,330    .4278308    .0057793     .494798    .4165017    .4391599
           1 |   5,621    .3239637    .0062426     .468028    .3117258    .3362016
    ---------+--------------------------------------------------------------------
    combined |  12,951    .3827504    .0042712    .4860769    .3743781    .3911226
    ---------+--------------------------------------------------------------------
        diff |            .1038671    .0085697                .0870693     .120665
    ------------------------------------------------------------------------------
        diff = mean(0) - mean(1)                                      t =  12.1203
    Ho: diff = 0                                     degrees of freedom =    12949
    
        Ha: diff < 0                 Ha: diff != 0                 Ha: diff > 0
     Pr(T < t) = 1.0000         Pr(|T| > |t|) = 0.0000          Pr(T > t) = 0.0000
    From this could I conclude that there is a difference in the incomes of the attrited and non-attrited samples?
    Or could someone please suggest an alternative method if the above is incorrect?

    Please let me know if further clarification is required

    Thank you
    Last edited by Rose Simmons; 18 Apr 2017, 10:53.

  • #2
    Rose:
    from -ttest- results you can see that there's a systematic difference in saving between -incomescaled_m- groups.
    So, you have around (31,230/38,688)=81% missing values for -incomescaled-. Probably a cautionary approach woud advise to rule -incomescaled- out from the set of predictors.
    Kind regards,
    Carlo
    (StataNow 18.5)

    Comment


    • #3
      if saving were never missing (and you can clear up another issue discussed below), then this would be one part of a test of whether incomescaled were "missing completely at random" (MCAR); note that the "missing at random" (MAR) assumption is untestable

      I don't understand part of your output:
      Code:
       
       . gen incomescaled_m=1 if incomescaled==. (7,458 missing values generated)  . . replace incomescaled_m=0 if incomescaled!=. (7,458 real changes made)  . . tab incomescaled_m  incomescale |         d_m |      Freq.     Percent        Cum. ------------+-----------------------------------           0 |      7,458       56.43       56.43           1 |      5,759       43.57      100.00 ------------+-----------------------------------       Total |     13,217      100.00
      why are there 5,759 "1's" given your -generate- results above the table?

      also, you say
      aim is to see whether the attrition in my dataset is random and informative
      I assume there is a typo here as if attrition is random then it is not informative; please clarify

      Comment


      • #4
        Carlo Lazzaro thank you for your reply

        I have realised why -incomescaled- had so many missing values, it is because I had originally tried -tsfill, full- to balance the dataset so that I could tell which years were missing. Upon reflection, it is probably best not to do -tsfill, full- because this creates many missing values and so I think it could be misleading about the results.
        Apologies for this, I have since edited #1 to reflect the values without -tsfill, full-

        The p-value from the ttest remains at zero, so I think the conclusion will be that there is indeed a systematic difference in income between those who dropped out of the sample and those who remained. May I ask if there is a way to show whether those who dropped out of the sample had lower (or higher) incomes than those who remained?

        Thanks

        Comment


        • #5
          Rich Goldstein thank you for your reply

          I have conducted a mcartest:
          Code:
          . mcartest $xlist employed retired health income risk selfcontrol child savingexp partner un
          > i owner male age, emoutput nolog
          
          Expectation-maximization estimation      Number obs           =      9867
                                                   Number missing       =      8944
                                                   Number patterns      =        59
          Prior: uniform                           Obs per pattern: min =         1
                                                                    avg =  167.2373
                                                                    max =      5241
          
          Observed log likelihood = -159901.91 at iteration 15
          
          -------------------------------------------------------------------------------------------
                       |      prec   purchase     retire    bequest    mediumh      longh   employed
          -------------+-----------------------------------------------------------------------------
          Coef         |                                                                            
                 _cons |  17.21164    9.64401   14.23288   9.876836   .4486985   .0356982   .5287321
          -------------+-----------------------------------------------------------------------------
          Sigma        |                                                                            
                  prec |  9.650364   2.653039   5.961269    2.35639   .1190314   .0226142  -.1061667
              purchase |  2.653039   11.29252   5.585159   3.287342   .1143339   .0413466   .3470621
                retire |  5.961269   5.585159   18.07993   4.558797   .2825322    .075291   .3273358
               bequest |   2.35639   3.287342   4.558797   25.73814   .2003085   .0279143  -.3021642
               mediumh |  .1190314   .1143339   .2825322   .2003085   .2474601  -.0160152   .0029871
                 longh |  .0226142   .0413466    .075291   .0279143  -.0160152   .0343796   .0032975
              employed | -.1061667   .3470621   .3273358  -.3021642   .0029871   .0032975   .2491745
               retired |  .0930456  -.3311124  -.3488665   .3765796   .0082096  -.0050864  -.1821921
                health | -.0170121   .1237921   .0137143   .0658642   .0154973  -.0013984   .0768141
                income | -1083.358   880.6893  -1872.866   4122.921   882.4843   52.91977   1295.523
                  risk | -3.786944   2.981907   -.474945   .4425276   .1815103    .015856   .4386603
           selfcontrol |  .3210033  -.2704047  -.0093713   .0753243   .0846079   .0082493  -.0792879
                 child | -.2769498   .2110997   .0052549   .7243602   .0010089   .0024441   .1923543
             savingexp |  .1074745    .122478    .121612   .1000051   .0330926   .0014483   .0211156
               partner | -.0793618  -.0944943  -.1150884   .4311682   .0168049  -.0019489   .0012033
                   uni | -.0362468   .0959123  -.0419606  -.0931491   .0117892   .0003633    .012833
                 owner |  .0323899   .0375088   .0250265   .4319967   .0377703  -.0000735   .0193809
                  male | -.1868004  -.1169552  -.1775345   .1917804   .0102924  -.0006911  -.0002442
                   age |  2.331756  -18.49912  -9.623477   11.99615   .3146326  -.1119548  -5.211748
          -------------------------------------------------------------------------------------------
          
          -------------------------------------------------------------------------------------------
                       |   retired     health     income       risk  selfcon~l      child  savingexp
          -------------+-----------------------------------------------------------------------------
          Coef         |                                                                            
                 _cons |   .344583   3.830952   31986.03   16.72017   5.231135   .5027871   .3757535
          -------------+-----------------------------------------------------------------------------
          Sigma        |                                                                            
                  prec |  .0930456  -.0170121  -1083.358  -3.786944   .3210033  -.2769498   .1074745
              purchase | -.3311124   .1237921   880.6893   2.981907  -.2704047   .2110997    .122478
                retire | -.3488665   .0137143  -1872.866   -.474945  -.0093713   .0052549    .121612
               bequest |  .3765796   .0658642   4122.921   .4425276   .0753243   .7243602   .1000051
               mediumh |  .0082096   .0154973   882.4843   .1815103   .0846079   .0010089   .0330926
                 longh | -.0050864  -.0013984   52.91977    .015856   .0082493   .0024441   .0014483
              employed | -.1821921   .0768141   1295.523   .4386603  -.0792879   .1923543   .0211156
               retired |  .2258455  -.0151247   25.85364  -.3103795   .1156065  -.1632184   .0029043
                health | -.0151247   .5255931   1635.571   .2837704   .1275956   .0647257   .0415905
                income |  25.85364   1635.571   1.08e+09   11156.27   2183.407   1240.501   2169.993
                  risk | -.3103795   .2837704   11156.27   39.16448  -.4310057   .6880169   .0762291
           selfcontrol |  .1156065   .1275956   2183.407  -.4310057   2.225804  -.1978589   .1807742
                 child | -.1632184   .0647257   1240.501   .6880169  -.1978589   .9109834  -.0256144
             savingexp |  .0029043   .0415905   2169.993   .0762291   .1807742  -.0256144   .2345818
               partner |  .0350431   .0500038   2579.413   .1363759   .0251232   .1274641   .0147169
                   uni | -.0083959   .0231324   1694.189   .2069189   .0331075  -.0126351   .0148898
                 owner |  .0128889   .0448441   2370.133   .2071253   .0903794   .0693949   .0334286
                  male |   .030571   .0265214   1655.534   .4622172   .0481309    .053058   .0189836
                   age |  5.120063   -1.36954   2753.934  -14.84941   3.565264  -5.365151  -.2163414
          -------------------------------------------------------------------------------------------
          
          ---------------------------------------------------------------------
                       |   partner        uni      owner       male        age
          -------------+-------------------------------------------------------
          Coef         |                                                      
                 _cons |  .6477146   .1609898    .701125   .7809871   56.21628
          -------------+-------------------------------------------------------
          Sigma        |                                                      
                  prec | -.0793618  -.0362468   .0323899  -.1868004   2.331756
              purchase | -.0944943   .0959123   .0375088  -.1169552  -18.49912
                retire | -.1150884  -.0419606   .0250265  -.1775345  -9.623477
               bequest |  .4311682  -.0931491   .4319967   .1917804   11.99615
               mediumh |  .0168049   .0117892   .0377703   .0102924   .3146326
                 longh | -.0019489   .0003633  -.0000735  -.0006911  -.1119548
              employed |  .0012033    .012833   .0193809  -.0002442  -5.211748
               retired |  .0350431  -.0083959   .0128889    .030571   5.120063
                health |  .0500038   .0231324   .0448441   .0265214   -1.36954
                income |  2579.413   1694.189   2370.133   1655.534   2753.934
                  risk |  .1363759   .2069189   .2071253   .4622172  -14.84941
           selfcontrol |  .0251232   .0331075   .0903794   .0481309   3.565264
                 child |  .1274641  -.0126351   .0693949    .053058  -5.365151
             savingexp |  .0147169   .0148898   .0334286   .0189836  -.2163414
               partner |  .2281804  -.0064747   .0714615   .1018254   .6214429
                   uni | -.0064747   .1350793   .0080146   .0000593  -.3404142
                 owner |  .0714615   .0080146   .2095487   .0440996   .4485456
                  male |  .1018254   .0000593   .0440996   .1710462   .7427155
                   age |  .6214429  -.3404142   .4485456   .7427155   223.4663
          ---------------------------------------------------------------------
          
          Little's MCAR test
          
          Number of obs       = 9867
          Chi-square distance = 5325.9430  
          Degrees of freedom  = 909
          Prob > chi-square   = 0.0000
          Question: From this can I conclude that the missing values in my dataset are not MCAR, so they might be MAR?

          For -incomescaled- there are 7,458 observations and 5,759 missing values. So for incomescaled_m, there will be 5,759 '1s' to represent these missing values.

          Code:
          -------------------------------------------------------------------------------------------
          incomescaled                                                                    (unlabeled)
          -------------------------------------------------------------------------------------------
          
                            type:  numeric (float)
          
                           range:  [0,1370.179]                 units:  .001
                   unique values:  2,348                    missing .:  5,759/13,217
          
                            mean:    32.699
                        std. dev:   32.8517
          
                     percentiles:        10%       25%       50%       75%       90%
                                          10      20.1        30    40.876        54


          I assume there is a typo here as if attrition is random then it is not informative; please clarify
          Yes sorry, I meant to say whether attrition is random or informative

          Thanks

          Comment


          • #6
            Rose:
            as Rich pointed out, attrition cannot be at the same time random and informative.
            As far as I can get your outcomes, it seems that you have missing values in both -saving- and -incomescaled-.
            It is impossible to state whether those who dropped out have higher/lower income than those who were retained in the dataset, as those data are missing (the same remark holds for -saving-).
            As Rich touched upon, if data are MAR, -mi- can produce plausible values that make, in turn, subsequent regressions (more) efficient.
            Conversely, if data are MNAR, things are trickier and more complex are approaches are needed.
            Kind regards,
            Carlo
            (StataNow 18.5)

            Comment


            • #7
              Rose:
              as MAR vs MNAR hypothesis cannot be tested (see Rich's reply at #3), you can only assume that data are MAR.
              Just out of curiosity: what does -savingexp- stay for?
              Last edited by Carlo Lazzaro; 18 Apr 2017, 12:01.
              Kind regards,
              Carlo
              (StataNow 18.5)

              Comment


              • #8
                Originally posted by Carlo Lazzaro View Post
                It is impossible to state whether those who dropped out have higher/lower income than those who were retained in the dataset, as those data are missing (the same remark holds for -saving-).
                Ah I see that this would not be possible as the data is of course missing, thank you.

                I read this on page 17 and 23 of http://younglives.org.uk/files/YL-TN...-Attrition.pdf which is for a different survey to mine
                "In this section we investigate non-random attrition by searching for patterns in outcome variables and household characteristics of attriting households. First, we do so by tabulating attriting and non-attriting households over a number of important dimensions. Second, and more rigorously, we carry out statistical tests for the equality of means for a large range of predetermined and outcome variables."
                "Although modest, we find that attrition in the Young Lives sample is to some extent nonrandom. Even though not always significant, attriting households typically are located in urban areas, have a low wealth index, own fewer assets, are less educated,etc. Furthermore, we uncover substantial differences in household profiles across attrition categories"

                So in #1 I had tried to see whether there is a difference in the attriting and non-attriting samples.

                How may I test whether the attrition in my sample is random or non-random? As I would like to know whether there is possible attrition bias

                -savingexp- is measured in a similar way to -saving-, both are binary, but it measures whether the household anticipates an ability to save in the next year. I hypothesised that if an individual expects to be able to save next year, this may influence whether they saved this year

                Thanks

                Comment


                • #9
                  I found this explanation online: "To test whether missingness is related to observed or unobserved causes, a dummy variable is created based on participants having complete or missing data on a variable or having left or remained in the study (e.g., 0, incomplete data/left the study data; 1, complete data/remained in the study). Then, a series of independent samples t-tests are used to indicate whether the dummy variable leads to significant average differences among other variables of interest."

                  So first I would like the create the dummy variable (0, incomplete data/left the study data; 1, complete data/remained in the study)

                  Code:
                  hhid    year
                  175    2016
                  184    2004
                  184    2005
                  184    2006
                  184    2007
                  184    2008
                  184    2009
                  194    2004
                  194    2005
                  194    2006
                  194    2007
                  228    2004
                  228    2005
                  228    2006
                  228    2007
                  228    2008
                  As the above shows, for example household number 184 left the study after 2009, so they were missing in 2010-2016.
                  How may I create a dummy variable to show that participants have left the study? E.g. a dummy which shows that they are missing in 2010

                  Thank you

                  Comment


                  • #10
                    Rose:
                    there's no way to test if missing data are MAR or MNAR (as the difference between the two missing data mechanism rests on the missing data, that you cannot see because they're indeed missing!). The only formal test is the one that you already performed: MCAR vs MAR (and there was evidence that your missing data were not MCAR).
                    As per the excerpt of the link you quoted, you can only judge from the observed data whether those who have missing data from -income-, -saving- look different from the remaining observations (that is, whether the underlying missing mechanism is MAR or MNAR)..
                    Last but not least: the explanation you provided to justify the inclusion of -saving- as a dependent variable and -savingexp- as predictor can possibly work the other way round, too: today's saving propensity can influence tomorrow's saving propensity.
                    As this is not my research field, I would recommend you to skim through the literature you're familiar with in order to avoid the risk of reversal causation (a form of endogeneity) between -saving- and -savingexp-.
                    Kind regards,
                    Carlo
                    (StataNow 18.5)

                    Comment


                    • #11
                      Rose:
                      the series of independent ttest (which is not free form statistical problems; see Enders CK. Applied Missing Data Analysis.New York: The Guilford Press, 2010: pages18-19) is for contrasting MCAR vs MAR mechanisms. There's no way to implement it for testing whether your missing data are MAR or MNAR (for the reasons reported at #10).
                      Kind regards,
                      Carlo
                      (StataNow 18.5)

                      Comment


                      • #12
                        Of course, sorry for being repetitive, I see your point about missingness!

                        I think perhaps I am getting confused between any missing values in the dataset for various variables e.g. saving, and missing values in terms of missing years.

                        That is a very good and valid point about reverse causation, I will definitely look into this further in the literature - thank you very much

                        Comment


                        • #13
                          Originally posted by Carlo Lazzaro View Post
                          Rose:
                          the series of independent ttest (which is not free form statistical problems; see Enders CK. Applied Missing Data Analysis.New York: The Guilford Press, 2010: pages18-19) is for contrasting MCAR vs MAR mechanisms. There's no way to implement it for testing whether your missing data are MAR or MNAR (for the reasons reported at #10).
                          Ah I see, thank you for clarifying that the t-tests I had mentioned were for MCAR vs MAR (and I have already established the data are MCAR using -mcartest- so there is no need for the t-tests) , and not for MAR vs MNAR which is what I would like to test but, as per #10, I now understand it is not possible to do

                          As always, you have been very helpful in clarifications. Thank you Carlo Lazzaro

                          Comment


                          • #14
                            Rose:
                            please note that, as per -mcartest- outcome, your data are not MCAR, but possibly (as you cannot test MAR vs MNAR) MAR.
                            Kind regards,
                            Carlo
                            (StataNow 18.5)

                            Comment


                            • #15
                              Thank you for the correction Carlo Lazzaro , my data are not MCAR.
                              Question 1: Based on my data being not MCAR, does this mean that the missingness is "informative"?
                              Could "informative" be interpreted as missingness is not independent of y and x variables?

                              As data are not MCAR, my understanding is that this missing data may lead to biased estimates (e.g. if say my data is MNAR and hypothetically people may be less inclined to report saving if they do not save, perhaps if they do not want to admit that they struggle to save).

                              I have read up on various ways to deal with missing data, but some of them also seem to lead to biased estimates - would you agree with the following?
                              - Listwise deletion will delete all individuals for which there is any missing data. As my data is not MCAR i.e. missingness is not independent, listwise deletion may yield biased estimates? E.g. if it it the case that more non-savers than savers do not respond to the -saving- question, then the missing values will lead to more non-savers being dropped, and the sample will become less representative of the population.
                              - Dummy variable adjustment whereby I could create a dummy variable if there is a missing value, and then impute these to a value such as the sample mean. Then include the missing dummy in the regression. But again this is not representative of the true values and hence may yield biased estimates.

                              Question 2: As the methods mentioned above may lead to serious bias, how would you recommend that I deal with missing data?
                              Can I proceed with my analyses and just acknowledge that there may be bias caused by missingness?
                              Or would a method such as multiple imputation be advisable?

                              Your insight would be much appreciated
                              Many thanks
                              Last edited by Rose Simmons; 19 Apr 2017, 03:43.

                              Comment

                              Working...
                              X