Testing whether attrition is informative

Rose Simmons

Join Date: Feb 2017
Posts: 114

Testing whether attrition is informative

18 Apr 2017, 09:42

Hi,

I have a panel dataset with 13 waves and my dataset involves questionnaire responses from individuals in the Netherlands.
The dependent variable is binary, measuring an individual's ability to save (saving=1 if individual indicated an ability to save; 0 otherwise).

Code:

. xtdes

    hhid:  6, 21, ..., 89972                                 n =       2976
    year:  2004, 2005, ..., 2016                             T =         13
           Delta(year) = 1 unit
           Span(year)  = 13 periods
           (hhid*year uniquely identifies each observation)

My regression is as follows (please note that incomescaled is income divided by 1000 because this makes AME interpretations at a later stage more meaningful):

Code:

. gen incomescaled = income/1000

. xtprobit saving $xlist employed retired health incomescaled risk
> selfcontrol child savingexp partner uni owner male c.age##c.age
>  i.year, re vce(cluster hhid) nolog

I would like to investigate missingness in my dataset. In particular my aim is to see whether the attrition in my dataset is random and informative - I would like to see if there are differences between the attriting and non-attriting samples.

I think this can be done by conducting significance tests of missingness, so I have done the following for incomescaled, to see if there is a significant difference in income between the attrited and non-attrited sample (because theoretically maybe more poorer households left the sample, which may then lead to sample bias due to under-representation of poor households)

Code:

. mdesc saving incomescaled

    Variable    |     Missing          Total     Percent Missing
----------------+-----------------------------------------------
         saving |         266         13,217           2.01
   incomescaled |       5,759         13,217          43.57
----------------+-----------------------------------------------

.
. gen incomescaled_m=1 if incomescaled==.
(7,458 missing values generated)

.
. replace incomescaled_m=0 if incomescaled!=.
(7,458 real changes made)

.
. tab incomescaled_m

incomescale |
        d_m |      Freq.     Percent        Cum.
------------+-----------------------------------
          0 |      7,458       56.43       56.43
          1 |      5,759       43.57      100.00
------------+-----------------------------------
      Total |     13,217      100.00

.
. sort incomescaled_m

.
. by incomescaled_m: su saving

--------------------------------------------------------------------------------------------
-> incomescaled_m = 0

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
      saving |      7,330    .4278308     .494798          0          1

--------------------------------------------------------------------------------------------
-> incomescaled_m = 1

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
      saving |      5,621    .3239637     .468028          0          1


.
. ttest saving, by(incomescaled_m)

Two-sample t test with equal variances
------------------------------------------------------------------------------
   Group |     Obs        Mean    Std. Err.   Std. Dev.   [95% Conf. Interval]
---------+--------------------------------------------------------------------
       0 |   7,330    .4278308    .0057793     .494798    .4165017    .4391599
       1 |   5,621    .3239637    .0062426     .468028    .3117258    .3362016
---------+--------------------------------------------------------------------
combined |  12,951    .3827504    .0042712    .4860769    .3743781    .3911226
---------+--------------------------------------------------------------------
    diff |            .1038671    .0085697                .0870693     .120665
------------------------------------------------------------------------------
    diff = mean(0) - mean(1)                                      t =  12.1203
Ho: diff = 0                                     degrees of freedom =    12949

    Ha: diff < 0                 Ha: diff != 0                 Ha: diff > 0
 Pr(T < t) = 1.0000         Pr(|T| > |t|) = 0.0000          Pr(T > t) = 0.0000

From this could I conclude that there is a difference in the incomes of the attrited and non-attrited samples?
Or could someone please suggest an alternative method if the above is incorrect?

Please let me know if further clarification is required

Thank you

Last edited by Rose Simmons; 18 Apr 2017, 09:53.

Tags: None

Carlo Lazzaro

Join Date: Apr 2014

Posts: 17656
#2

18 Apr 2017, 09:52

Rose:
from -ttest- results you can see that there's a systematic difference in saving between -incomescaled_m- groups.
So, you have around (31,230/38,688)=81% missing values for -incomescaled-. Probably a cautionary approach woud advise to rule -incomescaled- out from the set of predictors.

Kind regards,
Carlo
(StataNow 18.5)
Comment
Rich Goldstein

Join Date: Mar 2014

Posts: 4427
#3

18 Apr 2017, 09:53

if saving were never missing (and you can clear up another issue discussed below), then this would be one part of a test of whether incomescaled were "missing completely at random" (MCAR); note that the "missing at random" (MAR) assumption is untestable

I don't understand part of your output:

Code:

. gen incomescaled_m=1 if incomescaled==. (7,458 missing values generated) . . replace incomescaled_m=0 if incomescaled!=. (7,458 real changes made) . . tab incomescaled_m incomescale | d_m | Freq. Percent Cum. ------------+----------------------------------- 0 | 7,458 56.43 56.43 1 | 5,759 43.57 100.00 ------------+----------------------------------- Total | 13,217 100.00

why are there 5,759 "1's" given your -generate- results above the table?

also, you say

aim is to see whether the attrition in my dataset is random and informative

I assume there is a typo here as if attrition is random then it is not informative; please clarify
1 like
Comment
Rose Simmons

Join Date: Feb 2017

Posts: 114
#4

18 Apr 2017, 09:57

Carlo Lazzaro thank you for your reply

I have realised why -incomescaled- had so many missing values, it is because I had originally tried -tsfill, full- to balance the dataset so that I could tell which years were missing. Upon reflection, it is probably best not to do -tsfill, full- because this creates many missing values and so I think it could be misleading about the results.
Apologies for this, I have since edited #1 to reflect the values without -tsfill, full-

The p-value from the ttest remains at zero, so I think the conclusion will be that there is indeed a systematic difference in income between those who dropped out of the sample and those who remained. May I ask if there is a way to show whether those who dropped out of the sample had lower (or higher) incomes than those who remained?

Thanks
Comment

Rose Simmons

Join Date: Feb 2017
Posts: 114

18 Apr 2017, 10:08

Rich Goldstein thank you for your reply

I have conducted a mcartest:

Code:

. mcartest $xlist employed retired health income risk selfcontrol child savingexp partner un
> i owner male age, emoutput nolog

Expectation-maximization estimation      Number obs           =      9867
                                         Number missing       =      8944
                                         Number patterns      =        59
Prior: uniform                           Obs per pattern: min =         1
                                                          avg =  167.2373
                                                          max =      5241

Observed log likelihood = -159901.91 at iteration 15

-------------------------------------------------------------------------------------------
             |      prec   purchase     retire    bequest    mediumh      longh   employed
-------------+-----------------------------------------------------------------------------
Coef         |                                                                            
       _cons |  17.21164    9.64401   14.23288   9.876836   .4486985   .0356982   .5287321
-------------+-----------------------------------------------------------------------------
Sigma        |                                                                            
        prec |  9.650364   2.653039   5.961269    2.35639   .1190314   .0226142  -.1061667
    purchase |  2.653039   11.29252   5.585159   3.287342   .1143339   .0413466   .3470621
      retire |  5.961269   5.585159   18.07993   4.558797   .2825322    .075291   .3273358
     bequest |   2.35639   3.287342   4.558797   25.73814   .2003085   .0279143  -.3021642
     mediumh |  .1190314   .1143339   .2825322   .2003085   .2474601  -.0160152   .0029871
       longh |  .0226142   .0413466    .075291   .0279143  -.0160152   .0343796   .0032975
    employed | -.1061667   .3470621   .3273358  -.3021642   .0029871   .0032975   .2491745
     retired |  .0930456  -.3311124  -.3488665   .3765796   .0082096  -.0050864  -.1821921
      health | -.0170121   .1237921   .0137143   .0658642   .0154973  -.0013984   .0768141
      income | -1083.358   880.6893  -1872.866   4122.921   882.4843   52.91977   1295.523
        risk | -3.786944   2.981907   -.474945   .4425276   .1815103    .015856   .4386603
 selfcontrol |  .3210033  -.2704047  -.0093713   .0753243   .0846079   .0082493  -.0792879
       child | -.2769498   .2110997   .0052549   .7243602   .0010089   .0024441   .1923543
   savingexp |  .1074745    .122478    .121612   .1000051   .0330926   .0014483   .0211156
     partner | -.0793618  -.0944943  -.1150884   .4311682   .0168049  -.0019489   .0012033
         uni | -.0362468   .0959123  -.0419606  -.0931491   .0117892   .0003633    .012833
       owner |  .0323899   .0375088   .0250265   .4319967   .0377703  -.0000735   .0193809
        male | -.1868004  -.1169552  -.1775345   .1917804   .0102924  -.0006911  -.0002442
         age |  2.331756  -18.49912  -9.623477   11.99615   .3146326  -.1119548  -5.211748
-------------------------------------------------------------------------------------------

-------------------------------------------------------------------------------------------
             |   retired     health     income       risk  selfcon~l      child  savingexp
-------------+-----------------------------------------------------------------------------
Coef         |                                                                            
       _cons |   .344583   3.830952   31986.03   16.72017   5.231135   .5027871   .3757535
-------------+-----------------------------------------------------------------------------
Sigma        |                                                                            
        prec |  .0930456  -.0170121  -1083.358  -3.786944   .3210033  -.2769498   .1074745
    purchase | -.3311124   .1237921   880.6893   2.981907  -.2704047   .2110997    .122478
      retire | -.3488665   .0137143  -1872.866   -.474945  -.0093713   .0052549    .121612
     bequest |  .3765796   .0658642   4122.921   .4425276   .0753243   .7243602   .1000051
     mediumh |  .0082096   .0154973   882.4843   .1815103   .0846079   .0010089   .0330926
       longh | -.0050864  -.0013984   52.91977    .015856   .0082493   .0024441   .0014483
    employed | -.1821921   .0768141   1295.523   .4386603  -.0792879   .1923543   .0211156
     retired |  .2258455  -.0151247   25.85364  -.3103795   .1156065  -.1632184   .0029043
      health | -.0151247   .5255931   1635.571   .2837704   .1275956   .0647257   .0415905
      income |  25.85364   1635.571   1.08e+09   11156.27   2183.407   1240.501   2169.993
        risk | -.3103795   .2837704   11156.27   39.16448  -.4310057   .6880169   .0762291
 selfcontrol |  .1156065   .1275956   2183.407  -.4310057   2.225804  -.1978589   .1807742
       child | -.1632184   .0647257   1240.501   .6880169  -.1978589   .9109834  -.0256144
   savingexp |  .0029043   .0415905   2169.993   .0762291   .1807742  -.0256144   .2345818
     partner |  .0350431   .0500038   2579.413   .1363759   .0251232   .1274641   .0147169
         uni | -.0083959   .0231324   1694.189   .2069189   .0331075  -.0126351   .0148898
       owner |  .0128889   .0448441   2370.133   .2071253   .0903794   .0693949   .0334286
        male |   .030571   .0265214   1655.534   .4622172   .0481309    .053058   .0189836
         age |  5.120063   -1.36954   2753.934  -14.84941   3.565264  -5.365151  -.2163414
-------------------------------------------------------------------------------------------

---------------------------------------------------------------------
             |   partner        uni      owner       male        age
-------------+-------------------------------------------------------
Coef         |                                                      
       _cons |  .6477146   .1609898    .701125   .7809871   56.21628
-------------+-------------------------------------------------------
Sigma        |                                                      
        prec | -.0793618  -.0362468   .0323899  -.1868004   2.331756
    purchase | -.0944943   .0959123   .0375088  -.1169552  -18.49912
      retire | -.1150884  -.0419606   .0250265  -.1775345  -9.623477
     bequest |  .4311682  -.0931491   .4319967   .1917804   11.99615
     mediumh |  .0168049   .0117892   .0377703   .0102924   .3146326
       longh | -.0019489   .0003633  -.0000735  -.0006911  -.1119548
    employed |  .0012033    .012833   .0193809  -.0002442  -5.211748
     retired |  .0350431  -.0083959   .0128889    .030571   5.120063
      health |  .0500038   .0231324   .0448441   .0265214   -1.36954
      income |  2579.413   1694.189   2370.133   1655.534   2753.934
        risk |  .1363759   .2069189   .2071253   .4622172  -14.84941
 selfcontrol |  .0251232   .0331075   .0903794   .0481309   3.565264
       child |  .1274641  -.0126351   .0693949    .053058  -5.365151
   savingexp |  .0147169   .0148898   .0334286   .0189836  -.2163414
     partner |  .2281804  -.0064747   .0714615   .1018254   .6214429
         uni | -.0064747   .1350793   .0080146   .0000593  -.3404142
       owner |  .0714615   .0080146   .2095487   .0440996   .4485456
        male |  .1018254   .0000593   .0440996   .1710462   .7427155
         age |  .6214429  -.3404142   .4485456   .7427155   223.4663
---------------------------------------------------------------------

Little's MCAR test

Number of obs       = 9867
Chi-square distance = 5325.9430  
Degrees of freedom  = 909
Prob > chi-square   = 0.0000

Question: From this can I conclude that the missing values in my dataset are not MCAR, so they might be MAR?

For -incomescaled- there are 7,458 observations and 5,759 missing values. So for incomescaled_m, there will be 5,759 '1s' to represent these missing values.

Code:

-------------------------------------------------------------------------------------------
incomescaled                                                                    (unlabeled)
-------------------------------------------------------------------------------------------

                  type:  numeric (float)

                 range:  [0,1370.179]                 units:  .001
         unique values:  2,348                    missing .:  5,759/13,217

                  mean:    32.699
              std. dev:   32.8517

           percentiles:        10%       25%       50%       75%       90%
                                10      20.1        30    40.876        54

I assume there is a typo here as if attrition is random then it is not informative; please clarify

Yes sorry, I meant to say whether attrition is random or informative

Thanks

Comment

Carlo Lazzaro

Join Date: Apr 2014

Posts: 17656
#6

18 Apr 2017, 10:10

Rose:
as Rich pointed out, attrition cannot be at the same time random and informative.
As far as I can get your outcomes, it seems that you have missing values in both -saving- and -incomescaled-.
It is impossible to state whether those who dropped out have higher/lower income than those who were retained in the dataset, as those data are missing (the same remark holds for -saving-).
As Rich touched upon, if data are MAR, -mi- can produce plausible values that make, in turn, subsequent regressions (more) efficient.
Conversely, if data are MNAR, things are trickier and more complex are approaches are needed.

Kind regards,
Carlo
(StataNow 18.5)
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17656
#7

18 Apr 2017, 10:13

Rose:
as MAR vs MNAR hypothesis cannot be tested (see Rich's reply at #3), you can only assume that data are MAR.
Just out of curiosity: what does -savingexp- stay for?

Last edited by Carlo Lazzaro; 18 Apr 2017, 11:01.

Kind regards,
Carlo
(StataNow 18.5)
Comment
Rose Simmons

Join Date: Feb 2017

Posts: 114
#8

18 Apr 2017, 10:34

Originally posted by Carlo Lazzaro View Post

It is impossible to state whether those who dropped out have higher/lower income than those who were retained in the dataset, as those data are missing (the same remark holds for -saving-).

Ah I see that this would not be possible as the data is of course missing, thank you.

I read this on page 17 and 23 of http://younglives.org.uk/files/YL-TN...-Attrition.pdf which is for a different survey to mine
"In this section we investigate non-random attrition by searching for patterns in outcome variables and household characteristics of attriting households. First, we do so by tabulating attriting and non-attriting households over a number of important dimensions. Second, and more rigorously, we carry out statistical tests for the equality of means for a large range of predetermined and outcome variables."
"Although modest, we find that attrition in the Young Lives sample is to some extent nonrandom. Even though not always significant, attriting households typically are located in urban areas, have a low wealth index, own fewer assets, are less educated,etc. Furthermore, we uncover substantial differences in household profiles across attrition categories"

So in #1 I had tried to see whether there is a difference in the attriting and non-attriting samples.

How may I test whether the attrition in my sample is random or non-random? As I would like to know whether there is possible attrition bias

-savingexp- is measured in a similar way to -saving-, both are binary, but it measures whether the household anticipates an ability to save in the next year. I hypothesised that if an individual expects to be able to save next year, this may influence whether they saved this year

Thanks
Comment
Rose Simmons

Join Date: Feb 2017

Posts: 114
#9

18 Apr 2017, 10:48

I found this explanation online: "To test whether missingness is related to observed or unobserved causes, a dummy variable is created based on participants having complete or missing data on a variable or having left or remained in the study (e.g., 0, incomplete data/left the study data; 1, complete data/remained in the study). Then, a series of independent samples t-tests are used to indicate whether the dummy variable leads to significant average differences among other variables of interest."

So first I would like the create the dummy variable (0, incomplete data/left the study data; 1, complete data/remained in the study)

Code:

hhid year 175 2016 184 2004 184 2005 184 2006 184 2007 184 2008 184 2009 194 2004 194 2005 194 2006 194 2007 228 2004 228 2005 228 2006 228 2007 228 2008

As the above shows, for example household number 184 left the study after 2009, so they were missing in 2010-2016.
How may I create a dummy variable to show that participants have left the study? E.g. a dummy which shows that they are missing in 2010

Thank you
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17656
#10

18 Apr 2017, 10:51

Rose:
there's no way to test if missing data are MAR or MNAR (as the difference between the two missing data mechanism rests on the missing data, that you cannot see because they're indeed missing!). The only formal test is the one that you already performed: MCAR vs MAR (and there was evidence that your missing data were not MCAR).
As per the excerpt of the link you quoted, you can only judge from the observed data whether those who have missing data from -income-, -saving- look different from the remaining observations (that is, whether the underlying missing mechanism is MAR or MNAR)..
Last but not least: the explanation you provided to justify the inclusion of -saving- as a dependent variable and -savingexp- as predictor can possibly work the other way round, too: today's saving propensity can influence tomorrow's saving propensity.
As this is not my research field, I would recommend you to skim through the literature you're familiar with in order to avoid the risk of reversal causation (a form of endogeneity) between -saving- and -savingexp-.

Kind regards,
Carlo
(StataNow 18.5)
1 like
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17656
#11

18 Apr 2017, 11:01

Rose:
the series of independent ttest (which is not free form statistical problems; see Enders CK. Applied Missing Data Analysis.New York: The Guilford Press, 2010: pages18-19) is for contrasting MCAR vs MAR mechanisms. There's no way to implement it for testing whether your missing data are MAR or MNAR (for the reasons reported at #10).

Kind regards,
Carlo
(StataNow 18.5)
Comment
Rose Simmons

Join Date: Feb 2017

Posts: 114
#12

18 Apr 2017, 11:05

Of course, sorry for being repetitive, I see your point about missingness!

I think perhaps I am getting confused between any missing values in the dataset for various variables e.g. saving, and missing values in terms of missing years.

That is a very good and valid point about reverse causation, I will definitely look into this further in the literature - thank you very much
Comment
Rose Simmons

Join Date: Feb 2017

Posts: 114
#13

18 Apr 2017, 11:11

Originally posted by Carlo Lazzaro View Post

Rose:
the series of independent ttest (which is not free form statistical problems; see Enders CK. Applied Missing Data Analysis.New York: The Guilford Press, 2010: pages18-19) is for contrasting MCAR vs MAR mechanisms. There's no way to implement it for testing whether your missing data are MAR or MNAR (for the reasons reported at #10).

Ah I see, thank you for clarifying that the t-tests I had mentioned were for MCAR vs MAR (and I have already established the data are MCAR using -mcartest- so there is no need for the t-tests) , and not for MAR vs MNAR which is what I would like to test but, as per #10, I now understand it is not possible to do

As always, you have been very helpful in clarifications. Thank you Carlo Lazzaro
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17656
#14

18 Apr 2017, 22:44

Rose:
please note that, as per -mcartest- outcome, your data are not MCAR, but possibly (as you cannot test MAR vs MNAR) MAR.

Kind regards,
Carlo
(StataNow 18.5)
Comment
Rose Simmons

Join Date: Feb 2017

Posts: 114
#15

19 Apr 2017, 02:27

Thank you for the correction Carlo Lazzaro , my data are not MCAR.
Question 1: Based on my data being not MCAR, does this mean that the missingness is "informative"?
Could "informative" be interpreted as missingness is not independent of y and x variables?

As data are not MCAR, my understanding is that this missing data may lead to biased estimates (e.g. if say my data is MNAR and hypothetically people may be less inclined to report saving if they do not save, perhaps if they do not want to admit that they struggle to save).

I have read up on various ways to deal with missing data, but some of them also seem to lead to biased estimates - would you agree with the following?
- Listwise deletion will delete all individuals for which there is any missing data. As my data is not MCAR i.e. missingness is not independent, listwise deletion may yield biased estimates? E.g. if it it the case that more non-savers than savers do not respond to the -saving- question, then the missing values will lead to more non-savers being dropped, and the sample will become less representative of the population.
- Dummy variable adjustment whereby I could create a dummy variable if there is a missing value, and then impute these to a value such as the sample mean. Then include the missing dummy in the regression. But again this is not representative of the true values and hence may yield biased estimates.

Question 2: As the methods mentioned above may lead to serious bias, how would you recommend that I deal with missing data?
Can I proceed with my analyses and just acknowledge that there may be bias caused by missingness?
Or would a method such as multiple imputation be advisable?

Your insight would be much appreciated
Many thanks

Last edited by Rose Simmons; 19 Apr 2017, 02:43.
Comment

Announcement

Testing whether attrition is informative

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment