I have a longitudinal dataset of individuals across three waves where I consider the effect of unemployment on health for mothers. There is some evidence to suggest that the results may biased due to attrition, and thus that the effects of unemployment on health may be underestimated. To address this issue I will utilize Inverse Probability Weighting (IPW).
To determine if the baseline characteristics of mothers are associated with the probability of leaving the sample, a binary variable is created which is equal to one if a respondent was in wave 1 and wave 2 or 3, or both, and zero if mothers were in wave 1 and no other waves, giving a complete attrition rate of 44%.
First, I estimate the probability of being a stayer using binary logit models including a range of control variables including education, marital status, recipient of social assistance, age and own employment, at baseline.Standard errors are clustered at the mother’s baseline location area. The below is exactly the same manner in which I model the core analysis in this paper.
The inverse of this predicted probability is then to be used as a weight in the outcome analysis, such that mothers who have a lower probability of being a stayer are given a higher weight in the analysis, to compensate for similar mothers who are missing as informed by Wooldridge (2007), an archived Statalist post (https://www.stata.com/statalist/arch.../msg00999.html) and "12.2 Estimating IP weights via modeling" p. 12 of Causal Inference, Hernan and Robins https://cdn1.sph.harvard.edu/wp-cont...s_v2.17.18.pdf (worked examples in Stata can be found here: https://www.hsph.harvard.edu/miguel-...nference-book/).
My questions are, is the above approach to building this weight correct?
Also, in my analysis I make use of random effects estimators of unemployment on health (as informed by Hausman tests and the literature) for mothers who appear in the analysis in the first wave and at least one other wave. I intend to re-run this analysis and to apply the above weights to compare between the results when estimates are re-weighted to greater represent those individuals who were more likely to leave, however, it seems almost impossible to estimate a random effects model with weights.
xtregre2 estimates a random effects model with weights. It is an update to Kevin McKinney's rfregk (https://ideas.repec.org/c/boc/bocode/s456514.html)
However, xtregre2 only accepts aweights, factor variables not allowed and the alternative variance estimators are not supported. I'm not sure why but it also causes my number of observations to fall when I use it. Searching the archives someone also mentioned gllamm here https://www.stata.com/statalist/arch.../msg00716.html.
Can anyone please advise me as to whether my approach to building a weight above is correct, as well as my intention to apply it, and how exactly I can do this in a random effects regression?
Best,
John
Wooldridge, Jeffrey M. "Inverse probability weighted estimation for general missing data problems." Journal of Econometrics 141.2 (2007): 1281-1301.
To determine if the baseline characteristics of mothers are associated with the probability of leaving the sample, a binary variable is created which is equal to one if a respondent was in wave 1 and wave 2 or 3, or both, and zero if mothers were in wave 1 and no other waves, giving a complete attrition rate of 44%.
Code:
. capture drop insampm . generate insampm = 0 . recode insampm 0 = 1 if has_y0_questionnaire==1 & has_y5_questionnaire==1 | has_y0_questionnaire==1 & has_y10_questionnaire==1 | has_y0_questionnaire==1 & has_y5_questionnaire==1 & has_y10_questionnaire==1 . tab insampm insampm | Freq. Percent Cum. ------------+----------------------------------- 0 | 1,464 44.28 44.28 1 | 1,842 55.72 100.00 ------------+----------------------------------- Total | 3,306 100.00
First, I estimate the probability of being a stayer using binary logit models including a range of control variables including education, marital status, recipient of social assistance, age and own employment, at baseline.Standard errors are clustered at the mother’s baseline location area. The below is exactly the same manner in which I model the core analysis in this paper.
Code:
. logit insampm i.cown_education_y0 i.cmaritalstatus_y0 i.cmedical_card_y0 i.cemployment_y0 i.cord_age_y0, cluster ( addres > s_current_county_2002 ) note: 1.cown_education_y0 != 0 predicts failure perfectly 1.cown_education_y0 dropped and 3 obs not used note: 5.cemployment_y0 != 0 predicts failure perfectly 5.cemployment_y0 dropped and 6 obs not used note: 6.cown_education_y0 omitted because of collinearity Iteration 0: log pseudolikelihood = -1983.1518 Iteration 1: log pseudolikelihood = -1839.8105 Iteration 2: log pseudolikelihood = -1839.0974 Iteration 3: log pseudolikelihood = -1839.0964 Iteration 4: log pseudolikelihood = -1839.0964 Logistic regression Number of obs = 2,919 Wald chi2(18) = 1417.40 Prob > chi2 = 0.0000 Log pseudolikelihood = -1839.0964 Pseudo R2 = 0.0726 (Std. Err. adjusted for 30 clusters in address_current_county_2002) -------------------------------------------------------------------------------------------------------------------------- | Robust insampm | Coef. Std. Err. z P>|z| [95% Conf. Interval] ---------------------------------------------------------+---------------------------------------------------------------- cown_education_y0 | No schooling | 0 (empty) Primary school education | -1.980647 .8746838 -2.26 0.024 -3.694996 -.2662986 Some secondary school | -.3748552 .2317244 -1.62 0.106 -.8290268 .0793163 Complete secondary education | -.3227909 .1922295 -1.68 0.093 -.6995538 .053972 Some third level education at college, university, RTC | -.5120794 .2396614 -2.14 0.033 -.9818071 -.0423517 Complete third level education at college, university.. | 0 (omitted) | cmaritalstatus_y0 | Cohabiting | -.3362515 .3143212 -1.07 0.285 -.9523098 .2798068 Divorced | -1.057085 .6657199 -1.59 0.112 -2.361872 .2477024 Widowed | -1.542918 1.233189 -1.25 0.211 -3.959924 .8740873 Single/Never married | -.3289091 .257969 -1.27 0.202 -.8345191 .1767009 | cmedical_card_y0 | Yes | -.1179747 .1656617 -0.71 0.476 -.4426656 .2067163 | cemployment_y0 | Unemployed | .075984 .3919981 0.19 0.846 -.6923183 .8442862 Unable to work owing to permanent sickness or disabil.. | -.4583487 .5561027 -0.82 0.410 -1.54829 .6315926 At school/student | .9783511 .3391637 2.88 0.004 .3136025 1.6431 Seeking work for the first time | 0 (empty) Employed | .2686171 .1191097 2.26 0.024 .0351663 .5020679 Self Employed | .4014955 .419458 0.96 0.338 -.420627 1.223618 | cord_age_y0 | 20-23 | .2899182 .2319089 1.25 0.211 -.164615 .7444513 24-27 | .5287781 .307094 1.72 0.085 -.0731151 1.130671 28-32 | 1.025553 .339614 3.02 0.003 .3599222 1.691185 33 + | 1.257928 .3210913 3.92 0.000 .628601 1.887256 | _cons | -.3087511 .4024602 -0.77 0.443 -1.097559 .4800564 -------------------------------------------------------------------------------------------------------------------------- . predict p_insampm, pr (387 missing values generated)
The inverse of this predicted probability is then to be used as a weight in the outcome analysis, such that mothers who have a lower probability of being a stayer are given a higher weight in the analysis, to compensate for similar mothers who are missing as informed by Wooldridge (2007), an archived Statalist post (https://www.stata.com/statalist/arch.../msg00999.html) and "12.2 Estimating IP weights via modeling" p. 12 of Causal Inference, Hernan and Robins https://cdn1.sph.harvard.edu/wp-cont...s_v2.17.18.pdf (worked examples in Stata can be found here: https://www.hsph.harvard.edu/miguel-...nference-book/).
Code:
. gen w=. (3,306 missing values generated) . . replace w=1/p_insampm if insampm==1 (1,701 real changes made) . . replace w=1/(1-p_insampm) if insampm==0 (1,218 real changes made) . . summarize w Variable | Obs Mean Std. Dev. Min Max -------------+--------------------------------------------------------- w | 2,919 1.998182 .7833788 1.096622 6.406883
My questions are, is the above approach to building this weight correct?
Also, in my analysis I make use of random effects estimators of unemployment on health (as informed by Hausman tests and the literature) for mothers who appear in the analysis in the first wave and at least one other wave. I intend to re-run this analysis and to apply the above weights to compare between the results when estimates are re-weighted to greater represent those individuals who were more likely to leave, however, it seems almost impossible to estimate a random effects model with weights.
xtregre2 estimates a random effects model with weights. It is an update to Kevin McKinney's rfregk (https://ideas.repec.org/c/boc/bocode/s456514.html)
However, xtregre2 only accepts aweights, factor variables not allowed and the alternative variance estimators are not supported. I'm not sure why but it also causes my number of observations to fall when I use it. Searching the archives someone also mentioned gllamm here https://www.stata.com/statalist/arch.../msg00716.html.
Can anyone please advise me as to whether my approach to building a weight above is correct, as well as my intention to apply it, and how exactly I can do this in a random effects regression?
Best,
John
Wooldridge, Jeffrey M. "Inverse probability weighted estimation for general missing data problems." Journal of Econometrics 141.2 (2007): 1281-1301.
Comment