Dear Statalisters,
I have two distinct questions relating to the same model - apologies in advance if I should post them separately.
I am using a generalized Diff-in-Diff in Stata 15.1 on unbalanced data to estimate the impact of migration (post) on migrants (treatment group).
My model (a simplified version) looks as follows:
where:
- employed is a binary variable (1 for years in which respondent was employed; 0 otherwise)
- post - migration (switches 1 for treatment groups; remains 0 for control group)
- treat - 1 for treatment group (migrants); 0 for control (non-migrants. This is time-invariant.
- fam_size is respondents' number of children in each year
- fam_size25 is respondents' number of children when they were aged 25, which I am using as an approximation to control for the size of the family before treatment. I assume this can have impact on whether someone migrates.
- young_child - a binary variable for whether respondents had a young child in each year
- muslim / christian / other religion - binary variables for respondents' religion
- source - country of origin
Question no. 1 - relating to the lag: Is there a good justification for only using family size at age 25, i.e. excluding family_size from the model? I have a reason to believe that "young children" is a more important predictor of respondents' employment outcomes than how many children (regardless of age) they have.
Question no. 2 - relating to collinearity: I am controlling for respondents' religion (Muslim, Christian, other) in addition to their country of origin. My problem is that one of the three countries in the sample is predominantly Muslim (over 90% of respondents in this country said they were Muslim). On the other hand, the share of Muslims in the remaining two countries is less than 5%. Consequently, one could argue that country of origin is tantamount to religion in my dataset. At the same time, post-estimation tests suggest strong preference for the model with religion and country of origin. Moreover, regressing the outcome of interest on each religion as well as on country of origin separately shows that each is highly significant and - importantly to my research question - they affect the independent variable in opposite directions. If one stood for another, I would expect that coming from a predominantly Muslim country would have the same impact on the outcome of interest as being Muslim, but that is not the case - and the reason I would like to keep both. So the question is: is keeping both justifiable?
Thank you very much in advance!
I have two distinct questions relating to the same model - apologies in advance if I should post them separately.
I am using a generalized Diff-in-Diff in Stata 15.1 on unbalanced data to estimate the impact of migration (post) on migrants (treatment group).
My model (a simplified version) looks as follows:
logit employed i.post i.treat i.year c.family_size c.family_size_age25 i.young_child i.muslim i.christian i.other_religion i.source, cluster(ident)
- employed is a binary variable (1 for years in which respondent was employed; 0 otherwise)
- post - migration (switches 1 for treatment groups; remains 0 for control group)
- treat - 1 for treatment group (migrants); 0 for control (non-migrants. This is time-invariant.
- fam_size is respondents' number of children in each year
- fam_size25 is respondents' number of children when they were aged 25, which I am using as an approximation to control for the size of the family before treatment. I assume this can have impact on whether someone migrates.
- young_child - a binary variable for whether respondents had a young child in each year
- muslim / christian / other religion - binary variables for respondents' religion
- source - country of origin
Question no. 1 - relating to the lag: Is there a good justification for only using family size at age 25, i.e. excluding family_size from the model? I have a reason to believe that "young children" is a more important predictor of respondents' employment outcomes than how many children (regardless of age) they have.
Question no. 2 - relating to collinearity: I am controlling for respondents' religion (Muslim, Christian, other) in addition to their country of origin. My problem is that one of the three countries in the sample is predominantly Muslim (over 90% of respondents in this country said they were Muslim). On the other hand, the share of Muslims in the remaining two countries is less than 5%. Consequently, one could argue that country of origin is tantamount to religion in my dataset. At the same time, post-estimation tests suggest strong preference for the model with religion and country of origin. Moreover, regressing the outcome of interest on each religion as well as on country of origin separately shows that each is highly significant and - importantly to my research question - they affect the independent variable in opposite directions. If one stood for another, I would expect that coming from a predominantly Muslim country would have the same impact on the outcome of interest as being Muslim, but that is not the case - and the reason I would like to keep both. So the question is: is keeping both justifiable?
Thank you very much in advance!
Comment