Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Using lagged independent variable in place of its current value & question of collinearity

    Dear Statalisters,

    I have two distinct questions relating to the same model - apologies in advance if I should post them separately.

    I am using a generalized Diff-in-Diff in Stata 15.1 on unbalanced data to estimate the impact of migration (post) on migrants (treatment group).

    My model (a simplified version) looks as follows:

    logit employed i.post i.treat i.year c.family_size c.family_size_age25 i.young_child i.muslim i.christian i.other_religion i.source, cluster(ident)
    where:
    - employed is a binary variable (1 for years in which respondent was employed; 0 otherwise)
    - post - migration (switches 1 for treatment groups; remains 0 for control group)
    - treat - 1 for treatment group (migrants); 0 for control (non-migrants. This is time-invariant.
    - fam_size is respondents' number of children in each year
    - fam_size25 is respondents' number of children when they were aged 25, which I am using as an approximation to control for the size of the family before treatment. I assume this can have impact on whether someone migrates.
    - young_child - a binary variable for whether respondents had a young child in each year
    - muslim / christian / other religion - binary variables for respondents' religion
    - source - country of origin

    Question no. 1 - relating to the lag: Is there a good justification for only using family size at age 25, i.e. excluding family_size from the model? I have a reason to believe that "young children" is a more important predictor of respondents' employment outcomes than how many children (regardless of age) they have.

    Question no. 2 - relating to collinearity: I am controlling for respondents' religion (Muslim, Christian, other) in addition to their country of origin. My problem is that one of the three countries in the sample is predominantly Muslim (over 90% of respondents in this country said they were Muslim). On the other hand, the share of Muslims in the remaining two countries is less than 5%. Consequently, one could argue that country of origin is tantamount to religion in my dataset. At the same time, post-estimation tests suggest strong preference for the model with religion and country of origin. Moreover, regressing the outcome of interest on each religion as well as on country of origin separately shows that each is highly significant and - importantly to my research question - they affect the independent variable in opposite directions. If one stood for another, I would expect that coming from a predominantly Muslim country would have the same impact on the outcome of interest as being Muslim, but that is not the case - and the reason I would like to keep both. So the question is: is keeping both justifiable?

    Thank you very much in advance!
    Last edited by Justyna Hejman-Mancewicz; 13 Nov 2019, 06:54. Reason: Additional clarifications

  • #2
    Question 1 is neither a Stata question or a statistics question: it is a question about the scientific substance of your problem. Perhaps there are Forum members who work in the study of immigration and labor economics, and if so, I hope one will comment on that. If you don't get a timely response, however, you would be better off seeking advice from a colleague in your field, or finding an online forum for people in this area.

    Your second question is a statistical one. You have two predictor variables that are highly correlated with each other. When you put them in the model, their separate effects will be estimate with poor precision as a result, unless your sample is enormous. This is a consequence of linear algebra, and there is no way around it. But clearly these variables are very important in your model. You should keep them both in. But you will not be able to interpret them separately. If you are doing significance tests, a joint test of their effect will be useful. But you should not try to attach any importance to tests of either variable by itself.

    If either of these variables is a primary variable of interest in your study, then you have a serious problem here, and one that can only be resolved by gathering (a lot) more data. If, however, they are not the focus of your research goals but are being included only because you must adjust your analysis for their effects on the employment outcome, then this is not a problem at all and you just leave it alone and move on.

    Comment


    • #3
      Many thanks Clyde, that really is helpful (thankfully, neither religion nor country of origin is my primary variable of interest).

      Comment

      Working...
      X