Merging datasets - Statalist

Clyde Schechter

Join Date: Apr 2014

Posts: 29796
#16

25 Feb 2022, 15:42

Yes, I see, there are many inconsistencies here. In addition to year born, I see that gender often differs.

This is not a Stata problem, and there is no way to code your way out of it.

I think you need to look more carefully at the ESS documentation. In a few minutes on the ESS website (http://www.europeansocialsurvey.org/about/faq.html), I found this:

The ESS selects new sample members each round (cross-sectional sampling). To ensure comparability, all countries must use random probability sampling. This means that everyone (aged 15 and over, resident within private households) must have a chance to be selected, and that their chances of selection are known. Once selected, an individual cannot be replaced by anyone else, even if they cannot be contacted, are ill or refuse to take part. [emphasis added].

In other words, this isn't panel data, and the same people do not participate across waves. So your quest to find the same person in separate years is futile. If the same person does happen to be included in more than one year, it is a coincidence, and there will be very few such people.

It is still true that for purposes of analysis, you will append these data sets in the way suggested in #14. But you must analyze this data as cross-sectional. It just isn't longitudinal (panel) data.
Comment
Chul Lee

Join Date: Apr 2019

Posts: 45
#17

25 Feb 2022, 15:55

Although having a master file in long format is ideal for Stata analysis, sometimes it is necessary to merge two surveys in wide format since surveys ask different questions, and researchers want to merge them together, like this ESS(?) survey.
However, I wonder @Tailba is using publicly released data from this survey or a full 'limited ' survey data. Public version of the survey usually does not include an identifier and it is hard (almost impossible) and it secures privacy of survey participants.
However, I am not sure if the ESS survey really has two different versions. I tried to check in their website, but it was locked and couldn't access it.
C
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29796
#18

25 Feb 2022, 16:05

Although having a master file in long format is ideal for Stata analysis, sometimes it is necessary to merge two surveys in wide format since surveys ask different questions, and researchers want to merge them together

This is true. However, in the example data shown we can see that the variables are exactly the same in both data sets. So this is definitely a use case for -append-, not -merge-.

However, I wonder @Tailba is using publicly released data from this survey or a full 'limited ' survey data. Public version of the survey usually does not include an identifier and it is hard (almost impossible) and it secures privacy of survey participants.

This may well be the case. I did a little exploration of the example data, and I could not find any variables that might be used as an identifier of the same person across waves. But that is a moot point, as the ESS website makes it clear this is not a longitudinal survey, it is cross-sectional. There is no need, nor even any use for, an identifier (other than a sequential observation number) in cross-sectional data.
Comment
Chul Lee

Join Date: Apr 2019

Posts: 45
#19

25 Feb 2022, 16:24

@Clyde, I now see that you explored ESS already in their website #16. I agree that the survey participants will be rarely matched between two surveys. It is OP's turn to pursue this survey. C
Comment
Taiba Chau

Join Date: Feb 2022

Posts: 105
#20

26 Feb 2022, 13:56

I was wondering if you could still use a difference-in-difference approach to measures the differences of individuals in the two waves if that makes sense. Say I am looking at the health of individuals in 2014 and then in 2016. Could I still run a regression to work the differences between individuals while controlling for individual and time fixed effects? Thanks
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29796
#21

26 Feb 2022, 14:14

Your question is based on a supposition that isn't true. Since you don't have panel data, no, you cannot use individual fixed effects. Don't even think about it for another second--it's off the table with this data.

But you can don't need individual fixed effects to do a DID analysis. All you need is a variable distinguishing the two years (let's call it year), and a variable distinguishing the intervention and control groups (let's call it group, 1 for intervention, 0 for control) Then you just do:

Code:

regression_command outcome i.group##i.year perhaps_some_covariates_as_appropriate

For outcome substitute the actual name of your outcome variable. For regression_command you can use whatever Stata estimation command is appropriate to the type of outcome variable you are working with and is an acceptable model of the data generating process. The coefficient of 1.group#2016.year will be the DID estimate of the intervention effect.

I think the major difficulty you face here is that you only have one year of data before intervention, so you can't interrogate parallel trends, and you have only one year of data after intervention, which really isn't very much to go on. Similarly, with only two years of data, you can't really do any useful "placebo" tests or otherwise check the robustness of the analysis. But lack of individual fixed effects--not a problem. (It is true that an analysis of panel data with individual fixed effects would give a more precise and efficient estimate of the intervention effect. It would be better to use panel data were it available. But you can use serial cross-sections in this way--it's just a less powerful design.)
Comment

Announcement

Comment

Comment

Comment

Comment

Comment

Comment