Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Problems with Merge 1:1 using unique id number

    Good morning all,
    I am a PhD student and relative newcomer to Stata and was hoping you might be able to help me with a problem I am encountering when merging two datasets. The first dataset comprises Wave 1 of a survey (999 participants). The second dataset is Wave 2 of the same survey (plus some additional new questions) (815 participants) i.e. a follow-up survey which targeted the same sample just 3 months later. Each participant in both surveys has a unique numeric Prolific id (e.g. 503ld3858389289hk12). Responses from each of the two surveys have been uploaded separately to Stata, destrung and cleaned. I have checked the uploaded data against the original raw data and in both cases it all looks fine. The problem comes when I try to merge the two files in order to perform a panel data analysis. My process is as follows:
    • clear all, use "C:\Users\User\Downloads\Final Raw merged data file from LimeSurvey.dta" (MASTER FILE - ie RAW SURVEY 1 DATA)
    • I then run a duplicates list WhatisyourProlificid check and drop all duplicate ids (WhatisyourProlificid is the string id variable used in the data)
    • Then I run a whole load of rename, label var and encode [ ], gen[ ] commands to clean and destring the data, creating a destrung version of the WhatisyourProlificid var "id_d"
    • I then import the Wave 2 data (which has also been cleaned and destrung and checked for duplicate ids using merge 1:1 id_d using "C:\Users\User\OneDrive\Documents\STATA\finalcovid surveydestrungdataupdated03092020.dta" and drop the variables that don't match. Process seems to run fine and I am left with n=822 participants
    BUT when I order the new merged dataset and place [age] Wave 1 age var and [agec] Wave 2 age var side by side, it is obvious that the using data which has been imported has been changed. For example whereas in the faw files and in the stand alone data files age is 34 and agec is also 34, in the merged file age is still 34 but agec has changed to 69! The same goes for other variables such as education, parental status etc which should not have changed and which were fine in the stand alone data files.

    I am at a complete loss as to what is going on or what I could do to try and fix this issue so any help or advice you could offer would be hugely appreciated.

    Best wishes
    Diane

  • #2
    Usually combining different datasets that represent different waves of a panel calls for an append not a merge.

    Otherwise if I have to guess at what is going on I seize on the fact that you describe an identifier such as 503ld3858389289hk12 as "numeric". That kind of identifier could be held as a value label for a numeric variable but I fear that what you have done is encode and thereby accidentally messed up the identifiers.

    Given numeric identifiers merge matches on their numeric values, not their value labels.

    Here is a minimal example. In one dataset frog and toad as string identifiers are mapped to 1 and 2 by encode. In another dataset frog, newt and toad are mapped to 1, 2, 3 by encode. You can't expect one encode in one dataset to remember what was done in another dataset. (Well, you can, but you have to spell out the value label mapping to make that happen.)

    When you merge it's not so much a matter of Stata changing data as that merge gives the wrong information and so some data ends up in the wrong observations.

    Code:
    . clear
    
    . set obs 2
    number of observations (_N) was 0, now 2
    
    . gen strid = cond(_n == 1, "frog", "toad")
    
    . encode strid , gen(numid)
    
    . gen y = real(word("17 71", _n))
    
    . list 
    
         +--------------------+
         | strid   numid    y |
         |--------------------|
      1. |  frog    frog   17 |
      2. |  toad    toad   71 |
         +--------------------+
    
    . 
    . save first 
    file first.dta saved
    
    . 
    . clear
    
    . set obs 3
    number of observations (_N) was 0, now 3
    
    . gen strid = word("frog newt toad", _n)
    
    . encode strid , gen(numid)
    
    . gen x = real(word("42 666 3.14159", _n))
    
    . list
    
         +-------------------------+
         | strid   numid         x |
         |-------------------------|
      1. |  frog    frog        42 |
      2. |  newt    newt       666 |
      3. |  toad    toad   3.14159 |
         +-------------------------+
    
    . 
    . 
    . merge 1:1 numid using first 
    (label numid already defined)
    
        Result                           # of obs.
        -----------------------------------------
        not matched                             1
            from master                         1  (_merge==1)
            from using                          0  (_merge==2)
    
        matched                                 2  (_merge==3)
        -----------------------------------------
    
    . 
    . list 
    
         +------------------------------------------------+
         | strid   numid         x    y            _merge |
         |------------------------------------------------|
      1. |  frog    frog        42   17       matched (3) |
      2. |  newt    newt       666   71       matched (3) |
      3. |  toad    toad   3.14159    .   master only (1) |
         +------------------------------------------------+
    The long and short of this is to merge on string identifiers, but as said this sounds like an append problem any way.

    Comment

    Working...
    X