Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Merging two datasets with many of the same variables

    Hi,

    I want to merge two Stata datasets (both with >300 variables), of which ~200+ of the variables have the same name and the same meaning. I want the variables that have the same name and meaning to combine in the merged dataset. I have gone through and identified the different variables (the ones that don't have a corresponding variable in the other dataset) and added a prefix so that I will be able to identify them after the merge. In order for the duplicate variables to combine, do they need to be in the same format? In other words, do I have to go through and reformat all the variables I would like to combine? More specifically, would a variable that is formatted " byte %10.g" combine with a variable that is formatted "int %10g?" Is it just string and numeric that won't combine?

    Also, is there a fast way to output variable formats using "describe" to a Word or Excel document?? I have tried "asdoc describe" and it doesn't work (only outputs variables and variable labels).

    Thanks!
    Clare

  • #2
    Hello,

    If I understand it correctly, you have two separate datasets that contain a similar variable akin to a unique ID? Further, this unique ID is formatted as byte in the first dataset, and int in the second dataset? (Providing a minimum working example would be helpful). Contingent upon your datasets it shouldn't be a problem. Here's a minimum working example:

    Code:
    clear
    input byte id int(X1 X2 X3)
    1 1969    1971    1     
    2 1973    1980    1
    3 1950    1970    1
    4 1955    1970    1
    5 1982    1992    1
    end
    tempfile events
    save `events', replace
    
    clear
    input int id int(X4 X5)
    1 1971 3
    2 1971 4
    3 1972 9
    6 1973 0
    7 1970 2
    8 1971 0
    9 1963 8
    end
    
    merge m:1 id using `events', keep(match) nogenerate
    list
    Code:
         +-----------------------------------+
         | id     X4   X5     X1     X2   X3 |
         |-----------------------------------|
      1. |  1   1971    3   1969   1971    1 |
      2. |  2   1971    4   1973   1980    1 |
      3. |  3   1972    9   1950   1970    1 |
         +-----------------------------------+
    Assuming this is what you're trying to achieve, there is no need to manually sort through the data in advance to identify unique variables -- let Stata do the work for you! Simply use the keep() option.

    Comment


    • #3
      Clare McCarthy ,

      for describing to Excel you could use xlist.

      Merging data with different formats e.g. "%10.0f" vs "%6.4f" should not be a problem. But data types could be an issue (string vs number).

      You description is vague about what does it mean to "I want the variables that have the same name and meaning to combine in the merged dataset". Stata's merge has an UPDATE option, where the second dataset is used to patch values in the first one. If this is what you need, then no need for any suffixes - Stata will do it for you. If on the other hand you need e.g. to append the strings (such as lists of countries visited, books read, courses taken) then yes, use the indices and accumulate them later after the merge.

      Best, Sergiy Radyakin

      Comment


      • #4
        It is also possible that rather than merge the datasets, you want to append them. I think the lesson from this post and the two preceding it is that you need to do a better job of explaining your data. Even the best descriptions of data are no substitute for an actual example of the data.

        Be sure to use the dataex command to do this. If you are running version 15.1 or later, or a fully updated version 14.2, dataex is already part of your official Stata installation. If not, run ssc install dataex to get it. Either way, run help dataex and read the simple instructions for using it. dataex will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

        While you have hundreds of variables, the dataex command will accept a list of variable names, and will accept the if clause, so you can show a small sample of your data from each dataset, including the identifier (that you plan to merge by), two or three of the variables that are common to both datasets, and one that is not common, for a few values of the identifier, from each dataset.

        When asking for help with code, always show example data. When showing example data, always use dataex.

        Comment


        • #5
          Hi, thank you both for your responses. Sergiy Radyakin , one question --- if I have two string variables that have the exact same name in the two different datasets, will they combine? Or do I need to manually append them?

          Comment


          • #6
            I though I made it clear in the previous post:
            ...to append the strings (such as lists of countries visited, books read, courses taken) then ..., use the indices and accumulate them later after the merge.
            The following will not give you "Sergiy Radyakin" if this is what you expected:

            Code:
            clear all
            input int pid str40 name int age
            101 "Radyakin" 78
            end
            list
            sort pid
            tempfile tmp
            save `tmp'
            clear
            input int pid str40 name double income
            101 "Sergiy" 12345.67
            end
            list
            sort pid
            merge pid using `"`tmp'"'
            list
            Despite this, the files will be merged together.
            Whether this counts as 'combining' for you - don't know. You don't define 'combine' and it is not a Stata term. Merge, append, join, cross are not just poetic variations of the same action, but denote different, often opposite data transformations.

            Comment


            • #7
              Originally posted by Clare McCarthy View Post
              Hi, thank you both for your responses. Sergiy Radyakin , one question --- if I have two string variables that have the exact same name in the two different datasets, will they combine? Or do I need to manually append them?
              For matching on string variables, I would use matchit. Install via SSC,

              Code:
              ssc install matchit
              Note that you may also need to install freqindex in order for matchit to function.

              There are a couple helpful discussion posts here on how to use. See, for example, here and here.

              Comment

              Working...
              X