Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • merging two panel datasets on name and year using matchit or reclink

    Hi all

    I am trying to merge two panel data sets and therefore I am wondering in what way I could best use either reclink or matchit to merge them?

    The first dataset contains variables/information on assurance services and the second one contains financial data.
    I need to merge the data on company name and year. The two datasets differ with regards to the "name" variable but I want to merge and look into the companies that occur in both datasets. It is however so that the names of the companies are not written completely the same in both datasets (e.g. "ABN AMRO holding" in first set and "abn amro holding NV" in the second)For this variable I thus need a fuzzy match. The year variable, on the other hand, should be merged 1:1.

    A final important remark is that both datasets do not contain the same amount of observations. It could be that a company is found in the first dataset but not in the second and vice versa.

    Here’s an example of what the data looks like:

    Source A
    name
    AUDI AG
    AUDI AG
    AUDI AG
    AVINOR
    AVINOR
    AVINOR
    AVINOR
    AWISTA GmbH
    AWISTA GmbH
    AXA
    AXA
    AXA
    AXA
    AXA
    AXA
    year
    2013
    2015
    2017
    2014
    2015
    2016
    2017
    2015
    2016
    2012
    2013
    2014
    2015
    2016
    2017
    Source B
    Name
    AUDI AG
    AUDI AG
    AUDI AG
    AUDI AG
    AUDI AG
    AUDI AG
    AUDIKA GROUPE
    AUDIKA GROUPE
    AUDIKA GROUPE
    AUDINATE PTY LTD
    AUDINATE PTY LTD
    AUDINATE PTY LTD
    AUDINATE PTY LTD
    AUDIO PIXELS HOLDINGS LTD
    AUDIO PIXELS HOLDINGS LTD
    year
    2012
    2013
    2014
    2015
    2016
    2017
    2012
    2013
    2014
    2014
    2015
    2016
    2017
    2012
    2013
    I hope anyone has a good tip for me!

  • #2
    The need to match on a time variable as well as the name somewhat complicates matters. I think I would do this by first creating a crosswalk between the names in dataset A and dataset B, and then subsequently do a merge. So something like this:

    Code:
    use dataset_A
    keep name
    duplicates drop
    gen obs_no = _n
    preserve
    
    use dataset_B, clear
    keep name
    rename name name_b
    duplicates drop
    gen obs_no = _n
    tempfile b
    save `b'
    
    restore
    matchit obs_no name using `b', idusing(obs_no) txtusing(name_b)
    At this point you will have a data set in memory containing all of the names from both data sets and a score for how well they match. You then need to inspect this and decide what threshold on the similarity score gives you reasonable sensitivity and specificity for matching. Then just keep those observations that are above that score and save the data set in a crosswalk file.

    Now you can go back and use this to merge the data sets:

    Code:
    use dataset_A
    merge 1:m name using crosswalk_file, nogenerate
    rename name name_a
    rename name_b name
    merge m:1 name year using dataset_B
    rename name name_b
    Notes:

    1. None of this is tested, so you may have to work out some bugs. This is the gist of it.

    2. Your data examples are posted in some way that I have not encountered previously, so I was unable to find a way to import it to Stata short of typing it in manually, which I did not do. Please read the Forum FAQ for information about the helpful ways to show example data, Stata code, and Stata outputs here. In the future, when showing data examples, please use the -dataex- command to do so. If you are running version 15.1 or a fully updated version 14.2, it is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

    3. This code assumes that your panel data sets are properly structured and, specifically, that each firm has only a single observation in any given year.

    Comment


    • #3
      Dear Clyde, thank you very much. I am following up your tips, however, I get an error at the following point:


      Code:
      . save b, replace
      file b.dta saved
      
      . restore
      
      . matchit obs_no name using "b.dta", idusing(obs_no) txtusing(name_b)
      Matching current dataset with b.dta
       
      (!) Unsaved changes will be destroyed after running matching procedure.
          (note: use OVERRIDE option to bypass warning)
      no; data in memory would be lost
      r(4);
      How do I use the override option to work around this error?

      Comment


      • #4
        Sorry for not anticipating that. You just add it to the command, after the -txtusing()- option.

        So:
        Code:
        matchit obs_no name using "b.dta", idusing(obs_no) txtusing(name_b) override
        You did not identify yourself as a Stata beginner, and from your post I imagined that you have some experience using Stata. So I am surprised that you did not know how to add an option to the command. You are embarking on a project that is fairly advanced. I think you are going to encounter difficulties as you proceed if you do not stop and familiarize yourself with the fundamentals of Stata. From Stata's Help menu select PDF documentation. Read the Getting Started [GS] and User's Guide [U] volumes. It's a lengthy read, but it will expose you to the basics of Stata command syntax, Stata's overall data model and approach to management and analysis, and the commands that every Stata programmer needs to use on a regular basis. You won't remember everything, but you will know enough to be able understand code and make simple changes to it. You will also, in most situations, be able to recognize what commands are likely to be needed for a given task, and you can then get the details you don't remember from the help files or the PDF documentation.

        I'll also point out that had you posted your example data with -dataex-, as is requested in the Forum FAQ, in #1, I would have been able to do a test run with my code and I would have caught my error before posting my response. It really does make sense to show example data. I know you intended to do that. But you really need to use -dataex- so that good intentions lead to useful actions.




        Comment


        • #5
          Hi all,

          I'm working through a similar problem as the original post, but with a slightly additional layer of complexity of merging on whether a year falls within a particular range. More specifically, I wanted to add a variable to an existing dataset if there was a matching string and the year falls within a specified range. Minimal working example below. I wanted to post here for posterity in case others had a similar question. That said, although it appears that I achieved the desired output can I simplify my coding or perhaps use a more efficient method? Thanks!

          Code:
          clear
          input byte id1 str14 name1 int year int variable
          1 "Doe, Jon"           1980            1
          2 "Will Green"         1988              0
          3 "Jone, Emily"        1985             0
          4 "Sara Brown"         1980            1
          5 "Daniel Kennedy"     1992             0
          6 "Bobby Quest"         1995            1        
          end
          save file1.dta, replace
          
          clear
          input byte id2 str13 name2 int startyear    int endyear
          7 "John Doe"            1980    1982
          8 "William Green"        1989    1991            
          9 "Emily Jones"         1980    1989
          10 "Sarah Brown"          1980    1980
          11 "Dan Kennedy"           1990    1991
          12 "Sandra Lee"            1992    1998
          end
          save file2.dta, replace
          
          matchit id2 name2 using file1.dta, idusing(id1) txtusing(name1) threshold(.4)
          merge m:m id1 using file1.dta, nogenerate
          save crosswalk.dta, replace
          
          clear
          use file2.dta
          isid id2 , sort
          rangejoin year startyear endyear using "crosswalk.dta", by(id2)
          by id2 startyear endyear, sort: keep if _n == 1
          drop name2_U id1 name1 year similscore
          list

          Comment


          • #6
            @Clyde Schechter Hi Clyde, thank you for sharing your thoughts in this thread. Can I ask a quick question? In #2 post, there is no year variable in either dataset_A or crosswalk dataset. How to do the m:1 merge with dataset_B?

            Comment

            Working...
            X