Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Merge or Append?

    I have a two datasets both of which have a variable denoting state/county fips codes. I need to "I think" either merge or append data from a second dataset to my main data. I've attached a snippet of the main dataset below.


    ----------------------- copy starting from the next line -----------------------
    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input str28 countyname int year str3 stateabbr long fips double empgfunds
    "Acadia"   2014 "LA" 22001             32365
    "Acadia"   2015 "LA" 22001             29198
    "Acadia"   2016 "LA" 22001             29098
    "Acadia"   2017 "LA" 22001          29720.94
    "Acadia"   2018 "LA" 22001          29651.59
    "Acadia"   2019 "LA" 22001          29521.63
    "Acadia"   2020 "LA" 22001          29821.68
    "Accomack" 2014 "VA" 51001              7500
    "Accomack" 2015 "VA" 51001              7500
    "Accomack" 2016 "VA" 51001              7500
    "Accomack" 2017 "VA" 51001              7500
    "Accomack" 2018 "VA" 51001              7500
    "Accomack" 2019 "VA" 51001              7500
    "Accomack" 2020 "VA" 51001              7500
    "Ada"      2014 "ID" 16001         259958.99
    "Ada"      2015 "ID" 16001         252399.45
    "Ada"      2016 "ID" 16001         260049.12
    "Ada"      2017 "ID" 16001            257601
    "Ada"      2018 "ID" 16001          259547.5
    "Ada"      2019 "ID" 16001            269664
    "Ada"      2020 "ID" 16001         299311.41
    "Adams"    2014 "CO"  8001            135147
    "Adams"    2015 "CO"  8001            100000
    "Adams"    2016 "CO"  8001                 0
    "Adams"    2017 "CO"  8001             77000
    "Adams"    2018 "CO"  8001             77000
    "Adams"    2019 "CO"  8001             77000
    "Adams"    2020 "CO"  8001             77000
    "Adams"    2014 "ID" 16003            4221.1
    "Adams"    2015 "ID" 16003           5726.52
    "Adams"    2016 "ID" 16003                 0
    "Adams"    2017 "ID" 16003            1529.9
    "Adams"    2018 "ID" 16003              9510
    "Adams"    2019 "ID" 16003              9955
    "Adams"    2020 "ID" 16003          11058.76
    "Adams"    2014 "IL" 17001          28571.31
    "Adams"    2015 "IL" 17001          27869.37
    "Adams"    2016 "IL" 17001          34440.22
    "Adams"    2017 "IL" 17001          32741.75
    "Adams"    2018 "IL" 17001          34476.22
    "Adams"    2019 "IL" 17001          36777.78
    "Adams"    2020 "IL" 17001          33503.01
    "Adams"    2014 "IN" 18001           8485.29
    "Adams"    2015 "IN" 18001                 0
    "Adams"    2016 "IN" 18001             27760
    "Adams"    2017 "IN" 18001           6968.57
    "Adams"    2018 "IN" 18001 91923.98000000001
    "Adams"    2019 "IN" 18001          35264.06
    "Adams"    2020 "IN" 18001             30000
    "Adams"    2014 "MS" 28001             31055
    end
    The second dataset has two variables I need Risk Rating and Risk Score to be added to this data matching on the STCOFIPS to FIPS variable from the master. See the snippet below.
    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input str21 COUNTY double RISK_SCORE str19 RISK_RATNG str4 STATEABBRV
    "Johnson"    9.281418624 "Relatively Low"      "KY"
    "Kenton"     10.44905673 "Relatively Low"      "KY"
    "Knott"      10.06839548 "Relatively Low"      "KY"
    "Knox"       11.85824507 "Relatively Low"      "KY"
    "Larue"      4.610899676 "Very Low"            "KY"
    "Laurel"      12.2632724 "Relatively Low"      "KY"
    "Lawrence"   6.767260626 "Very Low"            "KY"
    "Lee"        4.937158257 "Very Low"            "KY"
    "Leslie"     7.782817821 "Very Low"            "KY"
    "Letcher"    9.271908629 "Relatively Low"      "KY"
    "Lewis"       8.90908052 "Relatively Low"      "KY"
    "Lincoln"    7.242164057 "Very Low"            "KY"
    "Livingston" 9.999168796 "Relatively Low"      "KY"
    "Lyon"       6.622276403 "Very Low"            "KY"
    "McCracken"  19.12364162 "Relatively Moderate" "KY"
    "McCreary"   6.122567653 "Very Low"            "KY"
    "McLean"     7.142575815 "Very Low"            "KY"
    "Madison"    10.82923089 "Relatively Low"      "KY"
    "Logan"      9.500956351 "Relatively Low"      "KY"
    "Magoffin"    7.70513816 "Very Low"            "KY"
    "Marion"     6.039854946 "Very Low"            "KY"
    "Marshall"   11.23343249 "Relatively Low"      "KY"
    "Martin"     6.381700555 "Very Low"            "KY"
    "Mason"      7.529803222 "Very Low"            "KY"
    "Meade"      6.093386636 "Very Low"            "KY"
    "Menifee"    8.299519073 "Relatively Low"      "KY"
    "Mercer"     7.722162526 "Very Low"            "KY"
    "Metcalfe"   5.590253834 "Very Low"            "KY"
    "Monroe"     6.881790428 "Very Low"            "KY"
    "Montgomery"  6.86183341 "Very Low"            "KY"
    "Morgan"      5.66691992 "Very Low"            "KY"
    "Muhlenberg" 9.414436873 "Relatively Low"      "KY"
    "Nelson"     8.085131815 "Very Low"            "KY"
    "Nicholas"   5.891145218 "Very Low"            "KY"
    "Ohio"       9.307983344 "Relatively Low"      "KY"
    "Oldham"     2.109627275 "Very Low"            "KY"
    "Owen"       6.836766087 "Very Low"            "KY"
    "Owsley"     7.583154061 "Very Low"            "KY"
    "Pendleton"  7.446827729 "Very Low"            "KY"
    "Perry"      10.10004836 "Relatively Low"      "KY"
    "Pike"       12.62172927 "Relatively Low"      "KY"
    "Powell"     9.581312986 "Relatively Low"      "KY"
    "Pulaski"    12.00379862 "Relatively Low"      "KY"
    "Robertson"  4.197439657 "Very Low"            "KY"
    "Rockcastle" 7.058219568 "Very Low"            "KY"
    "Rowan"      8.083623212 "Very Low"            "KY"
    "Russell"    8.555008101 "Relatively Low"      "KY"
    "Scott"       5.94170199 "Very Low"            "KY"
    "Shelby"      9.22171892 "Relatively Low"      "KY"
    "Simpson"    8.437796527 "Relatively Low"      "KY"
    end

    I believe the correct code would be something like

    Code:
     rename fips STCOFIPS
    to rename the master dataset variable to the same name, then use
    Code:
    merge 1:1 STCOPFIPS RISK_SCORE RISK_RATING using NRI_TABLE_COUNTY.csv
    to bring in the two variables I need into the master.

    Is this correct or should I be appending the data instead? I've read the help file but I always get confused with merge vs. append.

  • #2
    When two data sets have the same variables but are observations of different units (or the same units at different times) then you -append- them. But if the data sets are observations of the same (or largely overlapping) units (and times, if applicable) and have different variables describing them, you -merge- them.

    But your -merge- command as written will not work. The second data set contains no fips code under any name, so you can't merge them on that. You will have to fall back on merging based on the name of the county and the state abbreviation. Moreover, your first data set observes each of its counties in multiple years, whereas the second data set has only a single observation per county. So this is not a 1:1 merge.

    Combining these will require some additional preparation. The first thing is that the name of the county variable needs to be the same in both data sets. The second thing, potentially harder, is that, in my experience with data sets of this nature, the county names may be spelled out differently in the two data sets. For example, one data set might have a "Saint Lawrence" county while the other shows it as "St. Lawrence." Those will not match with each other. Even if both use abbreviations, one might follow them with periods and the other not. Text data can be complicated!

    Anyway, the skeleton of this will be something like this:
    Code:
    use first_data_set, clear
    rename countyname COUNTY
    merge m:1 COUNTY using second_data_set
    If you do indeed have county names that are spelled out differently in the two data sets, this will show up as mismatches in the result (observations with _merge != 3). So you will want to look at those results to see what kind of changes need to be made to the data to get appropriate matching.
    Code:
    browse COUNTY _merge if _merge != 3
    will show them to you. From there you'll have to figure it out what is supposed to match with what but isn't. Then you can go back to the start and change the county names in one or both data sets so that things that should match will. Then re-run the code to get better results.

    Comment


    • #3
      Clyde Schechter gave great advice. You will need to match on state as well, I guess, as some counties in different states have the same name.

      Comment


      • #4
        Indeed. I live in Orange County, CA. But there is an Orange County in New York and also in Florida--and those are just the ones I can think of quickly!

        Comment


        • #5
          Numerous counties named for Washington, Franklin, Jefferson

          Comment


          • #6
            Thank you Clyde and Nick,

            One quick follow up question for the two of you. is it possible to add county and stateabbr to the m:1 merge so that Stata looks at both variables when attempting the merge? I know I have a few counties that have the same name, but as you rightly point out are actually in different states so I don’t want to add to much post-command work on correcting if I can avoid it.

            Comment


            • #7
              Code:
              use first_data_set, clear
              rename countyname COUNTY
              rename stateabbr STATEABBRV
              merge m:1 COUNTY STATEABBRV using second_data_set

              Comment

              Working...
              X