Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Consistent Formatting of String Data

    Hi all,

    I'm working with some administrative data in which the names are formatted slightly differently in each year. The goal is to format all names the same way so that I can ask Stata to identify the years in which a person is present in the dataset, using a command such as:

    Code:
    bysort name year: gen present=_n==1
    Below is an example of my data where the formatting issues in the "name" variable are clear. I'm using Stata 15.1 on a Mac. Solution?

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input str20 name float year
    "SMITH JONES ,PAUL O"  2012
    "SMITH JONES ,PAUL O"  2012
    "SMITH JONES, PAUL, O" 2013
    "SMITHJONES,PAUL,O"    2014
    "SMITH, JONES, PAUL O" 2015
    "SMITH, JONES, PAUL O" 2015
    "SMITH JONES, PAUL, O" 2016
    end

  • #2
    I'm not sure if folks typically answer their own questions on this list, but after thinking about the problem a bit, I've come to a solution:

    Code:
    gen y=subinstr(name,",","",.)
    gen y1=subinstr(y," ","",.)
    The code removes spaces and commas and then...violá!

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input str20 name float year str18 y str15 y1
    "SMITH JONES ,PAUL O"  2012 "SMITH JONES PAUL O" "SMITHJONESPAULO"
    "SMITH JONES, PAUL, O" 2013 "SMITH JONES PAUL O" "SMITHJONESPAULO"
    "SMITHJONES,PAUL,O"    2014 "SMITHJONESPAULO"    "SMITHJONESPAULO"
    "SMITH, JONES, PAUL O" 2015 "SMITH JONES PAUL O" "SMITHJONESPAULO"
    "SMITH JONES, PAUL, O" 2016 "SMITH JONES PAUL O" "SMITHJONESPAULO"
    end

    Comment


    • #3
      Yes, people do answer their own questions here, and we encourage it, so that others who come to the forum later searching for an answer to a similar question can learn from what you found. Thank you for contributing to the Forum!

      Comment


      • #4
        Another possibility is to use -split- and then concatenate the parts back together:
        Code:
        clear
        input str20 name float year
        "SMITH JONES ,PAUL O"  2012
        "SMITH JONES ,PAUL O"  2012
        "SMITH JONES, PAUL, O" 2013
        "SMITHJONES,PAUL,O"    2014
        "SMITH, JONES, PAUL O" 2015
        "SMITH, JONES, PAUL O" 2015
        "SMITH JONES, PAUL, O" 2016
        end
        replace name = subinstr(name, ",", " ",.)
        replace name = subinstr(name, "SMITHJONES", "SMITH JONES",.)
        split name
        gen new_name = name1 + " " + name2 + ", " + name3 + " " + name4
        drop name name?
        list

        Comment

        Working...
        X