Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Extracting certain names/initials out of string variables

    Hi Statalist, long time lurker, first time poster.

    I am currently trying to clean my data, I began with only full names which were inputted by the people themselves. Therefore there is no consistent input as far as using a full name, initials, etc.

    I have gotten as far as what you see below using this code
    Code:
    generate contact_peron_first = substr(contact_person, 1, strpos(contact_person, " ") - 1)  
    generate contact_person_last = substr(contact_person,strpos(contact_person, " ") + 1, .)
    replace contact_person_last = strtrim(contact_person_last)
    generate contact_person_last_1 = substr(contact_person_last, 1, strpos(contact_person_last, " ") - 1)  
    generate contact_person_last_2 = substr(contact_person_last,strpos(contact_person_last, " ") + 1, .)
    Code:
    input str45 contact_person str14 contact_peron_first str16 contact_person_last_1 str17 contact_person_last_2
    Full Name                              First           Last_1                 Last_2                Last_3
    "K. V Sarathkumar"                "K."             "V"                "Sarathkumar"      
    "Katungal Padmanabhan Sasidharan" "Katungal"       "Padmanabhan"      "Sasidharan"      
    "Katungal Padmanabhan Sasidharan" "Katungal"       "Padmanabhan"      "Sasidharan"      
    "Katungal Padmanabhan Sasidharan" "Katungal"       "Padmanabhan"      "Sasidharan"      
    "Katungal Padmanabhan Sasidharan" "Katungal"       "Padmanabhan"      "Sasidharan"      
    "K S Shan"                        "K"              "S"                "Shan"            
    "K S Shan"                        "K"              "S"                "Shan"            
    "K S Shan"                        "K"              "S"                "Shan"            
    "Brij Mohan   Sharma"             "Brij"           "Mohan"            "Sharma"          
    "Rajeev K Sivadas"                "Rajeev"         "K"                "Sivadas"          
    "Mr. Tim Sunil"                   "Mr."            "Tim"              "Sunil"            
    "C. S. Suresh"                    "C."             "S."               "Suresh"    
    
    end

    What I really want is this:


    Code:
    input str45 contact_person str14 contact_peron_first str16 contact_person_last_1 str17 contact_person_last_2
    Full Name                              First           Last_1                 Last_2                Last_3
    "K. V Sarathkumar"                "K.V"          "Sarathkumar"      
    "Mr. Tim Sunil"                       "Tim"          "Sunil"            
    "K S Shan"                              "KS"          "Shan"

    The problem is that there is no consistent way in which people used initials, honorifics, etc... At the most basic level I would just want initials together as first name, then have the rest as last name. Any advice?
    Last edited by Eli Mogel; 13 Jul 2020, 14:10.

  • #2
    many years ago, Bill Gould, now Pres. emiritis of StataCorp, wrote a neat command called -extrname- which does most of what you want I think; use -search- to find and download

    Comment


    • #3
      Originally posted by Rich Goldstein View Post
      many years ago, Bill Gould, now Pres. emiritis of StataCorp, wrote a neat command called -extrname- which does most of what you want I think; use -search- to find and download


      Thanks, I'll look for it and try it out !


      EDIT: For anything that has found this post with the same problems, I would suggest using downloading extrname and using it. It seems to fix 90% of my problems. Thanks!
      Last edited by Eli Mogel; 13 Jul 2020, 15:01.

      Comment


      • #4
        Code:
        net stb 13 dm13
        will bring up a clickable link.

        On #1: split is not the answer here, but you could use it next time.

        Comment


        • #5
          Dear Stata members,

          Sorry for revisiting these again but I have two questions:

          QUESTION ONE:

          The var NAME in my dataset is in the form of the LAST NAME, FIRST NAME , and then middle name without a separate comma

          How I can extract only the two names to be consistent along my data ?? as in this example
          NAME WANTED
          QAZI, MOHAMED QAZI, MOHAMED
          QAZI, MOHAMED QAZI, MOHAMED
          QAZI, MOHAMED ASHRAF QAZI, MOHAMED
          RADOW, NORMAN RADOW, NORMAN
          RADOW, NORMAN RADOW, NORMAN
          RADOW, NORMAN RADOW, NORMAN
          RADOW, NORMAN J RADOW, NORMAN
          RAESE, JOHN RAESE, JOHN
          RAESE, JOHN R MR RAESE, JOHN

          QUESTION TWO:

          to be able to merge between datasets the second dataset includes only the last name and first name separately (i.e. the first letter is a upper case and the rest of the name is in a lowercase format?

          How to combine the last name (lower case) with the first name (lower case) to form an uppercase full name with the comma in between? like below
          LAST NAME FIRST NAME WANTED WITH COMMA WANTEDWITHOUT COMMA
          Qazi Mohmed QAZI, MOHAMED QAZI MOHAMED
          Radow Norman RADOW, NORMAN RADOW NORMAN

          Comment


          • #6
            #5: Asked and answered in #18 https://www.statalist.org/forums/for...-a-comma/page2. Please do not repeat the same question in multiple places.

            Comment


            • #7
              I am sorry I posted it instead of saving it

              Many thanks for your response,

              Comment

              Working...
              X