Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Extract hyphenated name and date from string variable

    Dear statalisters,

    I have a string variable (stringvar) in the following format:

    joe bloggs 10/03/1987
    jamie-lee cyrus 2/12/1982
    cameron reece jones aka smith 03/02/1961
    michelle simone peters-smith 16/8/1952

    The first portion of the variable is the person’s name, and the second is their date of birth. I have successfully extracted the date of birth (dob) using the following code:

    gen dob = regexs(0) if(regexm(stringvar, "[0-9]*[/][0-9]*[/][0-9]*"))

    I would like to extract the person’s first name (retaining hyphenation), middle and surnames (also retaining hyphenation), and also identify words that come after “aka” as this denotes former (e.g. maiden) names.

    I can extract the first name using:

    gen firstname = regexs(0) if(regexm(stringvar, "([a-z]+)[ ]*"))

    but this doesn’t retain hyphenation – I only get the first part of a hyphenated name. Using the following code, e.g.

    gen fourthname = regexs(4) if(regexm(stringvar, "([a-z]+)[ ]*([a-z]+)[ ]*([a-z]+)[ ]*([a-z]+)"))

    returns fourthname as the final character of the last name for names with fewer than four words, e.g. fourthname==”s” for joe bloggs.

    I am using Stata SE 13.0 for Windows. Any help is much appreciated.

    Thank you,

    Claudia.

  • #2
    My recommended strategy for these problems is always to start with simple string functions and to proceed to regex if and only if you need it.

    There is more on neglected simple functions in http://www.stata-journal.com/article...article=dm0058 To show I am happy to use regex when it's the best tool I cite http://www.stata-journal.com/sjpdf.h...iclenum=dm0054 and moss (SSC; joint work with Robert Picard).

    As I understand it from your examples:

    1. The date of birth is just the last word, so word(stringvar, -1) would work too.

    2. The first name is just the first word, so word(stringvar, 1) should work regardless of hyphenation.

    3. The surname is just the second last word in simple cases, so word(stringvar, -2) would work mostly.

    Words in Stata are just whatever spaces separate (modulo binding in double quotation or compound double quotation marks).

    However, here is a sequential strategy.

    1. Try out the last word as a daily date. If that works remove it.

    2. Look for " aka " as a substring. If you find it, remove it and what follows. N.B. not "aka".

    3. The first name is the first word of what remains and the surname the last word of what remains.

    4. Remove them and other names are what remains.

    Here are steps 1 and 2. Look: no regex.

    Code:
    . clear 
    
    . input str40 stuff
    
                                            stuff
      1. "joe bloggs 10/03/1987"
      2. "jamie-lee cyrus 2/12/1982"
      3. "cameron reece jones aka smith 03/02/1961"
      4. "michelle simone peters-smith 16/8/1952"
      5. end 
    
    . compress 
    
    . gen bdate = date(word(stuff, -1), "DMY") 
    
    . format bdate %tdDD_Mon_YY 
    
    . replace stuff = trim(subinstr(stuff, word(stuff, -1), "", 1)) if bdate < . 
    (4 real changes made)
    
    . gen akapos = strpos(stuff, " aka ") 
    
    . gen aka = trim(substr(stuff, akapos, .)) if akapos 
    (3 missing values generated)
    
    . replace stuff = trim(subinstr(stuff, aka, "", .)) if akapos 
    (1 real change made)
    
    . list 
    
         +---------------------------------------------------------------+
         |                        stuff       bdate   akapos         aka |
         |---------------------------------------------------------------|
      1. |                   joe bloggs   10 Mar 87        0             |
      2. |              jamie-lee cyrus   02 Dec 82        0             |
      3. |          cameron reece jones   03 Feb 61       20   aka smith |
      4. | michelle simone peters-smith   16 Aug 52        0             |
         +---------------------------------------------------------------+

    Comment


    • #3
      Claudia,

      I don't disagree with Nick's advice about avoiding regular expressions, but just for general edification purposes, here is how you can fix the statement that generates firstname so that hyphens are included:

      Code:
      gen firstname = regexs(0) if(regexm(stringvar, "([a-z/-]+)[ ]*"))
      The "/" before the second hyphen indicates that you are looking for a literal hyphen and not using it to indicate a range of characters. Also, if there is any chance that you will have upper case letters in names, you should do:

      Code:
      gen firstname = regexs(0) if(regexm(stringvar, "([A-Za-z/-]+)[ ]*"))
      All that said, Nick's suggestion is better for picking up the different names when you have a variable number of them.

      Regards,
      Joe

      Comment

      Working...
      X