Extract hyphenated name and date from string variable

Dr Claudia Pitts

Join Date: Apr 2014

Posts: 1
#1

Extract hyphenated name and date from string variable

15 Apr 2014, 00:19

Dear statalisters,

I have a string variable (stringvar) in the following format:

joe bloggs 10/03/1987
jamie-lee cyrus 2/12/1982
cameron reece jones aka smith 03/02/1961
michelle simone peters-smith 16/8/1952

The first portion of the variable is the person’s name, and the second is their date of birth. I have successfully extracted the date of birth (dob) using the following code:

gen dob = regexs(0) if(regexm(stringvar, "[0-9]*[/][0-9]*[/][0-9]*"))

I would like to extract the person’s first name (retaining hyphenation), middle and surnames (also retaining hyphenation), and also identify words that come after “aka” as this denotes former (e.g. maiden) names.

I can extract the first name using:

gen firstname = regexs(0) if(regexm(stringvar, "([a-z]+)[ ]*"))

but this doesn’t retain hyphenation – I only get the first part of a hyphenated name. Using the following code, e.g.

gen fourthname = regexs(4) if(regexm(stringvar, "([a-z]+)[ ]*([a-z]+)[ ]*([a-z]+)[ ]*([a-z]+)"))

returns fourthname as the final character of the last name for names with fewer than four words, e.g. fourthname==”s” for joe bloggs.

I am using Stata SE 13.0 for Windows. Any help is much appreciated.

Thank you,

Claudia.
Tags: None
Nick Cox

Join Date: Mar 2014

Posts: 35698
#2

15 Apr 2014, 05:06

My recommended strategy for these problems is always to start with simple string functions and to proceed to regex if and only if you need it.

There is more on neglected simple functions in http://www.stata-journal.com/article...article=dm0058 To show I am happy to use regex when it's the best tool I cite http://www.stata-journal.com/sjpdf.h...iclenum=dm0054 and moss (SSC; joint work with Robert Picard).

As I understand it from your examples:

1. The date of birth is just the last word, so word(stringvar, -1) would work too.

2. The first name is just the first word, so word(stringvar, 1) should work regardless of hyphenation.

3. The surname is just the second last word in simple cases, so word(stringvar, -2) would work mostly.

Words in Stata are just whatever spaces separate (modulo binding in double quotation or compound double quotation marks).

However, here is a sequential strategy.

1. Try out the last word as a daily date. If that works remove it.

2. Look for " aka " as a substring. If you find it, remove it and what follows. N.B. not "aka".

3. The first name is the first word of what remains and the surname the last word of what remains.

4. Remove them and other names are what remains.

Here are steps 1 and 2. Look: no regex.

Code:

. clear . input str40 stuff stuff 1. "joe bloggs 10/03/1987" 2. "jamie-lee cyrus 2/12/1982" 3. "cameron reece jones aka smith 03/02/1961" 4. "michelle simone peters-smith 16/8/1952" 5. end . compress . gen bdate = date(word(stuff, -1), "DMY") . format bdate %tdDD_Mon_YY . replace stuff = trim(subinstr(stuff, word(stuff, -1), "", 1)) if bdate < . (4 real changes made) . gen akapos = strpos(stuff, " aka ") . gen aka = trim(substr(stuff, akapos, .)) if akapos (3 missing values generated) . replace stuff = trim(subinstr(stuff, aka, "", .)) if akapos (1 real change made) . list +---------------------------------------------------------------+ | stuff bdate akapos aka | |---------------------------------------------------------------| 1. | joe bloggs 10 Mar 87 0 | 2. | jamie-lee cyrus 02 Dec 82 0 | 3. | cameron reece jones 03 Feb 61 20 aka smith | 4. | michelle simone peters-smith 16 Aug 52 0 | +---------------------------------------------------------------+
2 likes
Comment
Joe Canner

Join Date: Mar 2014

Posts: 580
#3

15 Apr 2014, 08:30

Claudia,

I don't disagree with Nick's advice about avoiding regular expressions, but just for general edification purposes, here is how you can fix the statement that generates firstname so that hyphens are included:

Code:

gen firstname = regexs(0) if(regexm(stringvar, "([a-z/-]+)[ ]*"))

The "/" before the second hyphen indicates that you are looking for a literal hyphen and not using it to indicate a range of characters. Also, if there is any chance that you will have upper case letters in names, you should do:

Code:

gen firstname = regexs(0) if(regexm(stringvar, "([A-Za-z/-]+)[ ]*"))

All that said, Nick's suggestion is better for picking up the different names when you have a variable number of them.

Regards,
Joe
1 like
Comment

Announcement

Extract hyphenated name and date from string variable

Comment

Comment