I am working with a very large (millions of obs) dataset with a string variable that looks like the fake data I've made up below. That is, each observation of the string holds 1 name and/or a list of names, and each name is associated with an ID. I wish to create a second dataset, example below, with a variable that holds each name, row by row, only once.
A few points about the starting dataset are worth noting. (i) The number of listed names varies by obs, (ii) almost all names repeat across obs, and I only want each name ONCE in the ultimate dataset, (iii) generally names are followed by - PX where X is a 4 digit number, but sometimes they are followed by - Y where Y is a long number not beginning with P... however, I *think* names are always followed by a space and then a dash, (iv) generally middle initials exist but I have to assume that in a few cases they don't.
So my thought is that I'd like to extract names (for saving in a local perhaps as an intermediary step, or saving straight into this other dataset) based on two rules:
As I said, I've given an example of the ultimate dataset I'd like below... but I'm not sure how to write the code to make it, primarily because (a) I'm not sure how to do the extraction based on a mix of character and position, (b) I'm not sure how to add each new name only if it hasn't been extracted before.
I know this is an odd and finicky request, any help would be greatly appreciated!
FAKE DATA EXAMPLE:
I'D WANT THIS DATASET AS OUTPUT:
A few points about the starting dataset are worth noting. (i) The number of listed names varies by obs, (ii) almost all names repeat across obs, and I only want each name ONCE in the ultimate dataset, (iii) generally names are followed by - PX where X is a 4 digit number, but sometimes they are followed by - Y where Y is a long number not beginning with P... however, I *think* names are always followed by a space and then a dash, (iv) generally middle initials exist but I have to assume that in a few cases they don't.
So my thought is that I'd like to extract names (for saving in a local perhaps as an intermediary step, or saving straight into this other dataset) based on two rules:
- from character 1 to the character before " -"
- from the character after "0, " until the character before " -" OR from the character after "1, " until the character before " -", OR from the character after "2, " until the character before " -", ... OR from the character after "9, " until the character before " -".
As I said, I've given an example of the ultimate dataset I'd like below... but I'm not sure how to write the code to make it, primarily because (a) I'm not sure how to do the extraction based on a mix of character and position, (b) I'm not sure how to add each new name only if it hasn't been extracted before.
I know this is an odd and finicky request, any help would be greatly appreciated!
FAKE DATA EXAMPLE:
Code:
clear set obs 6 gen names="" replace names="Bevis, Leah E - P2951, Smith, Robert G - P1942, Elfing, Maurad F - P3204" if _n==1 replace names="Brooker, Harry T - P2409" if _n==2 replace names="Smith, Robert G - P1942, Hanger, Ryan - P2442" if _n==3 replace names="Greene, Kelly - P8419, Jacob, Miller M - P1982, Mchugh, Lindsey A - 2003413359, Miracle, Aaron - P2095" if _n==4 replace names="Elfing, Maurad F - P3204, Gran, Mike T - 200489023" if _n==5 replace names="Hanger, Ryan - P2442" if _n==6
Code:
clear set obs 10 gen names="" replace names="Bevis, Leah E" if _n==1 replace names="Smith, Robert G" if _n==2 replace names="Elfing, Maurad F" if _n==3 replace names="Brooker, Harry T" if _n==4 replace names="Hanger, Ryan" if _n==5 replace names="Greene, Kelly" if _n==6 replace names="Jacob, Miller M" if _n==7 replace names="Mchugh, Lindsey A" if _n==8 replace names="Miracle, Aaron" if _n==9 replace names="Gran, Mike T" if _n==10
Comment