Extracting substrings by position + character, with no repeat extractions

Leah Bevis

Join Date: Oct 2015

Posts: 125
#1

Extracting substrings by position + character, with no repeat extractions

13 Feb 2024, 08:39

I am working with a very large (millions of obs) dataset with a string variable that looks like the fake data I've made up below. That is, each observation of the string holds 1 name and/or a list of names, and each name is associated with an ID. I wish to create a second dataset, example below, with a variable that holds each name, row by row, only once.

A few points about the starting dataset are worth noting. (i) The number of listed names varies by obs, (ii) almost all names repeat across obs, and I only want each name ONCE in the ultimate dataset, (iii) generally names are followed by - PX where X is a 4 digit number, but sometimes they are followed by - Y where Y is a long number not beginning with P... however, I *think* names are always followed by a space and then a dash, (iv) generally middle initials exist but I have to assume that in a few cases they don't.

So my thought is that I'd like to extract names (for saving in a local perhaps as an intermediary step, or saving straight into this other dataset) based on two rules:
from character 1 to the character before " -"

from the character after "0, " until the character before " -" OR from the character after "1, " until the character before " -", OR from the character after "2, " until the character before " -", ... OR from the character after "9, " until the character before " -".

I'd like to add each extracted name to the local / new dataset, UNLESS that name has been extracted previously.

As I said, I've given an example of the ultimate dataset I'd like below... but I'm not sure how to write the code to make it, primarily because (a) I'm not sure how to do the extraction based on a mix of character and position, (b) I'm not sure how to add each new name only if it hasn't been extracted before.

I know this is an odd and finicky request, any help would be greatly appreciated!

FAKE DATA EXAMPLE:

Code:

clear set obs 6 gen names="" replace names="Bevis, Leah E - P2951, Smith, Robert G - P1942, Elfing, Maurad F - P3204" if _n==1 replace names="Brooker, Harry T - P2409" if _n==2 replace names="Smith, Robert G - P1942, Hanger, Ryan - P2442" if _n==3 replace names="Greene, Kelly - P8419, Jacob, Miller M - P1982, Mchugh, Lindsey A - 2003413359, Miracle, Aaron - P2095" if _n==4 replace names="Elfing, Maurad F - P3204, Gran, Mike T - 200489023" if _n==5 replace names="Hanger, Ryan - P2442" if _n==6

I'D WANT THIS DATASET AS OUTPUT:

Code:

clear set obs 10 gen names="" replace names="Bevis, Leah E" if _n==1 replace names="Smith, Robert G" if _n==2 replace names="Elfing, Maurad F" if _n==3 replace names="Brooker, Harry T" if _n==4 replace names="Hanger, Ryan" if _n==5 replace names="Greene, Kelly" if _n==6 replace names="Jacob, Miller M" if _n==7 replace names="Mchugh, Lindsey A" if _n==8 replace names="Miracle, Aaron" if _n==9 replace names="Gran, Mike T" if _n==10

Last edited by Leah Bevis; 13 Feb 2024, 09:34.
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30101
#2

13 Feb 2024, 10:16

Try this:

Code:

split names, parse("-") gen(token) drop names gen `c(obs_t)' obs_no = _n reshape long token, i(obs_no) drop _j replace token = ustrregexra(token, " *P*\d+,* *", "") drop if missing(token) drop obs_no duplicates drop

It produces the intended results for your example data. Since you are somewhat uncertain about the full extent of the regularities in your full data set, I can't guarantee it will work perfectly there. But I believe it will properly handle anything that conforms to the structure you have described.
1 like
Comment

Mike LeGower

Join Date: Aug 2021
Posts: 11

13 Feb 2024, 10:16

Code:

split names, p(" - ") gen(name)
reshape long name, i(names) j(num)
drop names num
replace name = regexr(name, "(^P[0-9]+(, )?)|[0-9]+(, )?", "")
drop if name==""
duplicates drop name, force
qui compress name

Should get you from input to output as originally posted.

Comment

Andrew Musau

Join Date: Oct 2014
Posts: 10195

13 Feb 2024, 23:11

If the dataset is large, I would avoid using the split and reshape commands. Here is an alternative.

Code:

clear
set obs 6
gen names=""
replace names="Bevis, Leah E - P2951, Smith, Robert G - P1942, Elfing, Maurad F - P3204" if _n==1
replace names="Brooker, Harry T - P2409" if _n==2
replace names="Smith, Robert G - P1942, Hanger, Ryan - P2442" if _n==3
replace names="Greene, Kelly - P8419, Jacob, Miller M - P1982, Mchugh, Lindsey A - 2003413359, Miracle, Aaron - P2095" if _n==4
replace names="Elfing, Maurad F - P3204, Gran, Mike T - 200489023" if _n==5
replace names="Hanger, Ryan - P2442" if _n==6

replace names=ustrregexra(names, "(\-\s[P][\d]+)(\,|$)|(\-\s[\d]+)(\,|$)", ";")
gen howmany=length(names)-length(subinstr(names, ";", "", .))
expand howmany
bys names: gen wanted=trim(itrim(subinstr(word(subinstr(subinstr(trim(itrim(names)), " ", "_", .), ";", " ", .) , _n), "_", " ", .)))
contract wanted

Res.:

Code:

 l, sep(0)

     +---------------------------+
     |            wanted   _freq |
     |---------------------------|
  1. |     Bevis, Leah E       1 |
  2. |  Brooker, Harry T       1 |
  3. |  Elfing, Maurad F       2 |
  4. |      Gran, Mike T       1 |
  5. |     Greene, Kelly       1 |
  6. |      Hanger, Ryan       2 |
  7. |   Jacob, Miller M       1 |
  8. | Mchugh, Lindsey A       1 |
  9. |    Miracle, Aaron       1 |
 10. |   Smith, Robert G       2 |
     +---------------------------+

Comment

Bjarte Aagnes

Join Date: Apr 2014
Posts: 784

15 Feb 2024, 08:22

#4 is interesting. Playing with the code I made som minor changes. To my suprice using regexreplaceall() instead of ustrregexra() increased running time. Also, using the howmany expression directly with expand was less efficient than first defining a variable.

Code:

local rx \s-\sP?\d+\,?

replace names =                /// string of words (space separated "names")
    subinstr(                  /// . replace name separator ";" with space
        subinstr(              /// .. replace space with replacement char "_"
            ustrregexra(       /// ... replace pattern with name separator ";"
                names,         /// ... separated by pattern
                "`rx'",        /// ... regex pattern \s-\sP?\d+\,?       
                ";"            /// ... name separator ";"
            ),                 /// ..
            " ",               /// .. replace space
            "_",               /// .. with replacement char "_"
            .                  /// ..
        ),                     /// ..
        ";",                   /// . replace name separator ";"
        " ",                   /// . with space
        .                      /// .
    )                          ///

gen howmany = length(names)-length(subinstr(names, ";", "", .))
expand howmany

bys names: gen wanted=trim(subinstr(word(names, _n), "_", " ", .))
contract wanted if wanted != ""

Comment

Bjarte Aagnes

Join Date: Apr 2014

Posts: 784
#6

23 Feb 2024, 04:36

Correction: in #5 above, the

Code:

gen howmany = length(names)-length(subinstr(names, ";", "", .)) expand howmany

shoud be

Code:

expand wordcount(names)
Comment

Announcement