Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Extracting substrings by position + character, with no repeat extractions

    I am working with a very large (millions of obs) dataset with a string variable that looks like the fake data I've made up below. That is, each observation of the string holds 1 name and/or a list of names, and each name is associated with an ID. I wish to create a second dataset, example below, with a variable that holds each name, row by row, only once.

    A few points about the starting dataset are worth noting. (i) The number of listed names varies by obs, (ii) almost all names repeat across obs, and I only want each name ONCE in the ultimate dataset, (iii) generally names are followed by - PX where X is a 4 digit number, but sometimes they are followed by - Y where Y is a long number not beginning with P... however, I *think* names are always followed by a space and then a dash, (iv) generally middle initials exist but I have to assume that in a few cases they don't.

    So my thought is that I'd like to extract names (for saving in a local perhaps as an intermediary step, or saving straight into this other dataset) based on two rules:
    1. from character 1 to the character before " -"
    2. from the character after "0, " until the character before " -" OR from the character after "1, " until the character before " -", OR from the character after "2, " until the character before " -", ... OR from the character after "9, " until the character before " -".
    I'd like to add each extracted name to the local / new dataset, UNLESS that name has been extracted previously.

    As I said, I've given an example of the ultimate dataset I'd like below... but I'm not sure how to write the code to make it, primarily because (a) I'm not sure how to do the extraction based on a mix of character and position, (b) I'm not sure how to add each new name only if it hasn't been extracted before.

    I know this is an odd and finicky request, any help would be greatly appreciated!


    FAKE DATA EXAMPLE:
    Code:
    clear
    set obs 6
    gen names=""
    replace names="Bevis, Leah E - P2951, Smith, Robert G - P1942, Elfing, Maurad F - P3204" if _n==1
    replace names="Brooker, Harry T - P2409" if _n==2
    replace names="Smith, Robert G - P1942, Hanger, Ryan - P2442" if _n==3
    replace names="Greene, Kelly - P8419, Jacob, Miller M - P1982, Mchugh, Lindsey A - 2003413359, Miracle, Aaron - P2095" if _n==4
    replace names="Elfing, Maurad F - P3204, Gran, Mike T - 200489023" if _n==5
    replace names="Hanger, Ryan - P2442" if _n==6
    I'D WANT THIS DATASET AS OUTPUT:
    Code:
    clear
    set obs 10
    gen names=""
    replace names="Bevis, Leah E" if _n==1
    replace names="Smith, Robert G" if _n==2
    replace names="Elfing, Maurad F" if _n==3
    replace names="Brooker, Harry T" if _n==4
    replace names="Hanger, Ryan" if _n==5
    replace names="Greene, Kelly" if _n==6
    replace names="Jacob, Miller M" if _n==7
    replace names="Mchugh, Lindsey A" if _n==8
    replace names="Miracle, Aaron" if _n==9
    replace names="Gran, Mike T" if _n==10
    Last edited by Leah Bevis; 13 Feb 2024, 10:34.

  • #2
    Try this:
    Code:
    split names, parse("-") gen(token)
    drop names
    gen `c(obs_t)' obs_no = _n
    reshape long token, i(obs_no)
    drop _j
    replace token = ustrregexra(token, " *P*\d+,* *", "")
    drop if missing(token)
    drop obs_no
    duplicates drop
    It produces the intended results for your example data. Since you are somewhat uncertain about the full extent of the regularities in your full data set, I can't guarantee it will work perfectly there. But I believe it will properly handle anything that conforms to the structure you have described.

    Comment


    • #3
      Code:
      split names, p(" - ") gen(name)
      reshape long name, i(names) j(num)
      drop names num
      replace name = regexr(name, "(^P[0-9]+(, )?)|[0-9]+(, )?", "")
      drop if name==""
      duplicates drop name, force
      qui compress name
      Should get you from input to output as originally posted.

      Comment


      • #4
        If the dataset is large, I would avoid using the split and reshape commands. Here is an alternative.

        Code:
        clear
        set obs 6
        gen names=""
        replace names="Bevis, Leah E - P2951, Smith, Robert G - P1942, Elfing, Maurad F - P3204" if _n==1
        replace names="Brooker, Harry T - P2409" if _n==2
        replace names="Smith, Robert G - P1942, Hanger, Ryan - P2442" if _n==3
        replace names="Greene, Kelly - P8419, Jacob, Miller M - P1982, Mchugh, Lindsey A - 2003413359, Miracle, Aaron - P2095" if _n==4
        replace names="Elfing, Maurad F - P3204, Gran, Mike T - 200489023" if _n==5
        replace names="Hanger, Ryan - P2442" if _n==6
        
        replace names=ustrregexra(names, "(\-\s[P][\d]+)(\,|$)|(\-\s[\d]+)(\,|$)", ";")
        gen howmany=length(names)-length(subinstr(names, ";", "", .))
        expand howmany
        bys names: gen wanted=trim(itrim(subinstr(word(subinstr(subinstr(trim(itrim(names)), " ", "_", .), ";", " ", .) , _n), "_", " ", .)))
        contract wanted
        Res.:

        Code:
         l, sep(0)
        
             +---------------------------+
             |            wanted   _freq |
             |---------------------------|
          1. |     Bevis, Leah E       1 |
          2. |  Brooker, Harry T       1 |
          3. |  Elfing, Maurad F       2 |
          4. |      Gran, Mike T       1 |
          5. |     Greene, Kelly       1 |
          6. |      Hanger, Ryan       2 |
          7. |   Jacob, Miller M       1 |
          8. | Mchugh, Lindsey A       1 |
          9. |    Miracle, Aaron       1 |
         10. |   Smith, Robert G       2 |
             +---------------------------+

        Comment


        • #5
          #4 is interesting. Playing with the code I made som minor changes. To my suprice using regexreplaceall() instead of ustrregexra() increased running time. Also, using the howmany expression directly with expand was less efficient than first defining a variable.
          Code:
          local rx \s-\sP?\d+\,?
          
          replace names =                /// string of words (space separated "names")
              subinstr(                  /// . replace name separator ";" with space
                  subinstr(              /// .. replace space with replacement char "_"
                      ustrregexra(       /// ... replace pattern with name separator ";"
                          names,         /// ... separated by pattern
                          "`rx'",        /// ... regex pattern \s-\sP?\d+\,?       
                          ";"            /// ... name separator ";"
                      ),                 /// ..
                      " ",               /// .. replace space
                      "_",               /// .. with replacement char "_"
                      .                  /// ..
                  ),                     /// ..
                  ";",                   /// . replace name separator ";"
                  " ",                   /// . with space
                  .                      /// .
              )                          ///
          
          gen howmany = length(names)-length(subinstr(names, ";", "", .))
          expand howmany
          
          bys names: gen wanted=trim(subinstr(word(names, _n), "_", " ", .))
          contract wanted if wanted != ""

          Comment


          • #6
            Correction: in #5 above, the
            Code:
            gen howmany = length(names)-length(subinstr(names, ";", "", .))
            expand howmany
            shoud be
            Code:
            expand wordcount(names)

            Comment

            Working...
            X