Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Splitting a string variable into unique dummy variables when the string includes double digit numbers

    Hello,
    I have a string variable containing a mixture of single and double digits and I would like to use one of Stata's string functions to split them into unique dummy variables. There was a post in which Nick Cox addressed a similar issue involving single digit strings.(https://www.statalist.org/forums/for...ummy-variables). When the string involves a mixture of single digits and double digits, however, the solution provided does not produce the desired outcome. I have tried a number of string functions but the results are not what I expect. How should I go about this? I have posted an example dataset below.

    Code:
    clear
    input float hhid str35 iterm float pid
    2299 "6"         2
    2383 "19"        1
     777 "14"        1
    1858 "37"        2
      72 "19 22"     1
      55 "31"        2
     114 "6"         1
     495 "5 6 29"    1
     869 "19"        3
    1827 "29 43"     2
     784 "5"         2
    2107 "4"         1
    1081 "4"         1
    1001 "19"        3
    2132 "19"        2
    1198 "3"         2
     664 "23"        5
     537 "5 6 29"    1
     714 "23"        1
    1547 "21"        1
     472 "5 8 29"    1
     452 "19 40"     1
    1689 "19"        1
     568 "6"         1
     359 "19"        1
     830 "3"         3
     230 "6"         3
    1581 "6 29"      1
    1799 "19 44"     2
    1395 "14 23 27"  7
     657 "31"        2
     776 "5 29"      2
      48 "5 6 19 29" 1
    1758 "31"        2
     251 "5"         1
    1915 "5 8 29"    2
    1466 "31"        7
    2304 "14"        2
     273 "3"         2
     774 "29"        2
    1090 "6 29"      2
     366 "4"         2
      68 "19"        1
     293 "15"        4
    2424 "6 8 29"    1
    2441 "3 19"      2
     327 "5"         1
     976 "19"        1
     341 "5"         3
    1680 "44"        9
     831 "19"        4
    1435 "44"        2
    2540 "5 6"       3
    2198 "5 14 19"   2
     466 "3"         2
     527 "14"        2
    2452 "19"        1
     963 "14"        6
    1612 "22"        1
    1338 "19"        3
    1108 "6 7 25"    1
     724 "5 8 29 40" 2
    1869 "6"         2
    1411 "4"         1
      75 "6"         3
     637 "14"        4
    2375 "19"        5
     254 "27"        2
    1324 "3"         1
     363 "6"         1
    2122 "22"        6
    1118 "3 14"      3
    1365 "14"        3
     629 "19 40"     1
    1791 "6"         2
     433 "15"        3
    end

  • #2
    Code:
    gen `c(obs_t)' obs_no = _n
    split iterm, gen(token)
    reshape long token, i(obs_no)
    levelsof token, local(tokens)
    foreach t of local tokens {
        by obs_no (_j): egen var_`t' = max(token == `"`t'"')
    }
    reshape wide
    drop token* obs_no

    Comment


    • #3
      Many thanks, Clyde. Appreciated!

      Comment


      • #4
        Hi Clyde, I think there is a shortcut that does the same thing:

        Code:
        replace iterm = " " + iterm + " "
        
        forv itermNo = 1/44 {
            
          gen iterm`itermNo' = regexm(iterm , " `itermNo' " )
        }

        Comment


        • #5
          I don't know why you say the approach in my thread doesn't work.

          Here's some code. I presumed that indicators (you say dummy variables, but see Section 2 in https://journals.sagepub.com/doi/ful...36867X19830921)
          that are all 0 are not helpful, but you can simplify the code if that is wrong.

          Code:
          forval j = 1/44 { 
              gen wanted`j' = length(subinword(iterm, "`j'", "", .)) < length(iterm)
              su wanted`j', meanonly 
              if r(max) == 0 drop wanted`j'
          }
          
          l iterm wanted* in 1/10
          
              +--------------------------------------------------------------------------------------------------+
            1. |  iterm | wanted3  | wanted4  | wanted5  | wanted6  | wanted7  | wanted8  | wanted14  | wanted15  |
               |      6 |       0  |       0  |       0  |       1  |       0  |       0  |        0  |        0  |
               |--------------------------------------------------------------------------------------------------|
               | wanted19 | wanted21 | wanted22 | wanted23 | wanted25 | wanted27 | wanted29 | wanted31 | wanted37 |
               |        0 |        0 |        0 |        0 |        0 |        0 |        0 |        0 |        0 |
               |--------------------------------+--------------------------------+--------------------------------|
               |            wanted40            |            wanted43            |            wanted44            |
               |                   0            |                   0            |                   0            |
               +--------------------------------------------------------------------------------------------------+
          
               +--------------------------------------------------------------------------------------------------+
            2. |  iterm | wanted3  | wanted4  | wanted5  | wanted6  | wanted7  | wanted8  | wanted14  | wanted15  |
               |     19 |       0  |       0  |       0  |       0  |       0  |       0  |        0  |        0  |
               |--------------------------------------------------------------------------------------------------|
               | wanted19 | wanted21 | wanted22 | wanted23 | wanted25 | wanted27 | wanted29 | wanted31 | wanted37 |
               |        1 |        0 |        0 |        0 |        0 |        0 |        0 |        0 |        0 |
               |--------------------------------+--------------------------------+--------------------------------|
               |            wanted40            |            wanted43            |            wanted44            |
               |                   0            |                   0            |                   0            |
               +--------------------------------------------------------------------------------------------------+
          
               +--------------------------------------------------------------------------------------------------+
            3. |  iterm | wanted3  | wanted4  | wanted5  | wanted6  | wanted7  | wanted8  | wanted14  | wanted15  |
               |     14 |       0  |       0  |       0  |       0  |       0  |       0  |        1  |        0  |
               |--------------------------------------------------------------------------------------------------|
               | wanted19 | wanted21 | wanted22 | wanted23 | wanted25 | wanted27 | wanted29 | wanted31 | wanted37 |
               |        0 |        0 |        0 |        0 |        0 |        0 |        0 |        0 |        0 |
               |--------------------------------+--------------------------------+--------------------------------|
               |            wanted40            |            wanted43            |            wanted44            |
               |                   0            |                   0            |                   0            |
               +--------------------------------------------------------------------------------------------------+
          
               +--------------------------------------------------------------------------------------------------+
            4. |  iterm | wanted3  | wanted4  | wanted5  | wanted6  | wanted7  | wanted8  | wanted14  | wanted15  |
               |     37 |       0  |       0  |       0  |       0  |       0  |       0  |        0  |        0  |
               |--------------------------------------------------------------------------------------------------|
               | wanted19 | wanted21 | wanted22 | wanted23 | wanted25 | wanted27 | wanted29 | wanted31 | wanted37 |
               |        0 |        0 |        0 |        0 |        0 |        0 |        0 |        0 |        1 |
               |--------------------------------+--------------------------------+--------------------------------|
               |            wanted40            |            wanted43            |            wanted44            |
               |                   0            |                   0            |                   0            |
               +--------------------------------------------------------------------------------------------------+
          
               +--------------------------------------------------------------------------------------------------+
            5. |  iterm | wanted3  | wanted4  | wanted5  | wanted6  | wanted7  | wanted8  | wanted14  | wanted15  |
               |  19 22 |       0  |       0  |       0  |       0  |       0  |       0  |        0  |        0  |
               |--------------------------------------------------------------------------------------------------|
               | wanted19 | wanted21 | wanted22 | wanted23 | wanted25 | wanted27 | wanted29 | wanted31 | wanted37 |
               |        1 |        0 |        1 |        0 |        0 |        0 |        0 |        0 |        0 |
               |--------------------------------+--------------------------------+--------------------------------|
               |            wanted40            |            wanted43            |            wanted44            |
               |                   0            |                   0            |                   0            |
               +--------------------------------------------------------------------------------------------------+
          
               +--------------------------------------------------------------------------------------------------+
            6. |  iterm | wanted3  | wanted4  | wanted5  | wanted6  | wanted7  | wanted8  | wanted14  | wanted15  |
               |     31 |       0  |       0  |       0  |       0  |       0  |       0  |        0  |        0  |
               |--------------------------------------------------------------------------------------------------|
               | wanted19 | wanted21 | wanted22 | wanted23 | wanted25 | wanted27 | wanted29 | wanted31 | wanted37 |
               |        0 |        0 |        0 |        0 |        0 |        0 |        0 |        1 |        0 |
               |--------------------------------+--------------------------------+--------------------------------|
               |            wanted40            |            wanted43            |            wanted44            |
               |                   0            |                   0            |                   0            |
               +--------------------------------------------------------------------------------------------------+
          
               +--------------------------------------------------------------------------------------------------+
            7. |  iterm | wanted3  | wanted4  | wanted5  | wanted6  | wanted7  | wanted8  | wanted14  | wanted15  |
               |      6 |       0  |       0  |       0  |       1  |       0  |       0  |        0  |        0  |
               |--------------------------------------------------------------------------------------------------|
               | wanted19 | wanted21 | wanted22 | wanted23 | wanted25 | wanted27 | wanted29 | wanted31 | wanted37 |
               |        0 |        0 |        0 |        0 |        0 |        0 |        0 |        0 |        0 |
               |--------------------------------+--------------------------------+--------------------------------|
               |            wanted40            |            wanted43            |            wanted44            |
               |                   0            |                   0            |                   0            |
               +--------------------------------------------------------------------------------------------------+
          
               +--------------------------------------------------------------------------------------------------+
            8. |  iterm | wanted3  | wanted4  | wanted5  | wanted6  | wanted7  | wanted8  | wanted14  | wanted15  |
               | 5 6 29 |       0  |       0  |       1  |       1  |       0  |       0  |        0  |        0  |
               |--------------------------------------------------------------------------------------------------|
               | wanted19 | wanted21 | wanted22 | wanted23 | wanted25 | wanted27 | wanted29 | wanted31 | wanted37 |
               |        0 |        0 |        0 |        0 |        0 |        0 |        1 |        0 |        0 |
               |--------------------------------+--------------------------------+--------------------------------|
               |            wanted40            |            wanted43            |            wanted44            |
               |                   0            |                   0            |                   0            |
               +--------------------------------------------------------------------------------------------------+
          
               +--------------------------------------------------------------------------------------------------+
            9. |  iterm | wanted3  | wanted4  | wanted5  | wanted6  | wanted7  | wanted8  | wanted14  | wanted15  |
               |     19 |       0  |       0  |       0  |       0  |       0  |       0  |        0  |        0  |
               |--------------------------------------------------------------------------------------------------|
               | wanted19 | wanted21 | wanted22 | wanted23 | wanted25 | wanted27 | wanted29 | wanted31 | wanted37 |
               |        1 |        0 |        0 |        0 |        0 |        0 |        0 |        0 |        0 |
               |--------------------------------+--------------------------------+--------------------------------|
               |            wanted40            |            wanted43            |            wanted44            |
               |                   0            |                   0            |                   0            |
               +--------------------------------------------------------------------------------------------------+
          
               +--------------------------------------------------------------------------------------------------+
           10. |  iterm | wanted3  | wanted4  | wanted5  | wanted6  | wanted7  | wanted8  | wanted14  | wanted15  |
               |  29 43 |       0  |       0  |       0  |       0  |       0  |       0  |        0  |        0  |
               |--------------------------------------------------------------------------------------------------|
               | wanted19 | wanted21 | wanted22 | wanted23 | wanted25 | wanted27 | wanted29 | wanted31 | wanted37 |
               |        0 |        0 |        0 |        0 |        0 |        0 |        1 |        0 |        0 |
               |--------------------------------+--------------------------------+--------------------------------|
               |            wanted40            |            wanted43            |            wanted44            |
               |                   0            |                   1            |                   0            |
               +--------------------------------------------------------------------------------------------------+
          See also

          SJ-22-4 . . . . . . . . . . Stata tip 148: Searching for words within strings
          . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . N. J. Cox
          Q4/22 SJ 22(4):998--1003 (no commands)
          tip on searching for words within strings
          https://journals.sagepub.com/doi/pdf...6867X221141068

          Comment


          • #6
            Re #4: The code you write will work providing you know that all of the values occurring in variable iterm are numbers between 1 and 44. The code offered in #2 requires no prior knowledge of the content of iterm and it will work with any alphanumeric values separated by spaces. (Actually, it will also allow underscore characters.)

            Added: The contrast between the two approaches illustrates a more general principle in programming: there is a trade-off between generality and efficiency.
            Last edited by Clyde Schechter; 06 Jul 2024, 09:33.

            Comment


            • #7
              Note on #5. I looked at the data example and chose 44 as the highest occurring value, thinking that the OP should know what it is any way.

              #4 seems to confirm that identification of 44.

              But it is entirely possible to look at the contents of iterm to get a comprehensive list of occurring values.

              Code:
              split iterm, destring 
              foreach v in `r(varlist)' { 
                  levelsof `v', local(this)
                  local all `all' `this' 
              
              }
              
              local all : list uniq all 
              local all : list sort all 
              
              foreach v of local all { 
                  gen wanted`v' = length(subinword(iterm, "`v'", "", .)) < length(iterm)
              }
              
              ds wanted*

              Comment


              • #8
                Thank you, Clyde and Nick, for these insights.

                Comment


                • #9
                  a variantion on #7:
                  Code:
                  levelsof iterm, clean local(values)
                  numlist "`:list uniq values'", sort
                  
                  foreach n of numlist `r(numlist)' {
                      
                      gen byte iterm_`n' = subinword(iterm, "`n'", "", .) != iterm
                  }

                  Comment

                  Working...
                  X