Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to find the amount of gaps in a natural number sequence?

    Dear Stata community,

    I have some oddly formatted data, as in the following example dataset:
    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input str22 myvar float goal
    "1:2:3:10:11:12:"        2
    "1:2:3:4:"               1
    "6:7:15:16:22:23:24:25:" 3
    "13:18:23:"              3
    end
    I have the myvar variable, and the goal variable is what I need. myvar is currently a string including numbers seperated by ":". The goal variable essentially indicates how many gaps there are in the natural number sequence between the first and last number in the string, plus 1. Hence, "1:2:3:4:" returns 1 and "13:18:23:" returns 3.

    My problem is that I do not know how to have Stata compute the goal var for me, when I only have myvar. Could anyone please help me with this?

    Thanks,

    Patrick

  • #2
    Someone will probably have a more elegant solution, but this gives the desired result from your data example.

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input str22 myvar float goal
    "1:2:3:10:11:12:"        2
    "1:2:3:4:"               1
    "6:7:15:16:22:23:24:25:" 3
    "13:18:23:"              3
    end
    
    drop goal
    split myvar, parse(":") gen(s_)
    destring s_* , replace 
    
    reshape long s, i(myvar) j(var)string
    drop if mi(s)
    
    bys myvar (var): gen flag = (s-s[_n-1] == 1)
    keep if flag == 0 
    
    contract myvar, freq(goal)
    
    list, noobs 
    
     +-------------------------------+
      |                  myvar   goal |
      |-------------------------------|
      |              13:18:23:      3 |
      |        1:2:3:10:11:12:      2 |
      |               1:2:3:4:      1 |
      | 6:7:15:16:22:23:24:25:      3 |
      +-------------------------------+

    Comment


    • #3
      What's a gap? I see no omitted values in 1 2 3 4.

      Comment


      • #4
        The goal variable essentially indicates how many gaps there are in the natural number sequence between the first and last number in the string, plus 1.

        Comment


        • #5
          Fair enough!

          Comment


          • #6
            Originally posted by Justin Blasongame View Post
            Someone will probably have a more elegant solution, but this gives the desired result from your data example.
            This solution indeed works with the data example, but unfortunately would not when the dataset also contains other variables as they would get dropped. Or is there a way to prevent that from happening?

            Thanks anyways for the contribution, but I hope someone could find a solution that I could use in my actual dataset too.

            Comment


            • #7
              This solution assumes that arguments are in increasing order. If not,

              Code:
              search rowsort, sj
              Code:
              clear
              input str22 myvar float goal
              "1:2:3:10:11:12:"        2
              "1:2:3:4:"               1
              "6:7:15:16:22:23:24:25:" 3
              "13:18:23:"              3
              end
              
              split myvar, parse(:) destring
              local nvars : word count `r(varlist)'
              gen gaps = 1
              
              forval j = 2/`nvars' {
                 local i = `j' - 1
                 replace gaps = gaps + inrange(myvar`j' - myvar`i', 2, .)
              }
              
              list
              
                   +--------------------------------------------------------------------------------------------------------------+
                   |                  myvar   goal   myvar1   myvar2   myvar3   myvar4   myvar5   myvar6   myvar7   myvar8   gaps |
                   |--------------------------------------------------------------------------------------------------------------|
                1. |        1:2:3:10:11:12:      2        1        2        3       10       11       12        .        .      2 |
                2. |               1:2:3:4:      1        1        2        3        4        .        .        .        .      1 |
                3. | 6:7:15:16:22:23:24:25:      3        6        7       15       16       22       23       24       25      3 |
                4. |              13:18:23:      3       13       18       23        .        .        .        .        .      3 |
                   +--------------------------------------------------------------------------------------------------------------+

              Comment


              • #8
                In what follows, the variable id is created to identify the original observations. If the actual dataset includes an identifier (I'm unwilling to assume the sequences you show are not duplicated within the actual dataset) then you can use that identifier rather than generate and later drop the variable id. If this example exposes further problems, then you should better describe your actual dataset.

                Code:
                * Example generated by -dataex-. To install: ssc install dataex
                clear
                input str22 myvar float(goal another)
                "1:2:3:10:11:12:"        2 22
                "1:2:3:4:"               1 17
                "6:7:15:16:22:23:24:25:" 3  8
                "13:18:23:"              3  1
                end
                
                generate id = _n, before(myvar)
                
                split myvar, parse(":") gen(s_)
                destring s_* , replace 
                
                reshape long s, i(id) j(var)string
                drop if missing(s)
                
                bysort id (var): gen isgoal = sum((s-s[_n-1] > 1))
                bysort id (var): keep if _n==_N
                drop id var s
                
                list, noobs
                Code:
                . list, noobs 
                
                  +--------------------------------------------------+
                  |                  myvar   goal   another   isgoal |
                  |--------------------------------------------------|
                  |        1:2:3:10:11:12:      2        22        2 |
                  |               1:2:3:4:      1        17        1 |
                  | 6:7:15:16:22:23:24:25:      3         8        3 |
                  |              13:18:23:      3         1        3 |
                  +--------------------------------------------------+

                Comment


                • #9
                  Thank you very much Nick Cox and William Lisowski for your clever solutions.

                  I have tested both on my own data, and they both work.
                  However, if anyone wants to use these solutions in the future, I would like to note that William's code will drop observations from the dataset which are missing for myvar. So, if myvar has missing observations, use Nick's code.

                  Thanks again, this problem can now be regarded as solved.

                  Comment


                  • #10
                    My expectation is that anyone who wants to apply the solution in post #8 to their own problem in the future will not treat the code as some sort of talisman to be carefully copied and pasted into their program. Instead, they will take the time to study the code, learn from it, understand how it produces the results it does, and will then adapt it to their needs, making for example the simple changes needed to handle missing values to produce the result below. The result is not only a solution to their immediate problem, but an increase in their knowledge of Stata that will help in their future work.
                    Code:
                    . list, noobs
                    
                      +--------------------------------------------------+
                      |                  myvar   goal   another   isgoal |
                      |--------------------------------------------------|
                      |        1:2:3:10:11:12:      2        22        2 |
                      |               1:2:3:4:      1        17        1 |
                      |                             .        42        . |
                      | 6:7:15:16:22:23:24:25:      3         8        3 |
                      |              13:18:23:      3         1        3 |
                      +--------------------------------------------------+
                    Last edited by William Lisowski; 09 Dec 2020, 07:12.

                    Comment


                    • #11
                      Any example data should be representative of the data to be used, and if not a realistic description of the data should follow the example. For longer sequences, possibly in combination with many more observations, one might have to change strategy to avoid making lot of variables, created using -split- followed by repeated -replace-, and avoid reshape.

                      To illustrate that solutions to example data may not be copied to other data lets make another example data (N=23000) where 44% of the observations have longer sequences (100 integers), and 44% have a sequence without any gaps.
                      Code:
                      ----------------------------------------------------------------------------
                                myvar                                            | Freq.  Percent  
                      -----------------------------------------------------------+----------------
                      Valid   13:18:23:                                          |  1000    4.35  
                              1:2:3:10:11:12:                                    |  1000    4.35  
                              1:2:3:4:                                           | 10000   43.48  
                              1:2:3:4:5:6:7:8:9:10:11:12:13:14:15:16:17:18:19:20 | 10000   43.48  
                              :21:22:23:24:25:26:27:28:29:30:31:32:33:34:35:36:3 |                
                              7:38:39:40:41:42:43:44:45:46:47:48:49:50:52:53:54: |                
                              55:56:57:58:59:60:61:62:63:64:65:66:67:68:69:70:71 |                
                              :72:73:74:75:76:77:78:79:80:81:82:83:84:85:86:87:8 |                
                              8:89:90:91:92:93:94:95:96:97:98:99:100:101:        |                
                              6:7:15:16:22:23:24:25:                             |  1000    4.35  
                              Total                                              | 23000  100.00  
                      ----------------------------------------------------------------------------
                      A possible strategy to longer sequences/large data: Frames could be used to keep distinct sequences only, and mata can count gaps (without generating one Stata variable per integer).
                      Code:
                      sort myvar 
                      frame put myvar if myvar[_n-1] != myvar , into(seq)
                      frame change seq
                      mata: gaps("myvar")
                      frame change default
                      frlink m:1 myvar, frame(seq)
                      frget ngaps = ngaps, from(seq)
                      frame drop seq
                      Timing results:
                      Code:
                      . timer list
                         7:     78.13 /       10 =       7.8129       split + foreach
                         8:   1134.02 /       10 =     113.4021       split + reshape
                        99:      0.66 /       10 =       0.0659       frames/mata
                      Timings:
                      Code:
                      clear all 
                      input str500 myvar float(goal)
                      "1:2:3:10:11:12:"        2 
                      "1:2:3:4:"               1 
                      "6:7:15:16:22:23:24:25:" 3 
                      "13:18:23:"              3 
                      "1:2:3:4:5:6:7:8:9:10:11:12:13:14:15:16:17:18:19:20:21:22:23:24:25:26:27:28:29:30:31:32:33:34:35:36:37:38:39:40:41:42:43:44:45:46:47:48:49:50:52:53:54:55:56:57:58:59:60:61:62:63:64:65:66:67:68:69:70:71:72:73:74:75:76:77:78:79:80:81:82:83:84:85:86:87:88:89:90:91:92:93:94:95:96:97:98:99:100:101:" 2
                      end
                      
                      compress
                      expand 10 if inlist( _n ,  2 , 5 ) 
                      expand 1000
                      
                      ********************************************************************************
                      mata : 
                      
                      void gaps( string scalar intvar )
                      
                      {    
                          real scalar r
                          real scalar i
                          real scalar j
                          real scalar ngaps
                          real rowvector ints 
                          string colvector seq
                      
                          seq = st_sdata(., intvar)
                      
                          st_addvar("int", "ngaps", 1 )
                      
                          for ( r=1; r<=length(seq); r++ ) { 
                      
                              ngaps = 0
                      
                              ints = strtoreal(ustrsplit(seq[r], ":")) 
                              
                              if ( ints != (range(1,length(ints)-1,1)) ) { 
                                  
                                  for ( i=2; i<=length(ints); i++ ) {
                                      
                                      j = i - 1
                      
                                      ngaps = ngaps + ( ints[i] - ints[j] > 1 )   
                                  }
                              }
                              
                              else {
                                  
                                  ngaps = ngaps + 1 
                              }
                              
                              st_store(r, "ngaps", ngaps)
                          }    
                      }
                      
                      end
                      ********************************************************************************
                      
                      qui forvalues i = 1/10 {
                      
                          keep myvar goal
                      
                          timer on 7
                      
                              split myvar, parse(:) destring
                              local nvars : word count `r(varlist)'
                              gen gaps = 1
                      
                              forval j = 2/`nvars' {
                                 local i = `j' - 1
                                 replace gaps = gaps + inrange(myvar`j' - myvar`i', 2, .)
                              }
                              
                          timer off 7
                      
                          keep myvar goal
                      
                          timer on 8
                      
                              generate id = _n, before(myvar)
                      
                              split myvar, parse(":") gen(s_)
                              destring s_* , replace 
                      
                              reshape long s, i(id) j(var)string
                              drop if missing(s)
                      
                              bysort id (var): gen isgoal = sum((s-s[_n-1] > 1))
                              bysort id (var): keep if _n==_N
                              drop id var s
                      
                          timer off 8
                          
                          timer on 99
                      
                              sort myvar 
                              frame put myvar if myvar[_n-1] != myvar , into(seq)
                              frame change seq
                              mata: gaps("myvar")
                              frame change default
                              frlink m:1 myvar, frame(seq)
                              frget ngaps = ngaps, from(seq)
                              frame drop seq
                      
                          timer off 99
                          
                          gen rand = runiform()
                          sort rand
                          drop rand
                      }
                      
                      timer list

                      Comment


                      • #12
                        Really smart code as in #11 is needed for very large datasets and/or doing the task again and again. The timer results not here are how long it took to write the code. .

                        Comment

                        Working...
                        X