How to find the amount of gaps in a natural number sequence?

Patrick Lambertus

Join Date: May 2020

Posts: 10
#1

How to find the amount of gaps in a natural number sequence?

07 Dec 2020, 09:37

Dear Stata community,

I have some oddly formatted data, as in the following example dataset:

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input str22 myvar float goal "1:2:3:10:11:12:" 2 "1:2:3:4:" 1 "6:7:15:16:22:23:24:25:" 3 "13:18:23:" 3 end

I have the myvar variable, and the goal variable is what I need. myvar is currently a string including numbers seperated by ":". The goal variable essentially indicates how many gaps there are in the natural number sequence between the first and last number in the string, plus 1. Hence, "1:2:3:4:" returns 1 and "13:18:23:" returns 3.

My problem is that I do not know how to have Stata compute the goal var for me, when I only have myvar. Could anyone please help me with this?

Thanks,

Patrick
Tags: None

Justin Niakamal

Join Date: Aug 2017
Posts: 760

07 Dec 2020, 10:06

Someone will probably have a more elegant solution, but this gives the desired result from your data example.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str22 myvar float goal
"1:2:3:10:11:12:"        2
"1:2:3:4:"               1
"6:7:15:16:22:23:24:25:" 3
"13:18:23:"              3
end

drop goal
split myvar, parse(":") gen(s_)
destring s_* , replace 

reshape long s, i(myvar) j(var)string
drop if mi(s)

bys myvar (var): gen flag = (s-s[_n-1] == 1)
keep if flag == 0 

contract myvar, freq(goal)

list, noobs 

 +-------------------------------+
  |                  myvar   goal |
  |-------------------------------|
  |              13:18:23:      3 |
  |        1:2:3:10:11:12:      2 |
  |               1:2:3:4:      1 |
  | 6:7:15:16:22:23:24:25:      3 |
  +-------------------------------+

Comment

Nick Cox

Join Date: Mar 2014

Posts: 35696
#3

07 Dec 2020, 10:09

What's a gap? I see no omitted values in 1 2 3 4.
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#4

07 Dec 2020, 13:42

The goal variable essentially indicates how many gaps there are in the natural number sequence between the first and last number in the string, plus 1.
1 like
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35696
#5

07 Dec 2020, 17:56

Fair enough!
Comment
Patrick Lambertus

Join Date: May 2020

Posts: 10
#6

08 Dec 2020, 09:21

Originally posted by Justin Blasongame View Post

Someone will probably have a more elegant solution, but this gives the desired result from your data example.

This solution indeed works with the data example, but unfortunately would not when the dataset also contains other variables as they would get dropped. Or is there a way to prevent that from happening?

Thanks anyways for the contribution, but I hope someone could find a solution that I could use in my actual dataset too.
Comment

Nick Cox

Join Date: Mar 2014
Posts: 35696

08 Dec 2020, 09:44

This solution assumes that arguments are in increasing order. If not,

Code:

search rowsort, sj

Code:

clear
input str22 myvar float goal
"1:2:3:10:11:12:"        2
"1:2:3:4:"               1
"6:7:15:16:22:23:24:25:" 3
"13:18:23:"              3
end

split myvar, parse(:) destring
local nvars : word count `r(varlist)'
gen gaps = 1

forval j = 2/`nvars' {
   local i = `j' - 1
   replace gaps = gaps + inrange(myvar`j' - myvar`i', 2, .)
}

list

     +--------------------------------------------------------------------------------------------------------------+
     |                  myvar   goal   myvar1   myvar2   myvar3   myvar4   myvar5   myvar6   myvar7   myvar8   gaps |
     |--------------------------------------------------------------------------------------------------------------|
  1. |        1:2:3:10:11:12:      2        1        2        3       10       11       12        .        .      2 |
  2. |               1:2:3:4:      1        1        2        3        4        .        .        .        .      1 |
  3. | 6:7:15:16:22:23:24:25:      3        6        7       15       16       22       23       24       25      3 |
  4. |              13:18:23:      3       13       18       23        .        .        .        .        .      3 |
     +--------------------------------------------------------------------------------------------------------------+

Comment

William Lisowski

Join Date: Dec 2014
Posts: 10150

08 Dec 2020, 09:47

In what follows, the variable id is created to identify the original observations. If the actual dataset includes an identifier (I'm unwilling to assume the sequences you show are not duplicated within the actual dataset) then you can use that identifier rather than generate and later drop the variable id. If this example exposes further problems, then you should better describe your actual dataset.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str22 myvar float(goal another)
"1:2:3:10:11:12:"        2 22
"1:2:3:4:"               1 17
"6:7:15:16:22:23:24:25:" 3  8
"13:18:23:"              3  1
end

generate id = _n, before(myvar)

split myvar, parse(":") gen(s_)
destring s_* , replace 

reshape long s, i(id) j(var)string
drop if missing(s)

bysort id (var): gen isgoal = sum((s-s[_n-1] > 1))
bysort id (var): keep if _n==_N
drop id var s

list, noobs

Code:

. list, noobs 

  +--------------------------------------------------+
  |                  myvar   goal   another   isgoal |
  |--------------------------------------------------|
  |        1:2:3:10:11:12:      2        22        2 |
  |               1:2:3:4:      1        17        1 |
  | 6:7:15:16:22:23:24:25:      3         8        3 |
  |              13:18:23:      3         1        3 |
  +--------------------------------------------------+

Comment

Patrick Lambertus

Join Date: May 2020

Posts: 10
#9

09 Dec 2020, 04:28

Thank you very much Nick Cox and William Lisowski for your clever solutions.

I have tested both on my own data, and they both work.
However, if anyone wants to use these solutions in the future, I would like to note that William's code will drop observations from the dataset which are missing for myvar. So, if myvar has missing observations, use Nick's code.

Thanks again, this problem can now be regarded as solved.
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#10

09 Dec 2020, 06:58

My expectation is that anyone who wants to apply the solution in post #8 to their own problem in the future will not treat the code as some sort of talisman to be carefully copied and pasted into their program. Instead, they will take the time to study the code, learn from it, understand how it produces the results it does, and will then adapt it to their needs, making for example the simple changes needed to handle missing values to produce the result below. The result is not only a solution to their immediate problem, but an increase in their knowledge of Stata that will help in their future work.

Code:

. list, noobs +--------------------------------------------------+ | myvar goal another isgoal | |--------------------------------------------------| | 1:2:3:10:11:12: 2 22 2 | | 1:2:3:4: 1 17 1 | | . 42 . | | 6:7:15:16:22:23:24:25: 3 8 3 | | 13:18:23: 3 1 3 | +--------------------------------------------------+

Last edited by William Lisowski; 09 Dec 2020, 07:12.
Comment

Bjarte Aagnes

Join Date: Apr 2014
Posts: 783

#11

11 Dec 2020, 10:45

Any example data should be representative of the data to be used, and if not a realistic description of the data should follow the example. For longer sequences, possibly in combination with many more observations, one might have to change strategy to avoid making lot of variables, created using -split- followed by repeated -replace-, and avoid reshape.

To illustrate that solutions to example data may not be copied to other data lets make another example data (N=23000) where 44% of the observations have longer sequences (100 integers), and 44% have a sequence without any gaps.

Code:

----------------------------------------------------------------------------
          myvar                                            | Freq.  Percent  
-----------------------------------------------------------+----------------
Valid   13:18:23:                                          |  1000    4.35  
        1:2:3:10:11:12:                                    |  1000    4.35  
        1:2:3:4:                                           | 10000   43.48  
        1:2:3:4:5:6:7:8:9:10:11:12:13:14:15:16:17:18:19:20 | 10000   43.48  
        :21:22:23:24:25:26:27:28:29:30:31:32:33:34:35:36:3 |                
        7:38:39:40:41:42:43:44:45:46:47:48:49:50:52:53:54: |                
        55:56:57:58:59:60:61:62:63:64:65:66:67:68:69:70:71 |                
        :72:73:74:75:76:77:78:79:80:81:82:83:84:85:86:87:8 |                
        8:89:90:91:92:93:94:95:96:97:98:99:100:101:        |                
        6:7:15:16:22:23:24:25:                             |  1000    4.35  
        Total                                              | 23000  100.00  
----------------------------------------------------------------------------

A possible strategy to longer sequences/large data: Frames could be used to keep distinct sequences only, and mata can count gaps (without generating one Stata variable per integer).

Code:

sort myvar 
frame put myvar if myvar[_n-1] != myvar , into(seq)
frame change seq
mata: gaps("myvar")
frame change default
frlink m:1 myvar, frame(seq)
frget ngaps = ngaps, from(seq)
frame drop seq

Timing results:

Code:

. timer list
   7:     78.13 /       10 =       7.8129       split + foreach
   8:   1134.02 /       10 =     113.4021       split + reshape
  99:      0.66 /       10 =       0.0659       frames/mata

Timings:

Code:

clear all 
input str500 myvar float(goal)
"1:2:3:10:11:12:"        2 
"1:2:3:4:"               1 
"6:7:15:16:22:23:24:25:" 3 
"13:18:23:"              3 
"1:2:3:4:5:6:7:8:9:10:11:12:13:14:15:16:17:18:19:20:21:22:23:24:25:26:27:28:29:30:31:32:33:34:35:36:37:38:39:40:41:42:43:44:45:46:47:48:49:50:52:53:54:55:56:57:58:59:60:61:62:63:64:65:66:67:68:69:70:71:72:73:74:75:76:77:78:79:80:81:82:83:84:85:86:87:88:89:90:91:92:93:94:95:96:97:98:99:100:101:" 2
end

compress
expand 10 if inlist( _n ,  2 , 5 ) 
expand 1000

********************************************************************************
mata : 

void gaps( string scalar intvar )

{    
    real scalar r
    real scalar i
    real scalar j
    real scalar ngaps
    real rowvector ints 
    string colvector seq

    seq = st_sdata(., intvar)

    st_addvar("int", "ngaps", 1 )

    for ( r=1; r<=length(seq); r++ ) { 

        ngaps = 0

        ints = strtoreal(ustrsplit(seq[r], ":")) 
        
        if ( ints != (range(1,length(ints)-1,1)) ) { 
            
            for ( i=2; i<=length(ints); i++ ) {
                
                j = i - 1

                ngaps = ngaps + ( ints[i] - ints[j] > 1 )   
            }
        }
        
        else {
            
            ngaps = ngaps + 1 
        }
        
        st_store(r, "ngaps", ngaps)
    }    
}

end
********************************************************************************

qui forvalues i = 1/10 {

    keep myvar goal

    timer on 7

        split myvar, parse(:) destring
        local nvars : word count `r(varlist)'
        gen gaps = 1

        forval j = 2/`nvars' {
           local i = `j' - 1
           replace gaps = gaps + inrange(myvar`j' - myvar`i', 2, .)
        }
        
    timer off 7

    keep myvar goal

    timer on 8

        generate id = _n, before(myvar)

        split myvar, parse(":") gen(s_)
        destring s_* , replace 

        reshape long s, i(id) j(var)string
        drop if missing(s)

        bysort id (var): gen isgoal = sum((s-s[_n-1] > 1))
        bysort id (var): keep if _n==_N
        drop id var s

    timer off 8
    
    timer on 99

        sort myvar 
        frame put myvar if myvar[_n-1] != myvar , into(seq)
        frame change seq
        mata: gaps("myvar")
        frame change default
        frlink m:1 myvar, frame(seq)
        frget ngaps = ngaps, from(seq)
        frame drop seq

    timer off 99
    
    gen rand = runiform()
    sort rand
    drop rand
}

timer list

Comment

Nick Cox

Join Date: Mar 2014

Posts: 35696
#12

11 Dec 2020, 11:14

Really smart code as in #11 is needed for very large datasets and/or doing the task again and again. The timer results not here are how long it took to write the code. .
Comment

Announcement