Stacking String Long

Luis Pecht

Join Date: May 2017

Posts: 151
#1

Stacking String Long

26 Apr 2022, 10:52

Hi, I need a help here.

visit_history variable has the visits for each unique id and it might span to more than 100 events in this strL:

Code:

clear input float id strL visit_history 1 "2021-06-03,2020-12-07" 2 "2021-07-30,2020-12-28,2021-02-02" 3 "2020-09-12" 4 "2021-03-10,2021-02-02" end

datasets is big, about 2MM ids, and I am looking for an efficiently way to convert that to:

Code:

clear input float id strL visit_history 1 "2021-06-03" 1 "2020-12-07" 2 "2021-07-30" 2 "2020-12-28" 3 "2020-09-12" 4 "2021-03-10" 4 "2021-02-02" end

Mata coding is welcome.

Glad for any clue about this. thks.
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30101
#2

26 Apr 2022, 10:58

Code:

split visit_history, gen(vh) parse(",") reshape long vh, i(id) drop if missing(vh) drop visit_history _j rename vh visit_history

Note: This code assumes, but does not verify, that the dates within a given observation of visit_history are separated by commas, and that commas are not otherwise used in that variable.

Added: With 2 million IDs this code is going to be slow. I don't think there is any way to speed up the -split- command, but consider the user-written commands -tolong- or -greshape- instead of -reshape-. If I recall correctly, both are available from SSC, with -greshape- as part of the -gtools- package.

Last edited by Clyde Schechter; 26 Apr 2022, 11:02.
2 likes
Comment
Luis Pecht

Join Date: May 2017

Posts: 151
#3

26 Apr 2022, 13:30

Thanks, Clyde for your quick reply.

The code you suggested turned out not be scalable in my MP2, for 2MM dataset. The "split" chunk is running for +2 hours ...

Can you advise me any other approach ?
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30101
#4

26 Apr 2022, 13:59

I can think of other approaches that would be slower. I can't think of anything that would be faster. 2 hours is not a long time to process a data set of this size. Find something else to occupy yourself while it runs. Let it run overnight so you don't have to sit their getting impatient. I have run analyses that have taken several weeks to conclude. Big data sets are hard that way.
Comment
Luis Pecht

Join Date: May 2017

Posts: 151
#5

26 Apr 2022, 14:19

Yes, the point with the "wait till it finish" approach, is to have my stata+cpu busy that long, having other demands to solve. Maybe it time to buy another Stata license.

thanks again, Clyde.
Comment

Andrew Musau

Join Date: Oct 2014
Posts: 10195

26 Apr 2022, 15:49

Try

Code:

clear
input float id strL visit_history
1 "2021-06-03,2020-12-07"
2 "2021-07-30,2020-12-28,2021-02-02"
3 "2020-09-12"          
4 "2021-03-10,2021-02-02"
end

gen howmany= 1+length(ustrregexra(visit_history, "[^\,]", ""))
expand howmany
bys id: replace visit_history= word(ustrregexra(visit_history, "[\,]", " "), _n)

I think creating extra variables using split and reshaping may not be very efficient. Here, sorting and perhaps the regular expressions may be time consuming, but maybe not 2+ hours.

Res.:

Code:

. l, sepby(id)

     +---------------------------+
     | id   visit_hi~y   howmany |
     |---------------------------|
  1. |  1   2021-06-03         2 |
  2. |  1   2020-12-07         2 |
     |---------------------------|
  3. |  2   2021-07-30         3 |
  4. |  2   2020-12-28         3 |
  5. |  2   2021-02-02         3 |
     |---------------------------|
  6. |  3   2020-09-12         1 |
     |---------------------------|
  7. |  4   2021-03-10         2 |
  8. |  4   2021-02-02         2 |
     +---------------------------+

Last edited by Andrew Musau; 26 Apr 2022, 16:22.

Comment

Andrew Musau

Join Date: Oct 2014
Posts: 10195

26 Apr 2022, 16:10

-subinstr- should be much faster.

Code:

clear
input float id strL visit_history
1 "2021-06-03,2020-12-07"
2 "2021-07-30,2020-12-28,2021-02-02"
3 "2020-09-12"          
4 "2021-03-10,2021-02-02"
end

gen howmany= length(visit_history)- length(subinstr(visit_history, ",", "", .)) + 1
expand howmany
bys id: replace visit_history= word(subinstr(visit_history, ",", " ", .), _n)

Last edited by Andrew Musau; 26 Apr 2022, 16:21.

Comment

Luis Pecht

Join Date: May 2017
Posts: 151

26 Apr 2022, 17:31

thanks Andrew to jump in. I will give a try on your Regex approach.

Meanwhile, I did some research and came up with this Mata solution:

Code:

clear
input float id strL visit_history
1 "2021-06-03,2020-12-07"
2 "2021-07-30,2020-12-28,2021-02-02"
3 "2020-09-12"          
4 "2021-03-10,2021-02-02"
end
tostring id, replace
mata
x=st_sdata(.,("id","visit_history"))
mata stata clear
y=J(0,2,"")
t = tokeninit("",",")
for (i=1; i<=rows(x); i++){
tokenset(t, x[i,2])
while ( (token = tokenget( t)) != "") {
y=y\(x[i,1],token)
}
}

output:

Code:

: y
                 1            2
     +---------------------------+
   1 |           1   2021-06-03  |
   2 |           1            ,  |
   3 |           1   2020-12-07  |
   4 |           2   2021-07-30  |
   5 |           2            ,  |
   6 |           2   2020-12-28  |
   7 |           2            ,  |
   8 |           2   2021-02-02  |
   9 |           3   2020-09-12  |
  10 |           4   2021-03-10  |
  11 |           4            ,  |
  12 |           4   2021-02-02  |
     +---------------------------+

Andrew , as I have already witnessed your ability on Mata, any clue on how avoiding the parser "," while generating the y matrix?

thks, again.

Comment

Andrew Musau

Join Date: Oct 2014
Posts: 10195

26 Apr 2022, 18:09

You can add a condition such as

Code:

clear
input float id strL visit_history
1 "2021-06-03,2020-12-07"
2 "2021-07-30,2020-12-28,2021-02-02"
3 "2020-09-12"          
4 "2021-03-10,2021-02-02"
end
tostring id, replace
mata
x=st_sdata(.,("id","visit_history"))
mata stata clear
y=J(0,2,"")
t = tokeninit("",",")
for (i=1; i<=rows(x); i++){
tokenset(t, x[i,2])
while ( (token = tokenget( t)) != "") {
y=y\(x[i,1],token)
y= select(y, y[,2] :!=",")
}
}

y
end

Res.:

Code:

: y
                1            2
    +---------------------------+
  1 |           1   2021-06-03  |
  2 |           1   2020-12-07  |
  3 |           2   2021-07-30  |
  4 |           2   2020-12-28  |
  5 |           2   2021-02-02  |
  6 |           3   2020-09-12  |
  7 |           4   2021-03-10  |
  8 |           4   2021-02-02  |
    +---------------------------+

Comment

Luis Pecht

Join Date: May 2017
Posts: 151

#10

26 Apr 2022, 18:26

Tried you subinstr approach in the 2MM dataset and found that my Mata code is not needed in here.

Code:

. desc

Contains data
 Observations:     2,446,852                  
    Variables:             2                  
--------------------------------------------------------------------------------------------------------------------------------
Variable      Storage   Display    Value
    name         type    format    label      Variable label
--------------------------------------------------------------------------------------------------------------------------------
id              float   %9.0g                 
visit_history   strL    %9s                   
--------------------------------------------------------------------------------------------------------------------------------
Sorted by: 
     Note: Dataset has changed since last saved.

timer clear
timer on 1
gen howmany= length(visit_history)- length(subinstr(visit_history, ",", "", .)) + 1
expand howmany
bys id: replace visit_history= word(subinstr(visit_history, ",", " ", .), _n)
timer off 1

. desc

Contains data
 Observations:     8,624,904                  
    Variables:             3                  
--------------------------------------------------------------------------------------------------------------------------------
Variable      Storage   Display    Value
    name         type    format    label      Variable label
--------------------------------------------------------------------------------------------------------------------------------
id              float   %9.0g                 
visit_history   strL    %9s                   
howmany         float   %9.0g                 
--------------------------------------------------------------------------------------------------------------------------------
Sorted by: id
     Note: Dataset has changed since last saved.


. head

    id   visit_hi~y   howmany  
     1   2022-03-22         4  
     1   2021-11-19         4  
     1   2021-10-01         4  
     1   2021-09-09         4  
     2   2022-03-09         1  
     3   2022-01-10         1  
     4   2022-04-02         1  
     5   2022-01-31         1  
     6   2022-02-14         2  
     6   2021-08-19         2  

. timer list
   1:     29.51 /        1 =      29.5120

subinstr did the trick on 29 seconds. hands down!

thks, Andrew.

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30101
#11

26 Apr 2022, 18:35

Code:

subinstr did the trick on 29 seconds. hands down!

Wow, I'm impressed! Thank you, Andrew Musau .
Comment

Announcement