Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Stacking String Long

    Hi, I need a help here.

    visit_history variable has the visits for each unique id and it might span to more than 100 events in this strL:

    Code:
    clear
    input float id strL visit_history
    1 "2021-06-03,2020-12-07"
    2 "2021-07-30,2020-12-28,2021-02-02"
    3 "2020-09-12"           
    4 "2021-03-10,2021-02-02"
    end
    datasets is big, about 2MM ids, and I am looking for an efficiently way to convert that to:

    Code:
    clear
    input float id strL visit_history
    1 "2021-06-03"
    1 "2020-12-07"
    2 "2021-07-30"
    2 "2020-12-28"
    3 "2020-09-12"
    4 "2021-03-10"
    4 "2021-02-02"
    end
    Mata coding is welcome.

    Glad for any clue about this. thks.


  • #2
    Code:
    split visit_history, gen(vh) parse(",")
    reshape long vh, i(id)
    drop if missing(vh)
    drop visit_history _j
    rename vh visit_history
    Note: This code assumes, but does not verify, that the dates within a given observation of visit_history are separated by commas, and that commas are not otherwise used in that variable.

    Added: With 2 million IDs this code is going to be slow. I don't think there is any way to speed up the -split- command, but consider the user-written commands -tolong- or -greshape- instead of -reshape-. If I recall correctly, both are available from SSC, with -greshape- as part of the -gtools- package.
    Last edited by Clyde Schechter; 26 Apr 2022, 12:02.

    Comment


    • #3
      Thanks, Clyde for your quick reply.

      The code you suggested turned out not be scalable in my MP2, for 2MM dataset. The "split" chunk is running for +2 hours ...

      Can you advise me any other approach ?

      Comment


      • #4
        I can think of other approaches that would be slower. I can't think of anything that would be faster. 2 hours is not a long time to process a data set of this size. Find something else to occupy yourself while it runs. Let it run overnight so you don't have to sit their getting impatient. I have run analyses that have taken several weeks to conclude. Big data sets are hard that way.

        Comment


        • #5
          Yes, the point with the "wait till it finish" approach, is to have my stata+cpu busy that long, having other demands to solve. Maybe it time to buy another Stata license.

          thanks again, Clyde.

          Comment


          • #6
            Try

            Code:
            clear
            input float id strL visit_history
            1 "2021-06-03,2020-12-07"
            2 "2021-07-30,2020-12-28,2021-02-02"
            3 "2020-09-12"          
            4 "2021-03-10,2021-02-02"
            end
            
            gen howmany= 1+length(ustrregexra(visit_history, "[^\,]", ""))
            expand howmany
            bys id: replace visit_history= word(ustrregexra(visit_history, "[\,]", " "), _n)

            I think creating extra variables using split and reshaping may not be very efficient. Here, sorting and perhaps the regular expressions may be time consuming, but maybe not 2+ hours.


            Res.:

            Code:
            . l, sepby(id)
            
                 +---------------------------+
                 | id   visit_hi~y   howmany |
                 |---------------------------|
              1. |  1   2021-06-03         2 |
              2. |  1   2020-12-07         2 |
                 |---------------------------|
              3. |  2   2021-07-30         3 |
              4. |  2   2020-12-28         3 |
              5. |  2   2021-02-02         3 |
                 |---------------------------|
              6. |  3   2020-09-12         1 |
                 |---------------------------|
              7. |  4   2021-03-10         2 |
              8. |  4   2021-02-02         2 |
                 +---------------------------+
            Last edited by Andrew Musau; 26 Apr 2022, 17:22.

            Comment


            • #7
              -subinstr- should be much faster.

              Code:
              clear
              input float id strL visit_history
              1 "2021-06-03,2020-12-07"
              2 "2021-07-30,2020-12-28,2021-02-02"
              3 "2020-09-12"          
              4 "2021-03-10,2021-02-02"
              end
              
              gen howmany= length(visit_history)- length(subinstr(visit_history, ",", "", .)) + 1
              expand howmany
              bys id: replace visit_history= word(subinstr(visit_history, ",", " ", .), _n)
              Last edited by Andrew Musau; 26 Apr 2022, 17:21.

              Comment


              • #8
                thanks Andrew to jump in. I will give a try on your Regex approach.

                Meanwhile, I did some research and came up with this Mata solution:


                Code:
                clear
                input float id strL visit_history
                1 "2021-06-03,2020-12-07"
                2 "2021-07-30,2020-12-28,2021-02-02"
                3 "2020-09-12"          
                4 "2021-03-10,2021-02-02"
                end
                tostring id, replace
                mata
                x=st_sdata(.,("id","visit_history"))
                mata stata clear
                y=J(0,2,"")
                t = tokeninit("",",")
                for (i=1; i<=rows(x); i++){
                tokenset(t, x[i,2])
                while ( (token = tokenget( t)) != "") {
                y=y\(x[i,1],token)
                }
                }
                output:

                Code:
                : y
                                 1            2
                     +---------------------------+
                   1 |           1   2021-06-03  |
                   2 |           1            ,  |
                   3 |           1   2020-12-07  |
                   4 |           2   2021-07-30  |
                   5 |           2            ,  |
                   6 |           2   2020-12-28  |
                   7 |           2            ,  |
                   8 |           2   2021-02-02  |
                   9 |           3   2020-09-12  |
                  10 |           4   2021-03-10  |
                  11 |           4            ,  |
                  12 |           4   2021-02-02  |
                     +---------------------------+
                Andrew , as I have already witnessed your ability on Mata, any clue on how avoiding the parser "," while generating the y matrix?

                thks, again.

                Comment


                • #9
                  You can add a condition such as

                  Code:
                  clear
                  input float id strL visit_history
                  1 "2021-06-03,2020-12-07"
                  2 "2021-07-30,2020-12-28,2021-02-02"
                  3 "2020-09-12"          
                  4 "2021-03-10,2021-02-02"
                  end
                  tostring id, replace
                  mata
                  x=st_sdata(.,("id","visit_history"))
                  mata stata clear
                  y=J(0,2,"")
                  t = tokeninit("",",")
                  for (i=1; i<=rows(x); i++){
                  tokenset(t, x[i,2])
                  while ( (token = tokenget( t)) != "") {
                  y=y\(x[i,1],token)
                  y= select(y, y[,2] :!=",")
                  }
                  }
                  
                  y
                  end
                  Res.:

                  Code:
                  : y
                                  1            2
                      +---------------------------+
                    1 |           1   2021-06-03  |
                    2 |           1   2020-12-07  |
                    3 |           2   2021-07-30  |
                    4 |           2   2020-12-28  |
                    5 |           2   2021-02-02  |
                    6 |           3   2020-09-12  |
                    7 |           4   2021-03-10  |
                    8 |           4   2021-02-02  |
                      +---------------------------+

                  Comment


                  • #10
                    Tried you subinstr approach in the 2MM dataset and found that my Mata code is not needed in here.

                    Code:
                    . desc
                    
                    Contains data
                     Observations:     2,446,852                  
                        Variables:             2                  
                    --------------------------------------------------------------------------------------------------------------------------------
                    Variable      Storage   Display    Value
                        name         type    format    label      Variable label
                    --------------------------------------------------------------------------------------------------------------------------------
                    id              float   %9.0g                 
                    visit_history   strL    %9s                   
                    --------------------------------------------------------------------------------------------------------------------------------
                    Sorted by: 
                         Note: Dataset has changed since last saved.
                    
                    timer clear
                    timer on 1
                    gen howmany= length(visit_history)- length(subinstr(visit_history, ",", "", .)) + 1
                    expand howmany
                    bys id: replace visit_history= word(subinstr(visit_history, ",", " ", .), _n)
                    timer off 1
                    
                    . desc
                    
                    Contains data
                     Observations:     8,624,904                  
                        Variables:             3                  
                    --------------------------------------------------------------------------------------------------------------------------------
                    Variable      Storage   Display    Value
                        name         type    format    label      Variable label
                    --------------------------------------------------------------------------------------------------------------------------------
                    id              float   %9.0g                 
                    visit_history   strL    %9s                   
                    howmany         float   %9.0g                 
                    --------------------------------------------------------------------------------------------------------------------------------
                    Sorted by: id
                         Note: Dataset has changed since last saved.
                    
                    
                    . head
                    
                        id   visit_hi~y   howmany  
                         1   2022-03-22         4  
                         1   2021-11-19         4  
                         1   2021-10-01         4  
                         1   2021-09-09         4  
                         2   2022-03-09         1  
                         3   2022-01-10         1  
                         4   2022-04-02         1  
                         5   2022-01-31         1  
                         6   2022-02-14         2  
                         6   2021-08-19         2  
                    
                    . timer list
                       1:     29.51 /        1 =      29.5120
                    subinstr did the trick on 29 seconds. hands down!

                    thks, Andrew.

                    Comment


                    • #11
                      Code:
                      subinstr did the trick on 29 seconds. hands down!
                      Wow, I'm impressed! Thank you, Andrew Musau .

                      Comment

                      Working...
                      X