Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Speed string versus numeric identifier

    Anybody has experience whether using a string versus numeric identifier matters for speed with large datasets for commands such as merge, egen = group(), by, collapse? And does it matter if the string is StrL or a fixed format?

  • #2
    Stata doesn't allow strLs as the merge key, so you don't have to worry about that. Your query piqued my curiosity, so I created the code below with which you could experiment, as I did. Some simple experiments showed me a few things:

    1) A -merge- with a string identifier is perhaps 30- 40% slower than with a numeric identifier of type long.
    2) The size of the string identifier doesn't matter much
    3) If there is a large number of variables in the using file, the type of the identifier becomes a relatively less important factor.

    Item 2) seemed somewhat odd, but then it occurred to me that -merge- may be doing something like converting any string identifier internally to a numeric identifier.

    Here's the code to play with:
    Code:
    clear
    set seed 7658
    // Parameters you can vary.
    local reps 10
    local N = 1000000
    local idlen = 32 // duplicate string IDs will occur for small idlen and large N
    local numdatavars = 1
    //
    // Make a master file
    set obs `N'
    gen str`idlen' sid = ""  // string id
    forval i = 1/`idlen' {  
       quiet replace sid = sid + char(33 + ceil(runiform() * 127-33 ))
    }
    gen long nid = _n  // numeric id
    // Random order so that -merge- can't take advantage of master or using file
    // being sorted by nid.
    gen rand = runiform()
    sort rand
    drop rand
    duplicates report sid  // check before starting
    //
    // Put some data into a using file
    forval i = 1/`numdatavars' {
        gen x`i' = runiform()
    }   
    tempfile using
    save `using'
    //  
    // Back to master file. Do merges with string vs. numeric id
    drop x* //  just sid and nid in the master.
    //
    timer clear
    timer on 1
    // string key
    forval i = 1/`reps' {
        quiet merge 1:1 sid using `using', nogen keepusing(x*)
        drop x*
    }
    timer off 1
    //
    timer on 2
    // numeric key
    forval i = 1/`reps' {
        quiet merge 1:1 nid using `using', nogen keepusing(x*)
        drop x1
    }
    timer off 2
    quiet timer list
    di "Time with string id = " r(t1)
    di "Time with numeric id = " r(t2)

    Comment


    • #3
      Amazing, thank you!

      Comment

      Working...
      X