Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Performance issues when merging data

    Until now I have never experienced performance issues using STATA on my computer but this time, it took about 1,5 hours to m:1 merge dataset A, which contains 96 variables and 12,300 observations with a dataset B that contains 2 variables (Datatype long) and around 11 million observations. Of course I wouldn’t consider these 11 million observations to be a small dataset but 1.5 hours?
    After these 1.5 hours, I wanted to delete unmatched observations (about 10 million). This process runs since an hour now and still isn’t finished.

    I am running Stata 17 on an i5 with 32 Gigs of RAM.
    If this time is considered to be normal, are there any ways to improve speed on this process?

    thanks a lot

  • #2
    You can speed things up a little by reversing the merge. And it might help to use the built-in option to keep only matched observations. (See below.)

    .ÿ
    .ÿversionÿ17.0

    .ÿ
    .ÿclearÿ*

    .ÿ
    .ÿseedem
    setÿseedÿ373669464

    .ÿ
    .ÿ/*ÿdatasetÿA,ÿwhichÿcontainsÿ96ÿvariablesÿandÿ12,300ÿ*/
    .ÿquietlyÿsetÿobsÿ12300

    .ÿgenerateÿlongÿridÿ=ÿ_n

    .ÿforvaluesÿiÿ=ÿ2/95ÿ{
    ÿÿ2.ÿÿÿÿÿÿÿÿÿgenerateÿfloatÿv`i'ÿ=ÿround(100ÿ*ÿruniform())
    ÿÿ3.ÿ}

    .ÿ
    .ÿgenerateÿdoubleÿranduÿ=ÿruniform()

    .ÿsortÿrandu

    .ÿ
    .ÿtempfileÿA

    .ÿquietlyÿsaveÿ`A'

    .ÿ
    .ÿ/*ÿÿdatasetÿBÿthatÿcontainsÿ2ÿvariablesÿ(Datatypeÿlong)ÿandÿaroundÿ11ÿmillionÿobservationsÿ*/
    .ÿdropÿv*

    .ÿ//ÿunmatchedÿobservationsÿ(aboutÿ10ÿmillion)
    .ÿquietlyÿexpandÿ`=round(1000000ÿ/ÿ_N)ÿ+ÿ1'

    .ÿquietlyÿsetÿobsÿ11000000

    .ÿ
    .ÿquietlyÿreplaceÿridÿ=ÿruniformint(12301,ÿ50000)ÿifÿmissing(rid)

    .ÿ
    .ÿquietlyÿreplaceÿranduÿ=ÿruniform()

    .ÿsortÿrandu

    .ÿ
    .ÿtempfileÿB

    .ÿquietlyÿsaveÿ`B'

    .ÿ
    .ÿtimerÿclear

    .ÿ
    .ÿuseÿ`A'

    .ÿtimerÿonÿ1

    .ÿmergeÿ1:mÿridÿusingÿ`B',ÿassert(matchÿusing)ÿkeep(match)ÿnogenerate

    ÿÿÿÿResultÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿNumberÿofÿobs
    ÿÿÿÿ-----------------------------------------
    ÿÿÿÿNotÿmatchedÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ0
    ÿÿÿÿMatchedÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ1,008,600ÿÿ
    ÿÿÿÿ-----------------------------------------

    .ÿtimerÿoffÿ1

    .ÿ
    .ÿuseÿ`B',ÿclear

    .ÿcount
    ÿÿ11,000,000

    .ÿtimerÿonÿ2

    .ÿmergeÿm:1ÿridÿusingÿ`A',ÿassert(matchÿmaster)ÿkeep(match)ÿnogenerate

    ÿÿÿÿResultÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿNumberÿofÿobs
    ÿÿÿÿ-----------------------------------------
    ÿÿÿÿNotÿmatchedÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ0
    ÿÿÿÿMatchedÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ1,008,600ÿÿ
    ÿÿÿÿ-----------------------------------------

    .ÿtimerÿoffÿ2

    .ÿ
    .ÿtimerÿlist
    ÿÿÿ1:ÿÿÿÿÿÿ3.92ÿ/ÿÿÿÿÿÿÿÿ1ÿ=ÿÿÿÿÿÿÿ3.9180
    ÿÿÿ2:ÿÿÿÿÿ22.02ÿ/ÿÿÿÿÿÿÿÿ1ÿ=ÿÿÿÿÿÿ22.0250

    .ÿ
    .ÿexit

    endÿofÿdo-file


    .


    But something just doesn't sound right. I've run that on an aging laptop and it's done both ways in less than half a minute.

    What else is your machine running in the background? What are the ninety-six variables in dataset A (strL?) and are they all needed?

    Comment


    • #3
      Thanks for your reply. I followed your advise and only keep matched results. I also did an extra step and dumped variables before merging. Stata did the merge in a few seconds.
      Thank you very much.

      Comment

      Working...
      X