Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Fuzzy Match One Variable in Same DataSet with 10,000+ Observations

    I have the following data set:

    CompanyName CompanyID
    Tyco 1235
    Tyco Inc. 1236
    TycoCo 4567
    Brown 1342
    Green Deal 1434
    Green Deals 1344
    Ferrari 1343
    Ferrari Inc 1359
    Ferraris US 7897

    I want the following data set:

    CompanyName CompanyID GroupID
    Tyco 1235 1
    Tyco Inc. 1236 1
    TycoCo 4567 1
    Brown 1342 2
    Green Deal 1434 3
    Green Deals 1344 3
    Ferrari 1343 4
    Ferrari Inc 1359 4
    Ferraris US 7897 4

    I tried using a "cross" function, but it blew up because of what I assume was the number of observations (10,000+).

    What would be the best way to accomplish what I am trying to get? I am using Stata 18.

    Thank you!


  • #2
    Originally posted by Derek Henrichs View Post
    I tried using a "cross" function, but it blew up
    At least you know the size of the explosion, there should be no surprises. With 10,000 observations, cross will create a dataset with 100 million observations (or 100002). If your computer struggles with this, have the full dataset in one frame and split the observations from this dataset into 10 frames, each containing 1,000 observations. Then perform 10 crosses and append the results.

    Comment


    • #3
      Thank you, Andrew!

      Can you provide some clarification or resource on how to split the observations into 10 frames and the append the results?

      Comment


      • #4
        you can use ssc matchit to do fuzzy match with strings. It has the advantage that it first indexes the observations to match such that the search space is reduced before analyzing the data.

        Comment


        • #5
          Originally posted by Derek Henrichs View Post
          Thank you, Andrew!

          Can you provide some clarification or resource on how to split the observations into 10 frames and the append the results?
          So let us take 60 observations of the variable "make" from the auto dataset and split them into 3 frames each containing 20 observations and then run cross and append the results. From

          Code:
          help limits
          you can have up to 100 frames.


          frames
          # of frames 100 100

          This allows you to even have 100 frames each with 1000 observations in case you are still struggling with 10. Note from below that this does not entail extra typing as you are just changing the highlighted numbers in the loops.



          Code:
          sysuse auto, clear
          keep make
          keep in 1/60
          save main, replace
          local delta=`=_N'/3
          local start= 1
          forval i=1/3{
              frame put make if inrange(_n,`start', `start'+`delta'-1), into(frame`i')
              frame frame`i'{
                  rename make make1
                  cross using main
               }
               local start =  `start'+`delta'
          }
          clear
          set obs 1
          tempfile holding
          forval i=1/3{
              frame frame`i': save `holding', replace
              append using `holding'
          }
          list in 1/30, sep(0)
          list in -30/l, sep(0)
          Res.:

          Code:
          . forval i=1/3{
            2.
          .     frame put make if inrange(_n,`start', `start'+`delta'-1), into(frame`i')
            3.
          .     frame frame`i'{
            4.
          .         rename make make1
            5.
          .         cross using main
            6.
          .      }
            7.
          .      local start =  `start'+`delta'
            8.
          . }
          
          .
          . clear
          
          .
          . set obs 1
          Number of observations (_N) was 0, now 1.
          
          .
          . tempfile holding
          
          .
          . forval i=1/3{
            2.
          .     frame frame`i': save `holding', replace
            3.
          .     append using `holding'
            4.
          . }
          
          .
          . list in 1/30, sep(0)
          
               +-----------------------------------+
               | make1               make          |
               |-----------------------------------|
            1. |                                   |
            2. | AMC Concord         AMC Concord   |
            3. | AMC Pacer           AMC Concord   |
            4. | AMC Spirit          AMC Concord   |
            5. | Buick Century       AMC Concord   |
            6. | Buick Electra       AMC Concord   |
            7. | Buick LeSabre       AMC Concord   |
            8. | Buick Opel          AMC Concord   |
            9. | Buick Regal         AMC Concord   |
           10. | Buick Riviera       AMC Concord   |
           11. | Buick Skylark       AMC Concord   |
           12. | Cad. Deville        AMC Concord   |
           13. | Cad. Eldorado       AMC Concord   |
           14. | Cad. Seville        AMC Concord   |
           15. | Chev. Chevette      AMC Concord   |
           16. | Chev. Impala        AMC Concord   |
           17. | Chev. Malibu        AMC Concord   |
           18. | Chev. Monte Carlo   AMC Concord   |
           19. | Chev. Monza         AMC Concord   |
           20. | Chev. Nova          AMC Concord   |
           21. | Dodge Colt          AMC Concord   |
           22. | AMC Concord         AMC Pacer     |
           23. | AMC Concord         AMC Spirit    |
           24. | AMC Concord         Buick Century |
           25. | AMC Concord         Buick Electra |
           26. | AMC Concord         Buick LeSabre |
           27. | AMC Concord         Buick Opel    |
           28. | AMC Concord         Buick Regal   |
           29. | AMC Concord         Buick Riviera |
           30. | AMC Concord         Buick Skylark |
               +-----------------------------------+
          
          .
          . list in -30/l, sep(0)
          
                +--------------------------------+
                | make1         make             |
                |--------------------------------|
          3572. | Fiat Strada   Merc. Marquis    |
          3573. | Fiat Strada   Merc. Monarch    |
          3574. | Fiat Strada   Merc. XR-7       |
          3575. | Fiat Strada   Merc. Zephyr     |
          3576. | Fiat Strada   Olds 98          |
          3577. | Fiat Strada   Olds Cutl Supr   |
          3578. | Fiat Strada   Olds Cutlass     |
          3579. | Fiat Strada   Olds Delta 88    |
          3580. | Fiat Strada   Olds Omega       |
          3581. | Fiat Strada   Olds Starfire    |
          3582. | Fiat Strada   Olds Toronado    |
          3583. | Fiat Strada   Plym. Arrow      |
          3584. | Fiat Strada   Plym. Champ      |
          3585. | Fiat Strada   Plym. Horizon    |
          3586. | Fiat Strada   Plym. Sapporo    |
          3587. | Fiat Strada   Plym. Volare     |
          3588. | Fiat Strada   Pont. Catalina   |
          3589. | Fiat Strada   Pont. Firebird   |
          3590. | Fiat Strada   Pont. Grand Prix |
          3591. | Fiat Strada   Pont. Le Mans    |
          3592. | Fiat Strada   Pont. Phoenix    |
          3593. | Fiat Strada   Pont. Sunbird    |
          3594. | Fiat Strada   Audi 5000        |
          3595. | Fiat Strada   Audi Fox         |
          3596. | Fiat Strada   BMW 320i         |
          3597. | Fiat Strada   Datsun 200       |
          3598. | Fiat Strada   Datsun 210       |
          3599. | Fiat Strada   Datsun 510       |
          3600. | Fiat Strada   Datsun 810       |
          3601. | Fiat Strada   Fiat Strada      |
                +--------------------------------+
          Last edited by Andrew Musau; 20 Mar 2024, 04:06.

          Comment

          Working...
          X