Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Difference between joinby and cross

    Dear Statalisters,

    I use joinby a couple of times. I read the pdf manual for cross, it says it is rarely used, (http://www.stata.com/manuals13/dcross.pdf), but it seems function the same way as joinby.

    Could someone comment on when it is appropriate(inappropriate) to use joinby?

    Thanks,
    Rochelle

  • #2
    help cross

    "Form every pairwise combination of two datasets"

    help joinby

    "Form all pairwise combinations within groups"

    Huge difference.

    Comment


    • #3
      Thanks Ben ! I see the words are different, but without an example, I did not get the point.

      I tried an example from the stata13 manual, it seems cross do not require common variables between 2 datasets. Joinby does.

      Comment


      • #4
        Try running this code to see the difference:

        Code:
        clear*
        
        set seed 1234
        
        // CREATE "MASTER" DATA SET
        set obs 6
        gen int id = ceil(_n/3)
        gen x = runiform()
        list, clean
        tempfile master
        save `master'
        
        // CREATE "USING" DATA SET
        clear
        set obs 6
        gen int id = ceil(_n/3)
        gen y = runiform()
        list, clean
        tempfile using
        save `using'
        
        use `master', clear
        joinby id using `using'
        count
        assert `r(N)' == 18
        list, clean
        
        use `master', clear
        rename id id_master
        cross using `using'
        count
        assert `r(N)' == 36
        list, clean
        The -joinby- command leads to a data set containing each observation with id 1 in master paired with each observation with id 1 in using, and each observation with id 2 in master paired with each observation with id 2 in using. A total of 3x3 + 3x3 = 18 observations.

        The -cross- command is the full Cartesian product of the master and using data set. Every observation in the master data set is paired with every observation in the using data set, regardless of whether their id's match or not. This leads to 6x6 = 36 observations


        Comment


        • #5
          Thanks Clyde for always being so helpful !!!


          Sincerely,
          Rochelle

          Comment


          • #6
            Is there a more efficient alternative to cross command? I need to create all possible pairwises. Unfortunately joinby and merge commands do not achive this

            Comment


            • #7
              -cross- does create all possible pairwise combinations of observations in the two data sets in the command. -joinby- does not: it creates the pairwise combinations of observations that agree on the designated variables only.

              If you need to create all possible pairwises, then -cross- is what you want. If it is not doing what you want it to, then what you want is something other than all pairwise combinations and you need to explain more carefully what you actually want.

              As for efficiency, -cross- can be both very slow and can blow through memory limits with relatively modest data sets. For example, if each of the data sets has 1 million observations, which in contemporary terms is by no means a large data set, when you -cross- them the result has 1 trillion observations--which is huge and typically not something that a desktop computer can accommodate. If you have a computer that can accommodate it, it may have to use virtual memory to do so, and that slows things to a crawl. But the problem is not really with -cross-; it's inherent to the gargantuan size of the pairwise combination problem itself.

              Comment

              Working...
              X