Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Advantages of tempvar or tempfile

    Is there, for somebody writing do-files and not programs, an advantage to using tempvars and tempfiles instead of normal variables or saving datasets under a name? I mean in terms of speed or memory

  • #2
    Temporary variables are variables, so reside in memory, while temporary datasets reside on disk. Anything in memory will be faster than on disk, all else being equal. That said, whether it's a program or a do-file, one may be easier to work with than the other but it depends on the context of what you're doing.

    Comment


    • #3
      I cannot be sure but I believe there is no advantage in terms of speed or memory. If anything, using temporary objects might use more memory and slow down execution time because Stata needs to remember the temporary objects and needs to erase them when no longer needed. I doubt you would notice any differences.

      Comment


      • #4
        Working out how to do this in a do-file could be a step towards doing it in a program. For quite a while when I started with Stata programs looked more complicated than I needed and I just wrote do-files. In due course the advantages of programs led me to go further in many cases.

        So that’s a psychological or educational point that may apply to other people too.

        Comment


        • #5
          I was just wondering because the programming guidelines in my organization to work with big data say that whenever possible use tempvars or tempfiles. But maybe they just want to prevent people forgetting to drop or erase.

          I did a very small test of creating a dummy and then an egen = total() from it with either two tempvars or two normal vars and the normal vars case took one third of the time!

          Comment


          • #6
            Originally posted by Henry Strawforrd View Post
            I did a very small test of creating a dummy and then an egen = total() from it with either two tempvars or two normal vars and the normal vars case took one third of the time!
            Could you provide more details? I cannot replicate:

            Code:
            . clear
            
            . set obs 10000000
            Number of observations (_N) was 0, now 10,000,000.
            
            . 
            . timer clear
            
            . 
            . timer on 1
            
            . generate byte x1 = runiform() < .5
            
            . egen t1 = total(x)
            
            . timer off 1
            
            . 
            . timer on 2
            
            . tempvar x1 t1
            
            . generate byte `x1' = runiform() < .5
            
            . egen `t1' = total(x)
            
            . timer off 2
            
            . 
            . timer list
               1:      0.74 /        1 =       0.7450
               2:      0.72 /        1 =       0.7170
            
            . 
            end of do-file
            Code:
            . about
            
            Stata/SE 17.0 for Windows (64-bit x86-64)
            Revision 06 Apr 2022
            Copyright 1985-2021 StataCorp LLC
            
            Total physical memory:       16.00 GB
            Available physical memory:    9.41 GB

            Comment


            • #7
              I realized that I did not adjust the argument to egen's total(). Results do not change

              Code:
              . clear
              
              . set obs 10000000
              Number of observations (_N) was 0, now 10,000,000.
              
              . timer clear
              
              . timer on 1
              
              . generate byte x1 = runiform() < .5
              
              . egen t1 = total(x1)
              
              . timer off 1
              
              . timer on 2
              
              . tempvar x1 t1
              
              . generate byte `x1' = runiform() < .5
              
              . egen `t1' = total(`x1')
              
              . timer off 2
              
              . timer list
                 1:      0.91 /        1 =       0.9140
                 2:      0.93 /        1 =       0.9260

              Comment


              • #8
                Interesting! I did the same as you in my existing (confidential data), except that total was within group.

                But, indeed, when I try it like you the two are virtually the same. Interesting also that this time tempvar is slightly faster, opposite of your example.


                Code:
                . clear
                
                . set obs 10000000
                Number of observations (_N) was 0, now 10,000,000.
                
                .
                . gen int id = runiformint(0, 100)
                
                . gen int id2 = runiformint(0, 10000)
                
                .
                . timer clear
                
                . timer on 1
                
                . generate byte x = runiform() < .5
                
                . bysort id: egen t = total(x)
                
                . timer off 1
                
                .
                . sort id2
                
                .
                . timer on 2
                
                . tempvar x1 t1
                
                . generate byte `x1' = runiform() < .5
                
                . bysort id: egen `t1' = total(`x1')
                
                . timer off 2
                
                .
                . timer list
                   1:      4.88 /        1 =       4.8810
                   2:      4.74 /        1 =       4.7410
                Last edited by Henry Strawforrd; 17 Jan 2024, 03:53.

                Comment


                • #9
                  Running my example repeatedly (but not systematically), I get the impression that the temporary variable approach is usually a slight bit faster. That might be different when you reverse the order of timings, though, because the second time around, the dataset is larger (has already two variables in it). I would not make much of these differences unless (a) they are confirmed in a more serious simulation and (b) accompanied by a theoretical/technical explanation of why systematic differences arise.

                  Comment

                  Working...
                  X