Advantages of tempvar or tempfile

Henry Strawforrd

Join Date: Sep 2021

Posts: 228
#1

Advantages of tempvar or tempfile

16 Jan 2024, 07:48

Is there, for somebody writing do-files and not programs, an advantage to using tempvars and tempfiles instead of normal variables or saving datasets under a name? I mean in terms of speed or memory
Tags: None
Leonardo Guizzetti

Join Date: Jul 2016

Posts: 2402
#2

16 Jan 2024, 07:59

Temporary variables are variables, so reside in memory, while temporary datasets reside on disk. Anything in memory will be faster than on disk, all else being equal. That said, whether it's a program or a do-file, one may be easier to work with than the other but it depends on the context of what you're doing.
1 like
Comment
daniel klein

Join Date: Mar 2014

Posts: 3850
#3

16 Jan 2024, 08:01

I cannot be sure but I believe there is no advantage in terms of speed or memory. If anything, using temporary objects might use more memory and slow down execution time because Stata needs to remember the temporary objects and needs to erase them when no longer needed. I doubt you would notice any differences.
2 likes
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35698
#4

16 Jan 2024, 08:15

Working out how to do this in a do-file could be a step towards doing it in a program. For quite a while when I started with Stata programs looked more complicated than I needed and I just wrote do-files. In due course the advantages of programs led me to go further in many cases.

So that’s a psychological or educational point that may apply to other people too.
1 like
Comment
Henry Strawforrd

Join Date: Sep 2021

Posts: 228
#5

16 Jan 2024, 08:56

I was just wondering because the programming guidelines in my organization to work with big data say that whenever possible use tempvars or tempfiles. But maybe they just want to prevent people forgetting to drop or erase.

I did a very small test of creating a dummy and then an egen = total() from it with either two tempvars or two normal vars and the normal vars case took one third of the time!
Comment

daniel klein

Join Date: Mar 2014
Posts: 3850

16 Jan 2024, 09:07

Originally posted by Henry Strawforrd View Post

I did a very small test of creating a dummy and then an egen = total() from it with either two tempvars or two normal vars and the normal vars case took one third of the time!

Could you provide more details? I cannot replicate:

Code:

. clear

. set obs 10000000
Number of observations (_N) was 0, now 10,000,000.

. 
. timer clear

. 
. timer on 1

. generate byte x1 = runiform() < .5

. egen t1 = total(x)

. timer off 1

. 
. timer on 2

. tempvar x1 t1

. generate byte `x1' = runiform() < .5

. egen `t1' = total(x)

. timer off 2

. 
. timer list
   1:      0.74 /        1 =       0.7450
   2:      0.72 /        1 =       0.7170

. 
end of do-file

Code:

. about

Stata/SE 17.0 for Windows (64-bit x86-64)
Revision 06 Apr 2022
Copyright 1985-2021 StataCorp LLC

Total physical memory:       16.00 GB
Available physical memory:    9.41 GB

Comment

daniel klein

Join Date: Mar 2014
Posts: 3850

17 Jan 2024, 01:31

I realized that I did not adjust the argument to egen's total(). Results do not change

Code:

. clear

. set obs 10000000
Number of observations (_N) was 0, now 10,000,000.

. timer clear

. timer on 1

. generate byte x1 = runiform() < .5

. egen t1 = total(x1)

. timer off 1

. timer on 2

. tempvar x1 t1

. generate byte `x1' = runiform() < .5

. egen `t1' = total(`x1')

. timer off 2

. timer list
   1:      0.91 /        1 =       0.9140
   2:      0.93 /        1 =       0.9260

Comment

Henry Strawforrd

Join Date: Sep 2021
Posts: 228

17 Jan 2024, 02:47

Interesting! I did the same as you in my existing (confidential data), except that total was within group.

But, indeed, when I try it like you the two are virtually the same. Interesting also that this time tempvar is slightly faster, opposite of your example.

Code:

. clear

. set obs 10000000
Number of observations (_N) was 0, now 10,000,000.

.
. gen int id = runiformint(0, 100)

. gen int id2 = runiformint(0, 10000)

.
. timer clear

. timer on 1

. generate byte x = runiform() < .5

. bysort id: egen t = total(x)

. timer off 1

.
. sort id2

.
. timer on 2

. tempvar x1 t1

. generate byte `x1' = runiform() < .5

. bysort id: egen `t1' = total(`x1')

. timer off 2

.
. timer list
   1:      4.88 /        1 =       4.8810
   2:      4.74 /        1 =       4.7410

Last edited by Henry Strawforrd; 17 Jan 2024, 02:53.

Comment

daniel klein

Join Date: Mar 2014

Posts: 3850
#9

17 Jan 2024, 03:14

Running my example repeatedly (but not systematically), I get the impression that the temporary variable approach is usually a slight bit faster. That might be different when you reverse the order of timings, though, because the second time around, the dataset is larger (has already two variables in it). I would not make much of these differences unless (a) they are confirmed in a more serious simulation and (b) accompanied by a theoretical/technical explanation of why systematic differences arise.
1 like
Comment

Announcement