In the technical notes to the command scalar in the [P] manual two methods of naming and refering to scalars which avoid conflicts with the names of already existing variables are discussed. In this discussion the use of the pseudofunction scalar() is deprecated in favor of using tempname:
However, this advice is dangerous because you can run into situations where this method produces unintended results. Although it is argued that using tempname is safe because Stata will take care that the (internal) names actually used when employing tempname are unique, this is not quite true if the dataset contains variables created by Stata's tempvar facility; see [P] macro. The example below demonstrates that it is possible that you happen to use a dataset which contains a variable named __000000 which has previously been created by using tempvar. If this happens while you are creating a scalar using a name specified by tempname, Stata will not recognize if there is a conflict of names. As a consequence, the internal name of a temporary scalar may also be __000000, thus there will be a scalar and a variable with the same name.
But if a variable and a scalar have the same name, Stata will always use the variable, not the scalar (see the technical notes in [P] scalar and Kolev, 2006). The use of tempname does not help in this instance, because the selection rule (data-variable in preference to a scalar) also applies to internal variables and scalars to which the macro names created by tempvar or tempname refer to.
The following demonstration defines and makes use of four programs:
To summarize, I recommend to always use the pseudofunction scalar() in programs which use temporary scalars. Additionally, the manual entries should be adapted accordingly to warn the user of possibly unintended side effects of not following this advice.
A possibility to avoid the problem in the first place would be to change Stata's save (and saveold) command in such a way that temporary variables will not automatically be saved. Simultaneously an option such as "keeptemp" could be added that would allow users to save temporary variables if they really want to. At any rate, also the current manual entries of [P] save and [P] macro should alert the reader of possible unintended side effects (as demonstrated above) when using and saving temporary variables.
I know that changing the functionality of save risks to break code of already written programs, but on the other hand there may be programs in use which do not take into account the unintended side effects as demonstrated. Thus, the question is which does more harm: Breaking code of already written programs or having programs in use which might fall into the trap of conflicting names of variables and scalars.
One solution—and not a good one—is to place the scalar() pseudofunction around the names of all your scalars when you use them. A much better solution is to obtain the names for your scalars from Stata’s tempname facility; see [P] macro.
But if a variable and a scalar have the same name, Stata will always use the variable, not the scalar (see the technical notes in [P] scalar and Kolev, 2006). The use of tempname does not help in this instance, because the selection rule (data-variable in preference to a scalar) also applies to internal variables and scalars to which the macro names created by tempvar or tempname refer to.
The following demonstration defines and makes use of four programs:
- foo1: This program is used to create a temporary variable using tempvar. Because the temporary variable is not yet dropped when saving the dataset, the new dataset will contain the variable __000000. This may happen accidentally, because in [P] macro (section "The tempvar, tempname, and tempfile commands") a cursory reader may understand the statement Another advantage of temporary variables is that you do not have to drop them. Stata will do that for you when your program terminates, regardless of the reason for the termination.
- foo2: Using this program demonstrates that Stata does not recognize a conflict of names when creating a temporary scalar which internally will be named __000000. As a consequence, when trying to use the temporary scalar (as intended), what Stata actually does is to take the first value of the variable __000000 instead of the scalar. Note that a second use of foo2 will not encounter this problem because after finishing foo2 Stata not only drops the temporary scalar but also the variable __000000.
- foo3: This program is identical to foo2 except that it uses the pseudofunction scalar() — not instead of a temporary scalar using Stata's tempname facility but simultaneously. This avoids the problem encountered with foo2 and shows that the user is ill advised to use tempname instead of the pseudofunction scalar().
- foo0: To shed light on the question under which conditions Stata deletes the variable __000000 from the dataset, foo0 demonstrates that any command which creates a temporary scalar with a name that conflicts with an already existing variable (such as recode) will delete the variable with a conflicting name. The user thus should never try to save temporary variables because using a dataset containing such variables runs the risk to accidentally lose them.
Code:
* Demonstration: version 11.2 set more off * ------------------------------------------------------------------------------- cap program drop foo1 * foo1 will "accidentally" save a temporary variable: program define foo1 args data foovar sysuse `data', clear tempvar newvar gen `newvar' = abs(`foovar') * ... // additional things save `data'_new // note: tempvar _000000 will be saved! end * ------------------------------------------------------------------------------- cap program drop foo2 /* If a (temporary) variable _000000 exists already in the dataset, -foo2- will (accidentally) use the first value of this variable instead of the value of the temporary scalar __000000 because when creating the scalar Stata will not recognize that the variable __000000 exists already - if there is a variable and a scalar using the same name Stata will always use the variable: */ program define foo2 args foovar tempname fooval qui sum `foovar' sca `fooval' = max(1,r(max)) * ... // additional things /* Now change foovar dependent on the value of `fooval' (why and how is not important for the argument): */ local dec = ceil(log10(`fooval')) di _n as res "Check: dec = " `dec' ", fooval = " `fooval' _n if (round(10^`dec'-1) > `fooval') local maxmi : di round(10^`dec'-1) else local maxmi : di round(10^(`dec'+1)-1) recode `foovar' (. = `maxmi') end * ------------------------------------------------------------------------------- cap program drop foo3 /* -foo3- will avoid to accidentally use a (temporary) variable __000000 (if it exists) instead of the temporary scalar because of pseudofunction scalar() avoids the conflict of names of the variable and the scalar: */ program define foo3 // always use -scalar()- ! args foovar tempname fooval qui sum `foovar' sca `fooval' = max(1,r(max)) * ... // additional things /* Now change foovar dependent on the value of `fooval' (why and how is not important for the argument): */ local dec = ceil(log10(scalar(`fooval'))) di _n as res "Check: dec = " `dec' ", fooval = " scalar(`fooval') _n if (round(10^`dec'-1) > scalar(`fooval')) local maxmi : di round(10^`dec'-1) else local maxmi : di round(10^(`dec'+1)-1) recode `foovar' (. = `maxmi') end * ------------------------------------------------------------------------------- cap program drop foo0 /* This program will do nothing to the data in memory but running it will remove the (temporary) variable __000000 (if it exists) because the command -recode- will also use a temporary variable and Stata will remove both, the variable __000000 and the temporary variable created by -recode-. However, -recode- will issue an error if a (temporary) variable __000000 exists already and no -tempname- or -tempvar- command has been used: */ program define foo0 args foovar tempname fooval gen new = `foovar' recode new (1=0) (0=1) // works only if tempname or tempvar has been used drop new di _n "This program did nothing but nevertheless dropped variable __000000" _n end * =============================================================================== clear foo1 auto price use auto_new, replace /* Note that the first case of price has the value 4099 which is identical to the first value of (tempvar) __000000: */ di "price[1] = " price[1] " = __000000[1] = " __000000[1] * ------------------------------------------------------------------------------- clonevar rep78_1 = rep78 clonevar rep78_2 = rep78 tab rep78, mi foo2 rep78_1 // "wrong" result because (tempvar) __000000 exists tab rep78_1 foo2 rep78_2 // "correct" result because __000000 no longer exists tab rep78_2 * ------------------------------------------------------------------------------- use auto_new, replace di "price[1] = " price[1] " = __000000[1] = " __000000[1] * ------------------------------------------------------------------------------- clonevar rep78_1 = rep78 clonevar rep78_2 = rep78 tab rep78, mi foo3 rep78_1 // "correct" result even if (tempvar) __000000 exists tab rep78_1 foo3 rep78_2 // "correct" result tab rep78_2 * ------------------------------------------------------------------------------- use auto_new, replace di "price[1] = " price[1] " = __000000[1] = " __000000[1] * ------------------------------------------------------------------------------- foo0 foreign * di "price[1] = " price[1] " = __000000[1] = " __000000[1] * ------------------------------------------------------------------------------- erase auto_new.dta
A possibility to avoid the problem in the first place would be to change Stata's save (and saveold) command in such a way that temporary variables will not automatically be saved. Simultaneously an option such as "keeptemp" could be added that would allow users to save temporary variables if they really want to. At any rate, also the current manual entries of [P] save and [P] macro should alert the reader of possible unintended side effects (as demonstrated above) when using and saving temporary variables.
I know that changing the functionality of save risks to break code of already written programs, but on the other hand there may be programs in use which do not take into account the unintended side effects as demonstrated. Thus, the question is which does more harm: Breaking code of already written programs or having programs in use which might fall into the trap of conflicting names of variables and scalars.
Comment