impute pmm and seed

Hendri Adriaens

Join Date: Mar 2015

Posts: 44
#1

impute pmm and seed

10 Feb 2022, 10:33

I tried specifying the seed in two different ways but still the outcome is different every time I run the example below. Any ideas how to get reproducible output?

Code:

sysuse auto, clear sort mpg replace price = . if mpg == 21 mi set mlong set seed 1234 mi register imputed price mi impute pmm price mpg headroom, add(1) knn(10) rseed(1234) mi extract 1, clear
Tags: None
daniel klein

Join Date: Mar 2014

Posts: 3805
#2

10 Feb 2022, 10:54

EDIT: I can reproduce the problem now. Looking into it ...

EDIT 2: I have looked into it and confirm that the c(rngstate), i.e., the current state of the random number generator differs after each run. This is not the case if you do not sort on mpg before. I cannot make full sense of this but it is almost certainly a bug. Add the two lines

Code:

display r(rngstate) display c(rngstate)

to your code and report this issue to [email protected]

EDIT 3: I am using Stata 16.1.

EDIT 4:

Here is a slightly edited reproducible example that illustrates the problem:

Code:

forvalues run = 1/5 { sysuse auto, clear sort mpg replace price = . if mpg == 21 mi set flong mi register imputed price mi impute pmm price = mpg headroom, add(1) knn(1) rseed(1234) di r(rngstate) di c(rngstate) summarize price }

Note that even drawing from the 1 nearest neighbor results in different imputed values.

Last edited by daniel klein; 10 Feb 2022, 11:29.
Comment
Hendri Adriaens

Join Date: Mar 2015

Posts: 44
#3

10 Feb 2022, 12:36

Ok, thanks for testing and confirming the bug! I'm running stata 17 btw. Thanks for the edited example. I will submit it to tech support.
Comment
Hendri Adriaens

Join Date: Mar 2015

Posts: 44
#4

11 Feb 2022, 12:18

Tech support replied "The different results are produced by ties being randomly broken by -sort mpg-; therefore, -set sortseed- should be added before -sort mpg-." Setting the sortseed indeed solved the issue in the minimal example, and also in my actual code, even though I'm not doing any sorting in my actual code. So, my issue seems solved, but not yet sure why.
Comment
daniel klein

Join Date: Mar 2014

Posts: 3805
#5

11 Feb 2022, 14:28

With all due respect, that is a terrible answer! Edit: At least it is insufficient.

First, it does in not explain why the imputed values of price depend on the (initial) sort order of mpg. I do not think they should. Second, it does not explain why the state of the random number generator after mi changes with every run, when the seed is set after sorting on mpg but before (or at the time of) mi impute. Third, the documentation of mi impute implies that setting the seed (not the sortseed) is sufficient to yield reproducible results. It clearly is not. Fourth, the sortseed should, in my view, never ever be needed outside of purely technical testing and certification; admittedly, this last point is rather subjective.

I would either reply to tech-support pointing them to this thread or open a new thread entitled: Bug in mi impute pmm?

Last edited by daniel klein; 11 Feb 2022, 14:51. Reason: Edit: some statements were not true; I will get back tomorrow. I still do not think that the imputed values should depend on the sort order of the data.
Comment
Hendri Adriaens

Join Date: Mar 2015

Posts: 44
#6

11 Feb 2022, 16:50

The answer was indeed not fully satisfying to me either. I have pointed tech support to this thread in the first email. According to tech support it isn't a bug. But as far as I can tell at this point, at least it is getting very close to an undocumented feature. I have no clue about the inner workings of impute pmm but it almost seems from this info as though it is doing some sorting internally (because my actual code doesn't do any sorting explicitly), which apparently depends on sortseed. That is the only way I can interpret the info that rseed or set seed isn't enough for reproducible results.

And on a side note, irrespective of whether sorting is actually necessary at all: why sort that which cannot be sorted? Why not leave the relative order of the same values untouched? And I'm also wondering why there seem to be multiple random number streams in Stata. That doesn't make reproducible results easier.

Anyway, so many questions. The only thing I can say now is that I'm happy I can continue with my work with the set sortseed fix/workaround.
Comment
Hendri Adriaens

Join Date: Mar 2015

Posts: 44
#7

11 Feb 2022, 17:13

New feedback from tech support: a merge very early in the program was the culprit. It does a sorting internally (and apparently in a random fashion) which then influences the imputation many many lines further in my code. Weird stuff especially as I don't see the need for pmm to use any random number stream in the first place.

At least we can say this is not fully documented.
Comment
daniel klein

Join Date: Mar 2014

Posts: 3805
#8

12 Feb 2022, 02:02

Originally posted by Hendri Adriaens View Post

New feedback from tech support: a merge very early in the program was the culprit. It does a sorting internally (and apparently in a random fashion) which then influences the imputation many many lines further in my code. Weird stuff especially as I don't see the need for pmm to use any random number stream in the first place.

This answer is just as unsatisfactory as the initial answer. I am glad that you consider your problem solved. I am not convinced, yet.

Contrary to my earlier statement (which I have now edited out), the problem is not specific to pmm. We observe similar behavior using regress as an imputation method. Now, both do involve randomness when drawing the parameters \(\beta_\star\) and \(\sigma_\star^2\) from their posterior distribution. pmm involves an additional random step when one out of k nearest neighbors needs to be picked.

For simplicity, let's stick with the regression method. I do not understand why the imputed values depend on the sort order of the data. The imputation steps are essentially the following:

1. Run a regression on the observed data to obtain \(\hat{\beta}\) and \(\hat{\sigma}^2\)
2. Draw \(\beta_\star\) and \(\sigma_\star^2\) from the posterior of step 1.
3. Draw an imputed value from \(\mathcal{N}\left(\beta_\star, \sigma_\star^2\right)\)
4. Repeat steps 2 and 3 m times.

The observed data does not change from run to run of our example. Even if the sort order of the observed data changed, the parameters \(\hat{\beta}\) and \(\hat{\sigma}^2\) from the regression model in step 1 should still be the same in each run. In fact, I do not see how randomness is involved here at all. Given that the parameters from step 1 should be stable, the draws in step 2 should come from the same posterior in each run. These draws are (pseudo) random. However, I do not see why they depend on the sort order of the (observed) data. In my understanding, the draws should yield the same results given the same initial RNG state. I cannot see why step 3 depends on sort order either. I do see that the imputed values might be distributed across the missing observations according to sort order. However, that does not explain why we observe an entirely different set of imputed values for each run.

Here is the respective example code again:

Code:

clear sysuse auto quietly { keep price mpg headroom replace price = . if mpg == 21 sort mpg set seed 42 mi set wide mi register imputed price mi impute regress price = mpg headroom , add(1) double } list _1_price if missing(price)

Running the code repeatedly yields different imputed values. Note that it is not just the sort order of imputed values that differs across runs. Why?

Returning to pmm, step 3 of the imputation is more complex than in the simple regression method. I see how that might depend on sort order. We can demonstrate what I have speculated above. Try running this code repeatedly:

Code:

clear sysuse auto quietly { keep price mpg headroom replace price = . if mpg == 21 sort mpg set seed 42 mi set wide mi register imputed price mi impute pmm price = mpg headroom , add(1) knn(1) double } list _1_price if missing(price)

By drawing from only the 1 nearest neighbor, we do obtain the same set of imputed values, distributed randomly across observations. This is what I would expect for the regression-based method as well. Anyway, even if the imputed values' dependence on the sort order was statistically justified, it should technically not be up to the user to set sortseed. If you promise reproducible results by setting the seed, then the user expects reproducible results.

Another, more technical detail that I do still not understand is this: Change the above code to draw from, say, k=5 nearest neighbors. Look at the RNG state after the imputation. The RNG state differs from run to run. Why is that? I do not claim to fully understand how RNG states work, but from what I understand, the RNG state is supposed to change for each random draw. Moreover, the RNG state is not supposed to change after sorting the data. That is why there is a sortseed. I do not see why repeated runs of the same code should involve a different number of random draws to be performed.

The bottom line is that we are looking at least at misleading documentation. We might be looking at a bug. I would like StataCorp (or anyone else who is willing and able to) to address at least the following two queries:

1. Explain why the imputed values based on the regression approach depend on the sort order of the data.

2. Explain why the final RNG state differs given the same code and the same initial RNG state.

Last edited by daniel klein; 12 Feb 2022, 02:14. Reason: TeX formatting, spelling
Comment
Hendri Adriaens

Join Date: Mar 2015

Posts: 44
#9

12 Feb 2022, 02:54

Originally posted by daniel klein View Post

This answer is just as unsatisfactory as the initial answer. I am glad that you consider your problem solved. I am not convinced, yet.

Well, I'm not convinced at all yet, nor do I consider my problem solved I only got a workaround, as I use quite a lot of workarounds already to deal with problems and inconsistencies in Stata. And I've conveyed that also to tech support. I fully agree with the questions you raise here, and the fact that all of this is very curious and not documented. It's more of a time constraint that's holding me back from diving into this more I will send the link to this thread to tech support again and hope they will do something with the good questions you raised! Thanks a lot for looking into this!
Comment
Miguel Dorta (StataCorp)

StataCorp Employee

Join Date: Sep 2014

Posts: 8
#10

15 Feb 2022, 11:18

sort mpg is the culprit in obtaining different results with mi impute because duplicate values in mpg lead to different ordering of the observations before mi impute is called (see [D] sort). mi impute uses both the random-number generation and the observed data to compute imputed values. Setting the random-number seed ensures that the same sequence of random numbers is used before the command is run. But if the observations are supplied to the command in a different order, the same sequence of random numbers will potentially correspond to different observations during the computation of imputed values, which will lead to different imputed values. To relate this to the example, mpg has duplicate value of 12 in observations 1 and 2 after the sorting. However, headroom, which is also used in the imputation of price, may have values 2.5 and 3.5 in those first two observations or 3.5 and 2.5, depending on how sort breaks the ties in mpg. mi impute will use "the same" random numbers for the first two observations but these numbers may correspond to different values of headroom, which will thus produce different imputed values for price (see, e.g., step 3 in the Method and formulas of [MI] mi impute regress, where headroom values will be used to compute the mean of the normal distribution from which imputed values for price will be generated).

Below I run the same imputation code twice; In both cases the dataset is sorted by mpg; the only difference is that within group of mpg, the order of the observations has been changed. These are two possible sort orders you could get by typing just

Code:

. sort mpg

Case I:

Code:

quietly { clear sysuse auto gen int id = _n keep id price mpg headroom replace price = . if mpg == 12 sort mpg id mi set wide mi register imputed price mi impute regress price = mpg headroom, add(1) double rseed(123) } list id _1_price mpg headroom if missing(price)

Case II:

Code:

quietly { clear sysuse auto gen int id = _n replace id = 27 in 26 replace id = 26 in 27 keep id price mpg headroom replace price = . if mpg == 12 sort mpg id mi set wide mi register imputed price mi impute regress price = mpg headroom, add(1) double rseed(123) } list id _1_price mpg headroom if missing(price)

And below is the output:

Code:

. quietly { . list id _1_price mpg headroom if missing(price) +--------------------------------+ | id _1_price mpg headroom | |--------------------------------| 1. | 26 10203.1 12 3.5 | 2. | 27 7290.28 12 2.5 | +--------------------------------+ . quietly { . list id _1_price mpg headroom if missing(price) +--------------------------------+ | id _1_price mpg headroom | |--------------------------------| 1. | 26 9857.15 12 2.5 | 2. | 27 7636.27 12 3.5 | +--------------------------------+

To ensure reproducibility of the results in this case, you will need to ensure that the ordering of the data is also reproducible; see option stable in sort and Sorting with ties in [D] sort.

We updated our documentation of the rseed() option of mi impute to mention that the stable ordering of the data may also be needed to ensure reproducibility of the results. It is already in the update available today (15 Feb 2022).
2 likes
Comment
daniel klein

Join Date: Mar 2014

Posts: 3805
#11

15 Feb 2022, 14:42

Originally posted by Miguel Dorta (StataCorp) View Post

(see, e.g., step 3 in the Method and formulas of [MI] mi impute regress, where headroom values will be used to compute the mean of the normal distribution from which imputed values for price will be generated).

Indeed, this is the crucial detail that I have overlooked. Thank you very much for clarifying.
Comment
Hendri Adriaens

Join Date: Mar 2015

Posts: 44
#12

16 Feb 2022, 01:49

We don't need headroom or mpg. And we don't need to sort, as the data is sorted already. And we don't need randomized sort, as the make doesn't contain duplicates. So there is no need at all to tap into the sorting random number stream. And still uncommenting the line in the code below gives a different result.

Code:

sysuse auto, clear gen x = 1 keep make x save temp, replace sysuse auto, clear //merge 1:1 make using temp, nogenerate replace price = . if mpg == 21 mi set flong mi register imputed price mi impute pmm price, add(1) knn(1) rseed(1234) summarize price
Comment
Hendri Adriaens

Join Date: Mar 2015

Posts: 44
#13

16 Feb 2022, 02:09

Even better, try uncommenting the merge below. Note the added "set sortseed". Note again that the data is sorted already. With or without the merge, the data visually does not change ordering. Still the result of the imputation is different when we uncomment the merge, even inclusing resetting the sortseed. Considering that the imputation depends on the data ordering, which is the same throughout the code for the user, no idea what data and with which ordering impute pmm is working with.

Code:

sysuse auto, clear gen x = 1 keep make x save temp, replace sysuse auto, clear //merge 1:1 make using temp, nogenerate replace price = . if mpg == 21 mi set flong mi register imputed price set sortseed 1234 mi impute pmm price, add(1) knn(1) rseed(1234) summarize price
Comment

daniel klein

Join Date: Mar 2014
Posts: 3805

#14

16 Feb 2022, 02:57

I cannot replicate the problem. Using the code in #12 (changing to wide format as it makes listing the imputed values much easier), I get

Code:

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
       price |         69        6013     2766.14       3291      14500

     +----------+
     | _1_price |
     |----------|
 16. |     4010 |
 26. |     4749 |
 31. |     6165 |
 33. |     3798 |
 46. |     4010 |
     +----------+

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
       price |         69        6013     2766.14       3291      14500

     +----------+
     | _1_price |
     |----------|
 16. |     4010 |
 26. |     4749 |
 31. |     6165 |
 33. |     3798 |
 46. |     4010 |
     +----------+

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
       price |         69        6013     2766.14       3291      14500

     +----------+
     | _1_price |
     |----------|
 16. |     4010 |
 26. |     4749 |
 31. |     6165 |
 33. |     3798 |
 46. |     4010 |
     +----------+

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
       price |         69        6013     2766.14       3291      14500

     +----------+
     | _1_price |
     |----------|
 16. |     4010 |
 26. |     4749 |
 31. |     6165 |
 33. |     3798 |
 46. |     4010 |
     +----------+

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
       price |         69        6013     2766.14       3291      14500

     +----------+
     | _1_price |
     |----------|
 16. |     4010 |
 26. |     4749 |
 31. |     6165 |
 33. |     3798 |
 46. |     4010 |
     +----------+

for 5 runs.

Edit: summarizing is pointless in wide format. Here are results for the original flong format but restricting the summary to the imputed dataset. Using

Code:

summarize price if _mi_m

I obtain

Code:

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
       price |         74    5913.905    2704.963       3291      14500

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
       price |         74    5913.905    2704.963       3291      14500

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
       price |         74    5913.905    2704.963       3291      14500

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
       price |         74    5913.905    2704.963       3291      14500

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
       price |         74    5913.905    2704.963       3291      14500

Note that the results are stable across runs.

Now, merge internally sorts the data. Even if it did not, we could never guarantee that the sort order after merge corresponds to the sort order before merge. How could we? merge could add observations from the using-file. As I understand it, mi should yield reproducible results given the same initial sort order (as in the example above). Also, the way I understand it

Code:

set sortseed 1234
mi impute pmm price, add(1) knn(1) rseed(1234)

is supposed to be pointless. You will have to establish a stable sort order before entering the mi machinery. That is, you want something like

Code:

[...]
sort varlist , stable
mi impute ...

Last edited by daniel klein; 16 Feb 2022, 03:03.

Comment

Hendri Adriaens

Join Date: Mar 2015
Posts: 44

#15

16 Feb 2022, 03:48

Originally posted by daniel klein View Post

I cannot replicate the problem.

Please find the log of the following code attached.

Code:

log using "log.txt", text replace

sysuse auto, clear
gen x = 1
keep make x
save temp, replace

sysuse auto, clear
//merge 1:1 make using temp, nogenerate
replace price = . if mpg == 21
mi set flong
mi register imputed price
set sortseed 1234
mi impute pmm price, add(1) knn(1) rseed(1234)
summarize price

sysuse auto, clear
gen x = 1
keep make x
save temp, replace

sysuse auto, clear
merge 1:1 make using temp, nogenerate
replace price = . if mpg == 21
mi set flong
mi register imputed price
set sortseed 1234
mi impute pmm price, add(1) knn(1) rseed(1234)
summarize price

log close

Attached Files

log.txt (3.8 KB, 1 view)

Announcement