impute pmm and seed

daniel klein

Join Date: Mar 2014

Posts: 3850
#31

16 Feb 2022, 06:49

Originally posted by John Eiler View Post

Sometimes real and valid data contains duplicates, and the only way to get reproducible results is to add an artificial, but unique variable just for sorting purposes.

Could you provide an example?
Comment
daniel klein

Join Date: Mar 2014

Posts: 3850
#32

16 Feb 2022, 06:59

I might get back when John Eiler provides an example that demonstrates how results depend on the sort order of a dataset with duplicates (which by the way are always reducible; this is what fweights are for). Although, I feel this is getting a bit off-topic here.

Otherwise, these are my final thoughts here.

In a nutshell, the "problem" is this:

Code:

clear all set seed 42 display rnormal(0, 1) // #1 display rnormal(1, 1) // #2 set seed 42 display rnormal(1, 1) // #3 display rnormal(0, 1) // #4

In the context of mi, one might expect that #1 == #4 and #2 == #3. Obviously, that does not make sense here, therefore it should not make sense in mi either. Even though I overlooked it, it is documented in the Methods and formulas section. And, arguably, we should understand the algorithms that we are using to produce reproducible results.

Moreover, just as #1, #2, #3, and #4 are all equally valid draws from the respective normal distributions here, the different sets of imputed values are equally valid in mi. If for whatever reason, you prefer #1 and #2 over #3 and #4 (or the other way round), then you must make sure to draw in the correct order. Because mi is more complicated than four lines of display rnormal(), the way to draw in a specific order should be better explained, which it now as far as I am concerned.
Comment
John Eiler

Join Date: Nov 2019

Posts: 50
#33

16 Feb 2022, 07:48

Could you provide an example?

A survey asks: What is you age, sex, and who did you vote for in the 2020 U.S. presidential election?

You would have lots of duplicates unless you also make up a unique random identifier. I'm not here to argue, I just asking whether adding a unique random identifier is the difference between a valid and invalid data set?

To give a more realistic example, it is common to take subsets of larger data sets such that the subset might contain duplicates. Does the data set become invalid if you subset it from one with only unique rows to one that contains some duplicate rows?
Comment
John Eiler

Join Date: Nov 2019

Posts: 50
#34

16 Feb 2022, 08:01

I mean, forget "valid" and "invalid". Probably better to say that adding a unique identifier anytime the data lacks one make the data easier to work with, and you should use it in sorts to avoid ties.

Not saying I totally agree with that overall philosophy (as opposed to having more defaults that favor reproducibility over speed), but that in fact appears to be Stata's position and they make the rules here.
Comment
daniel klein

Join Date: Mar 2014

Posts: 3850
#35

16 Feb 2022, 10:23

I think there is a misunderstanding here. You claimed that

Originally posted by John Eiler View Post

Sometimes real and valid data contains duplicates, and the only way to get reproducible results is to add an artificial, but unique variable just for sorting purposes

I was not asking for examples of the first part, i.e., of how duplicates arise. I was asking for an example of how duplicates lead to irreproducible results without an artificial identifier.

Last edited by daniel klein; 16 Feb 2022, 10:28.
Comment
daniel klein

Join Date: Mar 2014

Posts: 3850
#36

16 Feb 2022, 23:12

I believe this discussion has been very interesting but it also lost track of the original problem. I would like to tie up loose ends.

Originally posted by John Eiler View Post

But for stuff like "impute pmm" that does sorts and merges inside a black box, users really depend on StataCorp for providing an option for easy reproducibility.

This statement implies that impute pmm changes sort order, internally, that users cannot know what happens, and that users' cannot easily ensure reproducible results. I do not think this is true. Going back to the original example in #1, I add one line to ensure reproducible results.

Code:

sysuse auto, clear sort mpg headroom replace price = . if mpg == 21 sort price mpg headroom // <- new line, ensures stable sort order mi set mlong set seed 1234 mi register imputed price mi impute pmm price mpg headroom, add(1) knn(10) rseed(1234) mi extract 1, clear

The line that I have added merely sorts on the variables in the imputation model. Note that there is no need to sort , stable or to set sortseed. Obviously, the completed dataset(s) will not and usually cannot have the same sort order as the original dataset. However, the imputed values are reproducible. The Methods and formulas section has always explained how the imputed values depend on the sort order of the covariates, \(Z_m\). Thus, there is no black box here, either. There was a problem in that the documentation of the rseed()option was misleading. With the updated documentation (#10), this problem seems to be solved.

The bottom line is, there is nothing wrong with mi impute. There seems to be some misconception and disagreement on how data can and should be sorted. But that is a different topic and should probably be discussed in a different thread.

Last edited by daniel klein; 16 Feb 2022, 23:18.
Comment
Miguel Dorta (StataCorp)

StataCorp Employee

Join Date: Sep 2014

Posts: 8
#37

17 Feb 2022, 09:31

Conceptually, the reason Hendri is getting different results from mi impute pmm is the same as before---the data after merge are in a different order and so the results with and without merge are different. However, the reason why mi impute pmm depends on the data ordering in this example is now different from the previous example because of the specifics of the PMM imputation.

Similarly to regression imputation, PMM computes the linear prediction, but it doesn't simulate imputed values from the normal distribution with the linear prediction as the mean. Instead, it replaces missing values with randomly chosen observed price values that correspond to the smallest absolute difference (for knn=1) between the linear prediction for the incomplete observation and linear predictions for all complete observations; see Methods and formulas in [MI] mi impute pmm. Because the linear prediction is constant in this example, there are many observed values to choose from and the chosen values will depend on the order of price before mi impute pmm is called.

In general, the reasons why the ordering affects certain commands will be tied to their algorithms and how they compute the results and thus will vary from one command to another.To match the results from the two examples (with and without merge), we can make sure that the data are in the original order after the merge. This can be done as follows:

Code:

sysuse auto, clear gen x = 1 gen int id = _n keep make x id save temp, replace sysuse auto, clear merge 1:1 make using temp, nogenerate sort id replace price = . if mpg == 21 mi set wide mi register imputed price mi impute pmm price = mpg headroom, add(1) knn(1) rseed(1234) list _1_price if missing(price)

You should now obtain the same results from the two examples.Sorting on unique id after merge also ensures that you will obtain the same results every time you run your code. So, at least in this example, you will not need to specify sortseed.
1 like
Comment

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment