Difficulty Reproducing Imputed Data Despite Setting the Seed

Josh Gagne

Join Date: May 2015

Posts: 5
#1

Difficulty Reproducing Imputed Data Despite Setting the Seed

14 May 2015, 16:51

Over a year ago, I used mi impute mvn- and I set the seed using r(seed)- to impute missing data. Now that my analysis is over, I am unable to reproduce the imputed data using the same commands and the same unimputed data that I had used the first time around.

A little more info about the first time I did this:
-This was the first time I had used mi impute, so I believe I probably ran a few imputations that didn't work out before nailing down the code and finally getting the results I kept. This presents a slight obstacle in that every time you use mi impute the random number generator uses up the generated numbers such that the next time you use mi impute during that session the results will differ from those of the first run even though r(seed) was still set at the same number.

I figured this would be easy enough to overcome, just a little time consuming. However, I have run the mi impute code 40 times- each time producing different values in the imputed data- and still have not reproduced the original imputed data. I may have been much more green then, but I highly doubt I tried to impute this 40 times before getting it right.

Does anybody know why r(seed) command isn't working to enable me to reproduce my imputed data? I never set c(seed) so starting a new session of Stata- or setting the seed to 123456789- should be all I need to make sure I begin with the same starting point, right?

I need to be able to reproduce this data precisely so that I can continue this project at a different institution that I am going to (the data is secure, so I can't just transport it), so any help on how I might overcome this issue is GREATLY appreciated. Thank you!
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 29948
#2

14 May 2015, 17:31

In principle this shouldn't be happening. I have two theories, though.

1. There is no r(seed) command. Hopefully you are referring to the rseed(#) option of -mi impute-. It is unlikely that misspecification of this option is the problem, though, because I think you would have gotten a syntax error rather than irreproducible results.

2. Does your do-file specify the version number? If not, and if you are now using version 14 but previously used an earlier version, then you are calling a different random number generator than was used previously , and you will not succeed in reproducing earlier results. If this is the case, you need to use version control (-help version-) to force the older random number generator to be used.
Comment
Josh Gagne

Join Date: May 2015

Posts: 5
#3

14 May 2015, 17:34

you are correct I used rseed(#). And you are right this is a different version number. Will give that a try- THANK YOU!
Comment
Josh Gagne

Join Date: May 2015

Posts: 5
#4

14 May 2015, 19:36

I'm not sure why but this has not been the solution I thought it would be.

Here is my code:
use "dataset"
mi set ...
mi register imputed ...
mi register regular ...
mi register passive ...
version #: mi impute mvn ..., add(#) rseed(#)

I also tried:
version #: set seed #
version #: mi impute mvn ...

Any ideas?
Comment
Richard Williams

Join Date: Apr 2014

Posts: 4941
#5

14 May 2015, 20:02

I would have expected your solutions to work. Are you sure the data set is exactly the same as it was a year ago?

Do you have Stata 13, or can you get access to it?

I am guessing something is not exactly the same as when you did it a year ago. I suppose it is also possible that there is bug that keeps the seeds from working correctly.

EDIT: I just ran a few old mi commands under version control and they worked perfectly. Without version control the results were a little different.

Last edited by Richard Williams; 14 May 2015, 20:11.

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://www3.nd.edu/~rwilliam
Comment
Andrew Lover

Join Date: Apr 2014

Posts: 182
#6

14 May 2015, 20:04

I would try to put -version 13- (or 12, or whatever the old analysis was) at *the header of the .do file.* This way, everything that follows would be the 'correct' version.

Also, seed(#) only starts with your seed value; if you run multiple test iterations within the same .do file you'll get different results, as you're pulling from different parts of your random sequence; I've been burned on this issue before.

I'd suggest running this in the header:

Code:

version xx clear all macro drop _all

and then running the .do file from the start for each iteration.

Last edited by Andrew Lover; 14 May 2015, 20:16. Reason: Clarified some points.

__________________________________________________ __
Assistant Professor, Department of Biostatistics and Epidemiology
School of Public Health and Health Sciences
University of Massachusetts- Amherst
Comment
Richard Williams

Join Date: Apr 2014

Posts: 4941
#7

14 May 2015, 20:14

I need to be able to reproduce this data precisely so that I can continue this project at a different institution that I am going to (the data is secure, so I can't just transport it), so any help on how I might overcome this issue is GREATLY appreciated. Thank you!

As a sidelight, it seems odd to me that you can't transfer the unimputed data but you can transfer the imputed data.

Do you have the actual log files from when you did it before? Again I continue to suspect at least some minor difference.

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://www3.nd.edu/~rwilliam
Comment
wbuchanan

Join Date: Mar 2014

Posts: 1361
#8

15 May 2015, 04:03

Not sure how much this could affect things, but are you doing this on the same underlying architecture? For example, maybe the work was originally done on a box with an x86 or SPARC chipset but you're now using a machine with an x86_64 chipset.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29948
#9

15 May 2015, 08:25

Your -version- command is incorrect. If you read the help for -version- it says that -version- by itself will reset the command interpreter, but does not change the random number generator. To change the random number generator, you need to specify the -user- option. Thus:

Code:

use "dataset" mi set ... mi register imputed ... mi register regular ... mi register passive ... version #, user: mi impute mvn ..., add(#) rseed(#)

Sorry, I should have been clearer in my original post.
Comment
Josh Gagne

Join Date: May 2015

Posts: 5
#10

15 May 2015, 13:27

I want to give you an update and also clear up some things (below) in response to everyone's much-appreciated insights. The big news I've found so far is that my dataset is not sorted the same way it had been the first around and this evidently makes a difference. So I'm sorting by the _mi_id from the prior file. Unfortunately, still no solution.

Also, for some reason, the number of iterations I run is no longer impacting the results. I've started 2 sessions in Stata. In one session I run the code I showed above (the only difference being -version 12, user- instead of -version 12-) and for the other session I copy the code used in the first session then paste it into a new .do file then insert (after use "....dta", clear) the following:

merge 1:1 .... using "prior data.dta", keepusing(_mi_id)
sort _mi_id
drop _mi_id

^This is the only thing I changed. The values are different between the 2 sessions, but more odd is that in the second session I get the same imputed values no matter how many times I run the code. Even if the code I run before it uses a different seed value, no seed value, or even one less variable to impute. I get new values each time I rerun the code in the first session (for the record, it doesn't matter whether the previous run(s) use different seeds or impute a different number of variables, it just matters which iteration it is).

---------------------

Richard:

Your point about the data set being different got me thinking: while all the variables are the same and they all take the same values for each individual, the individuals are not in the same order. Turns out that the sort of the data gets you different results and, even more strangely to me, different means for the imputed variables. It must generate the numbers by individual, not variable.

Regarding the transfer of data: I'm not transferring any data. I'm testing my log files to make sure they will correctly reproduce the data, as I will be taking only the log files with me to my new institution.

---------------------

Andrew:

If I specify -version 12, user- it doesn't make a difference whether I put it at the header or before the command. Though I'm no longer sure what exactly specifying -version 12- accomplishes, I do know that doing this before the command alters results (compared to not specifying) whereas putting it at the header gives me the same results as I get without specifying the version.

Your point about seed value is very true and it has been a major pain for testing which avenue to take since I need to start a fresh session of Stata each time.

---------------------

Clyde:

Thank you, I've corrected the -version- command. Somehow not including "user" still had an impact on the way the imputation ran. Not sure why...?
Comment
Richard Williams

Join Date: Apr 2014

Posts: 4941
#11

15 May 2015, 17:38

If the data are sorted differently then who knows what other changes have occurred in the past year. Maybe new variables were computed, some variables were recoded, or whatever. If you don't have the exact same file you had a year ago then it may be impossible to reproduce the exact same imputations.

Instead of trying to get the current version of the data sorted correctly, I think you should instead do

mi extract 0

This should get you the original data back. Then all these tricks you are trying will hopefully work right. i.e. if you preface your commands with -version 12- I think you should be able to reproduce the mi files you got before. (Assuming you have the exact same commands you did before.) I find the documentation for the user option a little confusing -- I didn't need it when I cloned old results -- but go ahead and use it if necessary.

If that does work -- you may still have a problem if you can't get the exact same input file at your new place. You may want to compare the old file you re-extracted with the file you currently have and see if there are other differences besides the sorting, e.g. has the number of cases changed. have new vars been created, do the descriptive statistics not match up? These would all be indications that something besides sorting got changed in the last year.

If, no matter what you do, you can't reproduce exactly -- do you really really really need to? One set of imputations should be as good as another. If a different set of imputations gives you radically different results you may want to wonder if you have made other mistakes.

Anyway, start with mi extract 0 and then take it from there.

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://www3.nd.edu/~rwilliam
Comment
Andrew Lover

Join Date: Apr 2014

Posts: 182
#12

15 May 2015, 22:07

Out of curiosity, how large are the differences you're seeing?

__________________________________________________ __
Assistant Professor, Department of Biostatistics and Epidemiology
School of Public Health and Health Sciences
University of Massachusetts- Amherst
Comment
Josh Gagne

Join Date: May 2015

Posts: 5
#13

17 May 2015, 20:41

The data sets are identical with regard to number of variables and their names, values, descriptives, and cases. In other words, doing -sum- on either data set provides identical results. The only differences are the order in which the variables are listed and, evidently, the sort of the observations. Given that the mi command provides an order to the variables being imputed and this order appears to be followed judging by the output, I don't think the variable list order should matter. However, Richard makes a good point that I need to make sure I can recreate the data with this code going from the original file as a way of cross-referencing the code. I will check that tomorrow.

The differences are minor, as one would expect. The back-up plan is to redo the analyses which is acceptable since the results will be similar, but as that is quite involved I want to do my best not to resort to that.

Thanks for the ideas!
Comment
Fabian Ochsenfeld

Join Date: Mar 2015

Posts: 2
#14

14 Nov 2015, 10:12

I think I encountered the same problem. Although I included the rseed() option in my mi impute command, running the exact same code on the exact same raw data produced different imputed values each time. The problem was resolved once I added a sort command immediately before the mi impute command. With the sort command, each run exactly reproduced the results of the previous runs.
Comment
Rajini Nagrani

Join Date: Jun 2018

Posts: 13
#15

12 Jul 2018, 04:03

Hello,

I have a very unique problem. I have a do file where I have Regression commands written for complete case Analysis. Then i ran multiple imputation with same Regression commands but with mi prefix. (I used the complete case Analysis before running multiple imputation as part of sensitvity Analysis). All this was running completely fine Stata 14.

But now when I am running in stata 15, though the complete case commands run fine, but the multiple imputation reproduce different results in each run. After reading the above Posts. I used Version Control, also tried sort, but the multiple imputation produced different results each time.

Finally I used version Control and removed completecase Regression commands from do file and then ran multiple imputation commands, and the results were reproducible.

I dont know how the presence of Regression commands of complete case analysis were affecting the reproducibilty of multiple imputation.

Can someone please help me understand this Problem

Thanks in advance
Comment

Announcement

Difficulty Reproducing Imputed Data Despite Setting the Seed

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment