CUSUM with seed giving slightly different answers each time

Jude Holmes

Join Date: Jul 2024

Posts: 2
#1

CUSUM with seed giving slightly different answers each time

18 Jul 2024, 04:32

Hello Statalist,

I am new to Stata and have unexpected changes to my results in reruns of code with a seed and cusum function.

I am trying to create a simple simulation in these steps:
1. create binary data with a probability using a seed
2. run the CUSUM function to see the result

Results should be the same every time I run this due to seed but aren't. The CUSUM summary data changes slightly each time. Changes in standard deviation noted from previous runs are: 3.5292196, 3.5285664, 3.5287592 - very small changes(!).

Code:

clear graph drop _all * set seed and obs set seed 1236 set obs 4849 // RUN-IN: binary outcome generator gen outcome1 = cond(runiform() < 0.0275, 1, 0) gen time = runiform() * 60 sort time replace time =round(time) // CUSUM cusum outcome1 time, generate(cs_temp1) nograph * Calculate standard deviation of the CUSUM values summarize cs_temp1 gen sigma = r(sd) noi display sigma

I have confirmed that outcome1, time, and seed values remain consistent when I rerun this section of code multiple times.
Having reread cusum and seed documentation, I believe I am implementing the seed correctly (choice of seed might be improved!) and cusum should be repeatable using a seed.

Question: Is this an expected slight change in the code that I am unaware of, or am I using the seed wrongly with the cusum chart?

Note: the change in sd is small enough that I could "run" with this as is, but as I'm new to Stata, I wanted to make sure my code is doing what I think it is(!) and this is an unexpected outcome which I couldn't explain.
Tags: None
George Ford

Join Date: Aug 2014

Posts: 3121
#2

18 Jul 2024, 07:46

it's the rounding.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29962
#3

18 Jul 2024, 10:18

I don't think it's the rounding. There are many duplicate values of the variable time, and -sort time- is therefore an indeterminate command. As always, given an indeterminate -sort-, Stata randomizes the order within the sort key. In other words, while the data are always sorted on time, the order of observations having the same value of time is randomized. This results in the 0's and 1's of outcome1 being scattered in different orders each time the code is run. Important: setting the random number generator seed does not fix the way that this -sort- randomization is done. To do that you must also specify the value of the sort seed.

Code:

clear graph drop _all * set seed and obs set seed 1236 set obs 4849 * stabilize indeterminate sorts set sortseed 7890 // RUN-IN: binary outcome generator gen outcome1 = cond(runiform() < 0.0275, 1, 0) gen time = runiform() * 60 sort time replace time =round(time) // CUSUM cusum outcome1 time, generate(cs_temp1) nograph * Calculate standard deviation of the CUSUM values summarize cs_temp1 gen sigma = r(sd) noi display sigma

will produce deterministic results.

Of course, what you will have done in this way is to simply pick one arbitrary result from the true range of possibilities. The more fundamental problem here is in the suitability of this kind of data for this kind of analysis. The actual, theoretical results of -cusum- in the setting of an indeterminate sort order are inherently indeterminate. The real solution is to make the order deterministic in some way that reflects natural properties of the data: additional variable(s) other than time should be relied on to distinguish observations having the same value of time, rather than fixing an altogether arbitrary order.
2 likes
Comment
George Ford

Join Date: Aug 2014

Posts: 3121
#4

18 Jul 2024, 11:46

Clyde may be on to something, but I get the same answers every time if I drop the round.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29962
#5

18 Jul 2024, 13:34

Well, if we eliminate the rounding and instead create time as:

Code:

gen time = runiformint(0, 60)

then there is no rounding anywhere. I believe the reason the rounding seems relevant is that it is the rounding that is creating (most of) the duplicate values of time. If we take the original code and eliminate the rounding, then, as it turns out, there are only two observations bearing the same value of time, and, as it happens, these both have the same outcome value. Consequently the randomized order of those two observations does not change the sequence of values of cs_temp1. But when, as with the rounding, you have many duplicates observations of time, some of those sets will have different values of outcome, which leads then to a different cs_temp1 sequence, and, hence a different result. So, yes, the rounding is the problem, but it creates the problem by virtue of the creation of many duplicate values of time.
1 like
Comment
Jude Holmes

Join Date: Jul 2024

Posts: 2
#6

19 Jul 2024, 04:36

Dear Clyde and George,

Thank you so much for your input, adding the sortseed fixed my issue.

Code:

* stabilize indeterminate sorts set sortseed 7890

And you are both correct, my choice in rounding time was causing a headache, I'll be updating my runiform format using:

Code:

gen time = runiformint(0, 60)
Comment

Announcement

CUSUM with seed giving slightly different answers each time

Comment

Comment

Comment

Comment

Comment