Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • CUSUM with seed giving slightly different answers each time

    Hello Statalist,

    I am new to Stata and have unexpected changes to my results in reruns of code with a seed and cusum function.

    I am trying to create a simple simulation in these steps:
    1. create binary data with a probability using a seed
    2. run the CUSUM function to see the result

    Results should be the same every time I run this due to seed but aren't. The CUSUM summary data changes slightly each time. Changes in standard deviation noted from previous runs are: 3.5292196, 3.5285664, 3.5287592 - very small changes(!).

    Code:
    clear
    graph drop _all
    
    * set seed and obs
    set seed 1236                                  
    set obs 4849
                                      
    // RUN-IN: binary outcome generator
    gen outcome1 = cond(runiform() < 0.0275, 1, 0)
    gen time = runiform() * 60
    sort time
    replace time =round(time)
    // CUSUM
    cusum outcome1 time, generate(cs_temp1) nograph
    
    * Calculate standard deviation of the CUSUM values
    summarize cs_temp1
    gen sigma = r(sd)
    noi display sigma
    I have confirmed that outcome1, time, and seed values remain consistent when I rerun this section of code multiple times.
    Having reread cusum and seed documentation, I believe I am implementing the seed correctly (choice of seed might be improved!) and cusum should be repeatable using a seed.


    Question: Is this an expected slight change in the code that I am unaware of, or am I using the seed wrongly with the cusum chart?

    Note: the change in sd is small enough that I could "run" with this as is, but as I'm new to Stata, I wanted to make sure my code is doing what I think it is(!) and this is an unexpected outcome which I couldn't explain.

  • #2
    it's the rounding.

    Comment


    • #3
      I don't think it's the rounding. There are many duplicate values of the variable time, and -sort time- is therefore an indeterminate command. As always, given an indeterminate -sort-, Stata randomizes the order within the sort key. In other words, while the data are always sorted on time, the order of observations having the same value of time is randomized. This results in the 0's and 1's of outcome1 being scattered in different orders each time the code is run. Important: setting the random number generator seed does not fix the way that this -sort- randomization is done. To do that you must also specify the value of the sort seed.

      Code:
      clear
      graph drop _all
      
      * set seed and obs
      set seed 1236                                  
      set obs 4849
      
      * stabilize indeterminate sorts
      set sortseed 7890
                                        
      // RUN-IN: binary outcome generator
      gen outcome1 = cond(runiform() < 0.0275, 1, 0)
      gen time = runiform() * 60
      sort time
      replace time =round(time)
      // CUSUM
      cusum outcome1 time, generate(cs_temp1) nograph
      
      * Calculate standard deviation of the CUSUM values
      summarize cs_temp1
      gen sigma = r(sd)
      noi display sigma
      will produce deterministic results.

      Of course, what you will have done in this way is to simply pick one arbitrary result from the true range of possibilities. The more fundamental problem here is in the suitability of this kind of data for this kind of analysis. The actual, theoretical results of -cusum- in the setting of an indeterminate sort order are inherently indeterminate. The real solution is to make the order deterministic in some way that reflects natural properties of the data: additional variable(s) other than time should be relied on to distinguish observations having the same value of time, rather than fixing an altogether arbitrary order.

      Comment


      • #4
        Clyde may be on to something, but I get the same answers every time if I drop the round.

        Comment


        • #5
          Well, if we eliminate the rounding and instead create time as:
          Code:
          gen time = runiformint(0, 60)
          then there is no rounding anywhere. I believe the reason the rounding seems relevant is that it is the rounding that is creating (most of) the duplicate values of time. If we take the original code and eliminate the rounding, then, as it turns out, there are only two observations bearing the same value of time, and, as it happens, these both have the same outcome value. Consequently the randomized order of those two observations does not change the sequence of values of cs_temp1. But when, as with the rounding, you have many duplicates observations of time, some of those sets will have different values of outcome, which leads then to a different cs_temp1 sequence, and, hence a different result. So, yes, the rounding is the problem, but it creates the problem by virtue of the creation of many duplicate values of time.


          Comment


          • #6
            Dear Clyde and George,

            Thank you so much for your input, adding the sortseed fixed my issue.
            Code:
             
             * stabilize indeterminate sorts set sortseed 7890
            And you are both correct, my choice in rounding time was causing a headache, I'll be updating my runiform format using:

            Code:
             
             gen time = runiformint(0, 60)

            Comment

            Working...
            X