Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Random sample not reproducible despite set seed

    Dear everyone,

    I want to draw a random sample of 500 observations that should be reproducible when running the do-file again and again. Despite the set seed command, I always get slightly different results.

    Code:
    egen company_tag = tag(company_uuid) //pick one observation to represent each company
    set seed 77
    randomtag if company_tag, count(500) gen(t) //select a random sample from the tagged obs. this
    bysort company_uuid: egen select = total(t) //keep all observations from picked companies
    sum no_org if select == 1 //500 observations
    keep if select == 1
    drop count_investor count_investor_index
    by investor_uuid, sort: gen count_investor = 1 if _n == 1
    by investor_uuid, sort: gen count_investor_index = _n
    sum count_investor if ba == 1 //901 BAs
    sum count_investor if ba == 1 & investor_type == "individual" //810 BAs
    sum count_investor if vc == 1 //173 VCs
    sum no_org if tag == 2 //118 startups
    sum no_org if tag == 3 //382 startups
    Thanks a lot for helping me on that!

    Best,
    Rike

  • #2
    The problem arises because you have
    Code:
     egen company_tag = tag(company_uuid) //pick one observation to represent each company
    before you set the seed. The -egen, tag()- command picks one observation per company_uuid. It selects that one by sorting on company_uuid and then taging the first. But since company_uuid does not identify observations uniquely (if it did you would have no need of company_tag), the sort order is indeterminate and is random. So each time you run this, different observations are tagged, and everything thereafter is indeterminate.

    So you need to set the seed before you use -egen, tag()-. You also should check the code leading up to what you show us as it, too, may contain indeterminate (explicit or implicit) sorts that affect later reproducibility.

    Comment


    • #3
      Dear Clyde,

      thank you for your fast response! I put the set seed command before egen and still the results slightly change. It might be due to indeterminate reasons as you mentioned before, although I don't really know how to "spot" those.

      Especially since I have a long do-file creating my final dataset by always ending up with the same number of observations (.e.g ventures, investors). Also, when I create count variables, I always sort the respective data before, which is why they should always result in the same order, don't they?

      Thanks a lot for your help!
      Last edited by Rike Bruchmann; 29 Jun 2016, 10:20.

      Comment


      • #4
        I agree with Clyde, this is most likely due to the egen, tag() picking different observations at each run. However, the set seed command does not alter how data is sorted. Check the sort order prior to picking the representative observation per company. See this FAQ for an explanation of what must be going on.
        Last edited by Robert Picard; 29 Jun 2016, 10:24.

        Comment


        • #5
          Here's a quick example of one way this could happen and how to spot it:

          Code:
          webuse grunfeld, clear
          gen long obs = _n
          
          bysort year: gen nobsperyear = _N
          bysort company: gen nobsperco = _N
          egen company_tag = tag(company)
          sum obs if company_tag
          A first run gives me
          Code:
          . sum obs if company_tag
          
              Variable |        Obs        Mean    Std. Dev.       Min        Max
          -------------+---------------------------------------------------------
                   obs |         10        95.9    56.75376         17        186
          but the next run is:
          Code:
          . sum obs if company_tag
          
              Variable |        Obs        Mean    Std. Dev.       Min        Max
          -------------+---------------------------------------------------------
                   obs |         10        99.3    61.24641          2        182

          Comment


          • #6
            Dear Robert,

            I understand what you mean, thanks for showing this example! I guess there is no other way of randomly (but systematically) sample the same companies (and thereby their total observations) when using set seed, right? For example, instead of picking a different observation per company each time, that always the first one is picked?

            Comment


            • #7
              No, randomtag (from SSC) will always pick exactly the same observations if the same seed was used. If you end up with different results downstream, then the state of the data was not the same when you called randomtag. There a nifty Stata command (isid) to sort data and verify that the data is fully sorted. Adjusting the example in #5:

              Code:
              webuse grunfeld, clear
              gen long obs = _n
              
              bysort year: gen nobsperyear = _N
              bysort company: gen nobsperco = _N
              isid company year, sort
              egen company_tag = tag(company)
              sum obs if company_tag
              will always generate the same results. But even with fully sorted data, you may get different results if the number of observations has changed. To confirm that the data you feed randomtag changes at every run, put
              Code:
              save "test_order.dta"
              just before the randomtag statement. On the first run of the do-file, the state of the data will be saved to "test_order.dta". On subsequent runs, the execution will halt at the save statement because the file already exits. You can then compare the data in memory with the one on disk using:
              Code:
              cf _all using "test_order.dta", all

              Comment

              Working...
              X