Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • #16
    Thanks for providing the log. I am away from my computer now but will get back later. My guess is that the sort oder before the merge is not identical to that after. As pointed out that might be expected.

    Comment


    • #17
      You're right, didn't check it carefully enough, only the first few records. The order of the data is different after the merge. Still I find the interaction with impute undesirable and poorly documented, to say the least. I would make sure that the imputation gives the same result when specifying the seed, and make it not depend on a command that might have or not have been executed before. If the sort order is important for the algorithm, than make sure the sort order is set in a fixed way before running the algorithm, as rseed suggests reproducible results.

      Comment


      • #18
        Originally posted by Hendri Adriaens View Post
        Still I find the interaction with impute [...] poorly documented, to say the least.
        The documentation issue seems solved to me now.


        Originally posted by Hendri Adriaens View Post
        If the sort order is important for the algorithm, than make sure the sort order is set in a fixed way before running the algorithm.
        That is debatable. Substantially, the sort order is irrelevant insofar as one set of imputed values is not better or worse than another set. Thus, from a statistical point of view, the results are valid either way. The sort order is relevant for reproducibility, but that is more of a technical issue. If mi were to fix the sort order, how should that behavior be implemented? Suppose the user of auto.dta has explicitly sorted the data on mpg. As demonstrated in the examples, that does not result in the unique identification of observations. Should mi overwrite the user's sort order?


        I would make sure that the imputation gives the same result when specifying the seed,
        That would actually be inconsistent with what the seed is supposed to do. Setting the seed is not supposed to guarantee reproducible results -- it is supposed to guarantee that the sequence of random numbers is reproducible. I totally get how this is frustrating but I also believe that Stata actually behaves pretty consistently here.

        Edit: By the way, I do not think that it is technically possible to see whether the seed has been set by the user. Put differently, other than specifying the rseed() option, it is difficult to see how setting the seed was related to mi. The seed is globally set and not specific to one do-file or one program. Thus, it is hard to see how mi's behavior should be made dependent on whether the seed was (explicitly) set.
        Last edited by daniel klein; 16 Feb 2022, 06:42.

        Comment


        • #19
          Originally posted by daniel klein View Post
          Setting the seed is not supposed to guarantee reproducible results -- it is supposed to guarantee that the sequence of random numbers is reproducible.
          From the help of "mi impute" (running update feb 15, 2022): "rseed(#) sets the random-number seed. This option can be used to reproduce results." Your text suggestion seems better indeed, as it doesn't guarantee anything for reproducibility, especially as also randomized sorting is involved. The docs now contain a remark about stable sorting, but that doesn't help with the randomized sorting that is conducted by "merge", which can only by worked around using "set sortseed".

          Originally posted by daniel klein View Post
          I totally get how this is frustrating but I also believe that Stata actually behaves pretty consistently here.
          One has to know the exact implementation of procedures to be able to get what is needed/wanted. I'm sure there are better ways to deal with this and provide more predictable behaviour. Randomized sorting, even as default, not something that would ever cross my mind.

          Comment


          • #20
            Originally posted by Hendri Adriaens View Post
            The docs now contain a remark about stable sorting, but that doesn't help with the randomized sorting that is conducted by "merge", which can only by worked around using "set sortseed".
            It does help. You need to sort (stable) after merge but before mi.

            Originally posted by Hendri Adriaens View Post
            Randomized sorting, even as default, not something that would ever cross my mind.
            Sorting is only randomized for tied values. Bill Gould had an excellent post, I believe on the old list-server but I could not track it down, explaining how this behavior can help discover bugs in algorithms that are supposed to yield statistically valid results but in fact rely on sort order -- something which should arguably never happen.

            Comment


            • #21
              Originally posted by daniel klein View Post
              Sorting is only randomized for tied values. Bill Gould had an excellent post, I believe on the old list-server but I could not track it down, explaining how this behavior can help discover bugs in algorithms that are supposed to yield statistically valid results but in fact rely on sort order -- something which should arguably never happen.
              Sure, it could help programmers in certain cases to test an algorithm. As could multiplying all data by 2. And for both cases: it shouldn't be done by default.

              Comment


              • #22
                A little late here, and didn't read every single word but I believe the crux of the issue is that Stata's "sort" defaults to non-stable sorts, which many other languages also do, but I've never understood why fast sorts are the default instead of stable sorts (err... I understand why, but don't agree).

                Anyway, there are 2 standard solutions:
                1) Use a stable sort explicitly (instead of "sort", do "sort, stable", but "gsort" doesn't offer this.
                2) Add an extra variable for sorting that is unique, e.g. "gen id = _n". This ensures there are no ties.

                Of course, there are lots of cases where the sorts are behind the scenes, so you can't do #1 (also you might be using "gsort"), so the main workaround will be via #2. Again, I didn't follow stuff here word for word but you might be able to fix by adding a sort after the merge and/or making sure the merge is a true 1:1 merge.

                FWIW, I believe Stata's philosophy here is that you ought to always organize your data so that there are no ties or duplicates when sorting and merging. I think that is not an unreasonable position, but you still end up with confusing situations like this because they won't do the checks behind the scenes and warn you about what the problem is. Stata is of course kinda well known for this basic approach to things.

                Comment


                • #23
                  Originally posted by Hendri Adriaens View Post
                  Sure, it could help programmers in certain cases to test an algorithm. As could multiplying all data by 2. And for both cases: it shouldn't be done by default.
                  I agree on not multiplying the data by 2. I do not agree so much on not breaking ties randomly. It does not only help programmers, it also helps users (me, for example) to discover their results depending on sort order when they should not. Whether results should or should not depend on sort order is a substantive question. As pointed out, mi results are substantively valid. Let aside that we have not agreed on how to establish a stable sort order, what is your argument that a stable sort order would be better?

                  Comment


                  • #24
                    That would actually be inconsistent with what the seed is supposed to do. Setting the seed is not supposed to guarantee reproducible results -- it is supposed to guarantee that the sequence of random numbers is reproducible. I totally get how this is frustrating but I also believe that Stata actually behaves pretty consistently here.
                    I believe this is literally true but the big picture for science in general is reproducible results and there ought to always be some reasonably easy way to ensure that running the same do file gives the exact same answer EVERY SINGLE TIME. Right?

                    In the case of sorts, users can always ensure this in a reasonably straightforward way (after they learn what is happening behind the scenes and I can assure you that many people who have used Stata for years don't know that sorts are not stable by default). But for stuff like "impute pmm" that does sorts and merges inside a black box, users really depend on StataCorp for providing an option for easy reproducibility.

                    Comment


                    • #25
                      Thanks for your thoughts! Indeed, stable sort isn't available generally, so your second suggestion will be added to my ever growing list of workarounds, thanks!

                      Originally posted by John Eiler View Post
                      but you still end up with confusing situations like this because they won't do the checks behind the scenes and warn you about what the problem is. Stata is of course kinda well known for this basic approach to things.
                      Indeed, unfortunately.

                      Comment


                      • #26
                        Originally posted by John Eiler View Post
                        2) Add an extra variable for sorting that is unique, e.g. "gen id = _n". This ensures there are no ties.
                        From the pedantics' corner: it should be

                        Code:
                        generate `c(obs_t)' = _n

                        Comment


                        • #27
                          Originally posted by daniel klein View Post
                          I agree on not multiplying the data by 2. I do not agree so much on not breaking ties randomly. It does not only help programmers, it also helps users (me, for example) to discover their results depending on sort order when they should not. Whether results should or should not depend on sort order is a substantive question. As pointed out, mi results are substantively valid. Let aside that we have not agreed on how to establish a stable sort order, what is your argument that a stable sort order would be better?
                          Reproducibility, the reason I started this thread.

                          Comment


                          • #28
                            Originally posted by John Eiler View Post
                            I believe this is literally true but the big picture for science in general is reproducible results
                            You could argue that the even bigger picture is reproducible valid results. If there is a (however slight) chance that stable sorts hide errors or produce (then reproducible) invalid results, then I would personally go for unstable sorts.

                            By the way, if the sort order is substantially relevant, then there cannot be ties. This is not about organizing data it is about holding all relevant information in your data.

                            Comment


                            • #29
                              Originally posted by daniel klein View Post
                              By the way, if the sort order is substantially relevant, then there cannot be ties. This is not about organizing data it is about holding all relevant information in your data.
                              Sometimes real and valid data contains duplicates, and the only way to get reproducible results is to add an artificial, but unique variable just for sorting purposes -- unless an option to impose stable sorts exists, and in Stata's case the answer is "sometimes".

                              Does holding an artificial variable that you created just for sorting count as "holding all relevant information in your data"?

                              This is epistemology or something at some point, so maybe there is no correct answer. At least the documentation has been improved in this case, if nothing else.

                              Comment


                              • #30
                                Originally posted by John Eiler View Post
                                Does holding an artificial variable that you created just for sorting count as "holding all relevant information in your data"?
                                I am honestly wondering about this. Is there some accepted definition of valid data that says each row must be unique, even if that just means adding a unique ID variable? Doesn't seem like a terrible definition actually.

                                Comment

                                Working...
                                X