Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Inconsistent (incorrect?) behavior of collapse with extended missing values?

    collapse can yield unexpected (and to my eyes, incorrect) results when all values in a group are missing:

    Code:
    clear
    set obs 3
    gen x = .
    replace x = .a in 1
    replace x = .z in 2
    replace x = .e in 3
    collapse (min) min_x = x (max) max_x= x
    The result is

    Code:
    . l
    
         +---------------+
         | min_x   max_x |
         |---------------|
      1. |    .e      .z |
         +---------------+
    The max is correct, but the min is not. The problem is that collapse sorts a temporary variable, "gen `revx' = -x", which is regular missing (.) when x is any missing value. Sorting by x correctly sorts extended missing values, but sorting by `revx' does not. If min/max are not intended to preserve extended missing values, then the result should be regular missing (.), but if they are intended to preserve missing values, the result should be ".a"

  • #2
    That's interesting. In my own work, I have never attempted to compare different missing values. While I sometimes have to distinguish different missing values when tabulating some variables, for analytic purposes, I have never had to consider whether one missing value was less than another or not. In fact, in my work, when I use extended missing values, they are assigned as values or a nominal variable, in the sense that the ordering is arbitrary and the particular extended missing value is usually chosen for mnemonic reasons. In addition, the way the -inrange()- function works, ordering among missing values is, in effect, undefined, as nothing, missing or otherwise, is ever between two missing values. So I didn't even realize that there is a sort order for missing values; I assumed that there isn't. And my first thought was to respond to you that this isn't a bug.

    But there is other evidence that Stata does have a sort order among missing values:

    Code:
    . assert .a < .e
    
    . assert .a > .e
    assertion is false
    r(9);
    Even though I never use the sort order of extended missing values myself, I can imagine some slightly far-fetched circumstances where it might be useful to do so, and perhaps in someone else's life they might not be so far-fetched. Moreover, if, as seen from above, there really is a sort order among missing values, then -collapse- ought to respect it. I've replicated your example on my own setup (Windows 7, Version 15.1 MP2).

    I recommend you report it to Stata Technical Support; I suspect they will agree that this is a bug.

    If, before they get around to fixing this, you need to do this for your work, I guess the obvious workaround is to -mvencode- these missing values as numbers in the sort order you want, and then proceed.

    Comment


    • #3
      I have never had to consider whether one missing value was less than another or not.
      Interesting. I came upon missing values because I needed to know how Stata coded them internally, which are just very large doubles (internally, they are 8.9e307 and so on). I have since found them useful, and "help missing" notes that "all nonmissing numbers < . < .a < .b < ... &< .z", so I often do think about their order.

      In this case, I mainly wanted to preserve some data when aggregating. So something like

      Code:
      collapse a (min) min_a = a (max) max_a = a, by(group)
      replace a = cond(min_a == max_a, min_a, .) if mi(min_a)
      If every value in the group is missing, I assign a missing value; if all values in the group are the same extended missing value, I preserve the extended missing value. There are other ways to do this, but I thought the above was a nice and compact way to go about it. You get a lot of false positives, though, because of the bug.

      Comment


      • #4
        Mauricio found a bug in collapse; we will have it fixed in a future update.

        Comment


        • #5
          This bug has been fixed in today's update to Stata 15.

          Comment

          Working...
          X