Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • min(), max() and missing values

    A small thread on X starting at https://twitter.com/SandroAmbuehl/st...10371435090086 raised puzzlement about results like these.

    Code:
    . di min(1, .)
    1
    
    . di max(1, .)
    1
    How do those results square with the idea that system missing . for numeric values is arbitrarily large?

    Personal rule: I look on X from time to time but I won't post there. That's personal, but here's much of why. If I posted on X, there might be some traffic as a result, with reasons varying from quite nice (hi Nick, as you're on Twitter please answer my Stata question) to not so nice (snark or flak or worse sent in my direction). To the former, my answer is easy: If you have a (serious) Stata question, please use Statalist. If you don't want to use Statalist, that's your choice. It's a much better place to ask a Stata question given its focus, including scope to show code, data and graphs as well. There are other forums (and I do watch Stack Overflow and Cross Validated too). (Also, life is short....)

    Missing values in any statistical software pose problems for researchers. Let me try at some platitudes as a starting point:

    You want software to be able to cope with missing values as you wish, which means including them (for example, reporting them) when you want to include them and ignoring them (more diplomatic: omitting from calculation) when you want to ignore or omit them (noting that you may have little choice; if a value is missing, it can be hard to know (for example) where it should go on a graph).

    You want your software to have rules that you understand and that are consistent.

    So far, so good, I hope.

    Stata's choices on strings don't seem to puzzle people. Any empty string is called missing, and missing strings means empty strings.

    Stata's choices on numeric values often do puzzle people. Here and below I set aside extended missing values .a to .z as not posing extra problems of understanding. Once you know that Stata often treats system missing . as arbitrarily large, the idea that other missing values are even larger doesn't seem to impose much extra mental strain.

    What bites: Modest familiarity with other software underlines that different programs make different choices here. Or even more simply Stata's choices clash sometimes with what people expect or prefer. You're responsible for your own code when you want something different from Stata's default choices. But first you need to understand those default choices.

    A gut reaction that Stata, or its designers, is stupid or insane because you don't understand its or their choices is itself of limited value. That's not good enough for anyone else unless you can explain different maximally consistent rules that you want to apply. There can be serious arguments about the design of languages or software, but there are plenty of silly arguments too, on the level of banter about sports or product brands.

    My way of explaining to myself what Stata does with missing values starts with two over-arching rules.

    Ignore what can be ignored.

    What is the median value of 1, 2, 3, 4, 20, missing value? Or the mean?

    Two answers can be defended. One answer: We can't say because one value is missing. Some other software does this, at least some of the time, so that you have to be explicit that you want missing values to be ignored. Stata usually tries to do what it can when missing values are present.

    Another answer: The median is 3 and the mean is 6, where we ignore the missing value without information on what it is. (Tautology: We are calling it missing because we don't know what it is.)

    This principle underlies what Stata does generally. Examples abound from summarize through regress and beyond. Sometimes, Stata gives up if all relevant observations contain missing values, so you see no result or you may even get an error message of no observations (to do that with). There is a subtle difference between Stata refusing to try and Stata returning missing if that is the best it can do (characteristic for example of functions, here including egen functions).

    This is the way that the opening example is to be understood.

    Code:
    . di min(1, .)
    1
    
    . di max(1, .)
    1
    Ignoring missing, just one argument is left. So the minimum of 1 and the maximum of 1 are both 1; the missing value is irrelevant. Know what is returned as min(.,.) or max(.,.)?

    If size matters, treat missing as arbitrarily large.

    William Gould, now President Emeritus, often introduced this principle by focusing on sorting a dataset on a numeric variable. Where are missing values to go? They don't belong somewhere in the middle and must be segregated somehow. There are two systematic ways to do that, to regard them as arbitrarily large negative, so that after sorting they are listed at the beginning of a dataset, or as arbitrarily large positive, so that after sorting they are listed at the end of a dataset. Stata chose the second. Even "arbitrary" is in the mind of the beholder, as Stata does in fact implement missing values as very large positive numbers, relative to the allowed range for a given storage type, such as int, float or double. But it does take pains to see that they are ignored when they should be.

    If people do not know or forget this, then they get bitten if they overlook that if foo >= 42 includes any missing values on foo. Unwelcome though it may be, some things just have to be learned the hard way, like the variants of irregular verbs, the rules of just about any sport you care about, and the fact that hot surfaces can burn (the latter being a personal formative experience hinting that I was not well suited for any experimental science).

    Given these rules, the first always seems to over-ride the second if there is tension between them. (Examples to the contrary?)

    Three-way logic?

    Stata choose two-way logic in the sense that there are two, and only two, results of logical comparisons, namely true and false. So the comparison
    Code:
    42 < .
    (missing) yields true, and not itself missing (which is, agreed, a defensible answer in the sense that we don't know whether the missing value is below, equal to or above 42).

    Other software made different choices, and there you go. Twice I have heard presentations of how Stata should implement three-way logic, or more modestly how you could implement three-way logic for yourself in Stata. The presentations were utterly different otherwise, but in each case as I recall the opinions expressed by the audience ran three ways:

    (1) Stata's two-way logic is what is there and familiar to experienced users and simpler than any alternative. Also, it's a bit late to change now.

    (2) We need a three-way logic, but not that in the presentation, which is itself illogical and/or unduly complicated.

    (3) We should adopt that (typically only the speaker).
    Last edited by Nick Cox; 22 Oct 2023, 08:13.
Working...
X