Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Collapse (sum) turns missings into zeroes

    Code:
    sysuse auto, clear
    replace turn = . if foreign == 1
    collapse (sum) turn, by(foreign)
    When specifying (sum) in collapse, any by-groups that have all missings end up with a zero value instead of missings. This is not mentioned anywhere in the manual and is actually very dangerous - as illustrated below.

    Code:
    quietly {
        sysuse xtline1.dta, clear
        gen week = week(day)
        replace calories = . if floor(week/4) == ceil(week/4)
    
        collapse (sum) calories, by(week person)
        collapse (mean) calories, by(person)
        noisily list calories
    
        sysuse xtline1.dta, clear
        gen week = week(day)
        replace calories = . if floor(week/4) == ceil(week/4)
        bysort person week: egen weeklycals = total(calories), m
        bysort person week: replace weeklycals = . if _n != 1
        collapse (mean) weeklycals, by(person)
        noisily list weeklycals
    }
    
    
         +----------+
         | calories |
         |----------|
      1. | 19187.54 |
      2. | 20048.13 |
      3. | 21470.63 |
         +----------+
    
         +----------+
         | weekly~s |
         |----------|
      1. | 25583.38 |
      2. | 26730.84 |
      3. |  28627.5 |
         +----------+
    Moreover, if you did not notice this happened and then do a regression, you will use a lot more data than you actually have - all those missings are now zeroes.

    EDIT: This also happens with fcollapse and gcollapse.
    Last edited by Jesse Wursten; 10 Nov 2017, 06:58.

  • #2
    For sum, I think this feature bites only if one is unable to distinguish whether a final sum of zero is due to all missing values or due to all zeros in the summed variable. I cannot think of any other scenario. If you want to disregard missing values, specify the option cw (casewise deletion) .

    Code:
     
     sysuse auto, clear replace turn = . if foreign == 1 collapse (sum) turn, by(foreign) cw

    Comment


    • #3
      This issue arises every now and again. For one example, the sum() function will produce a series of zeros for a series of missings and equivalently not change a sum when a new value is missing. For another example, a while back there was much fuss over what egen does.

      The Stata argument hinges on this principle. What is the sum of (say) two sums, one for

      . . . [three missings]

      and the other for

      42 666 [two non-missings].

      If the overall sum is missing + 708 then Stata's other rules imply that the result is missing. Hands up if that is what you want

      What Stata says is that the sum is 0 + 708, which is 708, Hands up if that is what you want.

      In short, the issue is how you want sums to behave when they are in turn summed (which is what cumulative or running sums are all about).

      Personally, I think the best fix would be an option to insist that missing sums are taken literally in collapse.

      That said, there is a fix

      If you count non-missings, then you can flip zeros to missing afterwards if and when appropriate:

      Code:
      sysuse auto, clear  
      replace turn = . if foreign == 1
      collapse (sum) turn (count) count=turn, by(foreign)
      
      list
      
      replace turn = . if count == 0
      
      list
      although clearly that fix does not appeal if you have to do it for several variables.

      Comment


      • #4
        I always thought this was the difference between manually summing and using a "summing function".

        As in
        Code:
        ** Warning, pseudo code **
        gen x = . + 708 = .
        gen x = sum(. 708) = 708
        egen x = rowtotal(. 708) = 708
        This is also what is stated in the documentation for egen. In the collapse documentation, the concept of missings is introduced only in the examples and only for means. Which option you want (. or 708) depends on the context. When you use the "+" operator, you will get feedback in the command window telling you how many missings were created. If you use rowtotal, the presence of a missing option (and the help text) alerts you that you have to be careful when it comes to missings.

        With collapse, you don't get any of this. Based on the last example of the help file you might even be fooled into thinking a situation of all missings is kept as missing, because (mean) does return a missing when all its relevant observations are missing. If you have a small dataset it is straightforward to check whether collapse did what you expected. If you are summing up 2000 subgroups with 3 observations each, this is less obvious.

        There may be a select number of cases where you want missings to sum up to zeroes, but this should always be a conscious decision ("I think the missings actually do represent information"). On the other hand, I find it strange that there are commands in Stata which take a situation with no information (all missing) and turn this into a situation with information (all zero). I just do not see how it makes sense that . plus . ever by default equals 0 . Again, there might be situations where you want . plus . to equal zero, but then summing them up is a very hacky solution and not what you expect to be default behaviour of your software program.

        In fact, I was surprised to see that sum() has the same behaviour. At least here, it is technically documented in the help file (though I wouldn't mind if it were more explicit). But still, why would you want a running sum of missings to equal zero? Sure, you can argue whether 708 +. should equal . or 708 - I have no issues with either choice (and often exploit that Stata can do both). But that . plus . equals 0 bothers me a lot. In fact, I wouldn't be surprised if I've made mistakes to this in the past without noticing this. And I am a "techy" Stata user. I dread to think how many junior financial analysts have created a cumulative return series of stocks they have no information on and used all the generated zeroes in their routine regressions.

        Am I missing something? (no pun intended)(okay maybe a little)

        TL;DR
        I believe the egen = rowtotal(), missing variant of dealing with missings should be the default for all Stata functions and commands. Currently, both sum() and collapse (sum) see a sum of missings as zero in a less than transparent way.

        PS: It also bothers me tremendously sum() is called sum() and not runningsum() or something more informative, especially because collapse (sum) is not a running sum. But that is not a battle I'm willing to fight.

        Comment


        • #5
          This is for StataCorp mostly, I think. Every now and again StataCorp tidy up inconsistencies in different corners of Stata, although users tend to prefer that the same code produces the same answers, an aim which here is in tension with consistency.

          There is one extra simple detail. In the 1980s when many of the early decisions were made function names were often short, partly because in some cases they had to be. In any case, short but cryptic names were just part of the culture. In languages that were smaller (Stata 1 or 2 or ...) it was a reasonable expectation that you could read all the documentation. Not so now, and people rightly expect self-explanatory names to the extent possible.
          Last edited by Nick Cox; 13 Nov 2017, 08:07.

          Comment

          Working...
          X