Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Handling of Missing Values by the COLLAPSE Command

    Hi, I'm using COLLAPSE to compute sums of variables by persons (who have unique ids) and by year. The data set is a network panel, in which some variables refer to 'neighbors' of persons in a given year. For this reason there can be multiple observations by person and year in the input file, in which each neighbor is a person with their own unique id. The observations are uniquely identified by ids for persons and their neighbors and by year. The observations can assume missing values (.) because data on persons and neighbors occur in spells of varying lengths over time.

    I'm assuming that COLLAPSE excludes missing values from the sums. Maybe. But here is another question: what happens when all the observations by person and year for a given variable are missing? Does COLLAPSE drop the observation completely, or does it insert a zero? If it inserts a zero, which my results argue that it does, then I have a problem, because sometime the sum is really zero and at other times the sum is truly missing.

    The crux of the identification problem is a requirement that missing values be distinguished from zeroes in the final file that is aggregated by person and year.

    I'm sorry to issue this post, but the COLLAPSE documentation does not answer this question.

  • #2
    Yes, the collapse command with the sum stat will return a zero for a given by() combination if all of the called values are missing.

    Comment


    • #3
      You may wish to take a look at this article.

      Best regards,

      Marcos

      Comment


      • #4
        Thanks Marcos and Dan for your replies, and I shall study the article by Robert Fornango. I think that the Stata documentation of COLLAPSE, excellent though it is, does not go far enough. it could be improved by discussing this issue and providing a fix for situations where missing values ans zeroes should be treated separately. How should I approach Stata on this?

        Comment


        • #5
          The following approach may be helpful.
          Code:
          cls
          * Example generated by -dataex-. To install: ssc install dataex
          clear
          input float(id x y)
          101 0 5
          101 0 6
          101 . 7
          102 . .
          102 . 8
          102 . .
          end
          
          collapse (sum) x y (count) n_x=x n_y=y, by(id)
          list, noobs
          replace x = . if n_x==0
          replace y = . if n_y==0
          drop n_*
          list, noobs
          Code:
          . collapse (sum) x y (count) n_x=x n_y=y, by(id)
          
          . list, noobs
          
            +--------------------------+
            |  id   x    y   n_x   n_y |
            |--------------------------|
            | 101   0   18     2     3 |
            | 102   0    8     0     1 |
            +--------------------------+
          
          . replace x = . if n_x==0
          (1 real change made, 1 to missing)
          
          . replace y = . if n_y==0
          (0 real changes made)
          
          . drop n_*
          
          . list, noobs
          
            +--------------+
            |  id   x    y |
            |--------------|
            | 101   0   18 |
            | 102   .    8 |
            +--------------+

          Comment


          • #6
            Although I am not surprised that people get upset the first time they are surprised by this, there is a work-around.

            It seems to me that

            1. Zero is the correct sum for a numeric variable whenever there there are non-missings that sum to zero.

            2. Otherwise the way to find out which sums of zeros "should be" missing is to count non-missing values and to replace sums of zero by missing whenever the number of non-missing values is zero, collapse lets you do this.

            Note: I am not surprised that William Lisowski made the same point!

            Comment


            • #7
              Dear William and Nick, thanks for your messages, and the auxiliary counting of non-missings is a great idea. And yet, from a software user's point of view, I do think that this discussion belongs in the documentation for COLLAPSE. This problem of ambiguous zeroes is common in micro data that are not cleaned up by a government agency or data vendor. By including this discussion with the documentation, the ambiguity gets resolved once and for all and a lot of time is saved.

              It's not that we users are upset, it's that there are a lot of different uses for human time. There is competition for that time so I for one would be grateful for less coding/deciphering.

              Comment


              • #8
                I would argue that collapse could use a missing option akin to that on the egen total() function which causes the result to be missing if all the input observations are missing. For other egen fiunctions like mean() that is the default.
                Code:
                . bysort id: egen tot_x = total(x), missing
                (3 missing values generated)
                
                . bysort id: egen mean_x = mean(x)
                (3 missing values generated)
                
                . list, noobs sepby(id)
                
                  +-------------------------------+
                  |  id   x    y   tot_x   mean_x |
                  |-------------------------------|
                  | 101   0    5       0        0 |
                  | 101   0    6       0        0 |
                  | 101   .    7       0        0 |
                  |-------------------------------|
                  | 102   .   -1       .        . |
                  | 102   .    1       .        . |
                  | 102   .    .       .        . |
                  +-------------------------------+

                Comment


                • #9
                  I agree, I think this is a very useful extension that would increase the power of COLLAPSE.

                  Comment


                  • #10
                    Documenting this more fully and adding an option to collapse is certainly reasonable. I note that this was discussed in https://www.stata-journal.com/articl...article=dm0047 back in 2010.

                    Code:
                    search missing
                    points to it but not

                    Code:
                    search collapse

                    Comment


                    • #11
                      Hi

                      I agree with James about the collapse documentation: it is unprecise regarding missing values. Also, while the "sum" command treat missing as 0 the "mean" command treats missing as missing. Fine by me but make sure to have it documented somewhere statacorp!

                      Comment

                      Working...
                      X