Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • collapse (mean) with two by-variables results in different output from the same input

    I just observed a bug using >collapse (mean)< when I had two by-variables. In hindsight, I realised that I should specify collapse differently anyways and so I was able to proceed. Anyway, I would like to share my observation, as I had observed some strange patterns in the past when using collapse (mean). I tried to reduce my data set (and the way I specified the collapse line) to show more clearly the pattern I observed, see below.

    What have I done? The attached data set is the result of an m:1 merge that I did before saving the sample file. So the duplicates of the variable values are exactly the same (which can be easily shown when using duplicates drop on this variable only). I then applied the following code:
    > use collapse_mean_statalist.dta, clear
    > collapse (mean) values, by(dim year)
    > keep year values
    > duplicates drop

    This results in the following data set. For the years 2009, 2014, 2018 and 2019, collapse (mean) has produced two different versions of exactly the same input values. For the other years, the result corresponds to my expectations.

    year values
    2008 14415208
    2009 9642943.7
    2009 9642943.7

    2010 12497334
    2011 15016518
    2012 14250374
    2013 13446495
    2014 13554714
    2014 13554714

    2015 13854357
    2016 12972321
    2017 15584707
    2018 16447676
    2018 16447676

    2019 15607856
    2019 15607856

    2020 13316325
    2021 17885927
    2022 22952812

    The above code doesn't really make sense as I show it here. It's just the stripped down version of my code that I tried to reduce to show the core problem. If you do it all at once (with only one side variable), you get the expected result
    > use collapse_mean_statalist.dta, clear
    > collapse (mean) values, by(year)

    year values
    2008 14415208
    2009 9642943.7
    2010 12497334
    2011 15016518
    2012 14250374
    2013 13446495
    2014 13554714
    2015 13854357
    2016 12972321
    2017 15584707
    2018 16447676
    2019 15607856
    2020 13316325
    2021 17885927
    2022 22952812

    In summary, it seems to me that collapse (mean) makes a problem when the observations are grouped over more than one variable. I have observed this problem both with Stata 15.1 (on my work PC) and with Stata 17 (on our servers). I hope that helps finding this bug.
    Attached Files

  • #2
    First, please read the FAQ - file attachments are strongly discouraged, and will not be opened by the users who would be most inclined to help you.

    Second, what you show looks exactly as I would expect if -dim- takes multiple values in years 2009, 2014, 2018 and 2019. This should be easy to verify if you list -year-, -dim-, and -values- in the first example.

    cheers,
    Jeph

    Comment


    • #3
      You are running into a floating-point arithmetic precision issue. Note computer has finite precision hence numerical computation is not the same as math. For example, in the year 2009 case, for number x = 9642943.7,

      (x+x+x+x+x+x) / 6 does not equal to x

      To demostrate:

      Code:
      . clear
      
      . set obs 6
      Number of observations (_N) was 0, now 6.
      
      . gen double x = 9642943.7
      
      . gen double y = sum(x)/_n
      
      . list
      
           +-----------------------+
           |         x           y |
           |-----------------------|
        1. | 9642943.7   9642943.7 |
        2. | 9642943.7   9642943.7 |
        3. | 9642943.7   9642943.7 |
        4. | 9642943.7   9642943.7 |
        5. | 9642943.7   9642943.7 |
           |-----------------------|
        6. | 9642943.7   9642943.7 |
           +-----------------------+
      
      . assert x==y
      1 contradiction in 6 observations
      assertion is false
      r(9);
      
      . list if x != y
      
           +-----------------------+
           |         x           y |
           |-----------------------|
        6. | 9642943.7   9642943.7 |
           +-----------------------+
      
      . format %21x x y
      
      . list if x != y
      
           +-----------------------------------------------+
           |                     x                       y |
           |-----------------------------------------------|
        6. | +1.26477f6666666X+017   +1.26477f6666667X+017 |
           +-----------------------------------------------+
      You can see that this differs at the last descimal point. See https://blog.stata.com/2012/04/02/th...-to-precision/ for a detailed discussion of precision.
      Last edited by Hua Peng (StataCorp); 22 Jun 2023, 16:21.

      Comment


      • #4
        First, thank you for your responses!

        Second, I’ve uploaded my dataset trying to make my “bug report” fully reproducible. I often ran into similar issues that using >collapse (mean)< was not useful because of these precision issues. I was aware of that before posting (as mentioned in your response) but was not able to imagine why this should be an issue here.

        After thinking about your example “(x+x+x+x+x+x) / 6 does not equal to x”, I can now see why the problem arose here too. It is because although my panel is balanced within variable year, it is not within variable dim resulting in “(x+x+x+x+x+x) / 6 does not necessarily equal to (x+x+x+x) / 4”. That’s why the two-step approach (as described in my original post) failed. On the other hand, in the one-step approach of collapsing my data set by year only (dropping dim immediately) “(x+…+x) / 15 = (x+…+x) / 15” holds.

        Very frankly, the fact that I regularly run into a precision issue with >collapse (mean)< makes the mean-option in collapse unusable for me. It is not really understandable to me why the tedious diversions via "egen x = mean() ... duplicates drop" is so much more accurate here. For me, this precision issue when using “collapse (mean)” is therefore still a “bug” since it can be easily circumvented…

        Again, thank you for your responses!

        Comment

        Working...
        X