collapse (mean) with two by-variables results in different output from the same input

Fabian Unterlass

Join Date: Jun 2023

Posts: 3
#1

collapse (mean) with two by-variables results in different output from the same input

22 Jun 2023, 10:39

I just observed a bug using >collapse (mean)< when I had two by-variables. In hindsight, I realised that I should specify collapse differently anyways and so I was able to proceed. Anyway, I would like to share my observation, as I had observed some strange patterns in the past when using collapse (mean). I tried to reduce my data set (and the way I specified the collapse line) to show more clearly the pattern I observed, see below.

What have I done? The attached data set is the result of an m:1 merge that I did before saving the sample file. So the duplicates of the variable values are exactly the same (which can be easily shown when using duplicates drop on this variable only). I then applied the following code:
> use collapse_mean_statalist.dta, clear
> collapse (mean) values, by(dim year)
> keep year values
> duplicates drop

This results in the following data set. For the years 2009, 2014, 2018 and 2019, collapse (mean) has produced two different versions of exactly the same input values. For the other years, the result corresponds to my expectations.

year values
2008 14415208
2009 9642943.7
2009 9642943.7
2010 12497334
2011 15016518
2012 14250374
2013 13446495
2014 13554714
2014 13554714
2015 13854357
2016 12972321
2017 15584707
2018 16447676
2018 16447676
2019 15607856
2019 15607856
2020 13316325
2021 17885927
2022 22952812

The above code doesn't really make sense as I show it here. It's just the stripped down version of my code that I tried to reduce to show the core problem. If you do it all at once (with only one side variable), you get the expected result
> use collapse_mean_statalist.dta, clear
> collapse (mean) values, by(year)

year values
2008 14415208
2009 9642943.7
2010 12497334
2011 15016518
2012 14250374
2013 13446495
2014 13554714
2015 13854357
2016 12972321
2017 15584707
2018 16447676
2019 15607856
2020 13316325
2021 17885927
2022 22952812

In summary, it seems to me that collapse (mean) makes a problem when the observations are grouped over more than one variable. I have observed this problem both with Stata 15.1 (on my work PC) and with Stata 17 (on our servers). I hope that helps finding this bug.

Attached Files

collapse_mean_statalist.dta (9.0 KB, 1 view)
Tags: None
Jeph Herrin

Join Date: Apr 2014

Posts: 332
#2

22 Jun 2023, 15:05

First, please read the FAQ - file attachments are strongly discouraged, and will not be opened by the users who would be most inclined to help you.

Second, what you show looks exactly as I would expect if -dim- takes multiple values in years 2009, 2014, 2018 and 2019. This should be easy to verify if you list -year-, -dim-, and -values- in the first example.

cheers,
Jeph
1 like
Comment

Hua Peng (StataCorp)

StataCorp Employee

Join Date: Jun 2014
Posts: 344

22 Jun 2023, 15:52

You are running into a floating-point arithmetic precision issue. Note computer has finite precision hence numerical computation is not the same as math. For example, in the year 2009 case, for number x = 9642943.7,

(x+x+x+x+x+x) / 6 does not equal to x

To demostrate:

Code:

. clear

. set obs 6
Number of observations (_N) was 0, now 6.

. gen double x = 9642943.7

. gen double y = sum(x)/_n

. list

     +-----------------------+
     |         x           y |
     |-----------------------|
  1. | 9642943.7   9642943.7 |
  2. | 9642943.7   9642943.7 |
  3. | 9642943.7   9642943.7 |
  4. | 9642943.7   9642943.7 |
  5. | 9642943.7   9642943.7 |
     |-----------------------|
  6. | 9642943.7   9642943.7 |
     +-----------------------+

. assert x==y
1 contradiction in 6 observations
assertion is false
r(9);

. list if x != y

     +-----------------------+
     |         x           y |
     |-----------------------|
  6. | 9642943.7   9642943.7 |
     +-----------------------+

. format %21x x y

. list if x != y

     +-----------------------------------------------+
     |                     x                       y |
     |-----------------------------------------------|
  6. | +1.26477f6666666X+017   +1.26477f6666667X+017 |
     +-----------------------------------------------+

You can see that this differs at the last descimal point. See https://blog.stata.com/2012/04/02/th...-to-precision/ for a detailed discussion of precision.

Last edited by Hua Peng (StataCorp); 22 Jun 2023, 16:21.

Comment

Fabian Unterlass

Join Date: Jun 2023

Posts: 3
#4

22 Jun 2023, 22:35

First, thank you for your responses!

Second, I’ve uploaded my dataset trying to make my “bug report” fully reproducible. I often ran into similar issues that using >collapse (mean)< was not useful because of these precision issues. I was aware of that before posting (as mentioned in your response) but was not able to imagine why this should be an issue here.

After thinking about your example “(x+x+x+x+x+x) / 6 does not equal to x”, I can now see why the problem arose here too. It is because although my panel is balanced within variable year, it is not within variable dim resulting in “(x+x+x+x+x+x) / 6 does not necessarily equal to (x+x+x+x) / 4”. That’s why the two-step approach (as described in my original post) failed. On the other hand, in the one-step approach of collapsing my data set by year only (dropping dim immediately) “(x+…+x) / 15 = (x+…+x) / 15” holds.

Very frankly, the fact that I regularly run into a precision issue with >collapse (mean)< makes the mean-option in collapse unusable for me. It is not really understandable to me why the tedious diversions via "egen x = mean() ... duplicates drop" is so much more accurate here. For me, this precision issue when using “collapse (mean)” is therefore still a “bug” since it can be easily circumvented…

Again, thank you for your responses!
Comment

Announcement

collapse (mean) with two by-variables results in different output from the same input

Comment

Comment

Comment