I just observed a bug using >collapse (mean)< when I had two by-variables. In hindsight, I realised that I should specify collapse differently anyways and so I was able to proceed. Anyway, I would like to share my observation, as I had observed some strange patterns in the past when using collapse (mean). I tried to reduce my data set (and the way I specified the collapse line) to show more clearly the pattern I observed, see below.
What have I done? The attached data set is the result of an m:1 merge that I did before saving the sample file. So the duplicates of the variable values are exactly the same (which can be easily shown when using duplicates drop on this variable only). I then applied the following code:
> use collapse_mean_statalist.dta, clear
> collapse (mean) values, by(dim year)
> keep year values
> duplicates drop
This results in the following data set. For the years 2009, 2014, 2018 and 2019, collapse (mean) has produced two different versions of exactly the same input values. For the other years, the result corresponds to my expectations.
year values
2008 14415208
2009 9642943.7
2009 9642943.7
2010 12497334
2011 15016518
2012 14250374
2013 13446495
2014 13554714
2014 13554714
2015 13854357
2016 12972321
2017 15584707
2018 16447676
2018 16447676
2019 15607856
2019 15607856
2020 13316325
2021 17885927
2022 22952812
The above code doesn't really make sense as I show it here. It's just the stripped down version of my code that I tried to reduce to show the core problem. If you do it all at once (with only one side variable), you get the expected result
> use collapse_mean_statalist.dta, clear
> collapse (mean) values, by(year)
year values
2008 14415208
2009 9642943.7
2010 12497334
2011 15016518
2012 14250374
2013 13446495
2014 13554714
2015 13854357
2016 12972321
2017 15584707
2018 16447676
2019 15607856
2020 13316325
2021 17885927
2022 22952812
In summary, it seems to me that collapse (mean) makes a problem when the observations are grouped over more than one variable. I have observed this problem both with Stata 15.1 (on my work PC) and with Stata 17 (on our servers). I hope that helps finding this bug.
What have I done? The attached data set is the result of an m:1 merge that I did before saving the sample file. So the duplicates of the variable values are exactly the same (which can be easily shown when using duplicates drop on this variable only). I then applied the following code:
> use collapse_mean_statalist.dta, clear
> collapse (mean) values, by(dim year)
> keep year values
> duplicates drop
This results in the following data set. For the years 2009, 2014, 2018 and 2019, collapse (mean) has produced two different versions of exactly the same input values. For the other years, the result corresponds to my expectations.
year values
2008 14415208
2009 9642943.7
2009 9642943.7
2010 12497334
2011 15016518
2012 14250374
2013 13446495
2014 13554714
2014 13554714
2015 13854357
2016 12972321
2017 15584707
2018 16447676
2018 16447676
2019 15607856
2019 15607856
2020 13316325
2021 17885927
2022 22952812
The above code doesn't really make sense as I show it here. It's just the stripped down version of my code that I tried to reduce to show the core problem. If you do it all at once (with only one side variable), you get the expected result
> use collapse_mean_statalist.dta, clear
> collapse (mean) values, by(year)
year values
2008 14415208
2009 9642943.7
2010 12497334
2011 15016518
2012 14250374
2013 13446495
2014 13554714
2015 13854357
2016 12972321
2017 15584707
2018 16447676
2019 15607856
2020 13316325
2021 17885927
2022 22952812
In summary, it seems to me that collapse (mean) makes a problem when the observations are grouped over more than one variable. I have observed this problem both with Stata 15.1 (on my work PC) and with Stata 17 (on our servers). I hope that helps finding this bug.
Comment