collapse VS duplicates. which one do you prefer?

Yao Zhao

Join Date: Feb 2017

Posts: 226
#1

collapse VS duplicates. which one do you prefer?

10 Mar 2020, 16:36

I think there are a lot cases when we use egen to form the average by group_id. Then we want to know the mean value of these averages. As far as I concerned, there are 2 methods.

Code:

*method1: duplicates drop id, force *method 2: collapse varlist, by(id)

Can anyone tell me which one is preferred?
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 29948
#2

10 Mar 2020, 16:59

The second code is safer: it loses information only to the extent necessary to calculate means. By contrast, the first one, I would never use. -force- options (in any of the commands that have them) are inherently dangerous. They must only be used when you are certain that they will not destroy needed information. If it is true that you have correctly calculated the group-level variables and there is no more individual level variation left in the data, then -duplicates drop- without the force option will work. But using -force- is gambling that you have not overlooked something. That's not a gamble that I ever take.

As betwee in -duplicates drop id- (without force) and -collapse, by(id)- I think the former is probably faster in large data sets, though I have never tested this proposition.
Comment
Yao Zhao

Join Date: Feb 2017

Posts: 226
#3

10 Mar 2020, 17:18

Thank you for your clarification. In fact, if you use -duplicate drop varlist-, then you have to add -,force- option. This is required.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29948
#4

10 Mar 2020, 17:30

Yes, that is true, and I should have said to use -duplicates drop-, with no varlist (nor a force term). If that does not end up reducing your data set to one observation per id, then it means that there are still other variables in the data set that vary within id. Any such variables should be explicitly removed with a -drop- (or complementary -keep-) command before running -duplicates drop-. That way you can be sure that nothing has been overlooked. That is the safe way to do it.
Comment

Announcement

collapse VS duplicates. which one do you prefer?

Comment

Comment

Comment