Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • collapse VS duplicates. which one do you prefer?

    I think there are a lot cases when we use egen to form the average by group_id. Then we want to know the mean value of these averages. As far as I concerned, there are 2 methods.
    Code:
    *method1:
    duplicates drop id, force
    *method 2:
    collapse varlist, by(id)
    Can anyone tell me which one is preferred?

  • #2
    The second code is safer: it loses information only to the extent necessary to calculate means. By contrast, the first one, I would never use. -force- options (in any of the commands that have them) are inherently dangerous. They must only be used when you are certain that they will not destroy needed information. If it is true that you have correctly calculated the group-level variables and there is no more individual level variation left in the data, then -duplicates drop- without the force option will work. But using -force- is gambling that you have not overlooked something. That's not a gamble that I ever take.

    As betwee in -duplicates drop id- (without force) and -collapse, by(id)- I think the former is probably faster in large data sets, though I have never tested this proposition.

    Comment


    • #3
      Thank you for your clarification. In fact, if you use -duplicate drop varlist-, then you have to add -,force- option. This is required.

      Comment


      • #4
        Yes, that is true, and I should have said to use -duplicates drop-, with no varlist (nor a force term). If that does not end up reducing your data set to one observation per id, then it means that there are still other variables in the data set that vary within id. Any such variables should be explicitly removed with a -drop- (or complementary -keep-) command before running -duplicates drop-. That way you can be sure that nothing has been overlooked. That is the safe way to do it.

        Comment

        Working...
        X