Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Collapse 2 different ways yields different means

    My dataset has 120,000 observations of air pollution (variable name PM25) data recorded by 21 people (variable name IDnum) before and after an intervention (variable name pre_post), where location (variable name timeactivity) is recorded. We've also indicated when PM25 data was missing using the variable name missing. FixedPM is the outdoor PM25 amount.

    Here is my question - I am collapsing the data 2 different ways for 2 tables. Unfortunately my mean PM25 values for home (timeactivity==4) do not match on the 2 tables and I think this is due to a problem with my collapse command or possibly with missing data.

    PM25=6.3 at home before any collapse commands are used.

    PM25 at home is 6.3 with this very brief collapse command (it works).
    Code:
    collapse (mean) PM25  FixedPM  (sum) obs_min, by(timeactivity)
    But PM=6.9 when I use this collapse command
    Code:
    collapse (mean) PM25  FixedPM  (sum) obs_min, by(ID_Sess timeactivity)
    And PM25=6.9 when I run the collapse command I planned for the analysis:
    Code:
    collapse (mean) PM25  FixedPM  (sum) obs_min (sum) miss_min, by(IDnum ID_Sess pre_post timeactivity Mit_Cat missing)

    Thank you for your advice on this confusing issue! Any input on what may be going wrong or what to check would be appreciated. Happy to upload data here if that would be helpful as well.

  • #2
    Please don't upload data. The people most likely to be able to help you are also the least likely to risk downloading a file from a person they don't know. But you should show example data. In fact, I doubt anybody can help you with this without actually seeing your data, as there seem to be a lot of moving parts here that are not fully explained.

    The useful way to show example data, and the only way you should do it, is with the -dataex- command. If you are running version 17, 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

    Please be sure that the example data you show exhibits the kind of problem you are having.

    Comment


    • #3
      Clyde thank you for your comment. I will paste representative data below.

      The crux of this problem is the question of why the mean value of a variable Iin this case PM25) is different before and after running a collapse command, when data has not been intentionally excluded during the collapse.


      Code:
      * Example generated by -dataex-. For more info, type help dataex
      clear
      input float(IDnum timeactivity pre_post PM25 FixedPM) int obs_min
      1 4 0  7 12 1
      1 4 0 13 11 1
      1 4 0  5 13 1
      1 4 0 12 11 1
      1 4 0  6 10 1
      1 4 0  6 10 1
      1 4 0  6 12 1
      1 4 0 11 11 1
      1 4 0 13 14 1
      1 4 0 11 18 1
      1 4 0  5 13 1
      1 4 0  6 13 1
      1 4 0  6 12 1
      1 4 0 13 14 1
      1 4 0  7 12 1
      1 4 0  6 13 1
      1 4 0  5 14 1
      1 4 0  5 14 1
      1 4 0  5 10 1
      1 4 0 15 14 1
      1 4 0  5 14 1
      1 4 0  9 18 1
      1 4 0  5 13 1
      1 4 0  9 18 1
      1 4 0  6 14 1
      1 4 0 12 18 1
      1 4 0  6 11 1
      1 4 0 14 14 1
      1 4 0  8 12 1
      1 4 0 12 18 1
      1 4 0  6 13 1
      1 4 0 12 18 1
      1 4 0 12 11 1
      1 4 0  6 10 1
      1 4 0  6 11 1
      1 4 0 14 11 1
      1 4 0  6 14 1
      1 4 0 10 18 1
      1 4 0  9 18 1
      1 4 0  8 18 1
      1 4 0  7 12 1
      1 4 0  5 14 1
      1 4 0  5 13 1
      1 4 0  5 13 1
      1 4 0  5 13 1
      1 4 0  6 10 1
      1 4 0  8 18 1
      1 4 0  5 13 1
      1 4 0  5 14 1
      1 4 0  6 13 1
      1 4 0  5 13 1
      1 4 0  5 14 1
      1 4 0 12 14 1
      1 4 0  5 13 1
      1 4 0  7 13 1
      1 4 0  5 13 1
      1 4 0  6 10 1
      1 4 0  6 13 1
      1 4 0 10 18 1
      1 4 0  7 14 1
      1 4 0  5 13 1
      1 4 0  5 14 1
      1 4 0  7 13 1
      1 4 0 11 18 1
      1 4 0  6 10 1
      1 4 0  6 14 1
      1 4 0  7 12 1
      1 4 0  8 12 1
      1 4 0  6 13 1
      1 4 0  8 12 1
      1 4 0 11 18 1
      1 4 0  5 13 1
      1 4 0  6 14 1
      1 4 0  6 14 1
      1 4 0  6 14 1
      1 4 0  9 18 1
      1 4 0  7 12 1
      1 4 0  5 14 1
      1 4 0 13 11 1
      1 4 0  6 14 1
      1 4 0  5 13 1
      1 4 0  6 11 1
      1 4 0  6 13 1
      1 4 0  5 14 1
      1 4 0  8 11 1
      1 4 0  7 10 1
      1 4 0  5 14 1
      1 4 0  5 13 1
      1 4 0  8 12 1
      1 4 0  5 14 1
      1 4 0 13 14 1
      1 4 0  5 10 1
      1 4 0  5 14 1
      1 4 0  6 13 1
      1 4 0  7 11 1
      1 4 0 11 18 1
      1 4 0  7 13 1
      1 4 0 14 11 1
      1 4 0  6 13 1
      1 4 0  5 14 1
      end
      label values timeactivity tactive2
      label def tactive2 4 "Home", modify
      label values pre_post pre_post
      label def pre_post 0 "pre", modify

      Comment


      • #4
        Well, your example data don't exhibit the problem. Part of the reason is that in the example data, timeactivity and prepost are both constant (1 and 4) respectively, so when I -collapse, by()- either of them, I just get the means of the whole simple, which, I can verify, are correctly reproduced.

        In #1, you seem to be concerned about what happens when you collapse -by(IDsess timeactivity)-. But your example doesn't even have the IDsess variable. So there is no way for me to check out what is going on there. That said, there is no reason to expect that the results of -collapse, by(timeactivity)- will be the same as -collapse, by(IDsess timeactivity)- because, in the latter, the timeactivity groups will be broken up into subsets by IDsess, and those subsets will generally have different means from the whole timeactivity together. Moreover, you can't count on the mean of those subset means to equal the mean of the whole timeactivity group, unless each of these subsets has the same number of (non-missing) observations.

        Comment

        Working...
        X