Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Calculating means/medians using an identified group as denominator in Panel data

    I had asked this follow up in a previous thread but thought I would post again in a new topic:
    I have panel data as below - Originally I wanted to create a variable that is the sum of PRICE - but with a separate TOTAL for each different group of DATE within the same CODE and num_id.
    which I was able to create with Nick Cox's code:
    Code:
     
     bysort DATE CODE num_id: egen total = total(PRICE)
    I had then created a variable that identifies the groups of dates within a CODE and assigned a specific number (group_id).
    Code:
    egen group_id = group(DATE CODE num_id)
    How can I get a mean/median for PRICE of a specific CODE using group_id as the denominator/group?

    I have tried:
    Code:
    bysort group_id: egen mean_group=mean(PRICE)
    but this just gives the overall mean for a given CODE
    And also tried:
    Code:
    univar PRICE if CODE==194, by(group_id)
    but this just lists individual means for each unique group_id, I want an over all mean of PRICE for a given CODE for using each associated group_id (i.e. with group_id as the denominator).
    If I were to tabulate CODE and the number of unique group_ids:
    CODE CODE (n) group_id (n) group_id mean PRICE
    475 30 15 ?
    476 40 25 ?
    and I want the mean PRICE associated to given CODE with respect to the group_id (n)


    Sample dataex with created variables total and group_id:

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input float num_id int(CODE DATE) double PRICE float(total group_id)
    97 475 20473     1248  58131.02  631
    97 475 20473     1441  58131.02  631
    97 475 20473     2275  58131.02  631
    97 475 20473     1361  58131.02  631
    97 475 20473     1932  58131.02  631
    97 475 20473     3058  58131.02  631
    97 475 20473    15324  58131.02  631
    97 475 20473    15131  58131.02  631
    97 475 20473     1411  58131.02  631
    97 475 20473   912.02  58131.02  631
    97 475 20473       10  58131.02  631
    97 475 20473       13  58131.02  631
    97 475 20473     1373  58131.02  631
    97 475 20473     1631  58131.02  631
    97 475 20473      741  58131.02  631
    97 475 20473     3819  58131.02  631
    97 475 20473     4177  58131.02  631
    97 475 20473     2274  58131.02  631
    97 863 20485      805     31089 1062
    97 863 20485      910     31089 1062
    97 863 20485    10216     31089 1062
    97 863 20485     3408     31089 1062
    97 863 20485     1631     31089 1062
    97 863 20485      304     31089 1062
    97 863 20485    10071     31089 1062
    97 863 20485       16     31089 1062
    97 863 20485     1404     31089 1062
    97 863 20485     2324     31089 1062
    95 193 20570     1843  131266.5 3880
    95 193 20570     1669  131266.5 3880
    95 193 20570      648  131266.5 3880
    95 193 20570     3126  131266.5 3880
    95 193 20570     1550  131266.5 3880
    95 193 20570     1373  131266.5 3880
    95 193 20570     6496  131266.5 3880
    95 193 20570      455  131266.5 3880
    95 193 20570      597  131266.5 3880
    95 193 20570      871  131266.5 3880
    95 193 20570      258  131266.5 3880
    95 193 20570    13231  131266.5 3880
    95 193 20570     1631  131266.5 3880
    95 193 20570     1884  131266.5 3880
    95 193 20570      435  131266.5 3880
    95 193 20570    14386  131266.5 3880
    95 193 20570     4174  131266.5 3880
    95 193 20570     2554  131266.5 3880
    95 193 20570     6014  131266.5 3880
    95 193 20570     4675  131266.5 3880
    95 193 20570     6132  131266.5 3880
    95 193 20570    568.5  131266.5 3880
    95 193 20570     1763  131266.5 3880
    95 193 20570     6134  131266.5 3880
    95 193 20570    16630  131266.5 3880
    95 193 20570      840  131266.5 3880
    95 193 20570     5243  131266.5 3880
    95 193 20570     4689  131266.5 3880
    95 193 20570    21397  131266.5 3880
    95 193 20632 18547.35 145158.81 5837
    95 193 20632   1032.2 145158.81 5837
    95 193 20632  8992.12 145158.81 5837
    95 193 20632   452.24 145158.81 5837
    95 193 20632  5674.88 145158.81 5837
    95 193 20632     1189 145158.81 5837
    95 193 20632 43607.46 145158.81 5837
    95 193 20632   100.54 145158.81 5837
    95 193 20632  2426.04 145158.81 5837
    95 193 20632  1627.19 145158.81 5837
    95 193 20632  4355.79 145158.81 5837
    95 193 20632 45322.45 145158.81 5837
    95 193 20632   242.74 145158.81 5837
    95 193 20632  4435.02 145158.81 5837
    95 193 20632  2130.08 145158.81 5837
    95 193 20632  5023.71 145158.81 5837
    96 470 20710      772  104797.9 8305
    96 470 20710      155  104797.9 8305
    96 470 20710     2004  104797.9 8305
    96 470 20710     2041  104797.9 8305
    96 470 20710    34746  104797.9 8305
    96 470 20710     7072  104797.9 8305
    96 470 20710      843  104797.9 8305
    96 470 20710       13  104797.9 8305
    96 470 20710     5800  104797.9 8305
    96 470 20710     5996  104797.9 8305
    96 470 20710     2212  104797.9 8305
    96 470 20710      999  104797.9 8305
    96 470 20710      382  104797.9 8305
    96 470 20710      570  104797.9 8305
    96 470 20710      660  104797.9 8305
    96 470 20710    15273  104797.9 8305
    96 470 20710  1143.88  104797.9 8305
    96 470 20710     3222  104797.9 8305
    96 470 20710     1548  104797.9 8305
    96 470 20710  2742.02  104797.9 8305
    96 470 20710     6622  104797.9 8305
    96 470 20710     1865  104797.9 8305
    96 470 20710     4952  104797.9 8305
    96 470 20710     3165  104797.9 8305
    end
    format %tdnn/dd/CCYY DATE

  • #2
    On a point of terminology: denominators are defined as what you divide by as in numerator/denominator =: quotient or ratio. The term is not best used as a synonym for group.

    Less pedantically: your code looks good to me. I think you're just being misled by the way the categories line up. Using groups (latest version downloadable from SSC or Stata 15 updated) I get this result for your sample data.

    Code:
    . bysort group_id: egen mean_group=mean(PRICE)
    
    . groups group_id num_id CODE DATE mean_group
    
      +-------------------------------------------------------------------+
      | group_id   num_id   CODE        DATE   mean_g~p   Freq.   Percent |
      |-------------------------------------------------------------------|
      |      631       97    475   1/20/2016   3229.501      18     18.56 |
      |     1062       97    863    2/1/2016     3108.9      10     10.31 |
      |     3880       95    193   4/26/2016   4526.431      29     29.90 |
      |     5837       95    193   6/27/2016   9072.426      16     16.49 |
      |     8305       96    470   9/13/2016   4366.579      24     24.74 |
      +-------------------------------------------------------------------+
    So the group identifier is separating different categories as instructed. The same should be true of your entire dataset, presumably much bigger.

    Note that creating a group identifier isn't needed at all, as

    Code:
    . bysort num_id CODE DATE: egen mean_group=mean(PRICE)
    would work fine. This point was also made in your earlier thread.

    Note: Clyde is singing the same song.
    Last edited by Nick Cox; 28 Sep 2017, 15:06.

    Comment


    • #3
      I can't replicate your problem. When I use your data and your code, I get correct results: a mean price within each combination of DATE, CODE, and num_id, and it is definitely not an overall mean for CODE because it gives different values for the same CODE depending on the DATE and num_id.

      Comment


      • #4
        I think I need to clarify: I want to use -table- (or some other similar command) to list specific CODEs and the associated PRICE per group_id (or some other way to get the PRICE of a CODE per group_id)

        Comment


        • #5
          Sorry, no clearer to me. I used one tabulation command in #2 and you can use any other you like. I don't see what the problem is.

          Comment


          • #6
            In the full data set there are ~7000 people (num_id), but with ~10,000 groups of CODE based on the different dates (group_id)
            Say I tabulated this form the dataex above:
            Code:
            table CODE if CODE==193, c(mean PRICE sd PRICE)
            it gives:
            Code:
                 CODE | mean(PRICE)    sd(PRICE)
            ----------+-------------------------
                  193 |   6142.7847     9761.177
            This is just the mean of PRICE by all occurrences of that CODE, I just want the mean PRICE per group_id
            Because the group_id tells me that the CODE occurred on a different date and this needs to be treated as a separate "event"

            or if I find the total:
            Code:
                 CODE | sum(PRICE)
            ----------+-----------
                  193 |   276425.3
            ----------------------
            The CODE may appear 400 times in the entire data set, but I want the mean price of that code per the number of group_ids that have that CODE.
            So the [total PRICE of CODE / number of group_id with that CODE] to calculate cost per group_id of that CODE and get error estimates...
            I am truly sorry if I am not articulating this properly - I am confusing myself trying to figure out exactly how to ask the question. I am thankful for all the suggestions so far (which have still been a help to me)

            Comment


            • #7
              I'm still not clear on what you want, but perhaps it is this:

              Code:
              tabstat price if CODE == 193, by(group_id) statistics(mean sd)
              Or if you strongly prefer the look of -table-'s output:

              Code:
              table group_id CODE if CODE == 193, c(mean PRICE sd PRICE)

              Comment


              • #8
                I basically want to be able to calculate for a given CODE: PRICE per group_id = (total PRICE of CODE ) / (number of group_id with that CODE)
                the total price associated with code 193 in the entire data set is:
                Code:
                table CODE if CODE==193, c(sum PRICE)
                
                -------------------------
                     CODE |   sum(PRICE)
                ----------+--------------
                      193 |       9461544
                -------------------------
                when I xtset to group_id, the number of group_id associated with that total is:
                Code:
                                 Overall             Between            Within
                     CODE |    Freq.  Percent      Freq.  Percent        Percent
                ----------+-----------------------------------------------------
                      193 |    2580    100.00       128    100.00         100.00
                ----------+-----------------------------------------------------
                    Total |    2580    100.00       128    100.00         100.00
                                               (n = 128)
                I want to be able to calculate the mean as [Price per group = 9461544 / 128] and an estimate of variation/error... but so far I have only been able to calculate overall mean of that price by the frequency of that code (i.e. 9461544 / 2580)

                Comment


                • #9
                  Graham: Working through http://www.stata-journal.com/sjpdf.h...iclenum=dm0055 may help.

                  Comment


                  • #10
                    I'll update since with all your help I was finally able to get the values I needed.
                    I ended up using the total calculated by group_id and then collapsing by group_id and CODE:
                    Code:
                    bysort DATE num_id CODE: egen total=total(PRICE)
                    
                    egen group_id = group(DATE CODE num_id)
                    
                    
                    sort group_id CODE
                    
                    collapse total, by(group_id CODE)
                    
                    
                    * tabulate mean/medians by group_ids
                    
                    table CODE if CODE==193, c(mean total sd total median total p25 total p75 total)
                    Code:
                    ---------------------------------------------------------------------------
                         CODE | mean(total)    sd(total)   med(total)   p25(total)   p75(total)
                    ----------+----------------------------------------------------------------
                          193 |    138212.7     9823.349     138212.7     131266.5     145158.8
                    ---------------------------------------------------------------------------
                    Thanks everyone for the guidance

                    Comment


                    • #11
                      I doubt that what's you want. When you collapse, by(group_id CODE) the only difference from collapse, by(group_id) is that the variable CODE will also end up in the collapsed dataset.

                      In that collapsed dataset, you will end with several observations for any distinct value of CODE because in general a given value of CODE will match more than one value of DATE and num_id.

                      When after the collapse you use table to take means, SD and quartiles over those values and look at a particular CODE, you've lost all information on relative numbers in the subcategories. It's like taking the auto dataset and collapsing by foreign and then taking the mean over those results for foreign and domestic cars. You have lost information on the numbers of foreign and domestic cars and the summary statistics won't have any useful meaning as being unweighted when only weighted makes sense.

                      I'd suggest code positively if I understood your problem, but I got lost before #5 and I haven't found my way yet.
                      But I think it's unlikely that you've got what you want..
                      Last edited by Nick Cox; 02 Oct 2017, 09:50.

                      Comment


                      • #12
                        Let me try asking this way:
                        As is, when I calculate mean PRICE (xtsum) for a CODE I am getting the mean based on the total frequency of that CODE in the data set. But since this is panel data, say I wanted to find the PRICE per person in this data set (the number in the table below for Between). i.e. if you xttab for CODE in the full data set, that is xtset on num_id you get:
                        Code:
                                          Overall             Between            Within
                             CODE |    Freq.  Percent      Freq.  Percent        Percent
                        ----------+-----------------------------------------------------
                              193 |    2580    100.00       121    100.00         100.00
                        ----------+-----------------------------------------------------
                            Total |    2580    100.00       121    100.00         100.00
                                                       (n = 121)
                        How can I get the PRICE per num_id associated with that CODE? And then I can apply that to the group_id, which when xtset for group_isd has:

                        Code:
                                          Overall             Between            Within
                             CODE |    Freq.  Percent      Freq.  Percent        Percent
                        ----------+-----------------------------------------------------
                              193 |    2580    100.00       128    100.00         100.00
                        ----------+-----------------------------------------------------
                            Total |    2580    100.00       128    100.00         100.00
                                                       (n = 128)

                        Comment


                        • #13
                          Sorry, but this style of explanation is the style you used in most previous posts -- and it fails with anyone who doesn't know your data and so has to read a word explanation and understand it all, yet cannot see the data you're using to reproduce an example, and in that way feel one's way to a code solution.

                          Clearly your dataset is far too big to reproduce here or attach -- that's understood -- but the implication is that you have to give us the simplest possible kind of example with the same form.

                          Here's an attempt to start at the other end. You have identifiers

                          Code:
                          date 
                          num_id 
                          code
                          and an outcome of interest

                          Code:
                          price
                          (Sorry, but the UPPERCASE variable names produce an synesthetic reaction in my case.)

                          Some silly numbers follow with the code to input them. Feel free to edit them if these data do not have the same logical structure as your data.

                          Code:
                          * Example generated by -dataex-. To install: ssc install dataex
                          clear
                          input float(code num_id price date)
                          1 1  2 1
                          1 2  3 2
                          1 3  5 3
                          2 1  7 4
                          2 2 11 5
                          2 3 13 6
                          3 1 17 7
                          3 2 19 8
                          3 3 23 9
                          end
                          
                          . list, sepby(code)
                          
                               +------------------------------+
                               | code   num_id   price   date |
                               |------------------------------|
                            1. |    1        1       2      1 |
                            2. |    1        2       3      2 |
                            3. |    1        3       5      3 |
                               |------------------------------|
                            4. |    2        1       7      4 |
                            5. |    2        2      11      5 |
                            6. |    2        3      13      6 |
                               |------------------------------|
                            7. |    3        1      17      7 |
                            8. |    3        2      19      8 |
                            9. |    3        3      23      9 |
                               +------------------------------+
                          Can you now explain with simple worked examples which results you want? . Code that gets you some of the way is naturally welcome.

                          Also, all of a sudden you are introducing xtsum and xttab but we have to reach back to #8 and #10 to recover the definitions, seemingly

                          Code:
                          egen group_id = group(num_id code date) 
                          xtset group_id
                          It seems pretty odd to define a panel identifier by group identifiers and date; how many observations does that give in each panel?

                          It may be that the xt framework is a distraction; I don't think anyone else can tell until they understand the problem.

                          Comment


                          • #14
                            so the data is more set up as such:
                            Code:
                            * Example generated by -dataex-. To install: ssc install dataex
                            
                            clear
                            
                            input float(code num_id price date)
                            
                            1 1  2 1
                            
                            1 1  4 1
                            
                            1 1  3 2
                            
                            1 1  2 2
                            
                            1 2  3 2
                            
                            1 3  5 3
                            
                            1 3  3 3
                            
                            1 3  4 1
                            
                            2 3  4 1
                            
                            2 3  3 1
                            
                            2 3  2 2
                            
                            2 2  2 1
                            
                            2 2  3 1
                            
                            2 2  1 2
                            
                            2 2  3 2
                            
                            2 1  7 4
                            
                            2 2 11 5
                            
                            2 3 13 6
                            
                            3 3  1 1
                            
                            3 1 17 1
                            
                            3 2 19 2
                            
                            3 2 3 2
                            
                            3 2 4 1
                            
                            3 2 5 1
                            
                            3 3 23 2
                            
                            end
                            and grouped as:
                            Code:
                            list, sepby(code)
                            to get:
                            Code:
                                 +------------------------------+
                                 | code   num_id   price   date |
                                 |------------------------------|
                              1. |    1        1       2      1 |
                              2. |    1        1       4      1 |
                              3. |    1        1       3      2 |
                              4. |    1        1       2      2 |
                              5. |    1        2       3      2 |
                              6. |    1        3       5      3 |
                              7. |    1        3       3      3 |
                              8. |    1        3       4      1 |
                                 |------------------------------|
                              9. |    2        3       4      1 |
                             10. |    2        3       3      1 |
                             11. |    2        3       2      2 |
                             12. |    2        2       2      1 |
                             13. |    2        2       3      1 |
                             14. |    2        2       1      2 |
                             15. |    2        2       3      2 |
                             16. |    2        1       7      4 |
                             17. |    2        2      11      5 |
                             18. |    2        3      13      6 |
                                 |------------------------------|
                             19. |    3        3       1      1 |
                             20. |    3        1      17      1 |
                             21. |    3        2      19      2 |
                             22. |    3        2       3      2 |
                             23. |    3        2       4      1 |
                             24. |    3        2       5      1 |
                             25. |    3        3      23      2 |
                                 +------------------------------+
                            so for instance code = 1
                            i) the mean price per code is total price / number of codes -> 26/8 = 3.52
                            ii) the mean price per num_id is total price / number of ids -> 26/3 = 8.6
                            iii) the mean price per num_id AND same date group is total price / group_id -> 26/5 = 5.2 (Because line 1&2 will be treated as 1 group, while line 3&4 will be treated as another group, line 5 one group, line 6&7 another group, and line 8 1 group, totaling 5 groups)

                            how can I code to get ii) and iii)?
                            Apologies for being unclear... and thank you for your time to help!

                            Comment


                            • #15
                              Thanks for using dataex. I reproduce your code here without the puzzling extra blank lines.

                              If I see "number of codes" I interpret that as "number of distinct (different) codes", not "number with each code". SImilarly with "number of ids".

                              As far as I can see, number of codes for you means number with each code but number of ids means number of different ids! That's inconsistent to say no more.

                              On i) using Stata I can't reproduce your calculation 26/8 = 3.52 as I get 3.25. I assume that's just a typo in your post. But I can suggest code for you:

                              Code:
                              * Example generated by -dataex-. To install: ssc install dataex
                              clear
                              input float(code num_id price date)
                              1 1  2 1
                              1 1  4 1
                              1 1  3 2
                              1 1  2 2
                              1 2  3 2
                              1 3  5 3
                              1 3  3 3
                              1 3  4 1
                              2 3  4 1
                              2 3  3 1
                              2 3  2 2
                              2 2  2 1
                              2 2  3 1
                              2 2  1 2
                              2 2  3 2
                              2 1  7 4
                              2 2 11 5
                              2 3 13 6
                              3 3  1 1
                              3 1 17 1
                              3 2 19 2
                              3 2  3 2
                              3 2  4 1
                              3 2  5 1
                              3 3 23 2
                              end
                              
                              egen mean_by_code = mean(price), by(code)
                               
                              list, sepby(code)
                              
                                   +-----------------------------------------+
                                   | code   num_id   price   date   mean_b~e |
                                   |-----------------------------------------|
                                1. |    1        1       2      1       3.25 |
                                2. |    1        1       4      1       3.25 |
                                3. |    1        1       3      2       3.25 |
                                4. |    1        1       2      2       3.25 |
                                5. |    1        2       3      2       3.25 |
                                6. |    1        3       5      3       3.25 |
                                7. |    1        3       3      3       3.25 |
                                8. |    1        3       4      1       3.25 |
                                   |-----------------------------------------|
                                9. |    2        3       4      1        4.9 |
                               10. |    2        3       3      1        4.9 |
                               11. |    2        3       2      2        4.9 |
                               12. |    2        2       2      1        4.9 |
                               13. |    2        2       3      1        4.9 |
                               14. |    2        2       1      2        4.9 |
                               15. |    2        2       3      2        4.9 |
                               16. |    2        1       7      4        4.9 |
                               17. |    2        2      11      5        4.9 |
                               18. |    2        3      13      6        4.9 |
                                   |-----------------------------------------|
                               19. |    3        3       1      1   10.28571 |
                               20. |    3        1      17      1   10.28571 |
                               21. |    3        2      19      2   10.28571 |
                               22. |    3        2       3      2   10.28571 |
                               23. |    3        2       4      1   10.28571 |
                               24. |    3        2       5      1   10.28571 |
                               25. |    3        3      23      2   10.28571 |
                                   +-----------------------------------------+
                              But on ii) I can't see any rationale for dividing the total price 26 for code 1 by the number of distinct codes 3. One sign that that is nonsensical is that the supposed mean isn't even within the range of the data. 26/3 is 8 and 2/3 but that's no kind of mean.

                              You might as well have heights for some Earth people, some Martians and some Venusians and then add them up and then divide by the number of groups to get an average or mean height.

                              I don't think that result has any meaning. It's certainly not a mean as defined in statistics.

                              Backing up: the key to getting means over groups of observations is just to use a by() option with one, or two or three variables specified, so far as this dataset is concerned. So, there are 2^3 = 8 possible levels of aggregation

                              Code:
                              (everything)
                              by(code)
                              by(date)
                              by(num_id)
                              by(code date)
                              by(date num_id)
                              by(code num_id)
                              by(code num_id date)
                              I bailed out at that point, but without looking at iii) closely I suspect the same fallacy as in ii). To repeat: for code 1 the prices range from 2 to 5, so whatever 5.2 is, it isn't any kind of mean.

                              Last edited by Nick Cox; 06 Oct 2017, 13:25.

                              Comment

                              Working...
                              X