Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Creating new observations by group

    Dear Statalist members,

    I am working with data about nutritional intake where participants describe what they eat during two consecutive days. Each food is entered as a new observation in the database and is categorised by a food group and subgroup, which appear as the variables "group" and "subgroup1". There are 18 groups, coded from 1 to 18. A quantity ("cons_qty") is also associated with every food consumed.

    I successfully used fillin as suggested by Clyde Schechter to create new observations for groups when these observations did not exist in the database (see:https://www.statalist.org/forums/for...-in-a-database). I would have another question related to this.

    Each group is further divided into subgroups. The different groups don't have the same number of subgroups, for example group 4 has 5 subgroups whereas group 7 has 8 subgroups. I want to create new observations for subgroups that were not consumed, in the same idea that for the groups.

    Here is an example of the dataset:

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input float id byte(int_num group subgroup1) float cons_qty
    1 21  1 .         0
    1 21  2 0       261
    1 21  2 1        32
    1 21  2 2  132.1875
    1 21  2 2      76.8
    1 21  2 2       111
    1 21  2 3      40.5
    1 21  3 .         0
    1 21  4 1       207
    1 21  4 1       138
    1 21  5 1    64.375
    1 21  5 1    64.375
    1 21  5 5      21.8
    1 21  5 5       125
    1 21  6 .         0
    1 21  7 .         0
    1 21  8 .         0
    1 21  9 .         0
    1 21 10 1      25.2
    1 21 11 2        10
    1 21 12 2        10
    1 21 13 3     187.5
    1 21 13 3     187.5
    1 21 13 4       500
    1 21 13 4       500
    1 21 13 4       300
    1 21 13 4       300
    1 21 14 .         0
    1 21 15 4        28
    1 21 15 4        .6
    1 21 16 .         0
    1 21 17 2         1
    1 21 17 2         1
    1 21 18 .         0
    1 22  1 .         0
    1 22  2 1 15.208208
    1 22  2 2 30.416416
    1 22  2 2 30.416416
    1 22  3 .         0
    1 22  4 1      54.6
    1 22  4 1     124.2
    1 22  4 1      57.6
    1 22  4 1       138
    1 22  4 1 36.399998
    1 22  5 1    64.375
    1 22  5 1    64.375
    1 22  5 1     57.68
    1 22  5 4       100
    1 22  6 3        28
    1 22  6 3     27.95
    1 22  6 3        40
    1 22  6 3      55.9
    1 22  7 4        90
    1 22  8 .         0
    1 22  9 .         0
    1 22 10 1 26.614365
    1 22 10 2       2.5
    1 22 11 1      5.25
    1 22 11 1         7
    1 22 12 .         0
    1 22 13 3     187.5
    1 22 13 3     187.5
    1 22 13 4       500
    1 22 13 4       300
    1 22 13 4       500
    1 22 14 .         0
    1 22 15 3  .1520821
    1 22 15 4  19.01026
    1 22 15 4  .3802052
    1 22 15 4  3.802052
    1 22 16 1     123.2
    1 22 17 2         1
    1 22 17 2         1
    1 22 18 .         0
    1 21  1 .         0
    2 21  2 1  24.14001
    2 21  2 2  48.28003
    2 21  2 6  48.28003
    2 21  3 .         0
    2 21  4 .         0
    2 21  5 1      37.5
    2 21  6 3      47.5
    2 21  6 3      47.5
    2 21  7 4       100
    2 21  7 4        90
    2 21  8 .         0
    2 21  9 .         0
    2 21 10 0  42.24502
    2 21 10 1       1.8
    2 21 11 1         8
    2 21 11 1         8
    2 21 11 1         8
    2 21 12 .         0
    2 21 13 3       150
    2 21 13 3     112.5
    2 21 13 3       150
    2 21 13 4       500
    2 21 13 4      1000
    2 21 13 4       500
    2 21 13 4       500
    end


    I tried using:

    Code:
    fillin id int_num group subgroup1
    replace cons_qty = 0 if _fillin
    but the problem is that it creates 8 subgroups for each group (8 is the highest number of subgroups inside a group in my dataset), independently of the group concerned.

    I saw that the option by() is not allowed for fillin and I also tried using if to distinguish between the different groups, it didn't work either. I checked mipolate but as I understood, it doesn't do what I am looking for.

    My solution would be to use fillin on subsets of my data where I keep only one group and append all the subsets afterwards. Yet, it doesn't seem like the most efficient way to do it. Would you have another approach to suggest?
    Last edited by Constance Legay; 02 Dec 2020, 07:19.

  • #2
    There are several things I don't understand here.

    1. How is the full extent of the subgroup variable to be ascertained for each group. For example, in your data, group 2 appears with subgroups 0, 1, 2, 3, and 6. Do you also want the end result to include group2 with subgroups 4 1n3 5? Or just with 0, 1, 2, 3, and 6?

    2. There are many observations with missing value for subgroup1. Do you want each group (or at least each group that appears with a missing value subgroup1 in the data) to be paired with missing value for every id in the final result?

    3. How do you want to handle groups that, at least in the example data, only occur with missing subgroup1 (i.e. groups 1, 3, 8, 9, 14, 18).

    Comment


    • #3
      My dataset only contains observations for foods that were consumed. If a food was not consumed, at the moment there is no observation in the database with an associated 0 value for "cons_qty", it doesn't exist. What I would like to do is create a new observation with a value of 0 for "cons_qty" for each group and subgroup if it was not consumed.

      Each group contains a certain number of subgroups, this is defined in a codebook, so I know how many subgroups a given group has. For example, group 1 would have subgroups 0,1,2, and 3, group 2 would have subgroups 0,1,2,3,4,5, and 6.

      The example data I showed in #1 is what I obtained after using the command below, so only taking into account the groups and not the subgroups. This is why all those observations have missing values for subgroup1.

      Code:
      fillin id int_num group replace cons_qty = 0 if _fillin
      So, for group 2, let's say there are 7 subgroups: 0,1,2,3,4,5 and 6. At the moment only observations for subgroups 0,1,2,3 and 6 exist, 4 and 5 were not consumed by this participant so they don't appear. I would like to create observations for group 2 subgroup 4 qty_cons=0 and group 2 subgroup 5 qty_cons=0. The goal is to do this for each group/subgroups, taking into account that different groups have a different number of subgroups. I hope this is clearer, please let me know if you need more information.

      Comment


      • #4
        That is clearer, but I do need more information. In the case of group 2, there are 7 subgroups numbered 0 through 6, and we need to fill in subgroups 4 and 5. That's clear enough. But what if there is a group for which the subgroups are also 0 through 6, and in your data set all and only 0 through 5 are instantiated. There is no apparent way to know from the data whether 0 through 5 is all there is, or whether we need to extend to 6, or maybe beyond that to 7, or 8, or who knows how many.

        So to do this, there needs to be a crosswalk that shows, for each group, what all the possible subgroups are. If the subgroups always start at 0 and run consecutively up to some maximum number, then just a list of the groups and the corresponding maximum subgroup numbers would be enough.

        Comment


        • #5
          Ok, I understand what you mean. Yes, subgroups start at 0 and run consecutively up to the maximum number. Here is the list of groups and their corresponding number of subgroups:

          Group Subgroup1
          1 0-3
          2 0-6
          3 0-1
          4 0-3
          5 0-7
          6 0-5
          7 0-6
          8 0-4
          9 0-1
          10 0-4
          11 0-3
          12 0-2
          13 0-4
          14 0-6
          15 0-4
          16 0-2
          17 0-2
          18 0-2

          Comment


          • #6
            I think this will do what you want:

            Code:
            * Example generated by -dataex-. To install: ssc install dataex
            clear
            input byte group float upper_limit
             1 3
             2 6
             3 1
             4 3
             5 7
             6 5
             7 6
             8 4
             9 1
            10 4
            11 3
            12 2
            13 4
            14 6
            15 4
            16 2
            17 2
            18 2
            end
            expand upper_limit + 1
            by group, sort: gen subgroup1 = _n-1
            gen cons_qty = 0
            drop upper_limit
            tempfile subgroups
            save `subgroups'
            
            * Example generated by -dataex-. To install: ssc install dataex
            clear
            input float id byte(int_num group subgroup1) float cons_qty
            1 21  2 0       261
            1 21  2 1        32
            1 21  2 2  132.1875
            1 21  2 2      76.8
            1 21  2 2       111
            1 21  2 3      40.5
            1 21  4 1       207
            1 21  4 1       138
            1 21  5 1    64.375
            1 21  5 1    64.375
            1 21  5 5      21.8
            1 21  5 5       125
            1 21 10 1      25.2
            1 21 11 2        10
            1 21 12 2        10
            1 21 13 3     187.5
            1 21 13 3     187.5
            1 21 13 4       500
            1 21 13 4       500
            1 21 13 4       300
            1 21 13 4       300
            1 21 15 4        28
            1 21 15 4        .6
            1 21 17 2         1
            1 21 17 2         1
            1 22  2 1 15.208208
            1 22  2 2 30.416416
            1 22  2 2 30.416416
            1 22  4 1      54.6
            1 22  4 1     124.2
            1 22  4 1      57.6
            1 22  4 1       138
            1 22  4 1 36.399998
            1 22  5 1    64.375
            1 22  5 1    64.375
            1 22  5 1     57.68
            1 22  5 4       100
            1 22  6 3        28
            1 22  6 3     27.95
            1 22  6 3        40
            1 22  6 3      55.9
            1 22  7 4        90
            1 22 10 1 26.614365
            1 22 10 2       2.5
            1 22 11 1      5.25
            1 22 11 1         7
            1 22 13 3     187.5
            1 22 13 3     187.5
            1 22 13 4       500
            1 22 13 4       300
            1 22 13 4       500
            1 22 15 3  .1520821
            1 22 15 4  19.01026
            1 22 15 4  .3802052
            1 22 15 4  3.802052
            1 22 16 1     123.2
            1 22 17 2         1
            1 22 17 2         1
            2 21  2 1  24.14001
            2 21  2 2  48.28003
            2 21  2 6  48.28003
            2 21  5 1      37.5
            2 21  6 3      47.5
            2 21  6 3      47.5
            2 21  7 4       100
            2 21  7 4        90
            2 21 10 0  42.24502
            2 21 10 1       1.8
            2 21 11 1         8
            2 21 11 1         8
            2 21 11 1         8
            2 21 13 3       150
            2 21 13 3     112.5
            2 21 13 3       150
            2 21 13 4       500
            2 21 13 4      1000
            2 21 13 4       500
            2 21 13 4       500
            end
            tempfile main_data
            save `main_data'
            
            use id int_num using `main_data'
            duplicates drop
            fillin id int_num
            drop _fillin
            cross using `subgroups'
            merge 1:m id int_num group subgroup1 using `main_data', update replace
            This data set will contain at least one observation for every id at every int_num in every group and subgroup, with a zero filled in for quantity if the main data does not have a corresponding entry.

            Comment


            • #7
              Thank you for your help Clyde, your code worked really well!

              Comment

              Working...
              X