Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Creating new rows in a database

    Dear Statalist members,

    I am working with food consumption data. Each participant describes every food that he/she consumed during two consecutive days, "int_num" is coded 21 for day 1 and 22 for day 2.

    Each food is entered as a new row in the database and is categorised by a food group, which appears as the variable "group". There are 18 groups, coded from 1 to 18. A quantity ("cons_qty") is also associated with every food consumed.

    As for now, if a food group was not consumed, there is no row in the database with an associated 0 value for "cons_qty", it simply doesn't exist.

    Here is an example of my dataset, participant 1 is from lines 1-70, participant 2 lines 70-100:

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input byte(int_num tag_recall tag_people tag_number_recalls_consumed) float number_recalls_consumed byte group float(cons_qty qty_per_group_per_rec mean_qty_per_group_per_person)
    21 0 0 1 2  1       114       114     179.5
    22 0 0 0 2  1       245       245     179.5
    21 0 0 0 .  2        62         .         .
    22 0 0 0 .  2 23.179415         .         .
    21 0 0 1 2  2 24.886913  243.4164  321.4293
    22 0 0 0 .  2 .25754905         .         .
    22 0 0 0 .  2  42.92484         .         .
    21 0 0 0 .  2        40         .         .
    22 0 0 0 .  2        18         .         .
    22 0 0 0 .  2       200         .         .
    22 0 0 0 2  2  9.954765  399.4421  321.4293
    22 0 0 0 .  2   8.71042         .         .
    22 1 0 0 .  2  9.577605         .         .
    22 0 0 0 .  2 29.864294         .         .
    22 0 0 0 .  2 24.886913         .         .
    21 0 0 0 .  2  9.954765         .         .
    21 0 0 0 .  2        68         .         .
    21 0 0 0 .  2 29.864294         .         .
    21 1 1 0 .  2   8.71042         .         .
    22 0 0 0 .  2   9.65809         .         .
    22 0 0 0 .  2  22.42823         .         .
    22 0 0 0 2  4       117     373.5     373.5
    21 0 0 0 .  4     100.5         .         .
    22 0 0 0 .  4       156         .         .
    22 0 0 0 .  4     100.5         .         .
    21 0 0 0 .  4       156         .         .
    21 0 0 1 2  4       117     373.5     373.5
    22 0 0 0 .  5  68.67975         .         .
    21 0 0 0 .  5        30         .         .
    21 0 0 0 .  5       150         .         .
    22 0 0 0 .  5 14.487134         .         .
    22 0 0 0 .  5       150         .         .
    22 0 0 0 2  5       150  383.1669  356.5834
    21 0 0 1 2  5       150       330  356.5834
    22 0 0 0 .  6   9.65809         .         .
    22 0 0 0 .  6  98.72713         .         .
    22 0 0 0 .  6      20.5         .         .
    21 0 0 0 .  6        60         .         .
    21 0 0 0 .  6  49.77383         .         .
    22 0 0 0 2  6  49.77383 178.65906 154.46645
    21 0 0 1 2  6      20.5 130.27383 154.46645
    21 0 0 0 .  7 24.886913         .         .
    21 0 0 0 .  7        40         .         .
    22 0 0 0 .  7       150         .         .
    21 0 0 1 2  7        75  239.8869 235.28806
    22 0 0 0 .  7 24.886913         .         .
    22 0 0 0 2  7  55.80229  230.6892 235.28806
    21 0 0 0 .  7       100         .         .
    21 0 0 1 1  9     51.92     51.92     25.96
    22 0 0 0 . 10  3.219363         .         .
    22 0 0 0 2 10  6.438726  9.658089 15.089045
    21 0 0 1 2 10     20.52     20.52 15.089045
    21 0 0 1 1 12        75        75      37.5
    22 0 0 0 2 13      3200      3200      3250
    21 0 0 1 2 13      3300      3300      3250
    21 0 0 0 . 15        20         .         .
    22 0 0 0 . 15 .04292484         .         .
    21 0 0 0 . 15  5.972859         .         .
    22 0 0 0 . 15  5.972859         .         .
    22 0 0 0 . 15 .04292484         .         .
    21 0 0 1 2 15      4.75 167.92287  94.12877
    22 0 0 0 2 15 .53656054 20.334673  94.12877
    22 0 0 0 . 15      11.4         .         .
    21 0 0 0 . 15      13.2         .         .
    22 0 0 0 . 15  .8584968         .         .
    22 0 0 0 . 15 .53656054         .         .
    22 0 0 0 . 15 .04292484         .         .
    22 0 0 0 . 15  .8584968         .         .
    21 0 0 0 . 15       124         .         .
    22 0 0 0 . 15 .04292484         .         .
    22 0 0 1 1  1       160       160        80
    22 0 0 0 .  2  3.356522         .         .
    22 0 0 1 1  2 16.782608  20.13913 10.069565
    21 0 0 1 2  4        75        75     161.7
    22 0 0 0 2  4     248.4     248.4     161.7
    22 0 0 0 2  5  8.391304  8.391304  66.69565
    21 0 0 1 2  5       125       125  66.69565
    22 0 0 0 .  6    59.598         .         .
    22 0 0 0 2  6     58.05 182.26105 140.33052
    21 0 0 1 2  6      98.4      98.4 140.33052
    22 0 0 0 .  6 64.613045         .         .
    22 0 0 0 2  7  52.86522  52.86522  126.4326
    21 0 0 1 2  7       200       200  126.4326
    22 0 0 0 . 10 30.208696         .         .
    22 0 0 0 2 10      12.8   43.0087  22.25435
    21 0 0 1 2 10       1.5       1.5  22.25435
    22 0 0 0 . 11        40         .         .
    22 0 0 0 2 11        46        86        67
    21 0 0 1 2 11        48        48        67
    22 0 0 1 1 12        80        80        40
    22 0 0 0 2 13        75    2087.5    2037.5
    21 0 0 0 . 13       600         .         .
    22 0 0 0 . 13       600         .         .
    21 1 1 0 . 13       300         .         .
    21 0 0 0 . 13        75         .         .
    22 0 0 0 . 13       300         .         .
    21 0 0 0 . 13      37.5         .         .
    21 0 0 0 . 13       900         .         .
    22 0 0 0 . 13       300         .         .
    22 0 0 0 . 13       300         .         .
    end
    I want to describe mean consumptions of the different food groups, in different subsets: 1) only the participants that consumed that food group and 2) all participants taking into account also those who did not consume that food group.

    I have generated new variables using the command tag: "tag_people" to identify uniquely one participant, "tag_recall" one day and "tag_group" one group (see example dataset). I am working with the average consumption per food group per participant over the two days, the variable "mean_qty_per_group_per_person".

    As participants who did not consume the food do not appear there, I face some issues to generate mean consumptions for all participants. I've tried using a loop where I manually enter the total number of participants to obtain the mean:

    Code:
    foreach var of local food_pattern {
            
        capt drop `var'
        gen `var'=0
        replace `var'= mean_qty_per_group_per_person if group==`i' & tag_number_recalls_consumed==1
        
        capt drop meanqty_`var'
        egen meanqty_`var' = total(`var'/123--> the total number of participants)
        
        local i=`i'+1
    
    }
    This works but if there is a change in the database this could lead to errors. Plus, I would need data concerning participants who did not consume a food group for further statistical analysis.

    My idea was then to create a new row for each group that was not consumed and to allocate a value of 0 for "cons_qty". The conditions would be to identify for a given "patient_id" if there is a number missing between 1 and 18 for the variable "group". If a number is missing, create a new row with the patient_id, indicate the missing number for the variable "group" and generate a value of 0 for the variable "cons_qty". For example, for participant 1 in the dataset above, the group 3 was not consumed. I would like to create a new row with patient_id= 01, group= 3 and cons_qty=0.

    I've checked Statalist and found this interesting post: https://www.statalist.org/forums/for...w-observations, but I did not manage to translate it to my case and have also looked the expand command.

    Would you have any suggestions on how to implement the above? or other approaches that would work?

    Thank you,
    Constance




  • #2
    If I understand you correctly, you want to have a data set in which there is an observation ("row") for each person on each day for each food group. And you want the cons_qty variable set to zero in those observations that, at present, simply don't exist in the data.

    That is not possible to do with the data you show in your examples, because there needs to be a variable that identifies which person each observation refers to. Your tag_people variable is presumably somehow related to that, but it does not provide the required information. I will assume that somewhere in your real data there is a variable that identifies individuals, and that it is called person_id. With that, you can do:

    Code:
    fillin person_id int_num group
    replace cons_qty = 0 if _fillin

    Comment


    • #3
      Dear Clyde, yes I indeed have a variable id_num that allows me to identify individuals. Thank you, the code worked really well!

      Comment


      • #4
        I have another question related to the dataset in #1. I hope it is okay to post it on the same thread, let me know if I should create a new thread instead.

        Each group is further divided into subgroups. The different groups don't have the same number of subgroups, for example group 4 has 5 subgroups whereas group 7 has 8 subgroups. I want to create new observations for subgroups that were not consumed, in the same idea that for the groups (see #1).

        I tried using:

        Code:
         
         fillin person_id int_num group subgroup replace cons_qty = 0 if _fillin
        but the problem is that it creates 8 subgroups for each group (8 is the highest number of subgroups inside a group in my dataset), independently of the group concerned.

        I saw that the option by() is not allowed for fillin and I also tried using if to distinguish between the different groups,it didn't work either. I checked mipolate but as I understood, it doesn't do what I am looking for.

        My solution would be to use fillin on subsets of my data where I keep only one group and append all the subsets afterwards. Yet, it doesn't seem like the most efficient way to do it. Would you have another approach to suggest?

        Comment


        • #5
          Question in #4 was resolved here: https://www.statalist.org/forums/forum/general-stata-discussion/general/1584339-creating-new-observations-by-group

          Comment

          Working...
          X