Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Creating a group variable out of other categorical variables.

    Hi,

    I have some binary variables (TB, HIV, Diabetes) with yes/no answers. Is it possible to combine them into one called 'diseasegroups' that will contain all the individuals and combined groups? e.g.

    tabulating diseasegroups will show the following categories:
    TB
    HIV
    Diabetes
    None
    All

    I tried the grouplabs method, but the result is different from what I am looking for.

    I can do it manually, but it gets complex when there are too many variables to handle, that's why I am wondering if there is a more efficient way of doing this:
    gen diseasegroups = "None"
    replace diseasegroups = "TB" if TB == 1
    replace diseasegroups = "HIV" if HIV == 1
    replace diseasegroups = "Diabetes" if Diabetes == 1
    replace diseasegroups = "All" if TB == 1 & HIV == 1 & Diabetes == 1


    Thanks in advance for your insight!

    Here is the dataex:

    ----------------------- copy starting from the next line -----------------------
    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input byte(TB HIV HBP Diabetes)
    0 1 0 0
    1 1 0 0
    1 1 0 0
    1 1 0 0
    1 1 0 0
    0 1 0 0
    1 1 0 0
    1 0 0 0
    1 1 0 0
    0 0 0 0
    1 0 0 0
    1 0 0 0
    1 1 0 0
    1 1 0 0
    1 1 0 0
    1 1 0 0
    0 1 0 0
    1 1 0 1
    0 1 1 1
    0 0 0 0
    0 0 0 0
    0 0 0 0
    1 1 0 0
    1 1 0 0
    0 0 0 0
    1 1 0 0
    1 1 0 0
    1 1 0 0
    1 1 0 0
    1 1 0 0
    0 0 0 0
    0 1 0 1
    0 1 0 1
    1 0 0 0
    0 1 0 0
    1 1 0 0
    0 0 0 0
    1 1 0 0
    1 1 0 0
    1 1 0 0
    1 1 0 1
    1 1 0 0
    0 1 0 0
    0 0 0 0
    0 1 0 0
    0 1 0 0
    0 1 0 1
    1 1 1 0
    1 1 1 0
    1 1 1 0
    0 1 1 0
    1 1 0 0
    1 0 0 0
    0 1 1 1
    1 1 0 1
    1 1 0 0
    0 0 0 0
    0 0 0 0
    0 1 0 0
    0 1 0 0
    1 1 0 0
    1 1 0 0
    1 1 0 0
    1 1 0 0
    1 1 0 0
    1 1 0 0
    0 1 0 0
    1 1 0 0
    0 1 1 0
    0 1 0 0
    0 1 0 0
    1 1 0 0
    1 0 0 0
    0 0 0 0
    1 1 0 0
    0 0 0 0
    0 1 0 0
    0 1 0 0
    0 1 0 0
    1 1 0 0
    0 0 0 0
    0 0 0 0
    0 0 0 0
    0 0 0 0
    0 1 0 0
    0 1 0 0
    0 0 0 0
    0 0 0 0
    1 0 0 0
    0 0 0 0
    1 0 0 0
    1 0 0 0
    0 0 0 0
    0 0 0 0
    0 0 0 0
    1 1 0 0
    0 0 0 0
    0 0 0 0
    1 1 0 0
    0 0 0 0
    end
    label values TB Diabetes
    label values HIV V121
    label def V121 0 "no", modify
    label def V121 1 "yes", modify
    label values HBP V122
    label def V122 0 "no", modify
    label def V122 1 "yes", modify
    label values Diabetes V123
    label def V123 0 "no", modify
    label def V123 1 "yes", modify
    Last edited by Sonnen Blume; 18 Jan 2024, 08:12.

  • #2
    Originally posted by Sonnen Blume View Post
    I can do it manually, but it gets complex when there are too many variables to handle, that's why I am wondering if there is a more efficient way of doing this:
    gen diseasegroups = "None"
    replace diseasegroups = "TB" if TB == 1
    replace diseasegroups = "HIV" if HIV == 1
    replace diseasegroups = "Diabetes" if Diabetes == 1
    replace diseasegroups = "All" if TB == 1 & HIV == 1 & Diabetes == 1
    Hm, so diabetes "beats" TB and HIV? Meaning, if you have diabetes but not TB and also HIV, you will be placed into the diabetes group and none other. That is what this code does. Is that really what you want?

    Comment


    • #3
      See for some technique:

      SJ-7-4 dm0034 . . . Stata tip 52: Generating composite categorical variables
      . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . N. J. Cox
      Q4/07 SJ 7(4):582--583 (no commands)
      tip on how to generate categorical variables using
      tostring and egen, group()
      The labelling of egen, group() would not work especially well for these data. For 3 binary variables, you can set up your own binary coding with distinct decimals 0 to 7 corresponding to binary 000 to binary 111.

      With that data example and labmask (Stata Journal) and groups (Stata Journal) I used this code

      Code:
      gen group = 4 * HIV + 2 * HBP + Diabetes 
      
      gen text = cond(HIV, "HIV", "")
      
      foreach v in HBP Diabetes { 
          replace text = text + " `v'" if `v'
      }
      
      labmask group, values(text)
      
      groups group HIV HBP Diabetes, nolabel 
      
      groups group HIV HBP Diabetes
      with these results:


      .
      Code:
       groups group HIV HBP Diabetes, nolabel 
      
        +------------------------------------------------+
        | group   HIV   HBP   Diabetes   Freq.   Percent |
        |------------------------------------------------|
        |     0     0     0          0      34     34.00 |
        |     4     1     0          0      53     53.00 |
        |     5     1     0          1       6      6.00 |
        |     6     1     1          0       5      5.00 |
        |     7     1     1          1       2      2.00 |
        +------------------------------------------------+
      
      . 
      . groups group HIV HBP Diabetes 
      
        +-----------------------------------------------------------+
        |            group   HIV   HBP   Diabetes   Freq.   Percent |
        |-----------------------------------------------------------|
        |                     no    no         no      34     34.00 |
        |              HIV   yes    no         no      53     53.00 |
        |     HIV Diabetes   yes    no        yes       6      6.00 |
        |          HIV HBP   yes   yes         no       5      5.00 |
        | HIV HBP Diabetes   yes   yes        yes       2      2.00 |
        +-----------------------------------------------------------+
      Naturally, you could add an explicit value label such as "<none>" for value 0.

      Comment


      • #4
        Here is another brute-force approach, using tuples (from SSC), if you want all possible combinations:

        Code:
        generate diseasegroups:diseasegroups = 0
        
        label define diseasegroups 0 "None"
        
        tuples TB HIV HBP Diabetes
        
        forvalues i = `ntuples'(-1)1 {
            
            label define diseasegroups `i' "`tuple`i''" , add
            
            local exp : subinstr local tuple`i' " " "==1 & " , all
            
            replace diseasegroups = `i' if `exp' == 1 & !diseasegroups
            
        }
        
        label define diseasegroups `ntuples' "All" , modify
        
        tabulate diseasegroups
        resulting in

        Code:
        . tabulate diseasegroups
        
           diseasegroups |      Freq.     Percent        Cum.
        -----------------+-----------------------------------
                    None |         25       25.00       25.00
                     HIV |         17       17.00       42.00
                      TB |          9        9.00       51.00
            HIV Diabetes |          3        3.00       54.00
                 HIV HBP |          2        2.00       56.00
                  TB HIV |         36       36.00       92.00
        HIV HBP Diabetes |          2        2.00       94.00
         TB HIV Diabetes |          3        3.00       97.00
              TB HIV HBP |          3        3.00      100.00
        -----------------+-----------------------------------
                   Total |        100      100.00
        for the example data in #1

        Comment


        • #5
          Sonnen Blume is aware of this thread for visualising such data. but others may not be. https://www.statalist.org/forums/for...lable-from-ssc

          A paper on upsetplot and vennbar will appear in the Stata Journal in 2024.

          Comment


          • #6
            Originally posted by daniel klein View Post

            Hm, so diabetes "beats" TB and HIV? Meaning, if you have diabetes but not TB and also HIV, you will be placed into the diabetes group and none other. That is what this code does. Is that really what you want?
            Hi Daniel,
            Yes, that's the goal for the new variable. The objective is to get a result like this:

            dis.grp Freq. Percent Cum.

            TB 6,019 19.33 100.00
            Diabetes 2,196 7.05 16.55
            HIV 11,710 37.62 54.16
            None 8,251 26.50 80.67
            All 2,955 9.49 9.49

            Comment


            • #7
              Originally posted by Sonnen Blume View Post
              Yes, that's the goal for the new variable.
              OK. Just to be sure, you do realize that the resulting distribution completely depends on the order in which variables, i.e., diseases, are processed?

              Still, brute-force, but there you go:

              Code:
              generate diseasegroups:diseasegroups = 0
              label define diseasegroups 0 "None"
              
              local sorted_varlist TB HIV Diabetes // <- changing the order, changes the result
              
              local i 0
              foreach var of local sorted_varlist {
                  
                  local ++i
                  
                  label define diseasegroups `i' "`var'" , add
                  
                  replace diseasegroups = `i' if (`var' == 1)
                  
              }
              
              local exp : subinstr local sorted_varlist " " "==1 & " , all
              
              local ++i
              replace diseasegroups = `i' if `exp'
              label define diseasegroups `i' "All" , add
              which yields

              Code:
              . tabulate diseasegroups
              
              diseasegrou |
                       ps |      Freq.     Percent        Cum.
              ------------+-----------------------------------
                     None |         25       25.00       25.00
                       TB |          9        9.00       34.00
                      HIV |         58       58.00       92.00
                 Diabetes |          5        5.00       97.00
                      All |          3        3.00      100.00
              ------------+-----------------------------------
                    Total |        100      100.00

              Edit: Because I still have a very hard time to see why you would want that, here is the result using the variable list in order HIV Diabetes TB:

              Code:
              . tabulate diseasegroups
              
              diseasegrou |
                       ps |      Freq.     Percent        Cum.
              ------------+-----------------------------------
                     None |         25       25.00       25.00
                      HIV |         19       19.00       44.00
                 Diabetes |          5        5.00       49.00
                       TB |         48       48.00       97.00
                      All |          3        3.00      100.00
              ------------+-----------------------------------
                    Total |        100      100.00
              Seems to me that the group to which individuals belong depends more on the researchers' idea of ordered diseases than on the diseases themselves ... odd.
              Last edited by daniel klein; 18 Jan 2024, 10:24.

              Comment


              • #8
                Originally posted by daniel klein View Post

                OK. Just to be sure, you do realize that the resulting distribution completely depends on the order in which variables, i.e., diseases, are processed?

                Still, brute-force, but there you go:

                Code:
                generate diseasegroups:diseasegroups = 0
                label define diseasegroups 0 "None"
                
                local sorted_varlist TB HIV Diabetes // <- changing the order, changes the result
                
                local i 0
                foreach var of local sorted_varlist {
                
                local ++i
                
                label define diseasegroups `i' "`var'" , add
                
                replace diseasegroups = `i' if (`var' == 1)
                
                }
                
                local exp : subinstr local sorted_varlist " " "==1 & " , all
                
                local ++i
                replace diseasegroups = `i' if `exp'
                label define diseasegroups `i' "All" , add
                which yields

                Code:
                . tabulate diseasegroups
                
                diseasegrou |
                ps | Freq. Percent Cum.
                ------------+-----------------------------------
                None | 25 25.00 25.00
                TB | 9 9.00 34.00
                HIV | 58 58.00 92.00
                Diabetes | 5 5.00 97.00
                All | 3 3.00 100.00
                ------------+-----------------------------------
                Total | 100 100.00

                Edit: Because I still have a very hard time to see why you would want that, here is the result using the variable list in order HIV Diabetes TB:

                Code:
                . tabulate diseasegroups
                
                diseasegrou |
                ps | Freq. Percent Cum.
                ------------+-----------------------------------
                None | 25 25.00 25.00
                HIV | 19 19.00 44.00
                Diabetes | 5 5.00 49.00
                TB | 48 48.00 97.00
                All | 3 3.00 100.00
                ------------+-----------------------------------
                Total | 100 100.00
                Seems to me that the group to which individuals belong depends more on the researchers' idea of ordered diseases than on the diseases themselves ... odd.
                Thanks so much, Daniel!!! The process is a little obscure, but the result is sweet.

                Comment


                • #9
                  Originally posted by Nick Cox View Post
                  See for some technique:



                  The labelling of egen, group() would not work especially well for these data. For 3 binary variables, you can set up your own binary coding with distinct decimals 0 to 7 corresponding to binary 000 to binary 111.

                  With that data example and labmask (Stata Journal) and groups (Stata Journal) I used this code

                  Code:
                  gen group = 4 * HIV + 2 * HBP + Diabetes
                  
                  gen text = cond(HIV, "HIV", "")
                  
                  foreach v in HBP Diabetes {
                  replace text = text + " `v'" if `v'
                  }
                  
                  labmask group, values(text)
                  
                  groups group HIV HBP Diabetes, nolabel
                  
                  groups group HIV HBP Diabetes
                  with these results:


                  .
                  Code:
                   groups group HIV HBP Diabetes, nolabel
                  
                  +------------------------------------------------+
                  | group HIV HBP Diabetes Freq. Percent |
                  |------------------------------------------------|
                  | 0 0 0 0 34 34.00 |
                  | 4 1 0 0 53 53.00 |
                  | 5 1 0 1 6 6.00 |
                  | 6 1 1 0 5 5.00 |
                  | 7 1 1 1 2 2.00 |
                  +------------------------------------------------+
                  
                  .
                  . groups group HIV HBP Diabetes
                  
                  +-----------------------------------------------------------+
                  | group HIV HBP Diabetes Freq. Percent |
                  |-----------------------------------------------------------|
                  | no no no 34 34.00 |
                  | HIV yes no no 53 53.00 |
                  | HIV Diabetes yes no yes 6 6.00 |
                  | HIV HBP yes yes no 5 5.00 |
                  | HIV HBP Diabetes yes yes yes 2 2.00 |
                  +-----------------------------------------------------------+
                  Naturally, you could add an explicit value label such as "<none>" for value 0.
                  Thanks so much!

                  Comment

                  Working...
                  X