Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Combining dichotomous variables to categorical ones


    Dear community, I am currently dealing with a somewhat problematic dataset (at least for a beginner like me).
    So there are many dichotomous variables in this dataset. I'll give you two examples: Example 1: The question in the questionnaire was: "Which rooms do you use?" and this is what I see in the dataset: V1: Kitchen (1: quoted, 0: not quoted) V2: WC (1: quoted, 0: not quoted) V3: living room (1: quoted, 0: not quoted) V4: corridor (1: quoted, 0: not quoted). Now I want to make out of these 4 variables (V1-V4) one categorical variable (let's call it V5 "used rooms") with the values ​​1 = kitchen, 2 = WC, 3 = living room and 4 = hallway. So that every "1" in one variable is a category in the new variable V5.
    So my first question is: With which command is it possible to merge these variables in the way I described above? Example 2: They asked in the questionnaire about the income in categories: "What is your daily wage?" V6: more than 150 (1: quoted, 0: not quoted) V7: 101-150 (1: quoted, 0: not quoted) V8: 51-100 (1: quoted, 0: not quoted) V9: 1-50 (1: quoted, 0: not quoted) V10: no payment (1: quoted, 0: not quoted) Now I need to combine these 5 variables (V6-V10) into one variable (V11: "Daily wage") in the same way like the example above. But now there is the problem that there are cases where people have selected multiple variables to this question. So my second question is: Is there a command to tell STATA, for example, to select the upper variable selected from the person and discard all the others (with the theoretical assumption that the first statement is closest to the true value)? So for example someone who choose "more than 150" and "101-150", so we assume that the "more than 150" is closer to the true value. Does anyone have a solution to this dilemma or can help me somehow? I am happy about any considerations! THANK YOU!!!

  • #2
    The following previous post addresses the problem you encounter in Example 1:

    https://www.statalist.org/forums/for...rical-variable

    Re-purposing Clyde Schechter's code for your situation yields the following code:

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input float(v1 v2 v3 v4)
    1 0 0 0
    1 0 0 0
    0 0 1 0
    0 0 1 0
    1 0 0 0
    end
    
    //    VERIFY ONLY ONE OPTION SELECTED PER OBSERVATION
    egen n_count = rowtotal(v1-v4)
    assert n_count <= 1
    drop n_count
    
    //    COMBINE TO A SINGLE VARIABLE
    gen v5 = .
    local i = 1
    foreach v of varlist v1-v4 {
        replace v5 = `i' if `v' == 1
        local ++i
    }
    list, noobs clean

    Stata/MP 14.1 (64-bit x86-64)
    Revision 19 May 2016
    Win 8.1

    Comment


    • #3
      Actually, the above code works perfectly for your Example 2 as well because it loops through the variables and replaces earlier value with later values if there is more than one. You will need to remove the section to verify only one option selected.

      EXAMPLE 2
      Code:
      * Example generated by -dataex-. To install: ssc install dataex
      clear
      input float(v6 v7 v8 v9)
      1 1 0 0
      1 0 0 0
      0 0 1 0
      0 0 1 0
      1 0 0 1
      end
      
      //    COMBINE TO A SINGLE VARIABLE
      gen v10 = .
      local i = 1
      foreach v of varlist v6-v9 {
          replace v10 = `i' if `v' == 1
          local ++i
      }
      list, noobs clean

      If you have a situation where you have multiple responses in which you DON'T want later values recorded, DO NOT run this code. For example, in Example 1, if the respondent could mention more than one room, then if they select both kitchen and hallway, you will only see hallway in the new variable.
      Stata/MP 14.1 (64-bit x86-64)
      Revision 19 May 2016
      Win 8.1

      Comment


      • #4
        Hi all, I have a similar problem but this code didn't work for me. I am trying to create a new variable using a set of 17 dichotomous variables. In my questionnaire, the respondent could select multiple options. In my final categorical variable, I don't mind if they chose more than one option and so it seems pretty straightforward but my code with gen (), replace is not giving me the accurate numbers in the end and I understand this has to do with folks choosing more than one option. Is there a way to create a new categorical variable that reflects the 1s from each of these variables? The following is what my dataex output looks like :

        Code:
        * Example generated by -dataex-. For more info, type help dataex
        clear
        input byte(YL_Q3_01 YL_Q3_02 YL_Q3_04 YL_Q3_05 YL_Q3_06 YL_Q3_07 YL_Q3_08 YL_Q3_09 YL_Q3_10 YL_Q3_11 YL_Q3_12 YL_Q3_13 YL_Q3_14 YL_Q3_15 YL_Q3_16 YL_Q3_17)
        1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
        0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
        1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
        0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
        0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0
        0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
        1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
        1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0
        1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
        1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
        end
        label values YL_Q3_01 LABV
        label values YL_Q3_02 LABV
        label values YL_Q3_04 LABV
        label values YL_Q3_05 LABV
        label values YL_Q3_06 LABV
        label values YL_Q3_07 LABV
        label values YL_Q3_08 LABV
        label values YL_Q3_09 LABV
        label values YL_Q3_10 LABV
        label values YL_Q3_11 LABV
        label values YL_Q3_12 LABV
        label values YL_Q3_13 LABV
        label values YL_Q3_14 LABV
        label values YL_Q3_15 LABV
        label values YL_Q3_16 LABV
        label values YL_Q3_17 LABV
        label def LABV 0 "No", modify
        label def LABV 1 "Yes", modify

        Comment


        • #5
          #4 doesn't explain your rules for your composite variable, nor does it show any of your code attempts. If you're just looking for any variable being 1 across observations, you can use egen's rowmax() function. If that's not the answer please give us more detail.

          Comment


          • #6
            Thanks so much for your response! My question in the survey was what kind of internet user are you and the respondent could select multiple options. My current code below doesn't give me the right composite variable I'm looking for because the dichotomous variable for YL_Q1_1 is 15,334 yes/1. When I run the following code, I get 10,123 for YL_Q1_1. I'm assuming this is because I'm replacing respondents as I go along if they chose more than one option:


            gen engaging_online = .
            replace engaging_online = 1 if YL_Q1_1 == 1
            replace engaging_online = 2 if YL_Q3_02 == 1
            replace engaging_online = 3 if YL_Q3_03 == 1
            replace engaging_online = 4 if YL_Q3_04 == 1
            replace engaging_online = 5 if YL_Q3_05 == 1
            replace engaging_online = 6 if YL_Q3_06 == 1
            replace engaging_online = 7 if YL_Q3_07 == 1
            replace engaging_online = 8 if YL_Q3_08 == 1
            replace engaging_online = 9 if YL_Q3_09 == 1
            replace engaging_online = 10 if YL_Q3_10 == 1
            replace engaging_online = 11 if YL_Q3_11 == 1
            replace engaging_online = 12 if YL_Q3_12 == 1
            replace engaging_online = 13 if YL_Q3_13 == 1
            replace engaging_online = 14 if YL_Q3_14 == 1
            replace engaging_online = 15 if YL_Q3_15 == 1
            replace engaging_online = 16 if YL_Q3_16 == 1
            replace engaging_online = 17 if YL_Q3_17 == 1

            I'm looking for a way to code it so I don't replace those respondents who chose more than one option. Thanks!

            Comment


            • #7
              Note first that your code returns the identifier of the last variable met with a value of 1. If you wanted that, the repetitive code could be slimmed down to

              Code:
              gen engaging_online = .
              forval j = 1/17 {
                  local J : di %02.0f `j'
                  replace engaging_online = `j' if YL_Q3_`J' == 1
              }

              You don't want that, but please note for what follows that I am guessing that

              Code:
              YL_Q1_1
              is a typo for

              Code:
              YL_Q3_01
              If not, then your code needs to be more complicated than what follows.

              There's an entire (but short) paper on this topic.

              SJ-7-4 dm0034 . . . Stata tip 52: Generating composite categorical variables
              . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . N. J. Cox
              Q4/07 SJ 7(4):582--583 (no commands)
              tip on how to generate categorical variables using
              tostring and egen, group()


              https://journals.sagepub.com/doi/epd...867X0800700407

              The top recommendation from that is

              Code:
              egen wanted = group(YL_Q3_??), label
              where the varlist would need to include

              Code:
              YL_Q1_1
              if as said that really is a name you're using.

              EDIT

              Looking back at #4 I see that you really do have YL_Q3_01 but that you don't have YL_Q3_03 at least in your data example. So #4 and #6 are inconsistent in detail. No matter. I think you need something a little more complicated than just above.


              Code:
              * Example generated by -dataex-. For more info, type help dataex
              clear
              input byte(YL_Q3_01 YL_Q3_02 YL_Q3_04 YL_Q3_05 YL_Q3_06 YL_Q3_07 YL_Q3_08 YL_Q3_09 YL_Q3_10 YL_Q3_11 YL_Q3_12 YL_Q3_13 YL_Q3_14 YL_Q3_15 YL_Q3_16 YL_Q3_17)
              1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
              0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
              1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
              0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
              0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0
              0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
              1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
              1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0
              1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
              1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
              end
              label values YL_Q3_01 LABV
              label values YL_Q3_02 LABV
              label values YL_Q3_04 LABV
              label values YL_Q3_05 LABV
              label values YL_Q3_06 LABV
              label values YL_Q3_07 LABV
              label values YL_Q3_08 LABV
              label values YL_Q3_09 LABV
              label values YL_Q3_10 LABV
              label values YL_Q3_11 LABV
              label values YL_Q3_12 LABV
              label values YL_Q3_13 LABV
              label values YL_Q3_14 LABV
              label values YL_Q3_15 LABV
              label values YL_Q3_16 LABV
              label values YL_Q3_17 LABV
              label def LABV 0 "No", modify
              label def LABV 1 "Yes", modify
              
              gen wanted = "" 
              
              forval j = 1/17 {
                  local J : di %02.0f `j'
                  capture replace wanted = wanted + " `j'" if YL_Q3_`J' == 1 
              }
              
              replace wanted = trim(wanted)
              
              egen group = group(wanted), label 
              
              tab group, nola 
              
              tab group
              The point is that all the information is carried by the categories with value 1.

              Results:

              Code:
              . tab group, nola 
              
              group(wante |
                       d) |      Freq.     Percent        Cum.
              ------------+-----------------------------------
                        1 |          5       50.00       50.00
                        2 |          1       10.00       60.00
                        3 |          2       20.00       80.00
                        4 |          1       10.00       90.00
                        5 |          1       10.00      100.00
              ------------+-----------------------------------
                    Total |         10      100.00
              
              . 
              . tab group 
              
              group(wante |
                       d) |      Freq.     Percent        Cum.
              ------------+-----------------------------------
                        1 |          5       50.00       50.00
                     1 14 |          1       10.00       60.00
                       17 |          2       20.00       80.00
                        2 |          1       10.00       90.00
                     6 15 |          1       10.00      100.00
              ------------+-----------------------------------
                    Total |         10      100.00
              Last edited by Nick Cox; 19 Jan 2025, 04:35.

              Comment


              • #8
                Following on #7 it is naturally possible that

                1. You would prefer to see something less cryptic than say 1 14 or 6 15. That might mean using variable labels, or shorter versions of them.

                2. There is a more natural or at least more helpful order to the composite variables, say from most popular on. If you want something different, again it's important to spell out what that would be.

                Comment

                Working...
                X