Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Convert dummy variables into a categorical variable

    Hello,

    I am new with Stata and I cannot find a solution to convert these variables into a unique categorical variable "Nationality" :
    Here is an example of the dataset:
    USA UK France Germany Spain Australia Canada Russia
    No No No Yes No No Yes No
    No Yes No No No No No No
    Yes No No No No No No No
    Yes No No No No No No No
    No Yes No No No No No No
    No No No No Yes No No No
    No No No No No Yes No No
    Does one of you have any suggestion on how to convert these variables into categorical variable as follow:

    Nationality
    1-USA
    2-UK
    3- France
    4- Germany
    5- Spain
    6- Australia
    7- Canada
    8- Russia


  • #2
    Not possible without additional information. Look at the first row of your table: both Germany and Canada are specified yes. In order to create a single nationality variable out of this, you would have to specify some sort of hierarchy among the different nations to determine which one to use in the case of multiple nationalities. Or you could allow a category "Multiple" nationalities in your variable. If you clarify how you want to handle this situation, coding will not be a problem.

    When posting back, please show an actual Stata data example, not an HTML table, by using the -dataex- command. Run -ssc install dataex- to get the command and then run -help dataex- to read the simple instructions for its use. By using -dataex- you will enable those who are inclined to help out to make a complete and faithful replica of your Stata example with a simple copy/paste operation. HTML tables of the kind you show are sometimes difficult to bring in to Stata. And even when they go in easily, they do not show, for an example relevant to your situation, whether the variables are string variables or value-labeled numeric variables. (The code could be different in these two cases.)

    Comment


    • #3
      Hi,

      Thank you for the advice.
      Here is the Stata data example using -dataex-.

      Code:
      * Example generated by -dataex-. To install: ssc install dataex
      clear
      input int Numberofnights float(USA UK France Russia Australia Germany Spain Canada)
      5 0 0 0 0 1 0 0 0
      5 0 0 0 0 1 0 0 0
      2 0 0 0 0 0 0 1 0
      5 0 0 0 0 0 0 1 0
      4 0 0 0 0 1 0 0 0
      end
      label values USA USA2
      label def USA2 0 "No", modify
      label values UK UK2
      label def UK2 0 "No", modify
      label values France France2
      label def France2 0 "No", modify
      label values Russia Russia2
      label def Russia2 0 "No", modify
      label values Australia Australia2
      label def Australia2 0 "No", modify
      label def Australia2 1 "Yes", modify
      label values Germany Germany2
      label def Germany2 0 "No", modify
      label values Spain Spain2
      label def Spain2 0 "No", modify
      label def Spain2 1 "Yes", modify
      label values Canada Canada2
      label def Canada2 0 "No", modify

      Comment


      • #4
        This should do it:
        Code:
        * Example generated by -dataex-. To install: ssc install dataex
        clear
        input int Numberofnights float(USA UK France Russia Australia Germany Spain Canada)
        5 0 0 0 0 1 0 0 0
        5 0 0 0 0 1 0 0 0
        2 0 0 0 0 0 0 1 0
        5 0 0 0 0 0 0 1 0
        4 0 0 0 0 1 0 0 0
        end
        label values USA USA2
        label def USA2 0 "No", modify
        label values UK UK2
        label def UK2 0 "No", modify
        label values France France2
        label def France2 0 "No", modify
        label values Russia Russia2
        label def Russia2 0 "No", modify
        label values Australia Australia2
        label def Australia2 0 "No", modify
        label def Australia2 1 "Yes", modify
        label values Germany Germany2
        label def Germany2 0 "No", modify
        label values Spain Spain2
        label def Spain2 0 "No", modify
        label def Spain2 1 "Yes", modify
        label values Canada Canada2
        label def Canada2 0 "No", modify
        
        //    VERIFY ONLY ONE NATIONALITY SELECTED PER OBSERVATION
        egen n_count = rowtotal(USA-Canada)
        assert n_count <= 1
        drop n_count
        
        //    COMBINE TO A SINGLE NATIONALITY VARIABLE
        gen nationality = .
        local i = 1
        foreach v of varlist USA-Canada {
            replace nationality = `i' if `v' == 1
            label define nationality `i' "`v'", add
            local ++i
        }
        label values nationality nationality
        list, noobs clean

        Comment


        • #5
          Thank you so much. It was exactly what I was looking for !

          Comment


          • #6
            Dear community,
            I am currently dealing with a somewhat problematic dataset (at least for a beginner like me). And one of my problems is exactly the one that Lorenna had, but somehow the command does not help in the right way.

            So in my example they asked in the questionnaire about the income in categories: "What is your daily wage?"

            V6: more than 150 (1: quoted, 0: not quoted)
            V7: 101-150 (1: quoted, 0: not quoted)
            V8: 51-100 (1: quoted, 0: not quoted)
            V9: 1-50 (1: quoted, 0: not quoted)
            V10: no payment (1: quoted, 0: not quoted)

            Now I need to combine these 5 variables (V6-V10) into one variable (V11: "Daily wage") in the same way like the example above. Now there are cases where people have selected multiple variables to this question. So I want to select the upper variable selected from the person and discard all the others (with the theoretical assumption that the first statement is closest to the true value)? So for example someone who choose "more than 150" and "101-150", so we assume that the "more than 150" is closer to the true value.
            When I examine the command from Clyde Schechter the number of obs go down from 234 (in the old variables V6-V10) to 159 in the new variable (V11).
            Also the people, who choose the 1 (quoted) in V6 for example should be the same amount of people like the first category of V11, but that isn't like that also.
            What could be the mistake I made?
            This is what I let STATA do:

            gen V11 = .
            local i = 1
            foreach v of varlist V6-V10 {
            replace V11 = `i' if `v' == 1
            label define V11 `i' "`v'", add
            local ++i
            }
            label values V11 V11
            list, noobs clean

            The second problem I have is that I need to combine dichotomous variables to one categorical variable where a multiple responses was useful.
            Example: The question in the questionnaire was: "Which rooms do you use?" and this is what I see in the dataset:
            V1: Kitchen (1: quoted, 0: not quoted)
            V2: WC (1: quoted, 0: not quoted)
            V3: living room (1: quoted, 0: not quoted)
            V4: corridor (1: quoted, 0: not quoted).

            Now I want to make out of these 4 variables (V1-V4) one categorical variable (let's call it V5 "used rooms") with the values ​​1 = kitchen, 2 = WC, 3 = living room and 4 = hallway. So that every "1" in one variable is a category in the new variable V5. But I don't want to overwrite the responses the person made and pick just one of them. I want to have all the responses in the categories. Is this even possible??

            Does anyone have any suggestions? I would be very happy about any help!

            Comment


            • #7
              Oh, I'm sorry. I solved the first problem (about the difference in the number of obs) myself. It comes from the merging of the variables, because STATA only considers one of the responses and "delete" the others. But now it takes the last made response - how can I manage it so STATA takes the first made response?

              Comment


              • #8
                So if someone said "kitchen" (V1=1) and "living room" (V3=1), what do you want V5 to look like?
                Stata/MP 14.1 (64-bit x86-64)
                Revision 19 May 2016
                Win 8.1

                Comment


                • #9
                  So I want, that someone who says "kitchen" (V1=1) is counted by kitchen but also by living room.
                  So that I have one variable V5 with 4 categories.

                  Comment


                  • #10
                    I’m just not understanding this. A single observation has v1=1 & v3=1. You want V5 that takes values 1,2,3, or 4. So what value should this observation have for v5? Or do you want v5 to look like this “1010”.
                    Stata/MP 14.1 (64-bit x86-64)
                    Revision 19 May 2016
                    Win 8.1

                    Comment


                    • #11
                      I'm giving this excellent post a bump. To ask: is there an instal that can do this automatically? It is a common task. And, it is essentially reversing the generate option following a tabulate command.

                      Thoughts?


                      Originally posted by Clyde Schechter View Post
                      This should do it:
                      Code:
                      * Example generated by -dataex-. To install: ssc install dataex
                      clear
                      input int Numberofnights float(USA UK France Russia Australia Germany Spain Canada)
                      5 0 0 0 0 1 0 0 0
                      5 0 0 0 0 1 0 0 0
                      2 0 0 0 0 0 0 1 0
                      5 0 0 0 0 0 0 1 0
                      4 0 0 0 0 1 0 0 0
                      end
                      label values USA USA2
                      label def USA2 0 "No", modify
                      label values UK UK2
                      label def UK2 0 "No", modify
                      label values France France2
                      label def France2 0 "No", modify
                      label values Russia Russia2
                      label def Russia2 0 "No", modify
                      label values Australia Australia2
                      label def Australia2 0 "No", modify
                      label def Australia2 1 "Yes", modify
                      label values Germany Germany2
                      label def Germany2 0 "No", modify
                      label values Spain Spain2
                      label def Spain2 0 "No", modify
                      label def Spain2 1 "Yes", modify
                      label values Canada Canada2
                      label def Canada2 0 "No", modify
                      
                      // VERIFY ONLY ONE NATIONALITY SELECTED PER OBSERVATION
                      egen n_count = rowtotal(USA-Canada)
                      assert n_count <= 1
                      drop n_count
                      
                      // COMBINE TO A SINGLE NATIONALITY VARIABLE
                      gen nationality = .
                      local i = 1
                      foreach v of varlist USA-Canada {
                      replace nationality = `i' if `v' == 1
                      label define nationality `i' "`v'", add
                      local ++i
                      }
                      label values nationality nationality
                      list, noobs clean

                      Comment


                      • #12
                        Another techniqe that may be useful on this (from Stata tech support folks)...

                        - The egen group() command can reverse this operation but it will not check for
                        rows with all zeros or rows with multiple entries:
                        - Note, it will assign new values to the categories so that they will not match
                        the original values.


                        Code:
                        . sysuse auto, clear
                        (1978 automobile data)
                        
                        . keep rep
                        
                        . tab rep78, gen(reps)
                        
                             Repair |
                        record 1978 |      Freq.     Percent        Cum.
                        ------------+-----------------------------------
                                  1 |          2        2.90        2.90
                                  2 |          8       11.59       14.49
                                  3 |         30       43.48       57.97
                                  4 |         18       26.09       84.06
                                  5 |         11       15.94      100.00
                        ------------+-----------------------------------
                              Total |         69      100.00
                        
                        . egen x = group(reps*)
                        (5 missing values generated)
                        
                        . list in 1/10 , clean
                        
                               rep78   reps1   reps2   reps3   reps4   reps5   x 
                          1.       3       0       0       1       0       0   3 
                          2.       3       0       0       1       0       0   3 
                          3.       .       .       .       .       .       .   . 
                          4.       3       0       0       1       0       0   3 
                          5.       4       0       0       0       1       0   2 
                          6.       3       0       0       1       0       0   3 
                          7.       .       .       .       .       .       .   . 
                          8.       3       0       0       1       0

                        Comment

                        Working...
                        X