Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Help with Stata Categorization of Factor Variables: Data Management Help

    I'm not so familiar with Stata's data management commands and this one has been really tough to figure out with the manual. Below I included a data set of responses college students submitted to a questionnaire. You'll see various meta columns denoting how they took the survey, hours of computer use etc. I've managed pretty well so far but having trouble with the Undergraduate Major component. This was maybe a flaw in my survey design, but I let the participants type in text to denote their Major. So you'll see numerous different ways students can write "Political Science" including for ex POSCI, or POSC; and similarly for almost all the majors many students abbreviated the title. So it creates a problem for me to simply encode major, gen(newvar) -- which would be how I naturally would want to go about it. How do I got about the easiest way to fix closely related words? Almost a find a replace would work but perhaps more effective? In addition to that, my second problem is after I fix the names and get all majors correctly unified and spelled homogeneously I need to further group them. My sample size isn't nearly large enough to test significance of individual majors against each other. So I wanted to group them into larger categories for example "Humanities" to include the following Majors "etc, etc". Or better yet, is there a way to have Stata examine which groups might belong together based on the outcome variable? I'm thinking theoretically something between an anova and factor analysis to give a more objective idea of which majors "move together"?

    Terribly sorry about such a convoluted question but any help is very much appreciated.

    Kind regards,
    Ali





    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input double osdi int byr str53 mjr str13 browser str15 os str9 BrowserMetaInfoScreenResolut str3 idrps str11 cpu float age long female byte col_age
    11.3636363636363   50 "Theater "                                      "Safari iPhone" "iPhone"          "375x667"   ""    "6 - 8 hours" 50 1 0
    14.5833333333333 1986 "Political Science"                             "Chrome"        "Windows NT 10.0" "1707x960"  ""    "6 - 8 hours" 31 1 0
    4.16666666666666 1990 "Biomedical engineering "                       "Chrome"        "Android 7.0"     "360x740"   ""    "6 - 8 hours" 27 0 1
    45.8333333333333 1983 "music"                                         "Chrome"        "Macintosh"       "1280x800"  ""    "6 - 8 hours" 34 1 0
    60.4166666666666 1990 "Animation and Digital Arts"                    "Chrome"        "Macintosh"       "2560x1440" ""    "6 - 8 hours" 27 1 1
    6.81818181818181 1991 "Environmental Studies"                         "Chrome"        "Windows NT 10.0" "1536x864"  ""    "4 - 6 hours" 26 1 1
                   0 1989 "Engineering"                                   "Chrome"        "Macintosh"       "1680x1050" ""    "6 - 8 hours" 28 1 1
    14.5833333333333   44 "Political Science"                             "Safari"        "Macintosh"       "1440x900"  ""    "4 - 6 hours" 44 1 0
    4.16666666666666 1987 "Theatre"                                       "Safari iPhone" "iPhone"          "414x736"   ""    "6 - 8 hours" 30 1 1
    27.7777777777777 1990 "Molecular biology "                            "Safari iPhone" "iPhone"          "320x568"   ""    "6 - 8 hours" 27 1 1
                62.5 1990 "Communications "                               "Safari iPhone" "iPhone"          "414x736"   ""    "6 - 8 hours" 27 1 1
                6.25 1980 "Computer science "                             "Safari iPhone" "iPhone"          "375x667"   "No"  "4 - 6 hours" 37 1 0
                  20 1990 "Anthropology and Spanish"                      "Chrome"        "Windows NT 10.0" "1920x1080" "No"  "8 - 9 hours" 27 1 1
                6.25 1995 "Accounting"                                    "Safari iPhone" "iPhone"          "320x568"   "No"  "8 - 9 hours" 22 1 1
    27.7777777777777 1996 "Human Biology"                                 "Chrome iPhone" "iPhone"          "375x667"   "Yes" "4 - 6 hours" 21 1 1
                  20 1999 "Psychology/Spanish"                            "Safari iPhone" "iPhone"          "375x667"   "Yes" "2 - 3 hours" 18 0 1
    6.81818181818181 1996 "Sociology"                                     "Chrome iPhone" "iPhone"          "375x667"   "No"  "4 - 6 hours" 21 0 1
    13.6363636363636 1996 "Human Biology and Spanish"                     "Chrome"        "Macintosh"       "1280x800"  "No"  "2 - 3 hours" 21 1 1
    20.4545454545454 1985 "Media Studies"                                 "Firefox"       "Macintosh"       "1440x900"  "No"  "6 - 8 hours" 32 1 0
    4.16666666666666 1996 "POSC, LHC, SPAN"                               "Chrome"        "Windows NT 10.0" "1366x768"  "Yes" "2 - 3 hours" 21 1 1
    29.5454545454545 1996 "Law, History, and Culture; Spanish"            "Safari iPhone" "iPhone"          "375x667"   "No"  "4 - 6 hours" 21 1 1
    38.6363636363636 1997 "International Relations and Spanish "          "Chrome"        "Windows NT 10.0" "1366x768"  "Yes" "6 - 8 hours" 20 1 1
    11.3636363636363 1996 "POSC"                                          "Safari iPhone" "iPhone"          "320x568"   "Yes" "8 - 9 hours" 21 1 1
                  45 1998 "Anthropology"                                  "Safari iPhone" "iPhone"          "320x568"   "Yes" "4 - 6 hours" 19 1 1
    4.16666666666666 1995 "human biology"                                 "Chrome"        "Macintosh"       "1280x800"  "No"  "4 - 6 hours" 22 0 1
    10.4166666666666 1991 "Political Science & Communication "            "Safari iPhone" "iPhone"          "414x736"   "Yes" "10 hours"    26 0 1
    29.5454545454545 1996 "Philosophy"                                    "Chrome"        "Android 6.0"     "360x640"   "Yes" "6 - 8 hours" 21 0 1
    10.4166666666666 1994 "Anthropology"                                  "Safari iPhone" "iPhone"          "375x667"   "No"  "10 hours"    23 1 1
    83.3333333333333 1997 "English and Anthro"                            "Safari iPhone" "iPhone"          "375x667"   "Yes" "6 - 8 hours" 20 1 1
    8.33333333333333 1986 "sociology "                                    "Safari iPhone" "iPhone"          "320x568"   "Yes" "8 - 9 hours" 31 1 0
                6.25 1993 "Electronics and communications"                "Chrome iPhone" "iPhone"          "375x667"   "Yes" "10 hours"    24 1 1
                6.25 1982 "Music composition"                             "Safari iPad"   "iPad"            "768x1024"  "No"  "4 - 6 hours" 35 1 0
    16.6666666666666 1995 "Sociology"                                     "MSIE"          "Windows NT 10.0" "1680x1050" "Yes" "8 - 9 hours" 22 1 1
                12.5 1981 "Sociology"                                     "Chrome"        "Macintosh"       "1920x1080" "Yes" "10 hours"    36 1 0
    13.6363636363636 1996 "Political science"                             "Safari iPhone" "iPhone"          "375x667"   "No"  "4 - 6 hours" 21 1 1
                  30 1987 "political science"                             "Chrome"        "Windows NT 10.0" "1536x864"  "Yes" "10 hours"    30 1 1
                37.5 1996 "Sociology/Social Psychology"                   "Safari"        "Macintosh"       "1440x900"  "Yes" "6 - 8 hours" 21 1 1
               18.75 1994 "human biology"                                 "Safari"        "Macintosh"       "1280x800"  "No"  "8 - 9 hours" 23 0 1
    91.6666666666666 1996 "Anthropology"                                  "Safari iPhone" "iPhone"          "375x667"   "Yes" "10 hours"    21 1 1
                  15 1983 "International Development Studies"             "Chrome"        "Macintosh"       "1280x800"  "No"  "4 - 6 hours" 34 0 0
                  25 1986 "Women's Studies"                               "Chrome"        "Macintosh"       "1366x768"  "Yes" "6 - 8 hours" 31 1 0
                77.5 1992 "Communication"                                 "Chrome"        "Macintosh"       "1280x800"  "Yes" "6 - 8 hours" 25 0 1
    14.5833333333333 1996 "Sociology"                                     "Chrome"        "Macintosh"       "1920x1080" "No"  "4 - 6 hours" 21 1 1
    33.3333333333333 1995 "Dramatic Arts (Acting)"                        "Safari iPhone" "iPhone"          "375x667"   "Yes" "8 - 9 hours" 22 0 1
    20.4545454545454 1995 "Psychology / neuroscience"                     "Safari iPhone" "iPhone"          "414x736"   "Yes" "10 hours"    22 1 1
    8.33333333333333 1980 "Psychology "                                   "Safari iPhone" "iPhone"          "375x667"   "No"  "4 - 6 hours" 37 1 0
                6.25 1985 "Creative writing "                             "Safari iPhone" "iPhone"          "320x568"   "No"  "4 - 6 hours" 32 1 0
                  50 1992 "Biology"                                       "Chrome"        "Macintosh"       "1280x800"  "Yes" "2 - 3 hours" 25 1 1
               18.75 1996 "BFA Acting"                                    "Safari iPhone" "iPhone"          "375x667"   "Yes" "4 - 6 hours" 21 0 1
               18.75 1980 "engineering"                                   "Chrome"        "Windows NT 10.0" "1280x720"  "Yes" "8 - 9 hours" 37 1 0
    8.33333333333333 1988 "Neuroscience "                                 "Safari iPhone" "iPhone"          "375x667"   "Yes" "4 - 6 hours" 29 1 1
                  25 1996 "dramatic arts"                                 "Safari iPhone" "iPhone"          "375x667"   "Yes" "6 - 8 hours" 21 1 1
    16.6666666666666 1988 "Computer science"                              "Chrome"        "Windows NT 6.1"  "1920x1080" "Yes" "10 hours"    29 0 1
    
    end
    label values female female
    label def female 0 "Male", modify
    label def female 1 "Female", modify

  • #2
    This was maybe a flaw in my survey design, but I let the participants type in text to denote their Major.
    Yes, that was a serious mistake and it will cost you hours or days of work now to deal with it.

    If this is your entire data set, then you may want to handle with a long series of -replace mjr = "x" if mjr == "y" - commands. But if your data set is appreciably larger, that will be impractical. (It's already at least borderline impractical at this size.) Assuming that the number of actual different majors is somewhat limited, say no more than 150, I would do this:

    Code:
    keep mjr
    duplicates drop
    sort mjr
    save majors, replace
    This creates a new data file with nothing in it but the listed majors. Now, open that data set and open Stata's Data Editor. In the second column, start typing in the "correct" name of the major--the name you want to use for it. For example you might want to go with Sociology for both Sociology and Sociology/Social Psychology. Pick either the British or American spelling for Theater/Theatre and put it in the second column next to each of those. Go through the data set one-by-one until every value of major is associated with a (possibly) new value in the second column. Rename that second column to mjr_cleaned and save the majors data set, overwriting the original majors.dta.

    Now re-load your original data set into memory and run:

    Code:
    merge m:1 mjr using majors, assert(match) nogenerate
    Now you can use your mjr_cleaned variable for any purpose you would have wanted to use mjr for.

    Comment


    • #3
      I attempted it but it's giving me an error of "variable mjr does not uniquely identify observations in the using data"

      Comment


      • #4
        Never mind I actually was clearing the corrected majors.dta, but when I used append following by the command above it seems to have solved it. Thanks again for the help

        Comment


        • #5
          I don't see how -append- could have solved your problem. It won't give you that error message, but it also won't produce useful results.

          The error message "variable mjr does not uniquely identify observations in the using data" means that you have more than one observation in majors.dta with the same value for variable mjr. That means you did something wrong in creating it. Did you forget the -keep mjr- or -duplicates drop-? Or did you perhaps mistakenly change a value of mjr instead of creating a value in the second column while editing?

          Comment


          • #6
            Yeah it actually didn't. It turned out I had doubles in the data set. So having run it again it gave me the same error as above. So in creating the second column of majors -- there were a lot of similar broad categories of majors that I grouped into "Humanities". So the mjr_cleaned has a bunch of repeating cells of "Humanities" where I categorized almost all the liberal arts. Theater for example I grouped as a performance art; as with dramatic arts. So am I not to repeat the same name in the second column more than once?


            Code:
            * Example generated by -dataex-. To install: ssc install dataex
            clear
            input str53 mjr str11 mjr_cleaned
            " Cinematic Arts Film and Television Production"        "Media"      
            "Accounting"                                            "Biz"        
            "Accounting"                                            "Performance"
            "Accounting"                                            "Humanities"
            "Acting "                                               "Media"      
            "African-Ethnic Studies"                                "Humanities"
            "Animation and Digital Arts"                            "Humanities"
            "Anthropology"                                          "Double"    
            "Anthropology"                                          "Math"      
            "Anthropology "                                         "Art"        
            "Anthropology "                                         "Performance"
            "Anthropology and Spanish"                              "Performance"
            "Applied Mathematics "                                  "Art"        
            "Applied mathematics"                                   "Engineer"  
            "Art & Design"                                          "Media"      
            "Art History"                                           "Science"    
            "Art History"                                           "Biz"        
            "Art history "                                          "Science"    
            "BA theater"                                            "Science"    
            "BFA Acting"                                            "Science"    
            "BFA Design"                                            "Science"    
            "BME"                                                   "Science"    
            "BRDJ"                                                  "Science"    
            "BS Physical Sciences"                                  "Science"    
            "BUAD"                                                  "Media"      
            "Biochem"                                               "Biz"        
            "Biochemistry"                                          "Double"    
            "Biochemistry"                                          "Biz"        
            "Biochemistry "                                         "Biz"        
            "Biological Science"                                    "CS"        
            "Biological science "                                   "Science"    
            "Biology"                                               "Humanities"
            "Biomedical Engineering"                                "Media"      
            "Biomedical engineering "                               "Media"      
            "Biotechnology"                                         "Media"      
            "Broadcast & Digital Journalism/Cinema & Media Studies" "Media"      
            "Business"                                              "Media"      
            "Business"                                              "Science"    
            "Business"                                              "Humanities"
            "Business & Chinese"                                    "Humanities"
            "Business Administration"                               "Double"    
            "Business Administration"                               "Humanities"
            "Business Administration "                              "Humanities"
            "Business Administtration"                              "Humanities"
            "Business Economics"                                    "CS"        
            "Business Economics with Accounting emphasis"           "CS"        
            "Business administration "                              "CS"        
            "CS"                                                    "CS"        
            "Chemistry"                                             "Humanities"
            "Chinese Literature"                                    "Humanities"
            "Cinema"                                                "Art"        
            "Cinema Studies"                                        "Art"        
            "Cinema and Media Studies"                              "Humanities"
            "Cinema and Media Studies "                             "Humanities"
            "Cinema and media studies"                              "Double"    
            "Cognitive Science"                                     "Double"    
            "Communication"                                         "Humanities"
            "Communication"                                         "Double"    
            "Communication "                                        "Double"    
            "Communication Studies"                                 "Double"    
            "Communication Studies "                                "Double"    
            "Communication and English"                             "Science"    
            "Communications"                                        "Science"    
            "Communications "                                       "Humanities"
            "Comparative literature"                                "Humanities"
            "Comparative literature"                                "Double"    
            "Computer Engineering"                                  "Double"    
            "Computer Science"                                      "Humanities"
            "Computer science"                                      "Double"    
            "Computer science "                                     "Humanities"
            "Creative Writing"                                      "Media"      
            "Creative Writing"                                      "Performance"
            "Creative writing "                                     "Art"        
            "Cultural Anthropology"                                 "Art"        
            "Design"                                                "Science"    
            "Dramatic Arts (Acting)"                                "Science"    
            "East Asian Languages and Cultures "                    "Science"    
            "East Asian Studies"                                    "Humanities"
            "East Asian Studies "                                   "Humanities"
            "Econ/Math"                                             "Double"    
            "Econ/math & English "                                  "Science"    
            "Economics"                                             "Double"    
            "Economics"                                             "Science"    
            "Economics "                                            "Science"    
            "Economics, Mathematics"                                "Humanities"
            "Economics/Mathematics"                                 "Science"    
            "Economics/mathematics"                                 "Biz"        
            "Electronics and communications"                        "Humanities"
            "Engineering"                                           "Humanities"
            "Engineering "                                          "Double"    
            "English"                                               "Humanities"
            "English "                                              "Double"    
            "English "                                              "Humanities"
            "English - American Studies"                            "Humanities"
            "English and Anthro"                                    "Science"    
            "English and History"                                   "Humanities"
            "English literature "                                   "Humanities"
            "English/Molecular Biology"                             "Humanities"
            "Environmental Studies"                                 "Double"    
            "Environmental Studies"                                 "Humanities"
            end
            Last edited by Ali Sheikhpour; 14 Aug 2017, 22:28. Reason: I ran the "drop duplicates" then sort. Then created the new column saving the data set again over the majors.dta

            Comment


            • #7
              You were right - it seemed somehow on the original drop of variables, something strange happened and mismatched the variables. I redid the process and the command worked as expected. Definitely the last time I make that questionnaire mistake again. Thanks again

              Comment

              Working...
              X