Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Merging/Collapsing Multiple Variables of Similar Type

    Hello,
    I am working with a dataset from a survey. In the survey, participants were asked to answer questions based on the number of dogs in their household.
    The study was conducted online and the dataset output that I received lists for example, the sex of each of the respondents' dogs in vertical tables.

    e.g.
    Respondent number Sexdog1 Sexdog2 Sexdog3 Sexdog4
    1 male female male male
    2 Female male . .
    There are 2006 respondents, and each has the ability to list many characteristics of the dog (including sex), for up to a maximum 10 dogs;
    So, there are 2006 rows, and 10 columns across for one question in one table, for example.

    Is there any way to merge/collapse the data in STATA14 wherein I can see by respondent number, what was the most commonly listed sex of dog, in all of their dogs listed?
    So, merge Sexdog1, Sexdog2, Sexdog3 etc. into one variable for this particular respondent to show "the most common type of sex of the dog in respondent household 1 was ... "
    ?
    I have both continuous and categorical data (just thought I would show the categorical as the example)

    Thanks so much in advance!
    Danielle

  • #2
    What do you want the result to be for Respondent 2? There is one male and one female: which is the "most common?"

    Comment


    • #3
      Yes, very good question Clyde, thank you for responding!
      Some of the respondents, only have 1 dog, some have 2, some have 5 and very, very few have 10 dogs.

      I suppose the question is, on that basis, is it even useful for me to merge the variables??

      Maybe I have to leave them all as is, and use the <tab> or <summ> command to view details for descriptive stats?

      I am new to STATA so sorry for the overly simple questions,

      Danielle

      Comment


      • #4
        Well, my question didn't really raise a Stata issue. If you were working in another package, my counter-question would still apply. This comes up all the time on the forum, where somebody wants to calculate a variable that tells them "the most frequent value" of some variable (or something related to that) and has not specified what to do if there is a tie.

        Perhaps the most sensible thing in this case is to simply assign each respondent a count of males and a count of females? Or perhaps a three-way variable: more males than females, equal numbers, more females than males?

        Code:
        * Example generated by -dataex-. To install: ssc install dataex
        clear
        input byte respondent_number str6(sexdog1 sexdog2) str4(sexdog3 sexdog4)
        1 "male"   "female" "male" "male"
        2 "female" "male"   ""    ""  
        end
        
        reshape long sexdog, i(respondent_number) j(dog_number)
        collapse (count) n_dogs_ = dog_number, by(respondent_number sexdog)
        drop if missing(sexdog)
        reshape wide n_dogs_, i(respondent_number) j(sexdog) string
        gen byte threeway = sign(n_dogs_male - n_dogs_female)
        label define threeway    1    "More Males" 0 "Equal" -1 "More Females"
        label values threeway threeway
        The code above begins by inputting a cleaned up version of your example data (and, in the future, please use -dataex- to post examples, as I have done), and then calculates for you the counts of males and females, and a three-way variable showing which, if either, is the majority. The original detail data on the dogs has been obliterated in the process. But if you want it back, you can always -merge- these results with your original data set.
        Last edited by Clyde Schechter; 20 Jun 2016, 12:38. Reason: Correct error in code: had the signs wrong in the label define command.

        Comment


        • #5
          I have a very similar problem where I have not found the solution yet.

          In my data (example from other post) I also have multiple entries for the same population ("entity") but also for different years ("year").

          1. With "egen tag" and "egen ndistinct" I can find out how many different "firms" each combination of "entity" and "year" (group?) has.
          ( I will have to do this not only for "firm" but also for "value" and other variables)

          2. Now I would like to know the frequencies for each value in "ndistinct" over all the combinations of "entity" and "year".
          (To get an idea of the variation within different variables and think about how to merge them in the most sensible way).

          3. Last I need to make sure there is only one observation per combinations of "entity" and "year" and therefore combine (merge, collapse, append...?) the values of other variables. If there is a most frequent value, choose that one, if there is not, choose the highest or first, or in some other cases the mean or sum (to be defined for each variable).

          I really struggle to find a way how to do that. I hope my example is clear.

          Original data
          Code:
          * Example generated by -dataex-. To install: ssc install dataex
          clear
          input byte entity int year str1 firm float value
          1 2010 "A" 15
          1 2010 "A"  8
          1 2010 "B" 12
          1 2011 "B" 25
          1 2012 "B"  8
          2 2010 "A"  7
          2 2011 "A"  5
          2 2011 "A" 12
          2 2011 "C" 13
          2 2012 "A" 19
          2 2012 "B" 25
          2 2011 "B" 14
          2 2012 "C" 18
          2 2012 "D" 16
          end
          
          sort entity year
          list, sepby(entity year)
          1. Question
          Code:
          * Example generated by -dataex-. To install: ssc install dataex
          clear
          input byte entity int year str1 firm float value
          1 2010 "A" 15
          1 2010 "A"  8
          1 2010 "B" 12
          1 2011 "B" 25
          1 2012 "B"  8
          2 2010 "A"  7
          2 2011 "A"  5
          2 2011 "A" 12
          2 2011 "C" 13
          2 2012 "A" 19
          2 2012 "B" 25
          2 2011 "B" 14
          2 2012 "C" 18
          2 2012 "D" 16
          end
          
          egen tag = tag(firm entity year)
          egen ndistinct = total(tag), by(entity year)
          
          sort entity year
          list, sepby(entity year)
          2. Question -> This is the table I would like to obtain
          Code:
          * Example generated by -dataex-. To install: ssc install dataex
          clear
          input byte ndistinct freq
          1 3
          2 1
          3 1
          4 1
          end
          sort ndistinct freq
          list, sepby(ndistinct freq)
          3. Question?

          See also:
          https://www.statalist.org/forums/for...r-each-subject
          https://www.statalist.org/forums/for...iven-condition
          https://www.statalist.org/forums/for...cases-by-group

          Comment


          • #6
            It appears that this same question has been posted on 6 different threads. It is answered at https://www.statalist.org/forums/for...r-each-subject.

            Seriously? 6 threads? That's abusive. Please stop it. Find one appropriate thread for your topic and post it there.

            Comment

            Working...
            X