Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Collapse across duplicates of strings

    With example data below, what is the most efficient way to produce a dataframe that contains just one row per unique entry of the four "place" variables? I have only used collapse with numeric variables before, so I'm not sure if this is even possible. Otherwise perhaps there is some way to flag duplicates and then drop based on that new flag variable?


    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input byte(id survey_wave) str9 place1 str13 place_clean1 str10 place2 str15 place_clean2
    1 1 "harvard"   "Harvard"       ""           ""               
    1 2 "harvard"   "Harvard"       ""           ""               
    1 3 "harvard"   "Harvard"       ""           ""               
    1 4 ""          ""              ""           ""               
    1 5 "aiello"    "Aiello's"      "ming"       "Ming Restaurant"
    2 1 "uchicago"  "U of Chicago"  ""           ""               
    2 3 "uchicago"  "U of Chicago"  ""           ""               
    2 4 "uchicago"  "U of Chicago"  ""           ""               
    2 4 "uchicago"  "U of Chicago"  ""           ""               
    2 5 "uchicago"  "U of Chicago"  ""           ""               
    3 1 "aiellos"   "Aiello's"      ""           ""               
    3 1 "aiello"    "Aiello's"      ""           ""               
    3 2 "none"      ""              ""           ""               
    3 2 "our place" "Our Place"     "sheetz"     "Sheetz"         
    3 3 "our place" "Our Place"     "sheetz"     "Sheetz"         
    3 5 "our place" "Our Place"     ""           ""               
    4 1 "uchicago"  "U of Chicago"  "jiffy lube" "Jiffy Lube"     
    4 1 "aiello"    "Aiello's"      "jiffy lube" "Jiffy Lube"     
    4 2 "truck co"  "Truck Company" "jiffy lube" "Jiffy Lube"     
    4 2 "truck co"  "Truck Company" "jiffy lube" "Jiffy Lube"     
    4 3 "truck co"  "Truck Company" ""           ""               
    end

  • #2
    Even if you look to remove duplicates across place*, each observation is still unique (literally; also distinct) because of the first two identifier variables.

    So, I can't follow what you want here. Perhaps you can give a data example of what the result should be.

    Comment


    • #3
      Sorry, my use of the word "across" conveyed something different than what I intended.

      I actually suppose I mean within each variable, not across. So within place1, the resulting unique values would be: harvard, aiello, uchicago, aiellos, none, our place, truck co. And repeat, separately, for each place* variable.

      Comment


      • #4
        Thanks for the extra detail. Sorry, but I still have no clear idea of what you want.

        Comment


        • #5
          Something like?

          Code:
          * Example generated by -dataex-. For more info, type help dataex
          clear
          input byte(id survey_wave) str9 place1 str13 place_clean1 str10 place2 str15 place_clean2
          1 1 "harvard"   "Harvard"       ""           ""              
          1 2 "harvard"   "Harvard"       ""           ""              
          1 3 "harvard"   "Harvard"       ""           ""              
          1 4 ""          ""              ""           ""              
          1 5 "aiello"    "Aiello's"      "ming"       "Ming Restaurant"
          2 1 "uchicago"  "U of Chicago"  ""           ""              
          2 3 "uchicago"  "U of Chicago"  ""           ""              
          2 4 "uchicago"  "U of Chicago"  ""           ""              
          2 4 "uchicago"  "U of Chicago"  ""           ""              
          2 5 "uchicago"  "U of Chicago"  ""           ""              
          3 1 "aiellos"   "Aiello's"      ""           ""              
          3 1 "aiello"    "Aiello's"      ""           ""              
          3 2 "none"      ""              ""           ""              
          3 2 "our place" "Our Place"     "sheetz"     "Sheetz"        
          3 3 "our place" "Our Place"     "sheetz"     "Sheetz"        
          3 5 "our place" "Our Place"     ""           ""              
          4 1 "uchicago"  "U of Chicago"  "jiffy lube" "Jiffy Lube"    
          4 1 "aiello"    "Aiello's"      "jiffy lube" "Jiffy Lube"    
          4 2 "truck co"  "Truck Company" "jiffy lube" "Jiffy Lube"    
          4 2 "truck co"  "Truck Company" "jiffy lube" "Jiffy Lube"    
          4 3 "truck co"  "Truck Company" ""           ""              
          end
          
          frame put place?, into(wanted)
          frame wanted{
              gen obsno=_n
              reshape long place, i(obsno) j(which)
              bys place(obsno): replace obsno= obsno[1]
              bys which obsno: keep if _n==1 & !missing(place)
              bys which (obsno): replace obsno=_n
              reshape wide place, i(obsno) j(which)
              l, sep(0)
          }
          frame drop wanted
          Res.:

          Code:
          .     l, sep(0)
          
               +--------------------------------+
               | obsno      place1       place2 |
               |--------------------------------|
            1. |     1     harvard         ming |
            2. |     2      aiello       sheetz |
            3. |     3    uchicago   jiffy lube |
            4. |     4     aiellos              |
            5. |     5        none              |
            6. |     6   our place              |
            7. |     7    truck co              |
               +--------------------------------+

          Comment


          • #6
            Nick Cox, sorry for confusion. If we imagine there's just one variable in the dataframe, place_clean1, then I just want to collapse that column to produce a list of the unique values of place_clean1. And then because there are multiple variables, I'd just want to repeat that in a way that allows for a final dataframe (or even just one long column) that contains all unique values in the original dataframe. One option I'm currently working on might actually be to just do a simple detection of unique values within each column, and output them sequentially to an Excel file.

            It looks like Andrew Musau's code produced what I was hoping for (though only for 2 of the 4 variables). I thought it might be fairly simple but I guess not!
            Last edited by Anne Todd; 12 Mar 2024, 11:06.

            Comment


            • #7
              In case anyone comes across this later, here's what I ended up doing:

              Code:
              preserve
              keep place1
              by place1, sort: gen uniquevalues=_n==1
              drop if uniquevalues!=1
              export excel using 'file', sheet("New sheet name here") 
              restore
              And then simply repeated that for each variable, while inserting the new list into a new sheet of the same Excel file. While that means I'll have to copy and paste to get them all together in the same Excel sheet, this seems like the most transparent and simplest solution (for me).

              Comment


              • #8
                No one else is chiming in here which is some comfort as I can't wrap my head around the entirety of a word description. As in #2 a data example of how you want the data to end up might crystallize the desire for anyone interested.

                Otherwise here are some commands I wrote earlier

                Code:
                help duplicates 
                
                search distinct, sj
                that may help one way or another.

                Comment


                • #9
                  Originally posted by Anne Todd View Post
                  It looks like Andrew Musau's code produced what I was hoping for (though only for 2 of the 4 variables). I thought it might be fairly simple but I guess not!
                  If you want all the variables, you will need to change the reshape command and the variables that you export to the frame. The essence of what you want, as I understand it, is achieved by having the data in a long layout and then using contract.

                  Code:
                  * Example generated by -dataex-. For more info, type help dataex
                  clear
                  input byte(id survey_wave) str9 place1 str13 place_clean1 str10 place2 str15 place_clean2
                  1 1 "harvard"   "Harvard"       ""           ""              
                  1 2 "harvard"   "Harvard"       ""           ""              
                  1 3 "harvard"   "Harvard"       ""           ""              
                  1 4 ""          ""              ""           ""              
                  1 5 "aiello"    "Aiello's"      "ming"       "Ming Restaurant"
                  2 1 "uchicago"  "U of Chicago"  ""           ""              
                  2 3 "uchicago"  "U of Chicago"  ""           ""              
                  2 4 "uchicago"  "U of Chicago"  ""           ""              
                  2 4 "uchicago"  "U of Chicago"  ""           ""              
                  2 5 "uchicago"  "U of Chicago"  ""           ""              
                  3 1 "aiellos"   "Aiello's"      ""           ""              
                  3 1 "aiello"    "Aiello's"      ""           ""              
                  3 2 "none"      ""              ""           ""              
                  3 2 "our place" "Our Place"     "sheetz"     "Sheetz"        
                  3 3 "our place" "Our Place"     "sheetz"     "Sheetz"        
                  3 5 "our place" "Our Place"     ""           ""              
                  4 1 "uchicago"  "U of Chicago"  "jiffy lube" "Jiffy Lube"    
                  4 1 "aiello"    "Aiello's"      "jiffy lube" "Jiffy Lube"    
                  4 2 "truck co"  "Truck Company" "jiffy lube" "Jiffy Lube"    
                  4 2 "truck co"  "Truck Company" "jiffy lube" "Jiffy Lube"    
                  4 3 "truck co"  "Truck Company" ""           ""              
                  end
                  
                  frame put place*, into(wanted)
                  frame wanted{
                      gen obsno=_n
                      reshape long place place_clean, i(obsno) j(which)
                      bys place(obsno): replace obsno= obsno[1]
                      bys which obsno: keep if _n==1 & !missing(place)
                      bys which (obsno): replace obsno=_n
                      reshape wide place place_clean, i(obsno) j(which)
                      l, sep(0)
                  }
                  frame drop wanted

                  Res.:

                  Code:
                  .     l, sep(0)
                  
                       +------------------------------------------------------------------+
                       | obsno      place1    place_clean1       place2      place_clean2 |
                       |------------------------------------------------------------------|
                    1. |     1     harvard         Harvard         ming   Ming Restaurant |
                    2. |     2      aiello        Aiello's       sheetz            Sheetz |
                    3. |     3    uchicago    U of Chicago   jiffy lube        Jiffy Lube |
                    4. |     4     aiellos        Aiello's                                |
                    5. |     5        none                                                |
                    6. |     6   our place       Our Place                                |
                    7. |     7    truck co   Truck Company                                |
                       +------------------------------------------------------------------+

                  Comment

                  Working...
                  X