Collapse across duplicates of strings

Anne Todd

Join Date: Dec 2018
Posts: 157

Collapse across duplicates of strings

11 Mar 2024, 19:30

With example data below, what is the most efficient way to produce a dataframe that contains just one row per unique entry of the four "place" variables? I have only used collapse with numeric variables before, so I'm not sure if this is even possible. Otherwise perhaps there is some way to flag duplicates and then drop based on that new flag variable?

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input byte(id survey_wave) str9 place1 str13 place_clean1 str10 place2 str15 place_clean2
1 1 "harvard"   "Harvard"       ""           ""               
1 2 "harvard"   "Harvard"       ""           ""               
1 3 "harvard"   "Harvard"       ""           ""               
1 4 ""          ""              ""           ""               
1 5 "aiello"    "Aiello's"      "ming"       "Ming Restaurant"
2 1 "uchicago"  "U of Chicago"  ""           ""               
2 3 "uchicago"  "U of Chicago"  ""           ""               
2 4 "uchicago"  "U of Chicago"  ""           ""               
2 4 "uchicago"  "U of Chicago"  ""           ""               
2 5 "uchicago"  "U of Chicago"  ""           ""               
3 1 "aiellos"   "Aiello's"      ""           ""               
3 1 "aiello"    "Aiello's"      ""           ""               
3 2 "none"      ""              ""           ""               
3 2 "our place" "Our Place"     "sheetz"     "Sheetz"         
3 3 "our place" "Our Place"     "sheetz"     "Sheetz"         
3 5 "our place" "Our Place"     ""           ""               
4 1 "uchicago"  "U of Chicago"  "jiffy lube" "Jiffy Lube"     
4 1 "aiello"    "Aiello's"      "jiffy lube" "Jiffy Lube"     
4 2 "truck co"  "Truck Company" "jiffy lube" "Jiffy Lube"     
4 2 "truck co"  "Truck Company" "jiffy lube" "Jiffy Lube"     
4 3 "truck co"  "Truck Company" ""           ""               
end

Tags: None

Nick Cox

Join Date: Mar 2014

Posts: 35210
#2

11 Mar 2024, 21:00

Even if you look to remove duplicates across place*, each observation is still unique (literally; also distinct) because of the first two identifier variables.

So, I can't follow what you want here. Perhaps you can give a data example of what the result should be.
Comment
Anne Todd

Join Date: Dec 2018

Posts: 157
#3

11 Mar 2024, 21:06

Sorry, my use of the word "across" conveyed something different than what I intended.

I actually suppose I mean within each variable, not across. So within place1, the resulting unique values would be: harvard, aiello, uchicago, aiellos, none, our place, truck co. And repeat, separately, for each place* variable.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35210
#4

12 Mar 2024, 09:19

Thanks for the extra detail. Sorry, but I still have no clear idea of what you want.
Comment

Andrew Musau

Join Date: Oct 2014
Posts: 9945

12 Mar 2024, 10:49

Something like?

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input byte(id survey_wave) str9 place1 str13 place_clean1 str10 place2 str15 place_clean2
1 1 "harvard"   "Harvard"       ""           ""              
1 2 "harvard"   "Harvard"       ""           ""              
1 3 "harvard"   "Harvard"       ""           ""              
1 4 ""          ""              ""           ""              
1 5 "aiello"    "Aiello's"      "ming"       "Ming Restaurant"
2 1 "uchicago"  "U of Chicago"  ""           ""              
2 3 "uchicago"  "U of Chicago"  ""           ""              
2 4 "uchicago"  "U of Chicago"  ""           ""              
2 4 "uchicago"  "U of Chicago"  ""           ""              
2 5 "uchicago"  "U of Chicago"  ""           ""              
3 1 "aiellos"   "Aiello's"      ""           ""              
3 1 "aiello"    "Aiello's"      ""           ""              
3 2 "none"      ""              ""           ""              
3 2 "our place" "Our Place"     "sheetz"     "Sheetz"        
3 3 "our place" "Our Place"     "sheetz"     "Sheetz"        
3 5 "our place" "Our Place"     ""           ""              
4 1 "uchicago"  "U of Chicago"  "jiffy lube" "Jiffy Lube"    
4 1 "aiello"    "Aiello's"      "jiffy lube" "Jiffy Lube"    
4 2 "truck co"  "Truck Company" "jiffy lube" "Jiffy Lube"    
4 2 "truck co"  "Truck Company" "jiffy lube" "Jiffy Lube"    
4 3 "truck co"  "Truck Company" ""           ""              
end

frame put place?, into(wanted)
frame wanted{
    gen obsno=_n
    reshape long place, i(obsno) j(which)
    bys place(obsno): replace obsno= obsno[1]
    bys which obsno: keep if _n==1 & !missing(place)
    bys which (obsno): replace obsno=_n
    reshape wide place, i(obsno) j(which)
    l, sep(0)
}
frame drop wanted

Res.:

Code:

.     l, sep(0)

     +--------------------------------+
     | obsno      place1       place2 |
     |--------------------------------|
  1. |     1     harvard         ming |
  2. |     2      aiello       sheetz |
  3. |     3    uchicago   jiffy lube |
  4. |     4     aiellos              |
  5. |     5        none              |
  6. |     6   our place              |
  7. |     7    truck co              |
     +--------------------------------+

Comment

Anne Todd

Join Date: Dec 2018

Posts: 157
#6

12 Mar 2024, 10:53

Nick Cox, sorry for confusion. If we imagine there's just one variable in the dataframe, place_clean1, then I just want to collapse that column to produce a list of the unique values of place_clean1. And then because there are multiple variables, I'd just want to repeat that in a way that allows for a final dataframe (or even just one long column) that contains all unique values in the original dataframe. One option I'm currently working on might actually be to just do a simple detection of unique values within each column, and output them sequentially to an Excel file.

It looks like Andrew Musau's code produced what I was hoping for (though only for 2 of the 4 variables). I thought it might be fairly simple but I guess not!

Last edited by Anne Todd; 12 Mar 2024, 11:06.
Comment
Anne Todd

Join Date: Dec 2018

Posts: 157
#7

12 Mar 2024, 11:33

In case anyone comes across this later, here's what I ended up doing:

Code:

preserve keep place1 by place1, sort: gen uniquevalues=_n==1 drop if uniquevalues!=1 export excel using 'file', sheet("New sheet name here") restore

And then simply repeated that for each variable, while inserting the new list into a new sheet of the same Excel file. While that means I'll have to copy and paste to get them all together in the same Excel sheet, this seems like the most transparent and simplest solution (for me).
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35210
#8

12 Mar 2024, 11:38

No one else is chiming in here which is some comfort as I can't wrap my head around the entirety of a word description. As in #2 a data example of how you want the data to end up might crystallize the desire for anyone interested.

Otherwise here are some commands I wrote earlier

Code:

help duplicates search distinct, sj

that may help one way or another.
1 like
Comment

Andrew Musau

Join Date: Oct 2014
Posts: 9945

12 Mar 2024, 12:17

Originally posted by Anne Todd View Post

It looks like Andrew Musau's code produced what I was hoping for (though only for 2 of the 4 variables). I thought it might be fairly simple but I guess not!

If you want all the variables, you will need to change the reshape command and the variables that you export to the frame. The essence of what you want, as I understand it, is achieved by having the data in a long layout and then using contract.

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input byte(id survey_wave) str9 place1 str13 place_clean1 str10 place2 str15 place_clean2
1 1 "harvard"   "Harvard"       ""           ""              
1 2 "harvard"   "Harvard"       ""           ""              
1 3 "harvard"   "Harvard"       ""           ""              
1 4 ""          ""              ""           ""              
1 5 "aiello"    "Aiello's"      "ming"       "Ming Restaurant"
2 1 "uchicago"  "U of Chicago"  ""           ""              
2 3 "uchicago"  "U of Chicago"  ""           ""              
2 4 "uchicago"  "U of Chicago"  ""           ""              
2 4 "uchicago"  "U of Chicago"  ""           ""              
2 5 "uchicago"  "U of Chicago"  ""           ""              
3 1 "aiellos"   "Aiello's"      ""           ""              
3 1 "aiello"    "Aiello's"      ""           ""              
3 2 "none"      ""              ""           ""              
3 2 "our place" "Our Place"     "sheetz"     "Sheetz"        
3 3 "our place" "Our Place"     "sheetz"     "Sheetz"        
3 5 "our place" "Our Place"     ""           ""              
4 1 "uchicago"  "U of Chicago"  "jiffy lube" "Jiffy Lube"    
4 1 "aiello"    "Aiello's"      "jiffy lube" "Jiffy Lube"    
4 2 "truck co"  "Truck Company" "jiffy lube" "Jiffy Lube"    
4 2 "truck co"  "Truck Company" "jiffy lube" "Jiffy Lube"    
4 3 "truck co"  "Truck Company" ""           ""              
end

frame put place*, into(wanted)
frame wanted{
    gen obsno=_n
    reshape long place place_clean, i(obsno) j(which)
    bys place(obsno): replace obsno= obsno[1]
    bys which obsno: keep if _n==1 & !missing(place)
    bys which (obsno): replace obsno=_n
    reshape wide place place_clean, i(obsno) j(which)
    l, sep(0)
}
frame drop wanted

Res.:

Code:

.     l, sep(0)

     +------------------------------------------------------------------+
     | obsno      place1    place_clean1       place2      place_clean2 |
     |------------------------------------------------------------------|
  1. |     1     harvard         Harvard         ming   Ming Restaurant |
  2. |     2      aiello        Aiello's       sheetz            Sheetz |
  3. |     3    uchicago    U of Chicago   jiffy lube        Jiffy Lube |
  4. |     4     aiellos        Aiello's                                |
  5. |     5        none                                                |
  6. |     6   our place       Our Place                                |
  7. |     7    truck co   Truck Company                                |
     +------------------------------------------------------------------+

Announcement

Collapse across duplicates of strings

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment