Deleting duplicates, but first copying some information from the duplicate

Emilie Gronsund

Join Date: Apr 2014

Posts: 26
#1

Deleting duplicates, but first copying some information from the duplicate

24 Apr 2014, 04:03

Hi,

Could someone help me with a problem, I am stuck!
So basically, my professor told me to do an append instead of a merge in stata (don't really agree, but might make sense if I get the dataset organised).

After an append, my observations were just added at the bottom. I sorted them so that I can see the observations more clearly.
I.e, if you see where I have marked blue in the picture? They are duplicates based on most variables, however for example, the dummies called seller and customer are different. What I basically want to do is to add(or copy) the information from the other duplicate, and then drop the duplicates. I.e for the company with gvkey=1161 and fyear =1995, I just want one observation, however the information about customer and seller to be on one line, so that it would look something like this:
gvkey fyear seller customer
1161 1995 1 1
Is this possible? There are some other variables as well that I want to add variables from, so it would have to work with copying information from several variables.

Hope someone can help! I am stuck!

Best,
Emilie

1 Photo
Tags: None
Scott Merryman

Join Date: Mar 2014

Posts: 895
#2

24 Apr 2014, 04:32

Code:

collapse (sum) seller customer, by(gvkey fyear ...)
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35211
#3

24 Apr 2014, 04:37

Output and data listings are easier to read if formatted as code. See my posting in http://www.statalist.org/forums/foru...f-collinearity

Your full real name is better as an identifier.

Scott Merryman's code is excellent and may be exactly what you want. Technique for paired items follows which may also be of value.

If you have duplicates on gvkey fyear then

Code:

bysort gvkey fyear : replace seller = seller[3 - _n] if missing(seller) & _N == 2 bysort gvkey fyear : replace customer = customer[3 - _n] if missing(customer) & _N == 2

copies information between them as 3 - _n is 2 if _n is 1 and vice versa.

See http://www.stata-journal.com/sjpdf.h...iclenum=dm0043 for a discussion.

Last edited by Nick Cox; 24 Apr 2014, 04:50.
Comment
Emilie Gronsund

Join Date: Apr 2014

Posts: 26
#4

24 Apr 2014, 05:14

Thanks, Nick and Scott! Really helpful!
Best,
Emilie Gronsund
Comment
Emilie Gronsund

Join Date: Apr 2014

Posts: 26
#5

24 Apr 2014, 07:37

So I tried both ways. The first way by Scott did not work in my dataset. Your method, Nick, worked almost perfectly. However, what to do with non-numerics. I.e one variable has the same problem, where there is information on one of the duplicates, but not the other. Your method only works with numbers. These are i.e company names. Is there another code for that?
Best,
Emilie Gronsund
Comment
Carlos Avellaneda

Join Date: Mar 2014

Posts: 23
#6

24 Apr 2014, 08:37

You should change your user ID in order to be your real name, as it is stated in Statalist's FAQ. Click on the 'Contact Us' link at the far bottom right of any forum page in order to contact the administrators and change it. That aside, Nick's solution works perfectly fine with both numeric and string variables, so you should tell us the exact code you wrote, an example of what you got and an example of what you want.
Best regards,
Carlos
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35211
#7

24 Apr 2014, 08:38

"did not work" is not helpful as an error report.

It's not correct that my method works only with numeric variables. There is no assumption there that any of the variables are numeric.

Otherwise put, I think you need to give a more detailed report of what didn't work for you. If it's that company names are missing, you would need to fix that before sorting as missing values on company names are just sorted together. Perhaps the company name should be same as the previous non-missing company name, but you would need to copy company names downwards in the dataset before sorting.
Comment
Bill McCumber

Join Date: Apr 2014

Posts: 1
#8

24 Apr 2014, 09:11

This was a maddening issue when I was first learning Stata. See http://www.ats.ucla.edu/stat/stata/f...ta_faq_dup.htm for a walk through on how to identify perfect duplicates (across all variables), duplicates by some variables, tag and/or drop duplicates, etc. Good luck!
Comment
Emilie Gronsund

Join Date: Apr 2014

Posts: 26
#9

24 Apr 2014, 14:44

Hi,
Thank you guys, again.
I have sent an email asking to change my id to Emilie Gronsund, so hopefully this will be up soon. (sorry, I didn't know it was preferred)

So, to start from the beginning, I want to append two datasets, where most of the variables are identical. However there are a few which as of now I want to keep even though they are only in one of the data sets.

When appending them, there are a lot of duplicates.

I first attempted the method suggested by Scott, and utilised the following code:
collapse (sum) seller customer cust_gvkey cust_name crsp_cust_name cust_permno salecs, by(gvkey fyear conm at dlc dltt ebit ebitda gdwl intan ni sale xsga mkvalt lpermno lpermco compustat_name crsp_name permno)
Got the following error: type mismatch.
Looked it up, and it is because sum assumes numerical values, and as you can see 'cust_name' and 'crsp_cust_name' are names of the firms. So just for observational purposes I attempted to perform the same but excluding the non-numeric variables:
collapse (sum) seller customer cust_gvkey cust_permno, by(gvkey fyear conm at dlc dltt ebit ebitda gdwl intan ni sale xsga mkvalt lpermno lpermco compustat_name crsp_name permno).
So, doing this didn't get the results I wanted. It basically just replaced the missing values (.) with 0. Please see before and after photos (the after photo is when the variables interested in summing is at the end). I want the info from the duplicates to be added to the same line.

However, I made too quick of an assumption that your code (Nick) didn't work with non-numericals. Tried it now and it worked! Thank you

Thanks everyone for the replies,
I am waiting for my name to show up on my posts, but until then, my name at the bottom of the posts is all I have to offer

Best,
Emilie Gronsund

2 Photos
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35211
#10

24 Apr 2014, 14:58

Thanks for the closure and attention to the detail on names. Emilie Gronsund is much more memorable than s135183. But (still) please see my posting earlier explaining how to get more readable listings. You would need to use list rather than take a screenshot of the Data Editor.
Comment

Announcement

Deleting duplicates, but first copying some information from the duplicate

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment