Cleaning dataset by finding duplicates among different variables

Fred Fennek

Join Date: Jul 2017
Posts: 2

Cleaning dataset by finding duplicates among different variables

23 Jul 2017, 05:23

Hello,

I hope this is not a redundant question (I'm rather new to Stata) but I haven't yet found an attractive solution to my problem.

I need to identify (and later drop) observations that have a value in one variable (ID_1) which is equal to another observation's value of a different variable (ID_2) under the condition that both observations have the same date and the matched observation has a specific value in an indicator variable (either C or W). I would like to generate a dummy variable that ticks the corresponding observations first before I drop them.

Here, the simplified data:

Firm	ID_1	ID_2	Date	Indicator
1	28896		07. Feb 03	T
2	29105		07. Feb 03	T
1	29106	28896	07. Feb 03	W
3	29126		07. Feb 03	T
3	29254		07. Feb 03	N
4	29283		07. Feb 03	T
3	29288	29254	07. Feb 03	C

For example, I would like to the dummy variable to mark the first observation with ID_1=28896 because the third line item's ID_2 matches that ID_1 while it has the same date and also the Indicator equals "W" instead of "T". The same applies to the fifth observation which should be marked due to the last line item.

I am sorry, I cannot provide a code, I have tried different approaches using forvalue and duplicates but nothing worked properly.

Thanks and best regards,
Fred

Tags: None

Clyde Schechter

Join Date: Apr 2014

Posts: 29846
#2

23 Jul 2017, 12:38

This is a good task for Robert Picard's -rangejoin-, which you can get from SSC. In order to use -rangejoin- you must also install -rangestat- (Also by Robert Picard, with Nick Cox & Roberto Ferrer) from SSC.

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input byte firm int(id_1 id_2) float date str1 indicator 1 28896 . 15743 "T" 2 29105 . 15743 "T" 1 29106 28896 15743 "W" 3 29126 . 15743 "T" 3 29254 . 15743 "N" 4 29283 . 15743 "T" 3 29288 29254 15743 "C" end format %td date preserve keep if !missing(id_2) & inlist(indicator, "C", "W") rename id_2 matching_id tempfile id2s save `id2s' restore rangejoin matching_id id_1 id_1 using `id2s', by(date) gen byte has_match = !missing(indicator_U)

In the future, please use -dataex- to give example data, as I have done here. Your example HTML table was cumbersome to import into Stata because it required converting the date variable to a Stata internal format date variable. Also, it is impossible to know from your table whether indicator is a string variable or a numeric variable with a value label. I guessed it's a string. If it's not, you'll have to modify the code above to treat it as a numeric variable. When you use -dataex- (also by Robert Picard and also available from SSC) you enable those who want to help you to instantly create a completely faithful replica of your Stata example, that includes all of the subtleties of storage types and labeling, with a simple copy/paste operation. Please use it every time you post example data in the future.
Comment
Fred Fennek

Join Date: Jul 2017

Posts: 2
#3

24 Jul 2017, 05:15

Hello Clyde,

thanks a lot, this works perfectly! The indicator is indeed a string (even the IDs were but I destringed those).

Next time, I will use dataex. I even tried it this time before posting but my inexperience led me to believe that it would be arranged more clearly like that, now I know.

Best
Fred
Comment

Announcement

Cleaning dataset by finding duplicates among different variables

Comment

Comment