How to drop duplicate ID observation series with different variable values

Robin Wagenbrenner

Join Date: Jun 2019
Posts: 10

How to drop duplicate ID observation series with different variable values

28 Jun 2019, 18:46

Hi all,

I hava an panel dataset for firms and stock returns between 2005 and 2015. However, for some observations (sorted by FirmID and Date) I have duplicates with differing stock prices.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input float(DailyObservation FirmID) long Date double ClosingPrice float dup_obs
675 23 16439 7.46 1
  1 23 16439 2.95 2
  2 23 16440 2.96 1
676 23 16440 7.59 2
  3 23 16441 2.95 1
677 23 16441 7.58 2
  4 23 16442 2.96 1
678 23 16442 7.75 2
679 23 16443 7.76 1
  5 23 16443 3.25 2
680 23 16446 7.84 0
681 23 16447    8 1
  6 23 16447 2.95 2
682 23 16448  7.8 0
683 23 16449 7.72 1
  7 23 16449 2.95 2
684 23 16450 7.84 0
685 23 16453 7.93 1
  8 23 16453 2.95 2
  9 23 16454 2.95 1
686 23 16454 7.83 2
 10 23 16455 2.95 1
687 23 16455 7.85 2
 11 23 16456 2.95 1
688 23 16456 7.79 2
 12 23 16457 2.95 1
689 23 16457 7.78 2
 13 23 16460 3.02 1
690 23 16460 7.68 2
691 23 16461 7.68 1
 14 23 16461 2.95 2
692 23 16462 7.77 1
 15 23 16462 2.95 2
 16 23 16463 2.95 1
693 23 16463 7.85 2
694 23 16464 7.84 1
 17 23 16464 2.95 2
695 23 16467  7.9 1
 18 23 16467 2.95 2
696 23 16468 8.09 1
 19 23 16468 2.93 2
697 23 16469 8.07 1
 20 23 16469  3.1 2
 21 23 16470    3 1
698 23 16470 7.96 2
 22 23 16471 3.01 1
699 23 16471 8.02 2
700 23 16474 8.28 0
701 23 16475 8.36 1
 23 23 16475 2.94 2
 24 23 16476 3.31 1
702 23 16476 8.85 2
 25 23 16477 3.35 1
703 23 16477 8.76 2
704 23 16478 8.78 1
 26 23 16478  3.3 2
 27 23 16481 3.03 1
705 23 16481 8.59 2
 28 23 16482 3.29 1
706 23 16482 8.71 2
 29 23 16483 3.02 1
707 23 16483 8.74 2
 30 23 16484 3.03 1
708 23 16484  8.8 2
709 23 16485 8.57 1
 31 23 16485  3.2 2
710 23 16488 8.51 1
 32 23 16488 3.02 2
 33 23 16489 3.02 1
711 23 16489 8.36 2
 34 23 16490 3.02 1
712 23 16490 8.26 2
713 23 16491 8.23 0
 35 23 16492 3.02 1
714 23 16492 8.38 2
715 23 16495 8.46 0
716 23 16496 8.27 1
 36 23 16496 3.02 2
717 23 16497 8.39 1
 37 23 16497 3.03 2
718 23 16498 8.23 1
 38 23 16498 3.02 2
 39 23 16499 3.05 1
719 23 16499 8.31 2
 40 23 16502 3.02 1
720 23 16502 8.41 2
 41 23 16503 3.02 1
721 23 16503 8.51 2
 42 23 16504 2.93 1
722 23 16504 8.52 2
 43 23 16505 2.95 1
723 23 16505 8.42 2
 44 23 16506  3.2 1
724 23 16506 8.62 2
725 23 16509  8.4 1
 45 23 16509  3.2 2
726 23 16510 8.04 1
 46 23 16510  3.3 2
727 23 16511 7.96 0
 47 23 16512 3.35 1
end
format %d Date

Dup_Obs was derived using an ADO file and the code:

Code:

dup FirmID Date

I would like to drop one duplicate time series set, either the series with the lower closing prices or the series that does not start with DailyObservation #1. I am having trouble coming up with the code to drop one of the series. I have tried:

Code:

dup FirmID Date, drop

However this drops all dup_obs that are not equal to 0. This parts of both of the duplicate series to be dropped.
Using code such as

Code:

drop if dup_obs!=0 & DailyObservation>2865

also does not help because since it is an unbalanced panel data not all duplicate series go until such a high Observation number.

Tags: None

Wouter Wakker

Join Date: Nov 2018
Posts: 621

29 Jun 2019, 04:44

I couldn't find the command -dup- that you use, but I think the problem can be solved with -duplicates- as well by sorting on the variable that you use to drop a series (DailyObservation or ClosingPrice) and then specifying -duplicates drop-, which will only keep the first observation of every duplicate.

Code:

sort FirmID Date DailyObservation
list in 1/15, noobs
 +------------------------------------------------+
  | DailyO~n   FirmID    Date   Closin~e   dup_obs |
  |------------------------------------------------|
  |        1       23   16439       2.95         2 |
  |      675       23   16439       7.46         1 |
  |        2       23   16440       2.96         1 |
  |      676       23   16440       7.59         2 |
  |        3       23   16441       2.95         1 |
  |------------------------------------------------|
  |      677       23   16441       7.58         2 |
  |        4       23   16442       2.96         1 |
  |      678       23   16442       7.75         2 |
  |        5       23   16443       3.25         2 |
  |      679       23   16443       7.76         1 |
  |------------------------------------------------|
  |      680       23   16446       7.84         0 |
  |        6       23   16447       2.95         2 |
  |      681       23   16447          8         1 |
  |      682       23   16448        7.8         0 |
  |        7       23   16449       2.95         2 |
  +------------------------------------------------+

duplicates drop FirmID Date, force
list in 1/9, noobs

  +------------------------------------------------+
  | DailyO~n   FirmID    Date   Closin~e   dup_obs |
  |------------------------------------------------|
  |        1       23   16439       2.95         2 |
  |        2       23   16440       2.96         1 |
  |        3       23   16441       2.95         1 |
  |        4       23   16442       2.96         1 |
  |        5       23   16443       3.25         2 |
  |------------------------------------------------|
  |      680       23   16446       7.84         0 |
  |        6       23   16447       2.95         2 |
  |      682       23   16448        7.8         0 |
  |        7       23   16449       2.95         2 |
  +------------------------------------------------+

The non-duplicates seem to belong to the dropped series as well, so you may want to drop them with

Code:

drop if dup_obs==0

Similarly, if you want to drop the duplicates with the lower closing price:

Code:

gsort FirmID Date -ClosingPrice
duplicates drop FirmID Date, force
list in 1/10, noobs
  +------------------------------------------------+
  | DailyO~n   FirmID    Date   Closin~e   dup_obs |
  |------------------------------------------------|
  |      675       23   16439       7.46         1 |
  |      676       23   16440       7.59         2 |
  |      677       23   16441       7.58         2 |
  |      678       23   16442       7.75         2 |
  |      679       23   16443       7.76         1 |
  |------------------------------------------------|
  |      680       23   16446       7.84         0 |
  |      681       23   16447          8         1 |
  |      682       23   16448        7.8         0 |
  |      683       23   16449       7.72         1 |
  |      684       23   16450       7.84         0 |
  +------------------------------------------------+

You need the -force- option because observations may differ on other variables than FirmID and Data, which is the case.

Comment

Robin Wagenbrenner

Join Date: Jun 2019
Posts: 10

29 Jun 2019, 11:47

Dear Wouter,
thanks for your reply!

I have used your code to delete the lower priced time series. However, I realised that the origin of the problem is something else!
I took a step back and saw that some return panels for various FirmIDs are duplicates and the prices are way off what they should be in the subsequent duplicate panels.
What is the code to keep only the first panel for each ID? Below is an excerpt from where the panels repeat. I am only interested in keeping the upper panel.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input float FirmID long Date str88 Name double ClosingPrice
7 17639 "ABN-AMRO HOLDINGS NV" 38.51
7 17640 "ABN-AMRO HOLDINGS NV" 38.51
7 17643 "ABN-AMRO HOLDINGS NV" 37.75
7 17644 "ABN-AMRO HOLDINGS NV" 37.67
7 17645 "ABN-AMRO HOLDINGS NV" 37.51
7 17646 "ABN-AMRO HOLDINGS NV" 36.95
7 16440 "ABN-AMRO HOLDINGS NV" 22.96
7 16442 "ABN-AMRO HOLDINGS NV"  27.7
7 16443 "ABN-AMRO HOLDINGS NV" 27.69
7 16446 "ABN-AMRO HOLDINGS NV" 27.73
7 16449 "ABN-AMRO HOLDINGS NV"    25
7 16450 "ABN-AMRO HOLDINGS NV" 27.69
end
format %d Date

Thanks!

Comment

Wouter Wakker

Join Date: Nov 2018

Posts: 621
#4

29 Jun 2019, 12:56

Hi Robin,

I'm afraid I don't understand you when you say you want to keep the first panel of each ID. A panel is a dataset with N panel members and T time observations. Do you want to keep only certain dates? Maybe if you explain more generally what you want to do, I or someone else will be better able to help you.

Last edited by Wouter Wakker; 29 Jun 2019, 13:00.
Comment

Robin Wagenbrenner

Join Date: Jun 2019
Posts: 10

29 Jun 2019, 13:01

Another Issue I am having is that some dates are still duplicated but with differing prices and TradingVolumes, eg 16Dec2005, regardless of using this

Code:

 gsort FirmID Date -ClosingPrice duplicates drop FirmID Date, force

For duplicate observations on any given Date I would like to drop the one with the lower TradingVolume, regardless of ObservationSeries as stated in the initial post. How do I go about doing this?

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input float FirmID long Date str88 Name double(TradingVolume ClosingPrice)
7 16785 "ABN-AMRO HOLDINGS NV"  7793783 21.85
7 16786 "ABN-AMRO HOLDINGS NV" 17663990 22.09
7 16786 "ABN-AMRO HOLDINGS NV"       20 24.75
7 16789 "ABN-AMRO HOLDINGS NV"  7206420 21.97
7 16790 "ABN-AMRO HOLDINGS NV"  7182666 22.06
7 16791 "ABN-AMRO HOLDINGS NV"  7335014  22.2
7 16792 "ABN-AMRO HOLDINGS NV"  6461439  22.2
7 16793 "ABN-AMRO HOLDINGS NV"  3786545 22.21
7 16797 "ABN-AMRO HOLDINGS NV"  1914285 22.29
7 16798 "ABN-AMRO HOLDINGS NV"  3087064 22.29
7 16798 "ABN-AMRO HOLDINGS NV"       30 24.75
7 16799 "ABN-AMRO HOLDINGS NV"  2640609 22.34
7 16800 "ABN-AMRO HOLDINGS NV"  4109429 22.09
7 16800 "ABN-AMRO HOLDINGS NV"       70 23.55
7 16803 "ABN-AMRO HOLDINGS NV"  2679901 22.23
7 16804 "ABN-AMRO HOLDINGS NV"  7932026 22.44
7 16804 "ABN-AMRO HOLDINGS NV"      550 24.31
7 16805 "ABN-AMRO HOLDINGS NV"  8435475  22.6
7 16806 "ABN-AMRO HOLDINGS NV"  6875973 22.58
7 16806 "ABN-AMRO HOLDINGS NV"       50    24
7 16807 "ABN-AMRO HOLDINGS NV"  7042624 22.58
7 16807 "ABN-AMRO HOLDINGS NV"      685 24.31
7 16810 "ABN-AMRO HOLDINGS NV"  6377258  22.6
7 16810 "ABN-AMRO HOLDINGS NV"     1080 24.75
7 16811 "ABN-AMRO HOLDINGS NV"  6014526 22.36
7 16811 "ABN-AMRO HOLDINGS NV"      500 24.75
7 16812 "ABN-AMRO HOLDINGS NV"  6858210 22.56
end
format %d Date

Comment

Robin Wagenbrenner

Join Date: Jun 2019

Posts: 10
#6

29 Jun 2019, 13:49

Wouter, I know it is a bit confusing. You would think that a paid data service such as Compustat would have data with correct variable values given the correct conditional statements.

Ultimately what I would like to do is something along the lines of: If among multiple observations the FirmID and Date match,, ie: duplicates, but ClosingPrice and TradingVolume differ THEN among those duplicates drop the duplicate which has lower/higher TradingVolume/ClosingPrice

Is there a way to properly code this?
.
Comment
Wouter Wakker

Join Date: Nov 2018

Posts: 621
#7

29 Jun 2019, 22:58

For duplicate observations on any given Date I would like to drop the one with the lower TradingVolume, regardless of ObservationSeries as stated in the initial post. How do I go about doing this?

Code:

gsort FirmID Date -TradingVolume duplicates drop FirmID Date, force

For dropping the highest closing price instead of the lowest, just remove the - before ClosingPrice in the code already provided.

Another Issue I am having is that some dates are still duplicated but with differing prices and TradingVolumes, eg 16Dec2005, regardless of using this gsort FirmID Date -ClosingPrice duplicates drop FirmID Date, force

For me it works on your data samples. Please provide me with a data sample and the code you use when this happens if you want me to check.
Comment
Robin Wagenbrenner

Join Date: Jun 2019

Posts: 10
#8

30 Jun 2019, 02:49

Hi Wouter, thanks for your answer. The code you provided is helpful, but it applies to the whole dataset. For each FirmID I need to manually check whether I need to drop the higher/lower TradingVolume/ClosingPrice. Thus,Is it also possible to apply this code with an -if- command? For example

Code:

gsort FirmID Date -TradingVolume, duplicates drop FirmID Date, if FirmID==7 , force

where I change -TradingVolume with Closing Price and FirmID for each firm? Also, instead of writing this code for each FirmID then is it possible to create a foreach loop, so I can combine FirmIDs for which the lower/higher Volume/Prices need to be dropped?

Sorry for all the questions, I am still very much a beginner in Stata.
Comment

Wouter Wakker

Join Date: Nov 2018
Posts: 621

30 Jun 2019, 03:37

Yes, your code is almost okay, it just has one comma to many. It should be

Code:

duplicates drop FirmID Date if FirmID==7, force

Also, the comma after your first line is not necessary.

To loop over firms:

Code:

* Keep highest trading volume duplicate for firms 1 3 5 and 7
gsort FirmID Date -TradingVolume
foreach num of numlist 1 3 5 7{ 
         duplicates drop FirmID Date if FirmID==`num', force
         }
        
* Keep lowest closing price for firms 2 4 and 6
gsort FirmID Date ClosingPrice
foreach num of numlist 2 4 6{ 
         duplicates drop FirmID Date if FirmID==`num', force
         }

Comment

Robin Wagenbrenner

Join Date: Jun 2019

Posts: 10
#10

30 Jun 2019, 13:30

Yes that is what I am looking for! Thank you so much Wouter!
Comment

Announcement

How to drop duplicate ID observation series with different variable values

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment