Remove duplicates in panel data

Dani Tr

Join Date: Apr 2024

Posts: 2
#1

Remove duplicates in panel data

23 Apr 2024, 14:07

Hello,

I'm trying to regress with xtreg, but before doing so, I wanted to use 'xtset globalcompanykey datayearfiscal':

xtset globalcompanykey datayearfiscal
repeated time values within panel

I went to check for duplicates and I got the following response:

duplicates report globalcompanykey datayearfiscal
Duplicates in terms of globalcompanykey datayearfiscal

--------------------------------------
Copies | Observations Surplus
----------+---------------------------
1 | 99109 0
2 | 26408 13204
--------------------------------------

In these duplicates, some rows are having more missing data than their duplicate. I want to remove the duplicate row(s) with the least data. How can I best tackle this?

Edit 1:
My dataset contains multiple variables:

ds
globalcomp~y population~e accountspa~e employees stockholde~l gdpgrowth ROA pct70 CCC_mature~m
datadate dataformat assetstotal inventorie~l activeinac~r DIO MB maturefirm~y
datayearfi~l tickersymbol costofgood~d receivable~e marketvalu~l DSO size growthfirm~y
industryfo~t companyname totaldebti~t revenuetotal standardin~o DPO leverage SME
levelofcon~a isocurrenc~e earningsbe~s salesturno~t salesgrowth CCC pct30 CCC_growth~m

Last edited by Dani Tr; 23 Apr 2024, 14:11.
Tags: None
Dani Tr

Join Date: Apr 2024

Posts: 2
#2

23 Apr 2024, 14:19

Following up myself:
I used "bysort globalcompanykey datayearfiscal : keep if _n == _N"
(13,204 observations deleted)

I guess it worked out fine!
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29797
#3

23 Apr 2024, 17:45

I guess it worked out fine!

Well, maybe, and I hope so. But not so fast.

First, whenever you have duplicate observations for what should be a single one, you need to first look and see if the observations are contradictory to each other. If one has missing values where the other has a non-missing one, that's not a problem. But if one says the value of a variable is 3, and the other says it's 5, then just picking one of the observations arbitrarily is creating a garbage data set. You need to resolve contradictions if there are any.:

Code:

ds globalcompanykey datayearfiscal, not gen byte is_missing = . foreach v of varlist `r(varlist)' { replace is_missing = missing(`v') by globalcompanykey datayearfiscal is_missing (`v'), sort: /// gen byte problem_`v' = ((`v'[1] != `v'[_N]) & !is_missing) }

If any of the problem_* variables this creates has any non-zero observations, then that indicates contradictory values for the corresponding original variable in those observations. And in this situation, your first responsibility is to figure out what the correct value should be and replace the incorrect ones. If it isn't possible to get correct values, then the conflicting values should be reset to missing.

If you don't have any contradictions, then you can still improve on what you did. Suppose your data set has 10 variables besides globalcompanykey and datafyearfiscal. Say there are two observations, one of which is missing 4 of the variables, and the other of which is missing the other 6. Between those two observations, you have complete data, yet you will only keep one or the other of the observations, depending on which happens to appear later in the data set. So you're throwing away half the data. Better than picking one, when there are no contradictions, is to combine them:

Code:

ds globalcompanykey datayearfiscal, not collapse (firstnm) `r(varlist)', by(globalcompanykey datayearfiscal)

This will be the most complete possible data set you can create from a data set with duplicate but non-contradictory observations.
Comment

Announcement

Remove duplicates in panel data

Comment

Comment