How to detect duplicates across two variables for longitudinal data?

Catherine Pellegrini

Join Date: Nov 2024

Posts: 1
#1

How to detect duplicates across two variables for longitudinal data?

12 Nov 2024, 14:32

I'm looking for assistance with duplicate data based on only two variables.
I started with duplicates list then duplicates drop to remove duplicates, but upon trying to reshape the longitudinal data from long form to wide form, STATA gave me an error message stating:

values of variable year not unique within id
Your data are currently long. You are performing a reshape wide. You specified i(id) and j(year). There are observations within
i(id) with the same value of j(year). In the long data, variables i() and j() together must uniquely identify the observations.

I ran reshape error, which provided a long list of duplicates based on only id and year, but I can't figure out how to remove these values. I'm working with a large dataset, so it won't show the whole list of these duplicates, so I couldn't individually drop them. I would also like to look at these duplicates to see where they differ in regard to the other variables since they were not removed with the original duplicate command.

Any suggestions?
Tags: None
Andrew Musau

Join Date: Oct 2014

Posts: 10190
#2

12 Nov 2024, 15:27

Getting unique combinations of id and year is easy enough, but it may mask important details in your data. Why do you have duplicates? What do observations in your dataset represent? If the command

Code:

duplicates drop *, force

did not eliminate all duplicates, there is at least one observation with the same id and year but different values in another variable. The best-case scenario is that missing values are causing this issue. This would be apparent if running the command

Code:

isid id year

produces the error, 'values of id should never be missing.' Otherwise, you must identify the reason for the duplicates. That said, the quickest way to resolve this is by running

Code:

bysort id year: keep if _n==1

but I do not recommend doing this before determining why you have duplicates.
Comment

Announcement

How to detect duplicates across two variables for longitudinal data?

Comment