variable date does not uniquely identify observations in the master data stata

Clyde Schechter

Join Date: Apr 2014

Posts: 29956
#31

22 Dec 2021, 12:22

Perhaps somebody here is familiar enough with the particular survey data you are working with to advise you. But most of us here are not. So if you don't get a response from somebody within, say, 12 hours, I suggest you post back and use the -dataex- command to post example data from the two data sets you are trying to -merge-. The problem is that your data do not conform to the expectations of the code you have written. But without knowing what is actually in the data, it is impossible to advise.

If you are running version 17, 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

When asking for help with code, always show example data. When showing example data, always use -dataex-.
Comment
Vasconcelos Wakim

Join Date: Sep 2016

Posts: 5
#32

12 Jul 2022, 12:05

Hi everyone, I'm trying to estimate a spatial panel. I have two datasets; in the first, I have the variables _ID, _CY and _CX with 853 observations of each state municipality, and in the second dataset I have my panel data with 11,089 observations from 2008 to 2020. My code used is:
spshape2dta mg.shp, replace
use mg
spset
merge m:1 _ID using "mg.dta"

The results are 11,089 observations matched. But, when I try create the contiguity matrix using the command:

spmatrix create contiguity W

The Stata returns the code r(459) variable _ID does not uniquely identify observations in the master data. Could somebody help me with this problem? What have I been doing wrong?
Comment
Ridwan Sheikh

Join Date: Apr 2021

Posts: 167
#33

21 Mar 2023, 03:18

Dear all,
I want to merge the two data sets (DTA1 with DTA2). The data set DTA1 contains information on bilateral trade flows between country_k and country_j at year (t), where as DTA2 is a gravity dataset obtained from CEPII
(http://www.cepii.fr/CEPII/en/bdd_mod..._item.asp?id=8) which contains information on bilateral and unilateral variables between k and j at year t.
Therefor i wrote the following code

Code:

use DTA1, clear merge m:1 year country_k country_j using DTA2

I get an error message, variables year country_k country_j do not uniquely identify observations in the using data.
Then I checked for duplicated in (using data) by running

Code:

use DTA2, clear sort year country_k country_j quietly by year country_k country_j: gen dup = cond(_N==1,0,_n)

Then if i drop duplicate but keeping the first occurrence by running

Code:

drop if dup 1

This worked fine. But, i need to carefully drop the duplicates, not exactly in this manner.

I am posting the dataex example here to explain my point.

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input str3(country_k country_j) int year double x1 byte x2 double x3 byte(x4 y1 y2) double(y3 f1 f2 f3 f4) byte dup "AAT" "BJS" 1991 23 43 45.01 . 1 0 5785.35 12 . 36 45 0 "ABW" "ANT" 1990 21.36 . 567002 0 . . 2487.632 11.35 36.27 . 24 1 "ABW" "ANT" 1990 . . 567002 . . . . 11.35 . . . 2 "ABW" "ANT" 1990 . . . . . . 2487.632 . . 12.56 24 3 "ABW" "ANT" 1990 . . . . . . . . . . . 4 "STU" "BWF" 1993 . . 37248 . 0 . . . . . 17.27 1 "STU" "BWF" 1993 30.7 0 37248 . 0 1 7623.23 . 56.12 23.67 17.27 2 "STU" "BWF" 1993 . . . . . . 7623.23 . . 23.67 . 3 "JKS" "FPW" 2010 . . 4589.765 . . 1 . . . 23.14 . 1 "JKS" "FPW" 2010 76.45 1 4589.765 0 . 1 8767.346 . . 23.14 32.87 2 end

In this example the first-row in triplet ABW-ANT-1990 contains the maximum information on all the variables from x1-f4, while in the other duplicates in this triplet has some missing values, so drop if dup>1 will retain the row containing the maximum information on all the columns. However, for STU-BFW-1993 the second row contains data on most of the variables and the first row is mostly missing, so here drop if dup>1 will drop the row which is more informative. Similarly in JKS-FPW-2010, the second row will be deleted, which is most information than1st-row in this triplet.

I want to ask, how should i drop duplicates and retain the unique occurrence in such a manner that rows (duplicates) containing less information are dropped while as that occurance is retained across each triplet which contain information on most of the variable (columns) . That is,
for ABW-ANT-1990 drop 2nd, 3rd, and 4th row (duplicate)
for STU-BFW-1993 drop 1st, 3rd row
for JKS-FPW-2010, drop 2nd row

Please get back to me, i shall be very thankful

(Ridwan)
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29956
#34

21 Mar 2023, 09:19

First, in order for this approach to be viable and meaningful, it is crucial that where two observations for the same country pair and year have non-missing values for a variable, they must have the same value. Otherwise the data are contradictory, which is a much bigger problem and requires a wholly different approach. Therefore this code begins by verifying that the non-missing data values are always consistent, before then retaining one of the observations (randomly and irreproducibly selected) that has the fewest missing values. (There could be two or more observations tied for the lowest number of missing values.)

Code:

ds country_k country_j year dup, not local vbles `r(varlist)' foreach v of varlist `vbles' { by country_k country_j year (`v'), sort: assert `v' == `v'[1] | missing(`v') } egen mcount = rowmiss(`vbles') by country_k country_j year mcount, sort: keep if _n == 1

That said, I think that rather than simply choosing one of the observations with lowest number of missing values, it would be better to combine all of the observations for a country pair and year into a single observation that picks as many non-missing values from among them as possible. So what I would do instead is:

Code:

ds country_k country_j year dup, not local vbles `r(varlist)' foreach v of varlist `vbles' { by country_k country_j year (`v'), sort: assert `v' == `v'[1] | missing(`v') } collapse (firstnm) `vbles', by(country_k country_j year)

With this approach, if a variable has a non-missing value in any of the country pair year observations, you will retain it, even if the only non-missing value happens to fall in an observation that has a lot of other variables with missing values.

Last edited by Clyde Schechter; 21 Mar 2023, 09:33.
Comment
Ridwan Sheikh

Join Date: Apr 2021

Posts: 167
#35

22 Mar 2023, 12:00

Thank you Clyde Schechter for explaining it so well. This was really helpful.

Untill now i was doing a more ad-hoc approach to address the issue at hand by running the following code

Code:

sort year country_k country_j quietly by year country_k country_j : gen dup = cond(_N==1,0,_n) egen m1 = rowmiss( x1 x2 x3 y1 y2 y3 f1 f2 f3 f4) bysort year country_k country_j: egen m2=min(m1) sort year country_k country_j dup keep if m1==m2

This dropped the duplicates by retaining the rows with minimum missing values on the variables from x1-f4. However the duplicates still existed for some country pair at some years with missing values for all the variables /columns. Therefore, i had to again drop those duplicates by running

Code:

drop m1 m2 dup sort year country_k country_j quietly by year country_k country_j : gen dup = cond(_N==1,0,_n) drop if dup>1

The last code you just sent me achieved the same thing (as in ad-hoc approach) in a more systematic and clean way. It saved me a lot of time. I am very thankful to you.

Thanks and regards,
(Ridwan)
Comment
Miftah Zikri

Join Date: Jun 2023

Posts: 3
#36

20 Jun 2023, 19:14

Hi Ridwan,

The Stata response on " variables year country_k country_j do not uniquely identify observations in the using data." It is commonly happened when your data is not uniquely identified due to duplicates. You should drop the duplicates to proceed with the merging.

Try the steps below to find out the duplicates in your data set:

duplicates example year country_k country_j

duplicates list year country_k country_j

duplicates tag year country_k country_j, gen(duple) -----this generate new var "duple"

tab duple

These commands is to analyse the duplicates within your data set (with your specified variable),

Then use this command to solve:

duplicates drop year country_k country_j, force (this will delete the duplicates values and hence obs drops)

now you can attempt to merge m:1 or 1:m
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29956
#37

20 Jun 2023, 20:35

#36 gives good advice, until it reaches -duplicates drop year country_k country_j, force-. There are two reasons you shouldn't use the -force- option, at least not yet:
There may be observations having the same values of country_k country_j and year that have inconsistent, conflicting values for other variables. If you use -force-, Stata will pick one of those observations (at random, and not reproducibly) to keep and discard the rest. But you need to keep the one that is correct! (If there is one. Maybe you need to combine them in some way. Or maybe all of them are just wrong and then you need to delete all of them without leaving one behind.)

Even if all of the duplicate observations on country_k country_j and year are entirely consistent on all variables, you have to wonder why they are there in the first place. It usually means that there was an error in the creation of the data set. So you should trace back the creation of the data set and find the error that led to this condition in the first place. Where there is one mistake, you may find others as well. There is non point in analyzing incorrect data.

So, after you have carefully inspected the offending observations and corrected the data set so that each combination of country_k country_j and year is associated with a single correct set of values for all variables, you will have no need to drop anything. In short, Stata is telling you that your data set is incorrect. Fix the data: don't just bludgeon it and blunder on.
Comment
Ridwan Sheikh

Join Date: Apr 2021

Posts: 167
#38

08 Jan 2025, 04:02

Dear Clyde Schechter
I have a similar problem with another dataset where I need to drop duplicates and retain the unique occurrence in such a manner that rows (duplicates) containing less information are dropped while as that occurance is retained across each triplet which contain information on most of the variable (columns).

As you seggested above, when I run

Code:

sort year iso_i iso_j quietly by year iso_i iso_j: gen dup = cond(_N==1,0,_n) ds year iso_i iso_j dup, not local vbles `r(varlist)' foreach v of varlist `vbles' { by year iso_i iso_j (`v'), sort: assert `v' == `v'[1] | missing(`v') } collapse (firstnm) `vbles', by(year iso_i iso_j)

It produces and error, just before the collapse that;

30,800 contradictions in 2,032,128 observations
assertion is false

It does not drop the duplicates as I wish them to be dropped (as stated above, keep the informative observations, and drop the duplicate ones)

Your help in this regard, will be highly appreciated.

Thanks,
(Ridwan)
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35432
#39

08 Jan 2025, 05:53

So, given groups of observations identical on year iso_i iso_j you have at least one other variable with different non-missing values.

That doesn't seem to contradict anything else you've said. If it's a problem, all I can suggest is looking at the observations in question.
Comment
Ridwan Sheikh

Join Date: Apr 2021

Posts: 167
#40

08 Jan 2025, 07:58

Thanks Nick Cox for replying back, I highly appreciate it.
It is little difficult to look at all the observations with at least one other variable with different non-missing values. Instead, I used more ad-hoc approach and ran the following codes.

Code:

sort year country_k country_j quietly by year country_k country_j : gen dup = cond(_N==1,0,_n) egen m1 = rowmiss( x1 x2 x3 x4 y1 y2 y3 f1 f2 f3 f4) bysort year country_k country_j: egen m2=min(m1) sort year country_k country_j dup keep if m1==m2

For the dataex above (#33), It gives me the rquired results (dropped the duplicate observations, while retaning the ones with more information across variables)

However, I was more interested in the type of solutions provided by Clyde Schechter, but it did not work in my original data, so I have to follow more ad-hoc approach, as here

Thanks,
(Ridwan)
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29956
#41

08 Jan 2025, 08:46

Ridwan, you are sleepwalking into danger here. The error message you report from #38 says that your data are not what you think they are. The code in #38 is designed for use when the observations in a group all contain the same single non-missing value of each variable, or the variable is missing. The error message is telling you that there is some pair of iso_i and iso_j values, and some other variable (let's call it `v'), such that the group of observations with those values of iso_i and iso_j contains two or more different, conflicting values of `v'. In this situation, you should not write quick code to reduce to a single observation: you have to step back and identify where these conflicting values are in the data and then you have to resolve the conflict between them. Otherwise you are just making an arbitrary selection of which values to keep, and you are likely to find that if you rerun the same code on the same data later, you will get different results.

By the way, the code in #40 contains an additional pitfall. If there are two or more observations for the same iso_i and iso_j that happen to both (all) contain the minimum number of missing values among the other variables, both (all) of those will be retained, not just one.

I urge you to go back and diagnose and fix the problems in your data. Here's an approach:

Code:

ds iso_i iso_j, not local vbles `r(varlist)' foreach v of varlist `vbles' { by iso_i iso_j (`v'), sort: egen `v'_flag = max(`v' != `v'[1] & !missing(`v')) } egen any_flag = rowmax(*_flag) browse if any_flag

This will show you the iso_i iso_j groups that contain conflicting values on some variable(s). You will then have to study those and decide how this happened and how to fix it. It may be that one of the conflicting values of a variable is correct and the others are wrong, or you may find that none is correct. Most likely, this reflects an error in the data management leading up to the creation of your data set, and you should review that carefully and fix any errors you uncover along the way.
Comment
Ridwan Sheikh

Join Date: Apr 2021

Posts: 167
#42

11 Jan 2025, 09:24

Thank you very much Clyde Schechter . I have been carefully looking at the data to know the possible reason for the conflicting observations, After assistance from the data organization team, managing the database, we have been able to understand the reasons. The code in #41 was very useful to flag such observations.

Thank you,
(Ridwan)
Comment

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment