Identifying non-unique entries in panel data

Pantelis Kazakis

Join Date: Aug 2014

Posts: 123
#1

Identifying non-unique entries in panel data

14 Aug 2014, 16:41

Hello all,

I have the following problem. I am using a panel database where individuals, among other things, are asked about their gender. After running some models it appears to me that some of them did not report the right gender (maybe by mistake). My database is unbalanced, some people appear only once, and some up to eight (8) times.

Now, for the sake of an example assume that we have only three people and only three years. We have the following variables:

year id gender gender_id
t1 1 M 1M
t2 1 F 1F
t3 1 F 1F
----------------------------------------------
t1 2 M 2M
t2 2 M 2M
t3 2 M 2M
----------------------------------------------
t1 3 F 3F
t2 3 F 3F
t3 3 F 3F

Above, the variable gender_id is the concatenation of "id" and "gender". As it can be seen from this example, the first individual has falsely reported her gender in t1.

I am asking if there is a way to identify these individuals, via a dummy variable for example, in order to act accordingly when I estimate my models. That is, how to create a dummy that asserts the value 1 when id = 1 in the above example?

Thank you in advance.
Tags: unique_entries
Andrew Maurer

Join Date: Apr 2014

Posts: 28
#2

14 Aug 2014, 17:31

I can't think of a more efficient way than the following (untested):

Code:

bysort id (year): gen byte false_reporter = sum(gender != gender[_n-1] & _n > 1) != 0 by id: replace false_reporter = false_reporter[_N]
Comment

Richard Williams

Join Date: Apr 2014
Posts: 4888

14 Aug 2014, 17:31

If you create a numeric variable for gender you can do this:

Code:

list, sepby(id)
egen min = min(gender), by(id)
egen max = max(gender), by(id)
gen problem = min != max
list if problem, sepby(id)

Here is the output for your example:

Code:

. list, sepby(id)

     +-------------+
     | id   gender |
     |-------------|
  1. |  1        1 |
  2. |  1        2 |
  3. |  1        2 |
     |-------------|
  4. |  2        1 |
  5. |  2        1 |
  6. |  2        1 |
     |-------------|
  7. |  3        2 |
  8. |  3        2 |
  9. |  3        2 |
     +-------------+

. 
. egen min = min(gender), by(id)
. egen max = max(gender), by(id)
. gen problem = min != max
. list if problem, sepby(id)

     +-----------------------------------+
     | id   gender   min   max   problem |
     |-----------------------------------|
  1. |  1        1     1     2         1 |
  2. |  1        2     1     2         1 |
  3. |  1        2     1     2         1 |
     +-----------------------------------+

Especially when there are only 2 records though, how do you know which record is the one with the error?

My code is assuming no missing data. If missing data is an issue, further tweaks are needed.

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 18.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://www3.nd.edu/~rwilliam

Comment

Robert Picard

Join Date: Mar 2014

Posts: 1536
#4

14 Aug 2014, 17:42

This can be done is a single line

Code:

bysort id (gender): gen check = gender[1] != gender[_N]
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35193
#5

14 Aug 2014, 17:45

I tend to use Robert's approach. The one-liner is spelled out at excruciating length in an FAQ, http://www.stata.com/support/faqs/da...ions-in-group/
Comment
Richard Williams

Join Date: Apr 2014

Posts: 4888
#6

14 Aug 2014, 17:54

Robert's code is simpler than mine and does not require creating a numeric gender variable. You still need extra actions if MD on gender is an issue.

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 18.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://www3.nd.edu/~rwilliam
Comment
Pantelis Kazakis

Join Date: Aug 2014

Posts: 123
#7

14 Aug 2014, 18:10

Thank you all for the time that you spent in this matter !
Comment
Richard Williams

Join Date: Apr 2014

Posts: 4888
#8

14 Aug 2014, 18:15

Here is a slight tweak to Robert's code that would also flag any cases with MD on gender:

Code:

bysort id (gender): gen check = gender[1] != gender[_N] | gender[_N] >= .

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 18.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://www3.nd.edu/~rwilliam
Comment

Announcement