How to create a variable that will identify the incorrect sequence number?

Shanaz Sadique

Join Date: Apr 2020
Posts: 10

How to create a variable that will identify the incorrect sequence number?

14 Apr 2022, 04:51

Hello, I have a patient level dataset that looks something like the following. There are some data entry errors in the country variable.

Patient ID	Sequence	Country
A	1	USA
A	2	UK
B	1	ALGERIA
B	2	ALBANIA
B	3	ALGERIA
C	1	BANGLADESH
C	2	BANGLADESH
C	3	BANGLADESH
C	4	BULGARIA
D	1	USA
D	2	USA
D	3	UK

The assumptions is that the country entered in sequence 1 is correct. I want to create a variable which will identify the sequence number with the incorrect entry. so, something like the following;

Patient ID	Sequence	Country	Error sequence
A	1	USA	2
A	2	UK	2
B	1	ALGERIA	2
B	2	ALBANIA	2
B	3	ALGERIA	2
C	1	BANGLADESH	4
C	2	BANGLADESH	4
C	3	BANGLADESH	4
C	4	BULGARIA	4
D	1	USA	3
D	2	USA	3
D	3	UK	3

Any idea how to create this? I know there is an indirect way of doing this such as the following, but I have quite a few variables like the "country" variable which has data entry errors and I was wondering if there is an easier way of doing this?

gen country_x=country if sequence==1
bysort patient_id (country_x): replace country_x= country_x[_N] if country_x==""
gen errorsequence=dosenum if country~=country_x
bysort patient_id (errorsequence): replace errorsequence = errorsequence[1] if errorsequence==.

appreciate your support

Tags: None

Nick Cox

Join Date: Mar 2014
Posts: 35696

14 Apr 2022, 05:30

Please use dataex for data examples. https://www.statalist.org/forums/help#stata explains.

You were asked to do that previously: https://www.statalist.org/forums/for...within-a-group

Your examples are limited to sequences with just one "incorrect" entry, so say nothing about what you want if

1. There is no incorrect entry.

2. There are two or more incorrect entries.

This works for what you show.

Code:

clear
input str1 patientid byte sequence str10 country
"A" 1 "USA"      
"A" 2 "UK"        
"B" 1 "ALGERIA"  
"B" 2 "ALBANIA"  
"B" 3 "ALGERIA"  
"C" 1 "BANGLADESH"
"C" 2 "BANGLADESH"
"C" 3 "BANGLADESH"
"C" 4 "BULGARIA"  
"D" 1 "USA"      
"D" 2 "USA"      
"D" 3 "UK"        
end

bysort patientid (sequence) : gen first = country[1]
egen wanted = total(sequence * (country != first)), by(patientid)

list, sepby(patientid)



     +--------------------------------------------------------+
     | patien~d   sequence      country        first   wanted |
     |--------------------------------------------------------|
  1. |        A          1          USA          USA        2 |
  2. |        A          2           UK          USA        2 |
     |--------------------------------------------------------|
  3. |        B          1      ALGERIA      ALGERIA        2 |
  4. |        B          2      ALBANIA      ALGERIA        2 |
  5. |        B          3      ALGERIA      ALGERIA        2 |
     |--------------------------------------------------------|
  6. |        C          1   BANGLADESH   BANGLADESH        4 |
  7. |        C          2   BANGLADESH   BANGLADESH        4 |
  8. |        C          3   BANGLADESH   BANGLADESH        4 |
  9. |        C          4     BULGARIA   BANGLADESH        4 |
     |--------------------------------------------------------|
 10. |        D          1          USA          USA        3 |
 11. |        D          2          USA          USA        3 |
 12. |        D          3           UK          USA        3 |
     +--------------------------------------------------------+

.
A more general way to flag whatever differs from the first value is

Code:

bysort patientid (sequence) : gen different_whatever = whatever != whatever[1]

Last edited by Nick Cox; 14 Apr 2022, 05:35.

Comment

Shanaz Sadique

Join Date: Apr 2020

Posts: 10
#3

25 Apr 2022, 01:15

Hi Nick, thank you for your response.

I find dataex a little challenging, I have referenced the help guide but would appreciate if you can point to additional material that would help.

Also, I didn't quite understand what you meant here

Your examples are limited to sequences with just one "incorrect" entry, so say nothing about what you want if

1. There is no incorrect entry.

2. There are two or more incorrect entries.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35696
#4

25 Apr 2022, 04:21

There is no material on dataex beyond its help. I suggest that you run the examples in its help.

Somewhat similarly, I don't know how to explain my questions differently. Does the code I suggest do what you want for your real data? If not, you need to explain what you want that is different.
Comment

Announcement

How to create a variable that will identify the incorrect sequence number?

Comment

Comment

Comment