Hello, I have a patient level dataset that looks something like the following. There are some data entry errors in the country variable.
The assumptions is that the country entered in sequence 1 is correct. I want to create a variable which will identify the sequence number with the incorrect entry. so, something like the following;
Any idea how to create this? I know there is an indirect way of doing this such as the following, but I have quite a few variables like the "country" variable which has data entry errors and I was wondering if there is an easier way of doing this?
gen country_x=country if sequence==1
bysort patient_id (country_x): replace country_x= country_x[_N] if country_x==""
gen errorsequence=dosenum if country~=country_x
bysort patient_id (errorsequence): replace errorsequence = errorsequence[1] if errorsequence==.
appreciate your support
Patient ID | Sequence | Country |
A | 1 | USA |
A | 2 | UK |
B | 1 | ALGERIA |
B | 2 | ALBANIA |
B | 3 | ALGERIA |
C | 1 | BANGLADESH |
C | 2 | BANGLADESH |
C | 3 | BANGLADESH |
C | 4 | BULGARIA |
D | 1 | USA |
D | 2 | USA |
D | 3 | UK |
The assumptions is that the country entered in sequence 1 is correct. I want to create a variable which will identify the sequence number with the incorrect entry. so, something like the following;
Patient ID | Sequence | Country | Error sequence |
A | 1 | USA | 2 |
A | 2 | UK | 2 |
B | 1 | ALGERIA | 2 |
B | 2 | ALBANIA | 2 |
B | 3 | ALGERIA | 2 |
C | 1 | BANGLADESH | 4 |
C | 2 | BANGLADESH | 4 |
C | 3 | BANGLADESH | 4 |
C | 4 | BULGARIA | 4 |
D | 1 | USA | 3 |
D | 2 | USA | 3 |
D | 3 | UK | 3 |
gen country_x=country if sequence==1
bysort patient_id (country_x): replace country_x= country_x[_N] if country_x==""
gen errorsequence=dosenum if country~=country_x
bysort patient_id (errorsequence): replace errorsequence = errorsequence[1] if errorsequence==.
appreciate your support
Comment