Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to create a variable that will identify the incorrect sequence number?

    Hello, I have a patient level dataset that looks something like the following. There are some data entry errors in the country variable.
    Patient ID Sequence Country
    A 1 USA
    A 2 UK
    B 1 ALGERIA
    B 2 ALBANIA
    B 3 ALGERIA
    C 1 BANGLADESH
    C 2 BANGLADESH
    C 3 BANGLADESH
    C 4 BULGARIA
    D 1 USA
    D 2 USA
    D 3 UK

    The assumptions is that the country entered in sequence 1 is correct. I want to create a variable which will identify the sequence number with the incorrect entry. so, something like the following;

    Patient ID Sequence Country Error sequence
    A 1 USA 2
    A 2 UK 2
    B 1 ALGERIA 2
    B 2 ALBANIA 2
    B 3 ALGERIA 2
    C 1 BANGLADESH 4
    C 2 BANGLADESH 4
    C 3 BANGLADESH 4
    C 4 BULGARIA 4
    D 1 USA 3
    D 2 USA 3
    D 3 UK 3
    Any idea how to create this? I know there is an indirect way of doing this such as the following, but I have quite a few variables like the "country" variable which has data entry errors and I was wondering if there is an easier way of doing this?

    gen country_x=country if sequence==1
    bysort patient_id (country_x): replace country_x= country_x[_N] if country_x==""
    gen errorsequence=dosenum if country~=country_x
    bysort patient_id (errorsequence): replace errorsequence = errorsequence[1] if errorsequence==.

    appreciate your support

  • #2
    Please use dataex for data examples. https://www.statalist.org/forums/help#stata explains.

    You were asked to do that previously: https://www.statalist.org/forums/for...within-a-group

    Your examples are limited to sequences with just one "incorrect" entry, so say nothing about what you want if

    1. There is no incorrect entry.

    2. There are two or more incorrect entries.

    This works for what you show.


    Code:
    clear
    input str1 patientid byte sequence str10 country
    "A" 1 "USA"      
    "A" 2 "UK"        
    "B" 1 "ALGERIA"  
    "B" 2 "ALBANIA"  
    "B" 3 "ALGERIA"  
    "C" 1 "BANGLADESH"
    "C" 2 "BANGLADESH"
    "C" 3 "BANGLADESH"
    "C" 4 "BULGARIA"  
    "D" 1 "USA"      
    "D" 2 "USA"      
    "D" 3 "UK"        
    end
    
    bysort patientid (sequence) : gen first = country[1]
    egen wanted = total(sequence * (country != first)), by(patientid)
    
    list, sepby(patientid)
    
    
    
         +--------------------------------------------------------+
         | patien~d   sequence      country        first   wanted |
         |--------------------------------------------------------|
      1. |        A          1          USA          USA        2 |
      2. |        A          2           UK          USA        2 |
         |--------------------------------------------------------|
      3. |        B          1      ALGERIA      ALGERIA        2 |
      4. |        B          2      ALBANIA      ALGERIA        2 |
      5. |        B          3      ALGERIA      ALGERIA        2 |
         |--------------------------------------------------------|
      6. |        C          1   BANGLADESH   BANGLADESH        4 |
      7. |        C          2   BANGLADESH   BANGLADESH        4 |
      8. |        C          3   BANGLADESH   BANGLADESH        4 |
      9. |        C          4     BULGARIA   BANGLADESH        4 |
         |--------------------------------------------------------|
     10. |        D          1          USA          USA        3 |
     11. |        D          2          USA          USA        3 |
     12. |        D          3           UK          USA        3 |
         +--------------------------------------------------------+
    .
    A more general way to flag whatever differs from the first value is

    Code:
    bysort patientid (sequence) : gen different_whatever = whatever != whatever[1]
    Last edited by Nick Cox; 14 Apr 2022, 06:35.

    Comment


    • #3
      Hi Nick, thank you for your response.

      I find dataex a little challenging, I have referenced the help guide but would appreciate if you can point to additional material that would help.

      Also, I didn't quite understand what you meant here
      Your examples are limited to sequences with just one "incorrect" entry, so say nothing about what you want if

      1. There is no incorrect entry.

      2. There are two or more incorrect entries.

      Comment


      • #4
        There is no material on dataex beyond its help. I suggest that you run the examples in its help.

        Somewhat similarly, I don't know how to explain my questions differently. Does the code I suggest do what you want for your real data? If not, you need to explain what you want that is different.

        Comment

        Working...
        X