Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How can I specify duplicated cases between two groups?

    The data set below is an example that I made. Original data has more cases than below and I could not upload it for confidentiality.
    id DOB year
    1 901020 2014
    2 981020 2014
    3 841028 2014
    4 790707 2014
    5 800301 2014
    6 900103 2014
    7 990328 2014
    8 780812 2014
    9 890302 2014
    10 900101 2014
    11 901020 2015
    12 981020 2015
    13 901020 2015
    14 770707 2015
    15 981020 2015
    16 981020 2015
    17 990328 2015
    18 740403 2015
    19 730203 2015
    20 980202 2015
    DOB; date of birth, year;the year of the survey

    I would like to make tag for duplicated cases that having same DOB between two different year group.
    Could you show me code for handling this?
    Last edited by Ho Sung; 04 Mar 2020, 07:02.

  • #2
    Code:
    bys DOB (year): gen tag = _N>1& year[1]!=year[_N]
    list if tag, sepby(DOB)
    Last edited by Andrew Musau; 04 Mar 2020, 07:30.

    Comment


    • #3
      It works~! thank you so much!

      Comment


      • #4
        Dear all

        Hope you don't mind my chiming in. Thanks Andrew - I too regularly encounter data processing tasks like Ho's example.

        Can I please take this opportunity to check my understanding of your code? My understanding of your train of thought is as follows:

        1. Sort the data first by DOB & within each DOB, by year.
        In this way, we can compare if the first & last value within each group is different
        Code:
         bysort DOB (year):
        2. Check that each DOB contains at least 2 year values. I.e. we don't want to make tag == 1 for single value groups. Hence the condition is
        Code:
         _N > 1
        3. Compare the first & last values in each (sorted) group.
        Code:
         year[1] != year[_N]
        Putting everything together, we have your code above
        Code:
         bysort DOB (year): generate tag = (_N > 1) & year[1] != year[_N]
        Is my interpretation above broadly correct? (I'm trying to improve my skills in these data processing tasks) Many thanks.

        Comment


        • #5
          1. Sort the data first by DOB & within each DOB, by year.
          In this way, we can compare if the first & last value within each group is different
          bysort DOB (year):
          Yes, but including the parentheses tells Stata not to consider year as a grouping variable (only a sorting variable). Thus, groups are defined by DOB. Excluding the parentheses will imply groups are defined by both DOB and year.

          2. Check that each DOB contains at least 2 year values. I.e. we don't want to make tag == 1 for single value groups. Hence the condition is
          Code:
          _N > 1
          3. Compare the first & last values in each (sorted) group.
          Code:
          year[1] != year[_N]
          Both interpretations are correct, but notice that the first condition is redundant. It is sloppiness on my part because my initial code was

          Code:
          bys DOB (year): gen tag = _N>1
          before I noticed that there were duplicates of DOB and year in the dataset. Absent such duplicates, my code above would have been sufficient because any group with more than 1 observation would have included multiple years. Therefore, to simplify, the following suffices

          Code:
          bys DOB (year): gen tag = year[1]!=year[_N]

          Comment


          • #6
            Thank you very much indeed Andrew.
            This is very helpful. Much appreciated.

            Comment

            Working...
            X