Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Identifying non-unique entries in panel data

    Hello all,

    I have the following problem. I am using a panel database where individuals, among other things, are asked about their gender. After running some models it appears to me that some of them did not report the right gender (maybe by mistake). My database is unbalanced, some people appear only once, and some up to eight (8) times.

    Now, for the sake of an example assume that we have only three people and only three years. We have the following variables:

    year id gender gender_id
    t1 1 M 1M
    t2 1 F 1F
    t3 1 F 1F
    ----------------------------------------------
    t1 2 M 2M
    t2 2 M 2M
    t3 2 M 2M
    ----------------------------------------------
    t1 3 F 3F
    t2 3 F 3F
    t3 3 F 3F

    Above, the variable gender_id is the concatenation of "id" and "gender". As it can be seen from this example, the first individual has falsely reported her gender in t1.

    I am asking if there is a way to identify these individuals, via a dummy variable for example, in order to act accordingly when I estimate my models. That is, how to create a dummy that asserts the value 1 when id = 1 in the above example?

    Thank you in advance.

  • #2
    I can't think of a more efficient way than the following (untested):

    Code:
    bysort id (year): gen byte false_reporter = sum(gender != gender[_n-1] & _n > 1) != 0
    by id: replace false_reporter = false_reporter[_N]

    Comment


    • #3
      If you create a numeric variable for gender you can do this:

      Code:
      list, sepby(id)
      egen min = min(gender), by(id)
      egen max = max(gender), by(id)
      gen problem = min != max
      list if problem, sepby(id)
      Here is the output for your example:

      Code:
      . list, sepby(id)
      
           +-------------+
           | id   gender |
           |-------------|
        1. |  1        1 |
        2. |  1        2 |
        3. |  1        2 |
           |-------------|
        4. |  2        1 |
        5. |  2        1 |
        6. |  2        1 |
           |-------------|
        7. |  3        2 |
        8. |  3        2 |
        9. |  3        2 |
           +-------------+
      
      . 
      . egen min = min(gender), by(id)
      . egen max = max(gender), by(id)
      . gen problem = min != max
      . list if problem, sepby(id)
      
           +-----------------------------------+
           | id   gender   min   max   problem |
           |-----------------------------------|
        1. |  1        1     1     2         1 |
        2. |  1        2     1     2         1 |
        3. |  1        2     1     2         1 |
           +-----------------------------------+
      Especially when there are only 2 records though, how do you know which record is the one with the error?

      My code is assuming no missing data. If missing data is an issue, further tweaks are needed.
      -------------------------------------------
      Richard Williams, Notre Dame Dept of Sociology
      StataNow Version: 18.5 MP (2 processor)

      EMAIL: [email protected]
      WWW: https://www3.nd.edu/~rwilliam

      Comment


      • #4
        This can be done is a single line

        Code:
        bysort id (gender): gen check = gender[1] != gender[_N]

        Comment


        • #5
          I tend to use Robert's approach. The one-liner is spelled out at excruciating length in an FAQ, http://www.stata.com/support/faqs/da...ions-in-group/

          Comment


          • #6
            Robert's code is simpler than mine and does not require creating a numeric gender variable. You still need extra actions if MD on gender is an issue.
            -------------------------------------------
            Richard Williams, Notre Dame Dept of Sociology
            StataNow Version: 18.5 MP (2 processor)

            EMAIL: [email protected]
            WWW: https://www3.nd.edu/~rwilliam

            Comment


            • #7
              Thank you all for the time that you spent in this matter !

              Comment


              • #8
                Here is a slight tweak to Robert's code that would also flag any cases with MD on gender:

                Code:
                bysort id (gender): gen check = gender[1] != gender[_N] | gender[_N] >= .
                -------------------------------------------
                Richard Williams, Notre Dame Dept of Sociology
                StataNow Version: 18.5 MP (2 processor)

                EMAIL: [email protected]
                WWW: https://www3.nd.edu/~rwilliam

                Comment

                Working...
                X