Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Match one column to another

    I have a dataset with two variables which look something like this.
    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input str15(prim_key spouse_id)
    "101000100040101" "101000100040102"
    "101000100040102" "101000100040101"
    "101000100130101" "101000100130102"
    "101000100130102" "101000100130101"
    "101000100130109" ""              
    "101000100250106" ""              
    "101000100320101" "101000100320102"
    "101000100320102" "101000100320101"
    "101000100320109" ""              
    "101000100370102" "101000100370101"
    "101000100370111" ""              
    "101000100590101" "101000100590102"
    "101000100590102" "101000100590101"
    "101000100760101" "101000100760102"
    "101000100760102" "101000100760101"
    "101000101040101" "101000101040102"
    "101000101040102" "101000101040101"
    "101000101330101" ""              
    "101000101720101" ""              
    "101000101890107" ""              
    "101000102490101" "101000102490102"
    "101000102490102" "101000102490101"
    "101000103350101" "101000103350102"
    "101000103350102" "101000103350101"
    "101000103550101" ""              
    "101000103550102" ""              
    "101000103690101" "101000103690102"
    "101000103690102" "101000103690101"
    "101000104030101" "101000104030102"
    "101000104030102" "101000104030101"
    "101000104240101" "101000104240102"
    "101000104240102" "101000104240101"
    "101000200050101" "101000200050102"
    "101000200050102" "101000200050101"
    "101000200160109" ""              
    "101000200210101" "101000200210102"
    "101000200210102" "101000200210101"
    "101000200210108" ""              
    "101000200270101" ""              
    "101000200290108" ""              
    "101000200290109" ""              
    "101000200340101" "101000200340102"
    "101000200340102" "101000200340101"
    "101000200470101" "101000200470102"
    "101000200470102" "101000200470101"
    "101000200740101" ""              
    "101000200860102" "101000200860101"
    "101000201190106" ""              
    "101000201360101" "101000201360102"
    "101000201360102" "101000201360101"
    "101000201360103" ""              
    "101000201560108" ""              
    "101000201670110" ""              
    "101000202130102" "101000202130101"
    "101000202440101" "101000202440102"
    "101000202440102" "101000202440101"
    "101000203010101" "101000203010102"
    "101000203010102" "101000203010101"
    "101000203460101" "101000203460102"
    "101000203460102" "101000203460101"
    "101000300030201" "101000300030202"
    "101000300030202" "101000300030201"
    "101000300070105" ""              
    "101000300140106" ""              
    "101000300220101" ""              
    "101000300260101" "101000300260102"
    "101000300260102" "101000300260101"
    "101000300380205" ""              
    "101000300550105" ""              
    "101000300920101" "101000300920102"
    "101000300920102" "101000300920101"
    "101000301080101" "101000301080102"
    "101000301080102" "101000301080101"
    "101000301450201" "101000301450202"
    "101000301450202" "101000301450201"
    "101000301720106" ""              
    "101000301840101" "101000301840102"
    "101000301840102" "101000301840101"
    "101000302140101" "101000302140102"
    "101000302140102" "101000302140101"
    "101000302250106" ""              
    "101000302720104" ""              
    "101000303240201" ""              
    "101000400040101" "101000400040102"
    "101000400040102" "101000400040101"
    "101000400070107" "101000400070108"
    "101000400070108" "101000400070107"
    "101000400120101" "101000400120102"
    "101000400120102" "101000400120101"
    "101000400120109" ""              
    "101000400140101" "101000400140102"
    "101000400140102" "101000400140101"
    "101000400230102" "101000400230101"
    "101000400230111" ""              
    "101000400430201" "101000400430202"
    "101000400430202" "101000400430201"
    "101000400610101" "101000400610102"
    "101000400610102" "101000400610101"
    "101000400930101" "101000400930102"
    "101000400930102" "101000400930101"
    end
    I want to flag those observations which are not consistent. Eg. If prim_key 101000100040101 has a spouse_id 101000100040102, then prim_key 101000100040102 should have a spouse_id 101000100040101. If the prim_key 101000100040102 has any other spouse_id, then I want it flagged. The problem I have is that duplicates are also possible. So I could have the same spouse_id for two (or more) prim_keys, and similarly I could have two(or more) spouse_id for the same prim_key. This prevents me from making two separate files and using merge on them.
    Even then, if a prim_key has a spouse_id, then that prim_key should show up as a spouse_id when there is a prim_key with the value of the original spouse_id. Or to put it more clearly, If prim_key 101000100040101 has a spouse_id 101000100040102, then prim_key 101000100040102 should have a spouse_id 101000100040101 if they occur only once. But if prim_key 101000100040101 occurs twice, with different spouse_ids (say A and B), and both A and B are also some values of the prim_key, then both A and B should have a spouse_id equal to 101000100040101

    Is there any way to do what I want?
    Last edited by Rajdeep Chaudhuri; 02 Apr 2025, 00:26.

  • #2
    Sorting within pairs ensures the creation of a unique identifier for couples. You can exploit the fact that your identifiers are numbers and do the sorting in wide layout. Otherwise, temporarily reshaping long the data so that the primary keys and spouse identifiers are one variable is an alternative.

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input str15(prim_key spouse_id)
    "101000100040101" "101000100040102"
    "101000100040102" "101000100040101"
    "101000100130101" "101000100130102"
    "101000100130102" "101000100130101"
    "101000100130109" ""              
    "101000100250106" ""              
    "101000100320101" "101000100320102"
    "101000100320102" "101000100320101"
    "101000100320109" ""              
    "101000100370102" "101000100370101"
    "101000100370111" ""              
    "101000100590101" "101000100590102"
    "101000100590102" "101000100590101"
    "101000100760101" "101000100760102"
    "101000100760102" "101000100760101"
    "101000101040101" "101000101040102"
    "101000101040102" "101000101040101"
    "101000101330101" ""              
    "101000101720101" ""              
    "101000101890107" ""              
    "101000102490101" "101000102490102"
    "101000102490102" "101000102490101"
    "101000103350101" "101000103350102"
    "101000103350102" "101000103350101"
    "101000103550101" ""              
    "101000103550102" ""              
    "101000103690101" "101000103690102"
    "101000103690102" "101000103690101"
    "101000104030101" "101000104030102"
    "101000104030102" "101000104030101"
    "101000104240101" "101000104240102"
    "101000104240102" "101000104240101"
    "101000200050101" "101000200050102"
    "101000200050102" "101000200050101"
    "101000200160109" ""              
    "101000200210101" "101000200210102"
    "101000200210102" "101000200210101"
    "101000200210108" ""              
    "101000200270101" ""              
    "101000200290108" ""              
    "101000200290109" ""              
    "101000200340101" "101000200340102"
    "101000200340102" "101000200340101"
    "101000200470101" "101000200470102"
    "101000200470102" "101000200470101"
    "101000200740101" ""              
    "101000200860102" "101000200860101"
    "101000201190106" ""              
    "101000201360101" "101000201360102"
    "101000201360102" "101000201360101"
    "101000201360103" ""              
    "101000201560108" ""              
    "101000201670110" ""              
    "101000202130102" "101000202130101"
    "101000202440101" "101000202440102"
    "101000202440102" "101000202440101"
    "101000203010101" "101000203010102"
    "101000203010102" "101000203010101"
    "101000203460101" "101000203460102"
    "101000203460102" "101000203460101"
    "101000300030201" "101000300030202"
    "101000300030202" "101000300030201"
    "101000300070105" ""              
    "101000300140106" ""              
    "101000300220101" ""              
    "101000300260101" "101000300260102"
    "101000300260102" "101000300260101"
    "101000300380205" ""              
    "101000300550105" ""              
    "101000300920101" "101000300920102"
    "101000300920102" "101000300920101"
    "101000301080101" "101000301080102"
    "101000301080102" "101000301080101"
    "101000301450201" "101000301450202"
    "101000301450202" "101000301450201"
    "101000301720106" ""              
    "101000301840101" "101000301840102"
    "101000301840102" "101000301840101"
    "101000302140101" "101000302140102"
    "101000302140102" "101000302140101"
    "101000302250106" ""              
    "101000302720104" ""              
    "101000303240201" ""              
    "101000400040101" "101000400040102"
    "101000400040102" "101000400040101"
    "101000400070107" "101000400070108"
    "101000400070108" "101000400070107"
    "101000400120101" "101000400120102"
    "101000400120102" "101000400120101"
    "101000400120109" ""              
    "101000400140101" "101000400140102"
    "101000400140102" "101000400140101"
    "101000400230102" "101000400230101"
    "101000400230111" ""              
    "101000400430201" "101000400430202"
    "101000400430202" "101000400430201"
    "101000400610101" "101000400610102"
    "101000400610102" "101000400610101"
    "101000400930101" "101000400930102"
    "101000400930102" "101000400930101"
    end
    
    gen couple_id= string(min(real(prim_key), real(spouse_id)), "%18.0g") ///
    + " " + string(max(real(prim_key), real(spouse_id)), "%18.0g") ///
    if !missing(prim_key) & !missing(spouse_id)
    
    bys prim_key (couple_id): gen flag= couple_id[1]!= couple_id[_N]
    Res.:

    Code:
    . l, sepby(couple_id)
    
         +----------------------------------------------------------------------------+
         |        prim_key         spouse_id                         couple_id   flag |
         |----------------------------------------------------------------------------|
      1. | 101000100040101   101000100040102   101000100040101 101000100040102      0 |
      2. | 101000100040102   101000100040101   101000100040101 101000100040102      0 |
         |----------------------------------------------------------------------------|
      3. | 101000100130101   101000100130102   101000100130101 101000100130102      0 |
      4. | 101000100130102   101000100130101   101000100130101 101000100130102      0 |
         |----------------------------------------------------------------------------|
      5. | 101000100130109                                                          0 |
      6. | 101000100250106                                                          0 |
         |----------------------------------------------------------------------------|
      7. | 101000100320101   101000100320102   101000100320101 101000100320102      0 |
      8. | 101000100320102   101000100320101   101000100320101 101000100320102      0 |
         |----------------------------------------------------------------------------|
      9. | 101000100320109                                                          0 |
         |----------------------------------------------------------------------------|
     10. | 101000100370102   101000100370101   101000100370101 101000100370102      0 |
         |----------------------------------------------------------------------------|
     11. | 101000100370111                                                          0 |
         |----------------------------------------------------------------------------|
     12. | 101000100590101   101000100590102   101000100590101 101000100590102      0 |
     13. | 101000100590102   101000100590101   101000100590101 101000100590102      0 |
         |----------------------------------------------------------------------------|
     14. | 101000100760101   101000100760102   101000100760101 101000100760102      0 |
     15. | 101000100760102   101000100760101   101000100760101 101000100760102      0 |
         |----------------------------------------------------------------------------|
     16. | 101000101040101   101000101040102   101000101040101 101000101040102      0 |
     17. | 101000101040102   101000101040101   101000101040101 101000101040102      0 |
         |----------------------------------------------------------------------------|
     18. | 101000101330101                                                          0 |
     19. | 101000101720101                                                          0 |
     20. | 101000101890107                                                          0 |
         |----------------------------------------------------------------------------|
     21. | 101000102490101   101000102490102   101000102490101 101000102490102      0 |
     22. | 101000102490102   101000102490101   101000102490101 101000102490102      0 |
         |----------------------------------------------------------------------------|
     23. | 101000103350101   101000103350102   101000103350101 101000103350102      0 |
     24. | 101000103350102   101000103350101   101000103350101 101000103350102      0 |
         |----------------------------------------------------------------------------|
     25. | 101000103550101                                                          0 |
     26. | 101000103550102                                                          0 |
         |----------------------------------------------------------------------------|
     27. | 101000103690101   101000103690102   101000103690101 101000103690102      0 |
     28. | 101000103690102   101000103690101   101000103690101 101000103690102      0 |
         |----------------------------------------------------------------------------|
     29. | 101000104030101   101000104030102   101000104030101 101000104030102      0 |
     30. | 101000104030102   101000104030101   101000104030101 101000104030102      0 |
         |----------------------------------------------------------------------------|
     31. | 101000104240101   101000104240102   101000104240101 101000104240102      0 |
     32. | 101000104240102   101000104240101   101000104240101 101000104240102      0 |
         |----------------------------------------------------------------------------|
     33. | 101000200050101   101000200050102   101000200050101 101000200050102      0 |
     34. | 101000200050102   101000200050101   101000200050101 101000200050102      0 |
         |----------------------------------------------------------------------------|
     35. | 101000200160109                                                          0 |
         |----------------------------------------------------------------------------|
     36. | 101000200210101   101000200210102   101000200210101 101000200210102      0 |
     37. | 101000200210102   101000200210101   101000200210101 101000200210102      0 |
         |----------------------------------------------------------------------------|
     38. | 101000200210108                                                          0 |
     39. | 101000200270101                                                          0 |
     40. | 101000200290108                                                          0 |
     41. | 101000200290109                                                          0 |
         |----------------------------------------------------------------------------|
     42. | 101000200340101   101000200340102   101000200340101 101000200340102      0 |
     43. | 101000200340102   101000200340101   101000200340101 101000200340102      0 |
         |----------------------------------------------------------------------------|
     44. | 101000200470101   101000200470102   101000200470101 101000200470102      0 |
     45. | 101000200470102   101000200470101   101000200470101 101000200470102      0 |
         |----------------------------------------------------------------------------|
     46. | 101000200740101                                                          0 |
         |----------------------------------------------------------------------------|
     47. | 101000200860102   101000200860101   101000200860101 101000200860102      0 |
         |----------------------------------------------------------------------------|
     48. | 101000201190106                                                          0 |
         |----------------------------------------------------------------------------|
     49. | 101000201360101   101000201360102   101000201360101 101000201360102      0 |
     50. | 101000201360102   101000201360101   101000201360101 101000201360102      0 |
         |----------------------------------------------------------------------------|
     51. | 101000201360103                                                          0 |
     52. | 101000201560108                                                          0 |
     53. | 101000201670110                                                          0 |
         |----------------------------------------------------------------------------|
     54. | 101000202130102   101000202130101   101000202130101 101000202130102      0 |
         |----------------------------------------------------------------------------|
     55. | 101000202440101   101000202440102   101000202440101 101000202440102      0 |
     56. | 101000202440102   101000202440101   101000202440101 101000202440102      0 |
         |----------------------------------------------------------------------------|
     57. | 101000203010101   101000203010102   101000203010101 101000203010102      0 |
     58. | 101000203010102   101000203010101   101000203010101 101000203010102      0 |
         |----------------------------------------------------------------------------|
     59. | 101000203460101   101000203460102   101000203460101 101000203460102      0 |
     60. | 101000203460102   101000203460101   101000203460101 101000203460102      0 |
         |----------------------------------------------------------------------------|
     61. | 101000300030201   101000300030202   101000300030201 101000300030202      0 |
     62. | 101000300030202   101000300030201   101000300030201 101000300030202      0 |
         |----------------------------------------------------------------------------|
     63. | 101000300070105                                                          0 |
     64. | 101000300140106                                                          0 |
     65. | 101000300220101                                                          0 |
         |----------------------------------------------------------------------------|
     66. | 101000300260101   101000300260102   101000300260101 101000300260102      0 |
     67. | 101000300260102   101000300260101   101000300260101 101000300260102      0 |
         |----------------------------------------------------------------------------|
     68. | 101000300380205                                                          0 |
     69. | 101000300550105                                                          0 |
         |----------------------------------------------------------------------------|
     70. | 101000300920101   101000300920102   101000300920101 101000300920102      0 |
     71. | 101000300920102   101000300920101   101000300920101 101000300920102      0 |
         |----------------------------------------------------------------------------|
     72. | 101000301080101   101000301080102   101000301080101 101000301080102      0 |
     73. | 101000301080102   101000301080101   101000301080101 101000301080102      0 |
         |----------------------------------------------------------------------------|
     74. | 101000301450201   101000301450202   101000301450201 101000301450202      0 |
     75. | 101000301450202   101000301450201   101000301450201 101000301450202      0 |
         |----------------------------------------------------------------------------|
     76. | 101000301720106                                                          0 |
         |----------------------------------------------------------------------------|
     77. | 101000301840101   101000301840102   101000301840101 101000301840102      0 |
     78. | 101000301840102   101000301840101   101000301840101 101000301840102      0 |
         |----------------------------------------------------------------------------|
     79. | 101000302140101   101000302140102   101000302140101 101000302140102      0 |
     80. | 101000302140102   101000302140101   101000302140101 101000302140102      0 |
         |----------------------------------------------------------------------------|
     81. | 101000302250106                                                          0 |
     82. | 101000302720104                                                          0 |
     83. | 101000303240201                                                          0 |
         |----------------------------------------------------------------------------|
     84. | 101000400040101   101000400040102   101000400040101 101000400040102      0 |
     85. | 101000400040102   101000400040101   101000400040101 101000400040102      0 |
         |----------------------------------------------------------------------------|
     86. | 101000400070107   101000400070108   101000400070107 101000400070108      0 |
     87. | 101000400070108   101000400070107   101000400070107 101000400070108      0 |
         |----------------------------------------------------------------------------|
     88. | 101000400120101   101000400120102   101000400120101 101000400120102      0 |
     89. | 101000400120102   101000400120101   101000400120101 101000400120102      0 |
         |----------------------------------------------------------------------------|
     90. | 101000400120109                                                          0 |
         |----------------------------------------------------------------------------|
     91. | 101000400140101   101000400140102   101000400140101 101000400140102      0 |
     92. | 101000400140102   101000400140101   101000400140101 101000400140102      0 |
         |----------------------------------------------------------------------------|
     93. | 101000400230102   101000400230101   101000400230101 101000400230102      0 |
         |----------------------------------------------------------------------------|
     94. | 101000400230111                                                          0 |
         |----------------------------------------------------------------------------|
     95. | 101000400430201   101000400430202   101000400430201 101000400430202      0 |
     96. | 101000400430202   101000400430201   101000400430201 101000400430202      0 |
         |----------------------------------------------------------------------------|
     97. | 101000400610101   101000400610102   101000400610101 101000400610102      0 |
     98. | 101000400610102   101000400610101   101000400610101 101000400610102      0 |
         |----------------------------------------------------------------------------|
     99. | 101000400930101   101000400930102   101000400930101 101000400930102      0 |
    100. | 101000400930102   101000400930101   101000400930101 101000400930102      0 |
         +----------------------------------------------------------------------------+
    
    .

    Comment


    • #3
      I'm not sure the code in #2 does the job. Since the data example did not seem to flag any instances, I had to modify it to create some pathological situations. In the code below, I have a manually input variable called problem, which I think are the pathological cases that need flagging. I retain Andrew's code to create flag, but then I construct my own variable called wanted. Note that my code does not need the couple_id or the flag variable created by Andrew's portion.

      Code:
      * MODIFIED DATA EXAMPLE
      clear
      input str15(prim_key spouse_id) byte problem
      "101000100040101" "101000100040102" 0
      "101000100040101" "101000100040103" 0
      "101000100040102" "101000100040101" 0
      "101000100040103" "101000100040101" 0
      "101000100130101" "101000100130102" 0
      "101000100130101" "101000100130103" 1
      "101000100130102" "101000100130101" 0
      "101000100130103" "101000100130104" 1
      "101000100130109" ""                0
      "101000100250106" ""                0
      "101000100320101" "101000100320102" 0
      "101000100320102" "101000100320101" 0
      "101000100320102" "101000100320103" 0
      "101000100320103" "101000100320102" 0
      "101000100320109" ""                0
      "101000100370102" "101000100370101" 1
      "101000100370101" ""                1
      "101000100370111" ""                0
      "101000100380101" "101000100380102" 1 
      "101000100590101" "101000100590102" 0
      "101000100590102" "101000100590101" 0
      "101000100760101" "101000100760102" 0
      "101000100760102" "101000100760101" 0
      "101000101040101" "101000101040102" 1 
      "101000101040102" "101000101040103" 1
      end
      
      * CODE FROM ANDREW IN #2
      gen couple_id= string(min(real(prim_key), real(spouse_id)), "%18.0g") ///
      + " " + string(max(real(prim_key), real(spouse_id)), "%18.0g") ///
      if !missing(prim_key) & !missing(spouse_id)
      
      bys prim_key (couple_id): gen flag= couple_id[1]!= couple_id[_N]
      
      * MY CODE STARTS HERE
      
      tempfile primary
      save `primary'
      
      tempfile opposite spouses
      rename (prim_key spouse_id) (spouse_id prim_key)
      save `opposite'
      
      keep prim_key
      duplicates drop
      save `spouses'
      
      use `primary', clear
      joinby prim_key spouse_id using `opposite', unmatched(master) _merge(_join) update
      merge m:1 prim_key using `spouses', keep(master match)
      
      gen byte wanted = cond(missing(spouse_id), (_merge == 3), (_join == 1))
      drop _merge _join couple_id
      This results in:

      Code:
      . list, noobs sep(0)
      
        +-------------------------------------------------------------+
        |        prim_key         spouse_id   problem   flag   wanted |
        |-------------------------------------------------------------|
        | 101000100040101   101000100040102         0      1        0 |
        | 101000100040101   101000100040103         0      1        0 |
        | 101000100040102   101000100040101         0      0        0 |
        | 101000100040103   101000100040101         0      0        0 |
        | 101000100130101   101000100130102         0      1        0 |
        | 101000100130101   101000100130103         1      1        1 |
        | 101000100130102   101000100130101         0      0        0 |
        | 101000100130103   101000100130104         1      0        1 |
        | 101000100130109                           0      0        0 |
        | 101000100250106                           0      0        0 |
        | 101000100320101   101000100320102         0      0        0 |
        | 101000100320102   101000100320101         0      1        0 |
        | 101000100320102   101000100320103         0      1        0 |
        | 101000100320103   101000100320102         0      0        0 |
        | 101000100320109                           0      0        0 |
        | 101000100370101                           1      0        1 |
        | 101000100370102   101000100370101         1      0        1 |
        | 101000100370111                           0      0        0 |
        | 101000100380101   101000100380102         1      0        1 |
        | 101000100590101   101000100590102         0      0        0 |
        | 101000100590102   101000100590101         0      0        0 |
        | 101000100760101   101000100760102         0      0        0 |
        | 101000100760102   101000100760101         0      0        0 |
        | 101000101040101   101000101040102         1      0        1 |
        | 101000101040102   101000101040103         1      0        1 |
        +-------------------------------------------------------------+

      Comment


      • #4
        Thanks Hemanshu, I think Andrews code simply flags those where the couple uniqueness gets broken. Your wanted variable however, is what I required. Thanks a ton.

        Comment


        • #5
          We may have celebrated too early. I noticed a couple situations in which my code didn't give the desired output. In my data example above, for instance, the pair "101000100380101 101000100380102" should have been coded 0 for problem & wanted. Here is some code with a different approach:

          Code:
          * a FURTHER MODIFIED data example
          clear
          input str15(prim_key spouse_id) byte problem
          "101000100040101" "101000100040102" 0
          "101000100040101" "101000100040103" 0
          "101000100040102" "101000100040101" 0
          "101000100040103" "101000100040101" 0
          "101000100130101" "101000100130102" 0
          "101000100130101" "101000100130103" 1
          "101000100130102" "101000100130101" 0
          "101000100130103" "101000100130104" 1
          "101000100130109" ""                0
          "101000100250106" ""                0
          "101000100320101" "101000100320102" 0
          "101000100320102" "101000100320101" 0
          "101000100320102" "101000100320103" 0
          "101000100320103" "101000100320102" 0
          "101000100320109" ""                0
          "101000100370102" "101000100370101" 1
          "101000100370101" ""                1
          "101000100370111" ""                0
          "101000100380101" "101000100380102" 0
          "101000100590101" "101000100590102" 0
          "101000100590102" "101000100590101" 0
          "101000100760101" "101000100760102" 0
          "101000100760102" "101000100760101" 0
          "101000101040101" "101000101040102" 0
          "101000101040102" "101000101040101" 0
          "101000101040102" "101000101040103" 0
          end
          
          * ANDREW'S CODE from #2
          gen couple_id= string(min(real(prim_key), real(spouse_id)), "%18.0g") ///
          + " " + string(max(real(prim_key), real(spouse_id)), "%18.0g") ///
          if !missing(prim_key) & !missing(spouse_id)
          
          bys prim_key (couple_id): gen flag= couple_id[1]!= couple_id[_N]
          
          *********************
          * NEW CODE from me
          *********************
          gen person_1 = string(min(real(prim_key), real(spouse_id)), "%18.0g")
          gen person_2 = string(max(real(prim_key), real(spouse_id)), "%18.0g") if !missing(spouse_id)
          
          egen pair_id = group(person_1 person_2)
          
          * "complete pairs" are those for which both observations exist
          bys pair_id (prim_key): gen byte complete_pair = (_N == 2)
          
          tempfile primary
          save `primary'
          
          * now to deal with incomplete pairs
          * -- first, we identify all the individuals involved in incomplete pairs
          * -- and count the number of incomplete pairs in which they occur
          keep if !complete_pair
          keep person_1 person_2
          gen x = _n
          reshape long person_, i(x) j(num)
          drop if missing(person_)
          contract person_, freq(n_occur)
          tempfile occurs
          save `occurs'
          
          use `primary', clear
          rename prim_key person_
          merge m:1 person_ using `occurs', keep(master match) nogen keepusing(n_occur)
          rename (person_ n_occur spouse_id) (prim_key n_occur_prim person_)
          merge m:1 person_ using `occurs', keep(master match) nogen keepusing(n_occur)
          rename (person_ n_occur) (spouse_id n_occur_spouse)
          
          * we flag as follows:
          * complete pairs are never flagged
          * incomplete pairs are flagged only if either spouse occurs in more than one incomplete pair
          gen byte wanted = cond(complete_pair, 0, max(n_occur_prim, n_occur_spouse) != 1)
          
          keep prim_key spouse_id problem flag wanted
          sort prim_key spouse_id
          This produces:

          Code:
          . list, noobs sep(0)
          
            +-------------------------------------------------------------+
            |        prim_key         spouse_id   problem   flag   wanted |
            |-------------------------------------------------------------|
            | 101000100040101   101000100040102         0      1        0 |
            | 101000100040101   101000100040103         0      1        0 |
            | 101000100040102   101000100040101         0      0        0 |
            | 101000100040103   101000100040101         0      0        0 |
            | 101000100130101   101000100130102         0      1        0 |
            | 101000100130101   101000100130103         1      1        1 |
            | 101000100130102   101000100130101         0      0        0 |
            | 101000100130103   101000100130104         1      0        1 |
            | 101000100130109                           0      0        0 |
            | 101000100250106                           0      0        0 |
            | 101000100320101   101000100320102         0      0        0 |
            | 101000100320102   101000100320101         0      1        0 |
            | 101000100320102   101000100320103         0      1        0 |
            | 101000100320103   101000100320102         0      0        0 |
            | 101000100320109                           0      0        0 |
            | 101000100370101                           1      0        1 |
            | 101000100370102   101000100370101         1      0        1 |
            | 101000100370111                           0      0        0 |
            | 101000100380101   101000100380102         0      0        0 |
            | 101000100590101   101000100590102         0      0        0 |
            | 101000100590102   101000100590101         0      0        0 |
            | 101000100760101   101000100760102         0      0        0 |
            | 101000100760102   101000100760101         0      0        0 |
            | 101000101040101   101000101040102         0      0        0 |
            | 101000101040102   101000101040101         0      1        0 |
            | 101000101040102   101000101040103         0      1        0 |
            +-------------------------------------------------------------+
          I would still advise thinking this through and double checking that it correctly identifies all sorts of pathological cases.
          Last edited by Hemanshu Kumar; 02 Apr 2025, 08:53.

          Comment


          • #6

            Sorry, I wanted to edit #4 but I could not due to some reason.

            While I think Hemanshu's code works better in terms of what I need, what that code does is if it doesn't find a matching spouse_id in the prim_key column, it takes wanted to be 1. I dont want those cases to be flagged. What I want is that if there is inconsistency, then the flag should be there.
            Code:
            * Example generated by -dataex-. For more info, type help dataex
            clear
            input str15(prim_key spouse_id) byte wanted float flag
            "101000100370102" "101000100370101" 1 0
            "101000200860102" "101000200860101" 1 0
            "101000202130102" "101000202130101" 1 0
            "101000400230102" "101000400230101" 1 0
            "101000401090102" "101000401090101" 1 0
            "101000401710102" "101000401710101" 1 0
            "101000403670102" "101000403670101" 1 0
            "101000600140102" "101000600140101" 1 0
            "101000600180102" "101000600180101" 1 0
            "101000600670102" "101000600670101" 1 0
            "101000600830102" "101000600830101" 1 0
            "101000601300102" "101000601300101" 1 0
            "101000700850102" "101000700850101" 1 0
            "101000702290206" "101000702290207" 1 0
            "101000800440102" "101000800440101" 1 0
            "101000802500102" "101000802500101" 1 0
            "101000802870102" "101000802870101" 1 0
            "101000802980102" "101000802980101" 1 0
            "101000901410101" "101000901410102" 1 0
            "101000902020102" "101000902020101" 1 0
            "101000902140102" "101000902140101" 1 0
            "101000902370101" "101000902370102" 1 0
            "101000902370102" "101000902370103" 1 0
            "101001003900102" "101001003900101" 1 0
            "101001004550102" "101001004550101" 1 0
            "101001004980102" "101001004980101" 1 0
            "101001202080102" "101001202080101" 1 0
            "101001300210101" "101001300210102" 1 0
            "101001300300101" "101001300300103" 1 0
            "101001300300102" "101001300300101" 1 0
            "101001301950101" "101001301950102" 1 0
            "101001304460202" "101001304460201" 1 0
            "101001400840102" "101001400840101" 1 0
            "101001402200102" "101001402200101" 1 0
            "101001500240102" "101001500240103" 1 0
            "101001502560101" "101001502560102" 1 0
            "101001600770101" "101001600770102" 1 0
            "101001600770102" "101001600770103" 1 0
            "101001701240102" "101001701240101" 1 0
            "101001701400101" "101001701400102" 1 0
            "101001701460102" "101001701460101" 1 0
            "101001800480102" "101001800480101" 1 0
            "101001900620102" "101001900620101" 1 0
            "101001902020107" "101001902020106" 1 0
            "101001902500102" "101001902500101" 1 0
            "101001903120102" "101001903120101" 1 0
            "101002000320104" "101002000320103" 1 0
            "101002000400102" "101002000400101" 1 0
            "101002001250102" "101002001250101" 1 0
            "101002002710102" "101002002710101" 1 0
            "101002003810102" "101002003810101" 1 0
            "101002100480102" "101002100480101" 1 0
            "101002100540102" "101002100540101" 1 0
            "101002101400102" "101002101400101" 1 0
            "101002101630101" "101002101630102" 1 0
            "101002101800102" "101002101800101" 1 0
            "101002101970102" "101002101970101" 1 0
            "101002200190102" "101002200190101" 1 0
            "101002200670102" "101002200670101" 1 0
            "101002200760101" "101002200760102" 1 0
            "101002202160202" "101002202160201" 1 0
            "101002202330102" "101002202330101" 1 0
            "101002203480102" "101002203480101" 1 0
            "101002301730102" "101002301730101" 1 0
            "101002400360103" "101002400360102" 1 0
            "101002401870101" "101002401870103" 1 0
            "101002401870102" "101002401870101" 1 0
            "101002403440101" "101002403440102" 1 0
            "101002404740101" "101002404740102" 1 0
            "101002500630102" "101002500630101" 1 0
            "101002503270102" "101002503270101" 1 0
            "101002601330102" "101002601330101" 1 0
            "101002602140102" "101002602140101" 1 0
            "101002603190101" "101002603190102" 1 0
            "101002603460201" "101002603460202" 1 0
            "101002800320101" "101002800320102" 1 0
            "101002800320102" ""                1 0
            "101002802040101" "101002802040102" 1 0
            "101002802190102" "101002802190101" 1 0
            "101002802940102" "101002802940101" 1 0
            "101002803500102" "101002803500101" 1 0
            "101002900220102" "101002900220101" 1 0
            "101002900330101" "101002900330102" 1 0
            "101002900980102" "101002900980101" 1 0
            "101002901190101" "101002901190102" 1 0
            "101002903670102" "101002903670101" 1 0
            "101003000730102" "101003000730101" 1 0
            "101003001560102" "101003001560101" 1 0
            "101003101620102" "101003101620101" 1 0
            "101003102430101" "101003102430102" 1 0
            "101003201180102" "101003201180101" 1 0
            "101003201970102" "101003201970101" 1 0
            "101003202620102" "101003202620101" 1 0
            "101003300050101" "101003300050102" 1 0
            "101003302320101" "101003302320102" 1 0
            "101003500680102" "101003500680101" 1 0
            "101003503940102" "101003503940101" 1 0
            "101003504820101" "101003504820102" 1 0
            "101003600800102" "101003600800101" 1 0
            "101003700130102" "101003700130101" 1 0
            end

            If you look at the prim_key = "101001600770102" has a spouse_id "101001600770103" but the spouse_id = "101001600770102" is mapped to prim_key = "101001600770101". Flag (Andew's code) does not flag it. And Hemanshu's code, while flagging it, also flags anything in spouse_id that does not exist in prim_key.
            What ideally I would want is to have all the others unflagged, keeping only this "101001600770102" and "101001600770101" flagged.

            Thanks again

            Comment


            • #7
              Rajdeep Chaudhuri can you check the modified code in #5 and let me know if it does what you need?

              Comment


              • #8
                Thanks Hemanshu for the updated code. It does overcome the earlier limitations but there are certain cases where it still flags. For instance:



                Code:
                * Example generated by -dataex-. For more info, type help dataex
                clear
                input str15(prim_key spouse_id) byte wanted
                "101000902370101" "101000902370102" 1
                "101000902370102" "101000902370103" 1
                "101001300300101" "101001300300103" 1
                "101001300300102" "101001300300101" 1
                "101001600770101" "101001600770102" 1
                "101001600770102" "101001600770103" 1
                "101002401870101" "101002401870103" 1
                "101002401870102" "101002401870101" 1
                "101002800320101" "101002800320102" 1
                "102005201100101" "102005201100103" 1
                "102005201100102" "102005201100101" 1
                "102005801420103" "102005801420104" 1
                "102005801420104" "102005801420101" 1
                "102006400390101" "102006400390102" 1
                "102006400390102" "102006400390104" 1
                "102007700330107" "102007700330102" 1
                "102007700330108" "102007700330107" 1
                "102008001600105" "102008001600101" 1
                "102008001600106" "102008001600105" 1
                "102008600170102" "102008600170101" 1
                "102008901450106" "102008901450101" 1
                "102008901450107" "102008901450101" 1
                "103010801850101" "103010801850103" 1
                "103010801850102" "103010801850101" 1
                "103010802940101" "103010802940102" 1
                "103010904820101" "103010904820102" 1
                "103010904820102" "103010904820103" 1
                "103014401650102" "103014401650101" 1
                "103014401650108" "103014401650101" 1
                "103015402080102" "103015402080101" 1
                "103015402080103" "103015402080102" 1
                "103016901270102" "103016901270101" 1
                "104018100080101" "104018100080102" 1
                "104018300130102" "104018300130101" 1
                "104018400310101" "104018400310102" 1
                "104018500100102" "104018500100101" 1
                "104019000040301" "104019000040303" 1
                "104019000040302" "104019000040301" 1
                "104019700040303" "104019700040304" 1
                "104019700040304" "104019700040301" 1
                "104021400100101" "104021400100102" 1
                "104021400100102" "104021400100103" 1
                "104021600400206" "104021600400207" 1
                "104021600400207" "104021600400201" 1
                "105023202520101" "105023202520102" 1
                "105023202520102" "105023202520105" 1
                "105023600280101" "105023600280102" 1
                "105023600280102" "105023600280106" 1
                "105023703690101" "105023703690102" 1
                "105023703690102" "105023703690103" 1
                "105025100300102" "105025100300101" 1
                "105025100300103" "105025100300102" 1
                "105026900450101" "105026900450103" 1
                "105026900450102" "105026900450101" 1
                "106027502530201" "106027502530202" 1
                "106027502530202" "106027502530203" 1
                "106028401500106" "106028401500107" 1
                "106028401500107" "106028401500101" 1
                "106029801310102" "106029801310101" 1
                "107035200080201" "107035200080203" 1
                "107035200080202" "107035200080201" 1
                "107036400550101" "107036400550102" 1
                "107036400550101" "107036400550102" 1
                "107036400550102" "107036400550101" 1
                "107036400550103" "107036400550101" 1
                "107036401130801" "107036401130802" 1
                "107036401130802" "107036401130803" 1
                "107036701550301" "107036701550302" 1
                "107036900030601" "107036900030603" 1
                "107036900030602" "107036900030601" 1
                "107037200260102" "107037200260101" 1
                "107037300770103" "107037300770101" 1
                "107037300770104" "107037300770103" 1
                "108040600260101" "108040600260102" 1
                "108040600260102" "108040600260103" 1
                "108041902870107" "108041902870108" 1
                "108041902870108" "108041902870101" 1
                "108045400690101" "108045400690103" 1
                "108045400690102" "108045400690101" 1
                "108045701810201" "108045701810211" 1
                "108045701810202" "108045701810201" 1
                "109046601060101" "109046601060102" 1
                "109046601060102" "109046601060103" 1
                "109046900660101" "109046900660103" 1
                "109046900660102" "109046900660101" 1
                "109046901520101" "109046901520102" 1
                "109046901520102" "109046901520103" 1
                "109047400890207" "109047400890208" 1
                "109047400890208" "109047400890201" 1
                "109048100010101" "109048100010103" 1
                "109048100010102" "109048100010101" 1
                "109049800210102" "109049800210101" 1
                "109049803280101" "109049803280102" 1
                "109051005160101" "109051005160102" 1
                "109051005160101" "109051005160102" 1
                "109051005160102" "109051005160101" 1
                "109051005160105" "109051005160101" 1
                "109051800170101" "109051800170102" 1
                "109052202420101" "109052202420102" 1
                "109052202420102" "109052202420103" 1
                end
                If you look at 102008901450106 and 102008901450107 prim_keys, they are mapped to a spouse_id 102008901450101. Now 102008901450101 does not exist as a prim_key at all, so there is no way to check for consistency. I would not want that flagged. In other words, we CAN have multiple spouse_ids for the same prim_key(or vice versa, as in this case) , provided that when those spouse_ids appear as prim_keys, they need to have the same spouse_id. I am flagging those cases where this is not the case. Here we have multiple prim_keys mapped to a spouse_id, and that is acceptable, since the spouse_id never occurs as a prim_key. If the spouse_id did occur as a prim_key, and then it's corresponding spouse_id was NOT any of the prim_keys, then I want it flagged. So if 102008901450101 occurred as a prim_key and it's spouse_id was not either of 10200890145010106 or 102008901450107, then I should flag it.

                Comment


                • #9
                  Okay, here's another shot at it. I've amended my data example to include the case you mentioned.

                  Code:
                  * YET FURTHER MODIFIED data example
                  clear
                  input str15(prim_key spouse_id) byte problem
                  "101000100040101" "101000100040102" 0
                  "101000100040101" "101000100040103" 0
                  "101000100040102" "101000100040101" 0
                  "101000100040103" "101000100040101" 0
                  "101000100130101" "101000100130102" 0
                  "101000100130101" "101000100130103" 1
                  "101000100130102" "101000100130101" 0
                  "101000100130103" "101000100130104" 1
                  "101000100130109" ""                0
                  "101000100250106" ""                0
                  "101000100320101" "101000100320102" 0
                  "101000100320102" "101000100320101" 0
                  "101000100320102" "101000100320103" 0
                  "101000100320103" "101000100320102" 0
                  "101000100320109" ""                0
                  "101000100370102" "101000100370101" 1
                  "101000100370101" ""                1
                  "101000100370111" ""                0
                  "101000100380101" "101000100380102" 0
                  "101000100590101" "101000100590102" 0
                  "101000100590102" "101000100590101" 0
                  "101000100760101" "101000100760102" 0
                  "101000100760102" "101000100760101" 0
                  "101000101040101" "101000101040102" 0
                  "101000101040102" "101000101040101" 0
                  "101000101040102" "101000101040103" 0
                  "102008901450106" "102008901450101" 0
                  "102008901450107" "102008901450101" 0
                  end
                  
                  
                  *********************
                  * EVEN NEWER CODE from me
                  *********************
                  rename (prim_key spouse_id) (person_1 person_2)
                  gen p_low = string(min(real(person_1), real(person_2)), "%18.0g")
                  gen p_high = string(max(real(person_1), real(person_2)), "%18.0g") if !missing(person_2)
                  
                  egen pair_id = group(p_low p_high)
                  drop p_low p_high
                  
                  * "complete pairs" are those for which both observations exist
                  bys pair_id (person_1): gen byte complete_pair = (_N == 2)
                  drop pair_id
                  
                  tempfile primary
                  save `primary'
                  
                  * now to deal with incomplete pairs
                  * -- first, we identify all the individuals involved in incomplete pairs
                  * -- and count the number of incomplete pairs in which they occur
                  * -- we count their occurrences as primary and as spouses, separately
                  keep if !complete_pair
                  keep person_1 person_2
                  gen x = _n
                  reshape long person_, i(x) j(num)
                  drop if missing(person_)
                  contract person_ num, freq(n_occur)
                  reshape wide n_occur, i(person_) j(num)
                  foreach var in n_occur1 n_occur2 {
                      replace `var' = 0 if missing(`var')
                  }
                  rename (n_occur1 n_occur2) (n_occur_as_prim n_occur_as_spouse)
                  tempfile occurs
                  save `occurs'
                  
                  
                  use `primary', clear
                  rename person_1 person_
                  merge m:1 person_ using `occurs', keep(master match) nogen keepusing(n_occur_as_spouse)
                  rename (person_ n_occur_as_spouse person_2) (prim_key n_prim_occur_as_spouse person_)
                  merge m:1 person_ using `occurs', keep(master match) nogen keepusing(n_occur_as_prim)
                  rename (person_ n_occur_as_prim) (spouse_id n_spouse_occur_as_prim)
                  
                  * we flag as follows:
                  * complete pairs are never flagged
                  * incomplete pairs are flagged only if either spouse occurs in at least one incomplete pair in the opposite role
                  gen byte wanted = cond(complete_pair, 0, ///
                                      max(n_spouse_occur_as_prim, n_prim_occur_as_spouse) >= 1 & ///
                                      !(missing(n_spouse_occur_as_prim) & missing(n_prim_occur_as_spouse)))
                  keep prim_key spouse_id problem wanted
                  sort prim_key spouse_id
                  This produces:
                  Code:
                  . list, noobs sep(0)
                  
                    +------------------------------------------------------+
                    |        prim_key         spouse_id   problem   wanted |
                    |------------------------------------------------------|
                    | 101000100040101   101000100040102         0        0 |
                    | 101000100040101   101000100040103         0        0 |
                    | 101000100040102   101000100040101         0        0 |
                    | 101000100040103   101000100040101         0        0 |
                    | 101000100130101   101000100130102         0        0 |
                    | 101000100130101   101000100130103         1        1 |
                    | 101000100130102   101000100130101         0        0 |
                    | 101000100130103   101000100130104         1        1 |
                    | 101000100130109                           0        0 |
                    | 101000100250106                           0        0 |
                    | 101000100320101   101000100320102         0        0 |
                    | 101000100320102   101000100320101         0        0 |
                    | 101000100320102   101000100320103         0        0 |
                    | 101000100320103   101000100320102         0        0 |
                    | 101000100320109                           0        0 |
                    | 101000100370101                           1        1 |
                    | 101000100370102   101000100370101         1        1 |
                    | 101000100370111                           0        0 |
                    | 101000100380101   101000100380102         0        0 |
                    | 101000100590101   101000100590102         0        0 |
                    | 101000100590102   101000100590101         0        0 |
                    | 101000100760101   101000100760102         0        0 |
                    | 101000100760102   101000100760101         0        0 |
                    | 101000101040101   101000101040102         0        0 |
                    | 101000101040102   101000101040101         0        0 |
                    | 101000101040102   101000101040103         0        0 |
                    | 102008901450106   102008901450101         0        0 |
                    | 102008901450107   102008901450101         0        0 |
                    +------------------------------------------------------+
                  Last edited by Hemanshu Kumar; 03 Apr 2025, 02:17.

                  Comment

                  Working...
                  X