Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to merge two datasets (string and integer values in ID)

    Dear all,
    I have two datasets: household and individual data. Usually, it is quite straightforward to use 1:m or m:1 merge in this case. However, in my datasets the personal ID contains integer values, while the household ID contains string values with more than one character relative to the personal ID. It seems that I need to harmonize the two IDs before merging, any help would be much appreciated. Thank you.

    Household data
    Code:
    clear
    input str12 id float(x1 x2)
    "        1 10" 3   .06849
    "        1 20" 3   .34714
    "        1 30" 4   .80035
    "        1 40" 2   -.7121
    "        1 51" 2  -.29017
    "        1 52" 1 -1.16279
    "        1 60" 3  -.05045
    "        1 70" 3  -.18113
    "        1 80" 3   .14243
    "        1 90" 2  -.63508
    "        1100" 2  -.62762
    "        1110" 3   .02875
    "        1120" 3   .11791
    "        2 20" 2  -.75315
    "        2 30" 3   .28492
    "        2 40" 2  -.67028
    "        2 50" 2  -.71762
    "        2 60" 1 -1.11277
    "        2 70" 1  -.99784
    "        2 80" 3   .61741
    "        2 90" 1  -.99544
    "        2100" 3   .45869
    "        2110" 2   -.4203
    "        2120" 4  1.00988
    "        2130" 4   1.0954
    "        3 10" 2   -.5373
    "        3 20" 4   .66405
    "        3 30" 2  -.36415
    "        3 40" 3   .19945
    "        3 50" 1 -1.08678
    "        3 60" 4   .73621
    "        3 70" 4   .74809
    "        3 80" 2  -.45847
    "        3 90" 3   .63303
    "        3100" 2  -.54443
    "        3110" 1  -1.3781
    "        3120" 3  -.09137
    "        4 10" 3  -.20053
    "        4 20" 4   .77128
    "        4 30" 2  -.90075
    "        4 40" 5   1.4545
    "        4 50" 3    .5492
    "        4 60" 3   .34798
    "        4 70" 3   .57425
    "        4 88" 4   .72483
    "        4 90" 4  1.13641
    "        4100" 4    .7428
    "        4110" 3   .07936
    "        4120" 5   1.4374
    "        4130" 4   .70773
    "        5 10" 3   .28812
    "        5 20" 5   1.2672
    "        5 30" 3   .15984
    "        5 40" 4  1.06975
    "        5 50" 4  1.13389
    "        5 60" 3   .20208
    "        5 70" 3   .56203
    "        5 80" 5  1.38634
    "        5 90" 4   .64855
    "        5100" 5  1.86468
    "        5110" 5  1.53115
    "        5120" 4  1.06683
    "        6 10" 5  1.33402
    "        6 30" 4   .97834
    "        6 40" 4   .82034
    "        6 50" 4  1.09458
    "        6 60" 5  1.48871
    "        6 70" 5   1.3944
    "        6 80" 5  1.38047
    "        6 90" 3   .44144
    "        6100" 2  -.26236
    "        6110" 2  -.69594
    "        6120" 3  -.02173
    "        7 10" 5   1.4374
    "        7 20" 3  -.17112
    "        7 30" 3   .00473
    "        7 40" 3  -.20517
    "        7 50" 3   .29125
    "        7 60" 2  -.57303
    "        7 80" 3  -.08485
    "        7 90" 3  -.11895
    "        7101" 2  -.60675
    "        7102" 2  -.47832
    "        7110" 2  -.61103
    "        7120" 2  -.89282
    "        7130" 5  1.46305
    "        8 10" 5  1.42465
    "        8 20" 4  1.18588
    "        8 30" 5  1.55226
    "        8 40" 5  1.29114
    "        8 50" 4  1.14672
    "        8 60" 4   1.0954
    "        8 70" 5  1.30324
    "        8 80" 5   1.3014
    "        8 90" 4   .79918
    "        8100" 2  -.39919
    "        8110" 3   .26246
    "        9 10" 5   1.2672
    "        9 20" 5  1.67923
    "        9 30" 4  1.12106
    end
    Individual data
    Code:
    clear
    input float pid
     10
     10
     20
     20
     30
     30
     40
     40
     40
     40
     40
     40
     40
     40
     40
     40
     40
     40
     51
     51
     51
     51
     52
     60
     60
     60
     60
     60
     60
     70
     80
     80
     80
     80
     80
     90
     90
     90
     90
     90
     90
    100
    100
    100
    100
    100
    110
    110
    110
    110
    110
    110
    110
    110
    120
    120
    120
    120
    120
    120
    120
    120
    120
     20
     20
     20
     20
     20
     20
     20
     20
     20
     30
     30
     30
     30
     30
     30
     30
     40
     40
     40
     40
     40
     40
     40
     40
     40
     50
     50
     60
     60
     60
     60
     60
     60
     60
     60
     60
     70
    end

  • #2
    Assuming that the observations are ordered, you may be able to extract the identifier. If everything is mixed up, then there is no obvious way to distinguish between 8112= 8 122 and 8122 = 81 22. With ordering, it may be possible to exploit the fact that a new group starts with "(number)space(number)" and an old group ends with "number".

    Code:
    clear
    input str12 id float(x1 x2)
    "        1 10" 3   .06849
    "        1 20" 3   .34714
    "        1 30" 4   .80035
    "        1 40" 2   -.7121
    "        1 51" 2  -.29017
    "        1 52" 1 -1.16279
    "        1 60" 3  -.05045
    "        1 70" 3  -.18113
    "        1 80" 3   .14243
    "        1 90" 2  -.63508
    "        1100" 2  -.62762
    "        1110" 3   .02875
    "        1120" 3   .11791
    "        2 20" 2  -.75315
    "        2 30" 3   .28492
    "        2 40" 2  -.67028
    "        2 50" 2  -.71762
    "        2 60" 1 -1.11277
    "        2 70" 1  -.99784
    "        2 80" 3   .61741
    "        2 90" 1  -.99544
    "        2100" 3   .45869
    "        2110" 2   -.4203
    "        2120" 4  1.00988
    "        2130" 4   1.0954
    "        3 10" 2   -.5373
    "        3 20" 4   .66405
    "        3 30" 2  -.36415
    "        3 40" 3   .19945
    "        3 50" 1 -1.08678
    "        3 60" 4   .73621
    "        3 70" 4   .74809
    "        3 80" 2  -.45847
    "        3 90" 3   .63303
    "        3100" 2  -.54443
    "        3110" 1  -1.3781
    "        3120" 3  -.09137
    "        4 10" 3  -.20053
    "        4 20" 4   .77128
    "        4 30" 2  -.90075
    "        4 40" 5   1.4545
    "        4 50" 3    .5492
    "        4 60" 3   .34798
    "        4 70" 3   .57425
    "        4 88" 4   .72483
    "        4 90" 4  1.13641
    "        4100" 4    .7428
    "        4110" 3   .07936
    "        4120" 5   1.4374
    "        4130" 4   .70773
    "        5 10" 3   .28812
    "        5 20" 5   1.2672
    "        5 30" 3   .15984
    "        5 40" 4  1.06975
    "        5 50" 4  1.13389
    "        5 60" 3   .20208
    "        5 70" 3   .56203
    "        5 80" 5  1.38634
    "        5 90" 4   .64855
    "        5100" 5  1.86468
    "        5110" 5  1.53115
    "        5120" 4  1.06683
    "        6 10" 5  1.33402
    "        6 30" 4   .97834
    "        6 40" 4   .82034
    "        6 50" 4  1.09458
    "        6 60" 5  1.48871
    "        6 70" 5   1.3944
    "        6 80" 5  1.38047
    "        6 90" 3   .44144
    "        6100" 2  -.26236
    "        6110" 2  -.69594
    "        6120" 3  -.02173
    "        7 10" 5   1.4374
    "        7 20" 3  -.17112
    "        7 30" 3   .00473
    "        7 40" 3  -.20517
    "        7 50" 3   .29125
    "        7 60" 2  -.57303
    "        7 80" 3  -.08485
    "        7 90" 3  -.11895
    "        7101" 2  -.60675
    "        7102" 2  -.47832
    "        7110" 2  -.61103
    "        7120" 2  -.89282
    "        7130" 5  1.46305
    "        8 10" 5  1.42465
    "        8 20" 4  1.18588
    "        8 30" 5  1.55226
    "        8 40" 5  1.29114
    "        8 50" 4  1.14672
    "        8 60" 4   1.0954
    "        8 70" 5  1.30324
    "        8 80" 5   1.3014
    "        8 90" 4   .79918
    "        8100" 2  -.39919
    "        8110" 3   .26246
    "        9 10" 5   1.2672
    "        9 20" 5  1.67923
    "        9 30" 4  1.12106
    end
    
    gen tag= cond(_n==1, 1, ustrregexm(trim(id), "\d+\s+\d+") & ustrregexm(trim(id[_n-1]), "\d+"))
    gen group= sum(tag & !tag[_n-1]) +1
    gen long obsno=_n
    bys group (obsno): gen wanted= real(ustrregexra(trim(id), "(\d+)\s+\d+", "$1"))
    by group: replace wanted=wanted[1]
    Note that the identification hinges on the condition I have specified above being met across all groups. If spaces are not present starting at "10 00", such that this is written as "1000", the code will fail.

    Res.:

    Code:
    . l, sepby(wanted)
    
         +-------------------------------------------------------------+
         |           id   x1         x2   tag   group   obsno   wanted |
         |-------------------------------------------------------------|
      1. |         1 10    3     .06849     1       1       1        1 |
      2. |         1 20    3     .34714     1       1       2        1 |
      3. |         1 30    4     .80035     1       1       3        1 |
      4. |         1 40    2     -.7121     1       1       4        1 |
      5. |         1 51    2    -.29017     1       1       5        1 |
      6. |         1 52    1   -1.16279     1       1       6        1 |
      7. |         1 60    3    -.05045     1       1       7        1 |
      8. |         1 70    3    -.18113     1       1       8        1 |
      9. |         1 80    3     .14243     1       1       9        1 |
     10. |         1 90    2    -.63508     1       1      10        1 |
     11. |         1100    2    -.62762     0       1      11        1 |
     12. |         1110    3     .02875     0       1      12        1 |
     13. |         1120    3     .11791     0       1      13        1 |
         |-------------------------------------------------------------|
     14. |         2 20    2    -.75315     1       2      14        2 |
     15. |         2 30    3     .28492     1       2      15        2 |
     16. |         2 40    2    -.67028     1       2      16        2 |
     17. |         2 50    2    -.71762     1       2      17        2 |
     18. |         2 60    1   -1.11277     1       2      18        2 |
     19. |         2 70    1    -.99784     1       2      19        2 |
     20. |         2 80    3     .61741     1       2      20        2 |
     21. |         2 90    1    -.99544     1       2      21        2 |
     22. |         2100    3     .45869     0       2      22        2 |
     23. |         2110    2     -.4203     0       2      23        2 |
     24. |         2120    4    1.00988     0       2      24        2 |
     25. |         2130    4     1.0954     0       2      25        2 |
         |-------------------------------------------------------------|
     26. |         3 10    2     -.5373     1       3      26        3 |
     27. |         3 20    4     .66405     1       3      27        3 |
     28. |         3 30    2    -.36415     1       3      28        3 |
     29. |         3 40    3     .19945     1       3      29        3 |
     30. |         3 50    1   -1.08678     1       3      30        3 |
     31. |         3 60    4     .73621     1       3      31        3 |
     32. |         3 70    4     .74809     1       3      32        3 |
     33. |         3 80    2    -.45847     1       3      33        3 |
     34. |         3 90    3     .63303     1       3      34        3 |
     35. |         3100    2    -.54443     0       3      35        3 |
     36. |         3110    1    -1.3781     0       3      36        3 |
     37. |         3120    3    -.09137     0       3      37        3 |
         |-------------------------------------------------------------|
     38. |         4 10    3    -.20053     1       4      38        4 |
     39. |         4 20    4     .77128     1       4      39        4 |
     40. |         4 30    2    -.90075     1       4      40        4 |
     41. |         4 40    5     1.4545     1       4      41        4 |
     42. |         4 50    3      .5492     1       4      42        4 |
     43. |         4 60    3     .34798     1       4      43        4 |
     44. |         4 70    3     .57425     1       4      44        4 |
     45. |         4 88    4     .72483     1       4      45        4 |
     46. |         4 90    4    1.13641     1       4      46        4 |
     47. |         4100    4      .7428     0       4      47        4 |
     48. |         4110    3     .07936     0       4      48        4 |
     49. |         4120    5     1.4374     0       4      49        4 |
     50. |         4130    4     .70773     0       4      50        4 |
         |-------------------------------------------------------------|
     51. |         5 10    3     .28812     1       5      51        5 |
     52. |         5 20    5     1.2672     1       5      52        5 |
     53. |         5 30    3     .15984     1       5      53        5 |
     54. |         5 40    4    1.06975     1       5      54        5 |
     55. |         5 50    4    1.13389     1       5      55        5 |
     56. |         5 60    3     .20208     1       5      56        5 |
     57. |         5 70    3     .56203     1       5      57        5 |
     58. |         5 80    5    1.38634     1       5      58        5 |
     59. |         5 90    4     .64855     1       5      59        5 |
     60. |         5100    5    1.86468     0       5      60        5 |
     61. |         5110    5    1.53115     0       5      61        5 |
     62. |         5120    4    1.06683     0       5      62        5 |
         |-------------------------------------------------------------|
     63. |         6 10    5    1.33402     1       6      63        6 |
     64. |         6 30    4     .97834     1       6      64        6 |
     65. |         6 40    4     .82034     1       6      65        6 |
     66. |         6 50    4    1.09458     1       6      66        6 |
     67. |         6 60    5    1.48871     1       6      67        6 |
     68. |         6 70    5     1.3944     1       6      68        6 |
     69. |         6 80    5    1.38047     1       6      69        6 |
     70. |         6 90    3     .44144     1       6      70        6 |
     71. |         6100    2    -.26236     0       6      71        6 |
     72. |         6110    2    -.69594     0       6      72        6 |
     73. |         6120    3    -.02173     0       6      73        6 |
         |-------------------------------------------------------------|
     74. |         7 10    5     1.4374     1       7      74        7 |
     75. |         7 20    3    -.17112     1       7      75        7 |
     76. |         7 30    3     .00473     1       7      76        7 |
     77. |         7 40    3    -.20517     1       7      77        7 |
     78. |         7 50    3     .29125     1       7      78        7 |
     79. |         7 60    2    -.57303     1       7      79        7 |
     80. |         7 80    3    -.08485     1       7      80        7 |
     81. |         7 90    3    -.11895     1       7      81        7 |
     82. |         7101    2    -.60675     0       7      82        7 |
     83. |         7102    2    -.47832     0       7      83        7 |
     84. |         7110    2    -.61103     0       7      84        7 |
     85. |         7120    2    -.89282     0       7      85        7 |
     86. |         7130    5    1.46305     0       7      86        7 |
         |-------------------------------------------------------------|
     87. |         8 10    5    1.42465     1       8      87        8 |
     88. |         8 20    4    1.18588     1       8      88        8 |
     89. |         8 30    5    1.55226     1       8      89        8 |
     90. |         8 40    5    1.29114     1       8      90        8 |
     91. |         8 50    4    1.14672     1       8      91        8 |
     92. |         8 60    4     1.0954     1       8      92        8 |
     93. |         8 70    5    1.30324     1       8      93        8 |
     94. |         8 80    5     1.3014     1       8      94        8 |
     95. |         8 90    4     .79918     1       8      95        8 |
     96. |         8100    2    -.39919     0       8      96        8 |
     97. |         8110    3     .26246     0       8      97        8 |
         |-------------------------------------------------------------|
     98. |         9 10    5     1.2672     1       9      98        9 |
     99. |         9 20    5    1.67923     1       9      99        9 |
    100. |         9 30    4    1.12106     1       9     100        9 |
         +-------------------------------------------------------------+
    
    .
    Last edited by Andrew Musau; 17 Mar 2023, 06:32.

    Comment

    Working...
    X