Hi,
even though my problem will be easy to solve I couldn't wrap my head around it:
I have a list of IDs where every Acq_PermID should only occur once. Whenever a Acq_PermID occurs more than once I would like to keep the observation which provides more information, meaning the observation that has fewer missing values in terms of the other variables (Acq_CIK, Acq_ISIN and Acq_LEI).
I tried the following code to sort my data in a way that the Acq_PermID with the most information (fewer missing values) will be first.
Than I would use
to tag the first observation and drop those that were tagged with a 0.
The first double entry of Acq_PermID is observation 23 and 24 where my code produces the desired order: The observation more missing values is sorted last.
However, looking at observation 55 and 56, it can be seen that the order is wrong, meaning that the observation that provides more information is actually sorted last. I am pretty sure I got the gsort command wrong.
Any help is much appreciated.
even though my problem will be easy to solve I couldn't wrap my head around it:
I have a list of IDs where every Acq_PermID should only occur once. Whenever a Acq_PermID occurs more than once I would like to keep the observation which provides more information, meaning the observation that has fewer missing values in terms of the other variables (Acq_CIK, Acq_ISIN and Acq_LEI).
Code:
* Example generated by -dataex-. For more info, type help dataex clear input double Acq_PermID long Acq_CIK str12 Acq_ISIN str20 Acq_LEI 4295236745 . "" "" 4295413777 1092914 "US0493921037" "6A3JIQGY1H24GWRW7H68" 4295492441 . "" "" 4295533401 1351285 "US7777801074" "5299001UYF7L66NG5049" 4295641240 1606757 "US49428J1097" "549300COBYN2GZCBIU87" 4295856698 928366 "AU000000SGM7" "5493002YM7ZHVDMFC916" 4295857066 1314102 "US30233G2093" "549300QE1RU34T50MR69" 4295858063 . "AU000000ALU8" "5299003MG7UFQSIFH576" 4295858704 1400561 "AU000000BLY8" "549300GTMW4DNIWH3044" 4295858798 1389072 "US4223681002" "54930046NXJLV1CZN080" 4295859577 1311370 "BMG540501027" "254900RIBCDJSUFG1A11" 4295859620 1003390 "BMG813651012" "" 4295860320 1124827 "VGG872101032" "" 4295860321 923168 "US90476B1017" "" 4295860391 1350282 "CA00847V1085" "" 4295860444 1375205 "CA91688R1082" "5493004ZEMNWEC7T6X93" 4295860458 1412067 "US14740B6065" "549300UJFWSN15PSBY77" 4295860526 . "CA3060711015" "213800KQY87Z1KNPIM76" 4295860614 316218 "CA4589781034" "" 4295860709 1389415 "US45685V2079" "" 4295860710 1405286 "US88338E1073" "" 4295860727 29590 "" "" 4295860746 903571 "CA38119T8077" "549300NSWNLPBZ2XTO69" 4295860746 800544 "" "" 4295860752 879933 "CA4626221010" "" 4295860898 1394114 "CA7430751031" "" 4295860975 1022705 "CA88157K1012" "5493008TXYN3II3PU369" 4295860976 1076930 "CA67000B1040" "" 4295861065 . "CA46612F4069" "" 4295861092 1016888 "CA5899751013" "" 4295861160 1075124 "CA8849037095" "549300561UZND4C7B569" 4295861494 1178818 "CA8621681011" "" 4295861545 . "CA97653G1000" "" 4295861554 1081831 "" "" 4295861628 1088162 "CA5561621056" "" 4295861639 1365295 "US44109A1051" "" 4295861681 1095102 "CA1409143007" "" 4295861921 . "CA2679202051" "" 4295861924 1023947 "CA0452972072" "" 4295861978 929351 "CA5359194019" "5493000JSMHPHZURLM72" 4295862042 . "CA42226R1001" "" 4295862055 1022282 "CA0018591077" "" 4295862063 1024520 "US9809054003" "" 4295862093 . "" "549300QSB0I5SHJS3O56" 4295862123 1178832 "CA8629261022" "" 4295862202 1419465 "CA86388A1084" "5493003YVQPIJS9TV046" 4295862405 1363851 "CA1488871023" "549300512MQVIWLOO713" 4295862489 1417821 "CA05500N6081" "" 4295862622 1385849 "CA2926717083" "529900P0GCATXMGK8I62" 4295862697 . "CA7659071004" "BVAEBO64S71F02FD6O44" 4295862700 1227767 "CA9292671026" "" 4295862709 1409916 "CA65500B1031" "" 4295862774 1323770 "US6942282061" "" 4295862776 1381531 "US2575592033" "XYSIJOU3HNVROZWNA008" 4295862791 1039519 "" "" 4295862791 823230 "CA4457371090" "549300YJYU2BNA3BLQ43" 4295862809 1041548 "CA04957F1018" "" 4295862809 810922 "" "" 4295862821 1365917 "CA4598751002" "" 4295862823 1177845 "US9839507004" "" end format Acq_PermID %15.0g
Code:
gsort +Acq_PermID -Acq_CIK -Acq_ISIN -Acq_LEI
Code:
egen DUPLI = tag(Acq_PermID)
The first double entry of Acq_PermID is observation 23 and 24 where my code produces the desired order: The observation more missing values is sorted last.
However, looking at observation 55 and 56, it can be seen that the order is wrong, meaning that the observation that provides more information is actually sorted last. I am pretty sure I got the gsort command wrong.
Any help is much appreciated.
Comment