Hello. I have a string variable and I need to see some way to find pairs of similar but not identical values, for example through some kind of identifier or categorical variable. In the example below, the "similar" variable indicates that observations 1 and 9 are quite similar, as are observations 5 and 6.
My dataset has hundreds of thousands of observations, obviously I can't do it visually. I don't need a perfect method, but at least something that approximates a first result to later analyze visually. Any ideas? Thanks
Code:
* Example generated by -dataex-. To install: ssc install dataex clear input float id str18 var float similar 1 "JUAN PEREZ" 1 2 "JUAN GONZALEZ" . 3 "ARTURO PRAT" . 4 "ARTURO VIDAL" . 5 "SALVADOR ALLENDE" 2 6 "SALVADOR ALLENDE G" 2 7 "DON MATIAS" . 8 "DIEGO SOTO" . 9 "JUAN A PEREZ" 1 10 "JUAN PABLO SASA" . end
My dataset has hundreds of thousands of observations, obviously I can't do it visually. I don't need a perfect method, but at least something that approximates a first result to later analyze visually. Any ideas? Thanks
Comment