Hi all,
I'm creating linkages between two different datasets hosted by two organisations, essentially linking different aspects of firm information and STATA 13. Although there are consistent firms in each data (firm x will exist in both for example), this data has been anonymised differently by both organisations making matching impossible (unfortunately each dataset carries completely different information so an approximate match based on other variables is also impossible). In order to end up with an overall match I have been provided with a concordance file that essentially lists how firm x in the first dataset is linked to firm x in the second. Unfortunately this is not a simple 1:1 concordance as the datasets carry slightly different information (so it is not strictly speaking the same firm classification level). This has lead to an awkward scenario where I actually have a m:m concordance file and so combining the firm data is proving near impossible.
In order to tackle this I have tried creating pseudo-firms from the concordance file, so in a simple example if firm x,y,z are associated with firm 1 in the other dataset I denote all of these as firm A. I do this for all types of concordance, 1:1, m:1, 1:m and m:m - I borrow heavily from the application of the Pierce and Schott algorithm that Van Beveren et al use for CN codes in order to do so. While I think I've done so correctly, the end result is eventually a single pseudo firm with a ludicrous amount of employees in excess of 2M, so clearly something has gone wrong. I currently believe that this is down to the file that I have been given being wrong, but I felt I could try posting on here to see if the approach I have taken is sensible or not and to be absolutely sure that the problem is on their end.
I would try and provide a working example, but the only example I have been allowed to obtain is of my problematic firm which covers a total of 42000 relationships between the first and second classification (and I'm not sure scaling an example like that down is possible given the nature of a m:m relationship, however I do have this as a dataset if useful). Has anyone had any experience with problems like these and is there a way to go about it that isn't based on the Pierce and Schott algorithm (which is ideally suited to product classification concordances and maybe not such a general application)? I appreciate this is a little general, so thanks for any of your time you spend!
Many thanks,
Alex
I'm creating linkages between two different datasets hosted by two organisations, essentially linking different aspects of firm information and STATA 13. Although there are consistent firms in each data (firm x will exist in both for example), this data has been anonymised differently by both organisations making matching impossible (unfortunately each dataset carries completely different information so an approximate match based on other variables is also impossible). In order to end up with an overall match I have been provided with a concordance file that essentially lists how firm x in the first dataset is linked to firm x in the second. Unfortunately this is not a simple 1:1 concordance as the datasets carry slightly different information (so it is not strictly speaking the same firm classification level). This has lead to an awkward scenario where I actually have a m:m concordance file and so combining the firm data is proving near impossible.
In order to tackle this I have tried creating pseudo-firms from the concordance file, so in a simple example if firm x,y,z are associated with firm 1 in the other dataset I denote all of these as firm A. I do this for all types of concordance, 1:1, m:1, 1:m and m:m - I borrow heavily from the application of the Pierce and Schott algorithm that Van Beveren et al use for CN codes in order to do so. While I think I've done so correctly, the end result is eventually a single pseudo firm with a ludicrous amount of employees in excess of 2M, so clearly something has gone wrong. I currently believe that this is down to the file that I have been given being wrong, but I felt I could try posting on here to see if the approach I have taken is sensible or not and to be absolutely sure that the problem is on their end.
I would try and provide a working example, but the only example I have been allowed to obtain is of my problematic firm which covers a total of 42000 relationships between the first and second classification (and I'm not sure scaling an example like that down is possible given the nature of a m:m relationship, however I do have this as a dataset if useful). Has anyone had any experience with problems like these and is there a way to go about it that isn't based on the Pierce and Schott algorithm (which is ideally suited to product classification concordances and maybe not such a general application)? I appreciate this is a little general, so thanks for any of your time you spend!
Many thanks,
Alex