Dear all,
I have a dataset with firms and individuals that are connected with these firms plus several descriptive variables. My original dataset is as big as 1,700,000 observations with 405,000 firms but I first try the code on a small sample of 30-40 obs. What I want is for each firm to construct all possible pairs of individuals that are connected with this firm (think of it as business network). I read on Stata site the post on title "How do I produce a dataset based on all possible pairs of identifiers within each group " (link posted) and seems that does exactly that.
My issue is that the value of the variable constructed (id2 on the post, ID_pair to me) is slightly changed than the original (id or ID respectively). (e.g. instead of 81427461, ID_pair takes the value 81427464)
Here is a data sample:
and here is my code:
So, the questions are:
- Does someone know why and how can be fixed?
- Can this work with factor variables as well? (I tried but received missing values mostly)
I need them because once I have constructed the person-to-person pairs I want to aggregate them to firm-to-firm pairs and match the postcodes for each firm. Here should mention that I also use the same code to construct directly the firm-to-firm pairs but need also to do it the other way because firm-to-firm pairs consider also the firms owned by the same person and I want to avoid that.
Thanks in advance and for maintaining this forum that is super useful.
I have a dataset with firms and individuals that are connected with these firms plus several descriptive variables. My original dataset is as big as 1,700,000 observations with 405,000 firms but I first try the code on a small sample of 30-40 obs. What I want is for each firm to construct all possible pairs of individuals that are connected with this firm (think of it as business network). I read on Stata site the post on title "How do I produce a dataset based on all possible pairs of identifiers within each group " (link posted) and seems that does exactly that.
My issue is that the value of the variable constructed (id2 on the post, ID_pair to me) is slightly changed than the original (id or ID respectively). (e.g. instead of 81427461, ID_pair takes the value 81427464)
Here is a data sample:
Code:
* Example generated by -dataex-. To install: ssc install dataex clear input str10 DM_id_number long(ID firm_id) float(gfreq numid2 ID_pair) "P081427461" 81427461 1 6 1 39590112 "P081427461" 81427461 1 6 2 74116440 "P081427461" 81427461 1 6 3 81427464 "P081427461" 81427461 1 6 4 206375584 "P081427461" 81427461 1 6 5 223103504 "P081427461" 81427461 1 6 6 247809792 "P223103507" 223103507 1 6 1 39590112 "P223103507" 223103507 1 6 2 74116440 "P223103507" 223103507 1 6 3 81427464 "P223103507" 223103507 1 6 4 206375584 "P223103507" 223103507 1 6 5 223103504 "P223103507" 223103507 1 6 6 247809792 "P247809795" 247809795 1 6 1 39590112 "P247809795" 247809795 1 6 2 74116440 "P247809795" 247809795 1 6 3 81427464 "P247809795" 247809795 1 6 4 206375584 "P247809795" 247809795 1 6 5 223103504 "P247809795" 247809795 1 6 6 247809792 end label values firm_id firm_id label def firm_id 1 "DE2010198197", modify
and here is my code:
Code:
sort firm_id by firm_id: gen gfreq = _N expand gfreq sort firm_id ID by firm_id ID: gen numid2=_n by firm_id: gen ID_pair = ID[gfreq*numid2] drop if ID == ID_pair drop if ID > ID_pair
So, the questions are:
- Does someone know why and how can be fixed?
- Can this work with factor variables as well? (I tried but received missing values mostly)
I need them because once I have constructed the person-to-person pairs I want to aggregate them to firm-to-firm pairs and match the postcodes for each firm. Here should mention that I also use the same code to construct directly the firm-to-firm pairs but need also to do it the other way because firm-to-firm pairs consider also the firms owned by the same person and I want to avoid that.
Thanks in advance and for maintaining this forum that is super useful.
Comment