Hello everyone,
I am a new member of the forum, despite I have used it quite few times to solve some issues I had while struggling with Stata. I have not found any post related to my problem but if there is one please excuse me and link it to me.
The basic questions are: why does clustering on Stata take into consideration the order of the observations? All the observations are analysed and therefore if one variable fits more in one group or the other it should be defined regardless the order, should not it? Is there a way to avoid such problem?
If you need more information, here you go.
My task is to find a pattern in the data I have, that is composed by patents and that encloses two variables: ID patent (a number) and ICP code (a string like A01B); for roughly 14,000 observations. Each observation shows the ID of a patent and one IPC code which defines its content, if the patent is defined by multiple IPC codes it will appear more than once, according to the number of codes it has. For instance, if patent 1234 has IPC code G06F and H04L it will appear twice, each observation having one of those IPC codes. The pattern I have to find regards the features of the patents, represented by the IPC codes. Thus, I have to cluster my observations according to that kind of variable.
To solve the problem I thought to modify the dataset as to end up with a series of observations representing each only one patent and all the IPC codes that define it. Eventually, I will be able to simply cluster all the observations. Patents having similar IPC codes will be gathered together in the same group, conversely patents that have high dissimilarity will be put in different groups.
To put all the IPC codes that defines one patent in one single observation I thought to create dummy variables for each IPC codes; if one observation has that IPC code it will display 1, if not it will display 0. Having all either 0 or 1 defining the IPC codes that one patent has, I eventually use the command collapse in order to unify all the information regarding that patent in one single observation. The code I wrote so far to perform this modification is (approximately):
foreach i of varlist ipc {
gen nA01G = 1 if (`i'=="A01G")
gen nA01K = 1 if (`i'=="A01K")
...
gen nH05K = 1 if (`i'=="H05K")
}
collapse (sum)n*, by(ID)
The final result is a set of 6666 observations (sounds evil), and 125 variables. Those variables include 1 identifying the ID of the patent and the other 124 variables defining its features (IPC codes). Since for some reasons the same IPC code could appear more than once for one patent, and since I summed all the dummy variables n*, the resulting 124 variables will contain nonnegative numbers, ranging from 0 till 13. However, a patent will show most of the time 0 since the amount of IPC codes it has ranges from 1 to 5/6.
At this point, following the example 2 at http://www.stata.com/manuals13/mvclu...plesex2_cllink, I clustered the dataset:
cluster wardslinkage n*
I got some results that are not that satisfactory but whatever, either I take another approach or I try to extract something from what the clustering gave to me.
But... When I ran again the code few more times I usually got different results for the same dataset and the same clustering. I figured out that this was due to the sorting of the dataset, some times it was sorted by ID, others by IPC codes. Why is that so? It does not make any sense to me, I asked some friends and they were also perplexed about it. Has anyone got a clue about what is going on? And how can I solve it?
Thanks in advance and have a nice day
I am a new member of the forum, despite I have used it quite few times to solve some issues I had while struggling with Stata. I have not found any post related to my problem but if there is one please excuse me and link it to me.
The basic questions are: why does clustering on Stata take into consideration the order of the observations? All the observations are analysed and therefore if one variable fits more in one group or the other it should be defined regardless the order, should not it? Is there a way to avoid such problem?
If you need more information, here you go.
My task is to find a pattern in the data I have, that is composed by patents and that encloses two variables: ID patent (a number) and ICP code (a string like A01B); for roughly 14,000 observations. Each observation shows the ID of a patent and one IPC code which defines its content, if the patent is defined by multiple IPC codes it will appear more than once, according to the number of codes it has. For instance, if patent 1234 has IPC code G06F and H04L it will appear twice, each observation having one of those IPC codes. The pattern I have to find regards the features of the patents, represented by the IPC codes. Thus, I have to cluster my observations according to that kind of variable.
To solve the problem I thought to modify the dataset as to end up with a series of observations representing each only one patent and all the IPC codes that define it. Eventually, I will be able to simply cluster all the observations. Patents having similar IPC codes will be gathered together in the same group, conversely patents that have high dissimilarity will be put in different groups.
To put all the IPC codes that defines one patent in one single observation I thought to create dummy variables for each IPC codes; if one observation has that IPC code it will display 1, if not it will display 0. Having all either 0 or 1 defining the IPC codes that one patent has, I eventually use the command collapse in order to unify all the information regarding that patent in one single observation. The code I wrote so far to perform this modification is (approximately):
foreach i of varlist ipc {
gen nA01G = 1 if (`i'=="A01G")
gen nA01K = 1 if (`i'=="A01K")
...
gen nH05K = 1 if (`i'=="H05K")
}
collapse (sum)n*, by(ID)
The final result is a set of 6666 observations (sounds evil), and 125 variables. Those variables include 1 identifying the ID of the patent and the other 124 variables defining its features (IPC codes). Since for some reasons the same IPC code could appear more than once for one patent, and since I summed all the dummy variables n*, the resulting 124 variables will contain nonnegative numbers, ranging from 0 till 13. However, a patent will show most of the time 0 since the amount of IPC codes it has ranges from 1 to 5/6.
At this point, following the example 2 at http://www.stata.com/manuals13/mvclu...plesex2_cllink, I clustered the dataset:
cluster wardslinkage n*
I got some results that are not that satisfactory but whatever, either I take another approach or I try to extract something from what the clustering gave to me.
But... When I ran again the code few more times I usually got different results for the same dataset and the same clustering. I figured out that this was due to the sorting of the dataset, some times it was sorted by ID, others by IPC codes. Why is that so? It does not make any sense to me, I asked some friends and they were also perplexed about it. Has anyone got a clue about what is going on? And how can I solve it?
Thanks in advance and have a nice day
Comment