Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Why does clustering consider the order of the observations?

    Hello everyone,

    I am a new member of the forum, despite I have used it quite few times to solve some issues I had while struggling with Stata. I have not found any post related to my problem but if there is one please excuse me and link it to me.

    The basic questions are: why does clustering on Stata take into consideration the order of the observations? All the observations are analysed and therefore if one variable fits more in one group or the other it should be defined regardless the order, should not it? Is there a way to avoid such problem?

    If you need more information, here you go.
    My task is to find a pattern in the data I have, that is composed by patents and that encloses two variables: ID patent (a number) and ICP code (a string like A01B); for roughly 14,000 observations. Each observation shows the ID of a patent and one IPC code which defines its content, if the patent is defined by multiple IPC codes it will appear more than once, according to the number of codes it has. For instance, if patent 1234 has IPC code G06F and H04L it will appear twice, each observation having one of those IPC codes. The pattern I have to find regards the features of the patents, represented by the IPC codes. Thus, I have to cluster my observations according to that kind of variable.
    To solve the problem I thought to modify the dataset as to end up with a series of observations representing each only one patent and all the IPC codes that define it. Eventually, I will be able to simply cluster all the observations. Patents having similar IPC codes will be gathered together in the same group, conversely patents that have high dissimilarity will be put in different groups.
    To put all the IPC codes that defines one patent in one single observation I thought to create dummy variables for each IPC codes; if one observation has that IPC code it will display 1, if not it will display 0. Having all either 0 or 1 defining the IPC codes that one patent has, I eventually use the command collapse in order to unify all the information regarding that patent in one single observation. The code I wrote so far to perform this modification is (approximately):

    foreach i of varlist ipc {
    gen nA01G = 1 if (`i'=="A01G")
    gen nA01K = 1 if (`i'=="A01K")
    ...
    gen nH05K = 1 if (`i'=="H05K")
    }

    collapse (sum)n*, by(ID)

    The final result is a set of 6666 observations (sounds evil), and 125 variables. Those variables include 1 identifying the ID of the patent and the other 124 variables defining its features (IPC codes). Since for some reasons the same IPC code could appear more than once for one patent, and since I summed all the dummy variables n*, the resulting 124 variables will contain nonnegative numbers, ranging from 0 till 13. However, a patent will show most of the time 0 since the amount of IPC codes it has ranges from 1 to 5/6.
    At this point, following the example 2 at http://www.stata.com/manuals13/mvclu...plesex2_cllink
    , I clustered the dataset:

    cluster wardslinkage n*

    I got some results that are not that satisfactory but whatever, either I take another approach or I try to extract something from what the clustering gave to me.

    But... When I ran again the code few more times I usually got different results for the same dataset and the same clustering. I figured out that this was due to the sorting of the dataset, some times it was sorted by ID, others by IPC codes. Why is that so? It does not make any sense to me, I asked some friends and they were also perplexed about it. Has anyone got a clue about what is going on? And how can I solve it?

    Thanks in advance and have a nice day

  • #2
    You should know that cluster analysis depends on the means of the starting groups. Without a start option, cluster picks the starting groups at random, so you can get different results everytime you run the code.

    Code:
    webuse labtech, clear
    set seed 452
    * These may yield different results
    cluster kmeans x1 x2 x3 x4, k(8)
    cluster kmeans x1 x2 x3 x4, k(8)
    * Different results: off diagonal terms
    tab _clus_1 _clus_2
    You can get consistent results using the start option in cluster. See help cluster kmeans

    Code:
    * These will yield the same result
    cluster kmeans x1 x2 x3 x4, k(8) start(firstk)
    cluster kmeans x1 x2 x3 x4, k(8) start(firstk)
    tab _clus_3 _clus_4
    Jorge Eduardo Pérez Pérez
    www.jorgeperezperez.com

    Comment


    • #3
      Okay, but that holds only for the cluster kmeans command, does not it? If I need to perform a hierarchical clustering through completelinkage or wardslinkage I do not have that option.

      Comment


      • #4
        Yes, there doesn't seem to be an option like this for the linkage case. I'm not familiar with this method but from what I gather it should not depend on the order of observations.

        Jorge Eduardo Pérez Pérez
        www.jorgeperezperez.com

        Comment


        • #5
          I think so too, that is why I am so puzzled because apparently the order does play a role somehow. I tried to see if I got the same problems by using other datasets and the result does not change, oddly Stata takes into account the order of the observation when performing hierarchical clustering. Again, if anyone can give me any clarification that would be really welcome.

          Comment


          • #6
            Here's a reproducible example that shows that the order of observations has an impact. Note that I am using stable sorts.

            Code:
            clear all
            * Load example dataset from cluster
            webuse homework, clear
            * We'll do the cluster analysis many times, sorting on a different variable each time
            * Note sorts are stable
            forv i=1(1)60 {
                sort a`i', stable
                * Do analysis and save results to a temporary file
                preserve
                cluster wardslinkage a1-a60, measure(matching) name(wardlink)
                * Keep results
                sort wardlink_hgt, stable
                keep wardlink_hgt
                ren wardlink_hgt wardlink_hgt`i'
                tempfile a`i'
                save `a`i''
                restore
            }
            
            * Merge results and check if they're different every time: If there are differences assertion should be false
            clear
            use `a1'
            forv i=2(1)60 {
                merge 1:1 _n using `a`i'', nogen norep
                cap noi assert wardlink_hgt`i'== wardlink_hgt`=`i'-1'
            }
            Jorge Eduardo Pérez Pérez
            www.jorgeperezperez.com

            Comment


            • #7
              Comparing the _hgt variables from different runs with different sort orders is not what you want to do. The _hgt variable is meaningless without the corresponding _ord and _id variables that were created with the _hgt variable (also, the _id variable will be related to the sort order of the data).

              Now to the substance of your question, the first difference you will see when you do an agglomerative hierarchical cluster analysis with the data in a different sort order is merely a difference in labeling. Generate an ID variable (for the webuse homework data you could gen id = _n). If you then run cluster wardslinkage ... followed by cluster dendrogram with the labels(id) option for 2 runs with the data in different orders (but keeping the original id variable the same), the leaves of the tree will be in a different order, but the substantive meaning will be the same (except for what I will talk about next). For instance with the homework dataset and setting id as I have shown -- id 5 and 28 cluster together early on (as do id 20 and 23; id 2 and 30; id 7 and 9; and so forth).

              Now, after accounting for that, you may (depending on your data and the (dis)similarity measure you pick) still find some substantive differences. Why? Think of ties in (dis)similarities. Agglomerative hierarchical clustering starts with all observations being their own clusters (lets say 30 observations, so 30 clusters as in the homework dataset). First step is to find the 2 closest observations and combine them into 1 cluster (so we now have 29 clusters, with 28 singletons and a newly formed cluster with 2 observations). Next step look for the 2 closest groups and combine them, and so on. What if there are ties among the (dis)similarities. That would mean that at some steps along the way there will be more than one way to pick 2 groups that are closest. Stata picks the first of these that it finds (it checks them all, but a new winner is only declared if it beats the previous winner). When you change the sort order, you change the order in which the possible pairs of groups get seen.

              Binary data and the (dis)similarity measures used on them are very prone to having ties. I am betting that this is what you are experiencing with your dataset.

              Comment


              • #8
                Great explanation, thank you.
                Jorge Eduardo Pérez Pérez
                www.jorgeperezperez.com

                Comment

                Working...
                X