Issue with Clustering - Distance Measures (centroid, ward)

Sebastian Geiger

Join Date: Oct 2015

Posts: 124
#1

Issue with Clustering - Distance Measures (centroid, ward)

14 Nov 2019, 03:36

Hello everyone,

I'm struggeling with a problem, which is probably quite easy to resolve - but I wasn't able to do that so far. I'm trying to understand how Stata calcualtes the distance measure between clusters for centroid and Ward's clustering. For computing those I use the following commands:

Code:

cluster wardslinkage score, name(ward) measure(L2) cluster centroidlinkage score, name(center) measure(L2)

This results in the following (test) dataset (I display single linkage as well for possible reference):

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input double(id score) int(center_hgt ward_hgt single_hgt) 1 200 100 100 100 2 300 225 300 200 3 500 . . . end

For centroid clustering: In the first Iteration unit 1 is merged with unit 2, resulting in a difference of 100 (=300-100). Now, I would expect that the centroid of those two observations is calculated like (200+300)/2 = 250. Thus, the distance between this centroid and the remaining observation 3 should be 250 (=500-250). However, Stata says the difference is 225. I cannot explain how Stata comes up with this number (the same applies for Ward's linkage, which should be quite similar in this simple case).

Does anyone know how Stata calculates the centroid and/or the distance measure at the second iteration? Thank you in advance!

Best wishes,
Sebastian
Tags: None

Announcement

Issue with Clustering - Distance Measures (centroid, ward)