Ward cluster analysis after sequence data OM (sqom): Problems generating specified number of groups

Bernd Bender

Join Date: Jul 2018

Posts: 2
#1

Ward cluster analysis after sequence data OM (sqom): Problems generating specified number of groups

06 Jul 2018, 07:12

Hello Statalisters,

I'm using Stata IC 15.1 and have sequence data in the following long format:

Code:

ID episode element 1 1 1 1 4 2 1 3 3 2 7 1 2 7 2 2 5 3

ID identifies people, episode identifies the episode and element is the position in the sequence
(e.g. the second position is episode 4 for the first person). That is, each row corresponds to an episode
in a person's sequence.

I'm using the sq package, installed by ssc install sq.

First, I run:

sqset episode ID element

to designate the dataset as sequence data.

Now I'm trying to run a Ward cluster analysis in order to group episodes based on a full distance matrix.

The commands I run are:

sqom, full

sqclusterdat

clustermat wardslinkage SQdist, name(wards) add

cluster tree wards, cutnumber(20)

sqclusterdat, return

Based on the dendrogram, I'm trying to create grouping variables for, say, three clusters. As this fails because of ties, I run:

cluster gen group3 = gr(3), name(wards) ties(more)

My understanding is that this should create more than three groups because of ties. However, running

tab group3

produces the following output:

Code:

=============================== group3 Freq. Percent Cum. ------------------------------- 1 10,474 82.54 82.54 3 2,207 17.39 99.94 4 8 0.06 100.00 ------------------------------- Total 12,689 100.00 ===============================

I.e. Stata has created only three groups. But looking at the numbers assigned to them, there's group 1, 3 and 4.

I don't really understand what's going on here. Has Stata created four groups (as I would expect) but not assigned any episodes to group 2?
Is it possible for Ward's linkage to produce empty clusters? If so, how would I adequately deal with this situation?

Running:

cluster gen group4 = gr(4), name(wards) ties(more)

tab group4

produces the exact same results as the command for three groups.

Running

cluster gen group5 = gr(5), name(wards) ties(more)

tab group5

produces

Code:

================================================ sq_gr5 | Freq. Percent Cum. ------------+----------------------------------- 1 | 4,132 32.56 32.56 3 | 6,342 49.98 82.54 5 | 2,207 17.39 99.94 6 | 8 0.06 100.00 ------------+----------------------------------- Total | 12,689 100.00 ================================================

I.e. there's four groups whereas I expected at least six. Does that imply that groups 2 and 4 don't contain any episodes?

Any assistance is highly appreciated!

Best regards,
Bernd
Tags: None

Announcement

Ward cluster analysis after sequence data OM (sqom): Problems generating specified number of groups