Hello Statalisters,
I'm using Stata IC 15.1 and have sequence data in the following long format:
ID identifies people, episode identifies the episode and element is the position in the sequence
(e.g. the second position is episode 4 for the first person). That is, each row corresponds to an episode
in a person's sequence.
I'm using the sq package, installed by ssc install sq.
First, I run:
sqset episode ID element
to designate the dataset as sequence data.
Now I'm trying to run a Ward cluster analysis in order to group episodes based on a full distance matrix.
The commands I run are:
sqom, full
sqclusterdat
clustermat wardslinkage SQdist, name(wards) add
cluster tree wards, cutnumber(20)
sqclusterdat, return
Based on the dendrogram, I'm trying to create grouping variables for, say, three clusters. As this fails because of ties, I run:
cluster gen group3 = gr(3), name(wards) ties(more)
My understanding is that this should create more than three groups because of ties. However, running
tab group3
produces the following output:
I.e. Stata has created only three groups. But looking at the numbers assigned to them, there's group 1, 3 and 4.
I don't really understand what's going on here. Has Stata created four groups (as I would expect) but not assigned any episodes to group 2?
Is it possible for Ward's linkage to produce empty clusters? If so, how would I adequately deal with this situation?
Running:
cluster gen group4 = gr(4), name(wards) ties(more)
tab group4
produces the exact same results as the command for three groups.
Running
cluster gen group5 = gr(5), name(wards) ties(more)
tab group5
produces
I.e. there's four groups whereas I expected at least six. Does that imply that groups 2 and 4 don't contain any episodes?
Any assistance is highly appreciated!
Best regards,
Bernd
I'm using Stata IC 15.1 and have sequence data in the following long format:
Code:
ID episode element
1 1 1
1 4 2
1 3 3
2 7 1
2 7 2
2 5 3
ID identifies people, episode identifies the episode and element is the position in the sequence
(e.g. the second position is episode 4 for the first person). That is, each row corresponds to an episode
in a person's sequence.
I'm using the sq package, installed by ssc install sq.
First, I run:
sqset episode ID element
to designate the dataset as sequence data.
Now I'm trying to run a Ward cluster analysis in order to group episodes based on a full distance matrix.
The commands I run are:
sqom, full
sqclusterdat
clustermat wardslinkage SQdist, name(wards) add
cluster tree wards, cutnumber(20)
sqclusterdat, return
Based on the dendrogram, I'm trying to create grouping variables for, say, three clusters. As this fails because of ties, I run:
cluster gen group3 = gr(3), name(wards) ties(more)
My understanding is that this should create more than three groups because of ties. However, running
tab group3
produces the following output:
Code:
===============================
group3 Freq. Percent Cum.
-------------------------------
1 10,474 82.54 82.54
3 2,207 17.39 99.94
4 8 0.06 100.00
-------------------------------
Total 12,689 100.00
===============================
I don't really understand what's going on here. Has Stata created four groups (as I would expect) but not assigned any episodes to group 2?
Is it possible for Ward's linkage to produce empty clusters? If so, how would I adequately deal with this situation?
Running:
cluster gen group4 = gr(4), name(wards) ties(more)
tab group4
produces the exact same results as the command for three groups.
Running
cluster gen group5 = gr(5), name(wards) ties(more)
tab group5
produces
Code:
================================================
sq_gr5 | Freq. Percent Cum.
------------+-----------------------------------
1 | 4,132 32.56 32.56
3 | 6,342 49.98 82.54
5 | 2,207 17.39 99.94
6 | 8 0.06 100.00
------------+-----------------------------------
Total | 12,689 100.00
================================================
Any assistance is highly appreciated!
Best regards,
Bernd