ANOVA Differences cluster

Timo Leise

Join Date: Jan 2021

Posts: 20
#1

ANOVA Differences cluster

27 May 2021, 11:39

Dear all,

I have a dataset and divided my observations into 5 clusters (based on 8 criteria) with the following command:

Code:

egen cluster = group( criteria 1 criteria 2 criteria 3 criteria 4 criteria 5 criteria 6 criteria 7 criteria 8 ) tab cluster tabstat criteria 1 criteria 2 criteria 3 criteria 4 criteria 5 criteria 6 criteria 7 criteria 8, by(cluster)

Out of the many groups, I choose the ones where most of the observations were in (5 groups with a large number of observations).

With an ANOVA I would like to compute if there are differences in the variable firm profit. So I would like to analyze if the profit variance between the clusters is higher than the variance within clusters.

Is the following command appropriate in your opinion?

Code:

anova Profit 1.cluster 2.cluster 3.cluster 4.cluster 5.cluster

Is ANOVA the right method or should I choose a regression model?
Tags: anova, cluster, regression
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17707
#2

27 May 2021, 22:46

Timo:
1) there's nothing linear that OLS cannot do better than ANOVA;
2) you should group all the clusters together in a 5-level categorical variable and adopt the -long- format (see -reshape-).
After converting yiour dataset from -wide- to -long-, your code shoud be something like:

Code:

regress profit i.cluster

Your 5 clusters are too few to invoke clustered standard errors.

Kind regards,
Carlo
(Stata 19.0)
Comment
Joseph Coveney

Join Date: Apr 2014

Posts: 4410
#3

27 May 2021, 23:37

Wouldn't it be more informative to assess the association between the criteria, themselves, and profit?

Code:

anova Profit criteria? // or regress Profit i.criteria?

You could (1) use all of your data and (2) take advantage of a postestimation command such as -lincom- to examine whether profit systematically differs between any arbitration constellation of criteria and any other (or within any desired set of multiple constellations of criteria), not limited to the five most frequent.
Comment
Timo Leise

Join Date: Jan 2021

Posts: 20
#4

28 May 2021, 00:06

Thank you so much Carlo Lazzaro

Just to make sure that I get it right and I really have to reshape. Currently my data looks like this:

ID Year Profit Cluster

1 2018 100 1

2 2018 2 2

3 2018 1 5

4 2018 12 4

5 2018 12 3

6 2018 200 1

7 2018 3 2

8 2018 15 3

... ... ... ...

Is it really necessary to reshape or can I not just directly do the regression?
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17707
#5

28 May 2021, 00:15

Timo:
thanks for sharing your data structure.
No, you do niot have to -reshape-, as your dataset is already in -long- format.
I thought it was the case because your previous code included 5 categorical variables concerning clusters.
That said, you can safely go:

Code:

regress profit i.cluster

Obvioulsy, Joseph's wise advice is relevant here, because your current code include -i.cluster- only as a predictor; simple regression models are rarely informative.

Kind regards,
Carlo
(Stata 19.0)
Comment
Timo Leise

Join Date: Jan 2021

Posts: 20
#6

28 May 2021, 00:33

Thanks Carlo Lazzaro and Joseph Coveney !

Just to make clear what my research goal is (sorry I did not properly do that before):
- I have 8 binary criteria (e.g. value "1" if Company is non-profit organization and "0" otherwise). These are criteria 1-8 mentioned in post #1
- With these 8 criteria I am generating clusters (see #1). I am just taking the largest 5 because there are other clusters with e.g just one company in there
- What I want to find out is which combination of criteria works best. Thats why I thought comparing the clusters could answer my question

Having stated this: Would you examine all criteria separately or go for the clusters?

One additional question just since I am interested: How would you proceed with an ANOVA?
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17707
#7

28 May 2021, 01:10

Timo:
I would go with -clusters-.
As far as anova is concerned:

Code:

oneway profit cluster, bonferroni

Last edited by Carlo Lazzaro; 28 May 2021, 01:24.

Kind regards,
Carlo
(Stata 19.0)
Comment
Timo Leise

Join Date: Jan 2021

Posts: 20
#8

28 May 2021, 01:17

Carlo Lazzaro I think you accidentally copied the previous answer right?
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17707
#9

28 May 2021, 01:24

Timo:
yes, now edited.

Kind regards,
Carlo
(Stata 19.0)
Comment
Timo Leise

Join Date: Jan 2021

Posts: 20
#10

28 May 2021, 02:31

Thank you so much Carlo Lazzaro

I tried the command

Code:

regress profit ibn.cluster, noconst

I added "ibn" since otherwise one cluster would be used as the reference group (which is not what I want to do)

All of the cluster are significant but there are slightly different coefficients
How can I test if there are significant differences between the clusters with regards to the impact on profit?
Comment
Joseph Coveney

Join Date: Apr 2014

Posts: 4410
#11

28 May 2021, 02:59

Originally posted by Timo Leise View Post

I added "ibn" since otherwise one cluster would be used as the reference group (which is not what I want to do)
How can I test if there are significant differences between the clusters with regards to the impact on profit?

Code:

regress profit ibn.cluster, noconstant testparm i.cluster, equal // or regress profit i.cluster testparm i.cluster // or anova profit cluster

You can verify that they give you the same test results.
1 like
Comment
Timo Leise

Join Date: Jan 2021

Posts: 20
#12

28 May 2021, 03:04

Thanks Joseph Coveney !

After the -testparm- command: If the Prob > chi2 is smaller than 0.05, does that mean that the groups are significantly different from each other with regards to the profit?
Same question for the ANOVA: How can I interpret at significant Prob<F?

Many many thanks for all your help!
Comment
Joseph Coveney

Join Date: Apr 2014

Posts: 4410
#13

28 May 2021, 17:26

Originally posted by Timo Leise View Post

How can I interpret at significant Prob<F?

I think that it would be nigh on impossible to interpret cleanly.

You've first semi-arbitrarily grouped firms into collections such that they're different on the basis of (possibly happenstance) combinations of eight selected characteristics. Then you further select a subset of these collections (the five most numerous) and test to see whether they're different on the basis of profit, as well. Lo and behold, they are. I don't know what your research question is, but if it has to do with exploring associations between these eight firm characteristics and profit, then the result of this exercise strikes me as conceptually vapid.

If your research question is as implied by the dataset you describe, then wouldn't you be better off fitting a regression model that more directly relates all eight firm characteristics to profit and doing your exploring postestimation, as I described above in #3?
2 likes
Comment
Timo Leise

Join Date: Jan 2021

Posts: 20
#14

30 May 2021, 04:20

Thank you Joseph Coveney

Originally posted by Joseph Coveney View Post

You've first semi-arbitrarily grouped firms into collections such that they're different on the basis of (possibly happenstance) combinations of eight selected characteristics. Then you further select a subset of these collections (the five most numerous) and test to see whether they're different on the basis of profit, as well. Lo and behold, they are. I don't know what your research question is, but if it has to do with exploring associations between these eight firm characteristics and profit, then the result of this exercise strikes me as conceptually vapid.
#3?

The eight characteristics are based on previous research and are not chosen "random"
My research question is what combination of criteria are actually used by companies (that's why I only choose the most commons ones and not combinations that are only used by one company) and what combinations (that why I look at the clusters) work best

Originally posted by Joseph Coveney View Post

If your research question is as implied by the dataset you describe, then wouldn't you be better off fitting a regression model that more directly relates all eight firm characteristics to profit and doing your exploring postestimation, as I described above in #3?
#3?

Thanks for the idea, Could you maybe add some more information on how you would exactly test all possible combinations of characteristics? I thought to test this I need to go with interaction terms.
Comment
Timo Leise

Join Date: Jan 2021

Posts: 20
#15

03 Jun 2021, 23:29

Joseph Coveney Do you have an idea for the questions above? Many thanks in advance
Comment

ID	Year	Profit	Cluster
1	2018	100	1
2	2018	2	2
3	2018	1	5
4	2018	12	4
5	2018	12	3
6	2018	200	1
7	2018	3	2
8	2018	15	3
...	...	...	...

Announcement

ANOVA Differences cluster

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment