Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • ANOVA Differences cluster

    Dear all,

    I have a dataset and divided my observations into 5 clusters (based on 8 criteria) with the following command:
    Code:
    egen cluster = group( criteria 1 criteria 2 criteria 3 criteria 4 criteria 5 criteria 6 criteria 7 criteria 8 )
        tab cluster
        tabstat criteria 1 criteria 2 criteria 3 criteria 4 criteria 5 criteria 6 criteria 7 criteria 8, by(cluster)

    Out of the many groups, I choose the ones where most of the observations were in (5 groups with a large number of observations).

    With an ANOVA I would like to compute if there are differences in the variable firm profit. So I would like to analyze if the profit variance between the clusters is higher than the variance within clusters.


    Is the following command appropriate in your opinion?
    Code:
    anova Profit 1.cluster 2.cluster 3.cluster 4.cluster 5.cluster
    Is ANOVA the right method or should I choose a regression model?

  • #2
    Timo:
    1) there's nothing linear that OLS cannot do better than ANOVA;
    2) you should group all the clusters together in a 5-level categorical variable and adopt the -long- format (see -reshape-).
    After converting yiour dataset from -wide- to -long-, your code shoud be something like:
    Code:
    regress profit i.cluster
    Your 5 clusters are too few to invoke clustered standard errors.
    Kind regards,
    Carlo
    (Stata 19.0)

    Comment


    • #3
      Wouldn't it be more informative to assess the association between the criteria, themselves, and profit?
      Code:
      anova Profit criteria?
      // or
      regress Profit i.criteria?
      You could (1) use all of your data and (2) take advantage of a postestimation command such as -lincom- to examine whether profit systematically differs between any arbitration constellation of criteria and any other (or within any desired set of multiple constellations of criteria), not limited to the five most frequent.

      Comment


      • #4
        Thank you so much Carlo Lazzaro

        Just to make sure that I get it right and I really have to reshape. Currently my data looks like this:
        ID Year Profit Cluster
        1 2018 100 1
        2 2018 2 2
        3 2018 1 5
        4 2018 12 4
        5 2018 12 3
        6 2018 200 1
        7 2018 3 2
        8 2018 15 3
        ... ... ... ...

        Is it really necessary to reshape or can I not just directly do the regression?

        Comment


        • #5
          Timo:
          thanks for sharing your data structure.
          No, you do niot have to -reshape-, as your dataset is already in -long- format.
          I thought it was the case because your previous code included 5 categorical variables concerning clusters.
          That said, you can safely go:
          Code:
          regress profit i.cluster
          Obvioulsy, Joseph's wise advice is relevant here, because your current code include -i.cluster- only as a predictor; simple regression models are rarely informative.

          Kind regards,
          Carlo
          (Stata 19.0)

          Comment


          • #6
            Thanks Carlo Lazzaro and Joseph Coveney !

            Just to make clear what my research goal is (sorry I did not properly do that before):
            - I have 8 binary criteria (e.g. value "1" if Company is non-profit organization and "0" otherwise). These are criteria 1-8 mentioned in post #1
            - With these 8 criteria I am generating clusters (see #1). I am just taking the largest 5 because there are other clusters with e.g just one company in there
            - What I want to find out is which combination of criteria works best. Thats why I thought comparing the clusters could answer my question

            Having stated this: Would you examine all criteria separately or go for the clusters?

            One additional question just since I am interested: How would you proceed with an ANOVA?

            Comment


            • #7
              Timo:
              I would go with -clusters-.
              As far as anova is concerned:
              Code:
              oneway profit cluster, bonferroni
              Last edited by Carlo Lazzaro; 28 May 2021, 01:24.
              Kind regards,
              Carlo
              (Stata 19.0)

              Comment


              • #8
                Carlo Lazzaro I think you accidentally copied the previous answer right?

                Comment


                • #9
                  Timo:
                  yes, now edited.
                  Kind regards,
                  Carlo
                  (Stata 19.0)

                  Comment


                  • #10
                    Thank you so much Carlo Lazzaro

                    I tried the command

                    Code:
                     
                     regress profit ibn.cluster, noconst
                    I added "ibn" since otherwise one cluster would be used as the reference group (which is not what I want to do)

                    All of the cluster are significant but there are slightly different coefficients
                    How can I test if there are significant differences between the clusters with regards to the impact on profit?





                    Comment


                    • #11
                      Originally posted by Timo Leise View Post
                      I added "ibn" since otherwise one cluster would be used as the reference group (which is not what I want to do)
                      How can I test if there are significant differences between the clusters with regards to the impact on profit?
                      Code:
                      regress profit ibn.cluster, noconstant
                      testparm i.cluster, equal
                      // or
                      regress profit i.cluster
                      testparm i.cluster
                      // or
                      anova profit cluster
                      You can verify that they give you the same test results.

                      Comment


                      • #12
                        Thanks Joseph Coveney !

                        After the -testparm- command: If the Prob > chi2 is smaller than 0.05, does that mean that the groups are significantly different from each other with regards to the profit?
                        Same question for the ANOVA: How can I interpret at significant Prob<F?

                        Many many thanks for all your help!

                        Comment


                        • #13
                          Originally posted by Timo Leise View Post
                          How can I interpret at significant Prob<F?
                          I think that it would be nigh on impossible to interpret cleanly.

                          You've first semi-arbitrarily grouped firms into collections such that they're different on the basis of (possibly happenstance) combinations of eight selected characteristics. Then you further select a subset of these collections (the five most numerous) and test to see whether they're different on the basis of profit, as well. Lo and behold, they are. I don't know what your research question is, but if it has to do with exploring associations between these eight firm characteristics and profit, then the result of this exercise strikes me as conceptually vapid.

                          If your research question is as implied by the dataset you describe, then wouldn't you be better off fitting a regression model that more directly relates all eight firm characteristics to profit and doing your exploring postestimation, as I described above in #3?

                          Comment


                          • #14
                            Thank you Joseph Coveney

                            Originally posted by Joseph Coveney View Post
                            You've first semi-arbitrarily grouped firms into collections such that they're different on the basis of (possibly happenstance) combinations of eight selected characteristics. Then you further select a subset of these collections (the five most numerous) and test to see whether they're different on the basis of profit, as well. Lo and behold, they are. I don't know what your research question is, but if it has to do with exploring associations between these eight firm characteristics and profit, then the result of this exercise strikes me as conceptually vapid.
                            #3?
                            The eight characteristics are based on previous research and are not chosen "random"
                            My research question is what combination of criteria are actually used by companies (that's why I only choose the most commons ones and not combinations that are only used by one company) and what combinations (that why I look at the clusters) work best

                            Originally posted by Joseph Coveney View Post
                            If your research question is as implied by the dataset you describe, then wouldn't you be better off fitting a regression model that more directly relates all eight firm characteristics to profit and doing your exploring postestimation, as I described above in #3?
                            #3?
                            Thanks for the idea, Could you maybe add some more information on how you would exactly test all possible combinations of characteristics? I thought to test this I need to go with interaction terms.

                            Comment


                            • #15
                              Joseph Coveney Do you have an idea for the questions above? Many thanks in advance

                              Comment

                              Working...
                              X