Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • What test to choose?

    I believe I tried every test on both SPSS and Stata, but I keep on encountering problems.
    Maybe you can help me to think from scratch.

    The dependent variable is logo (0 or 1 whether there is a logo on the label or not)
    The independent variable is price (can be in euro's or in 3 ordinal price categories)

    There are 5 control variables
    4 are either in dummy or categorical: Region(44 different), Country(9 different), Store(5 different), Colour(2 different).
    1 is the number of hectares but can be transformed into categories as well.
    N=161

    Problem,
    some Regions occur only once and are therefore perfect predictors ( collinearity error)
    I tried to categorize the regions, but the results don't seem to be right.

    What test should I pick?
    And what code should I use?




  • #2
    Hello Maarten. Do you really have 44 regions? Or was that a typo? I ask, because if you are including a variable with 44 levels, you are over-fitting your model. (See this nice article by Mike Babyak for more info on over-fitting.) It's also not clear to me whether you have clustering that needs to be taken into account. E.g., are the regions clustered within countries? HTH.
    --
    Bruce Weaver
    Email: [email protected]
    Version: Stata/MP 18.5 (Windows)

    Comment


    • #3
      Maarten:
      posting what you tyoped and what Stata give you back can help (as usual).
      That said, have you considered interactions (say between Store and Price)?
      As an aside, please note that categozing continuous predictors is, in general, a bad idea: http://citeseerx.ist.psu.edu/viewdoc...=rep1&type=pdf

      PS: crossed in the cyberspace with Bruce's helpful advice.
      Kind regards,
      Carlo
      (StataNow 18.5)

      Comment


      • #4
        Hello Bruce and Carlo,

        I have indeed 44 regions.(the data is about wine labels, and the wines come from 44 different regions)

        I am not sure what you mean with clustering.
        Each region belongs to a certain country. but I have a variable for both.
        How can I check this?

        Regarding interactions, that is a good idea. You mean by entering # between variables, right?

        As a solution to over-fitting, I was thinking about combining the regions into only 4 types of regions, i.e. Large unknown regions, Specified and expensive regions.

        Do you think Probit or regression is suited, or should I think about different tests?

        Comment


        • #5
          Maarten:
          - yes, I mean something along the lines you sketch in your reply (please, see -fvvarlist- for further details, especilly for the difference between -#- and -##-;
          - Bruce is correct in suspecting that clustering may be an issue with your data, as regions that belong to the same country are probably more similar vs regions belonging to different countries. By the way, having regions nested within countries can be enough to consider a hierarchical model (see -melogit-);
          - gathering Regions together according to a given set of criteria may be a wise approach (although I would not take it for granted that over-fitting wiill disappear);
          - eventually, I'm not clear what you mean by test (which are inferential procidures to detect difference and/or statistical significance): I think that you're seeking advice about regression model, instead. That said, if your dependent variable is categorical (yes/no), there's no room for -regression- and you should consider -logit- or -probit- which give similar results.
          Kind regards,
          Carlo
          (StataNow 18.5)

          Comment


          • #6
            Thank you for that advise.

            Very good to know that regression is not possible!

            I have combined the 44 regions into 4 categories
            When I run it as 4 dummies, the results are as followed.


            Price | .3477105 .0984887 3.53 0.000 .1546762 .5407448

            Region1D | 3.760862 352.7587 0.01 0.991 -687.6335 695.1552

            Region2D | 3.752317 352.7583 0.01 0.992 -687.6413 695.1459

            Region3D | 3.692831 352.7584 0.01 0.992 -687.701 695.0867


            I have also placed them under 1 categorical variable (instead of 4 dummies)
            these are the results.
            I do not understand the empty variables and what to do with the ormitted ones?
            Is this data interpretable, or should i make more changes?


            Price | .3470497 .0986867 3.52 0.000 .1536273 .540472

            |

            Region1 |

            1 | 0 (empty)

            2 | -.0086384 .4393754 -0.02 0.984 -.8697984 .8525216

            3 | -.0565939 .5758633 -0.10 0.922 -1.185265 1.072077

            4 | 0 (omitted)

            |

            Store |

            2 | -1.124323 .4174382 -2.69 0.007 -1.942487 -.3061589

            3 | -1.644975 .6311591 -2.61 0.009 -2.882024 -.4079256

            4 | -.9489177 .5176832 -1.83 0.067 -1.963558 .0657228

            5 | -1.153552 .4774939 -2.42 0.016 -2.089422 -.2176809

            |

            Country |

            2 | 0 (empty)

            3 | 1.030602 .6420614 1.61 0.108 -.2278153 2.289019

            4 | -.3744293 .6855527 -0.55 0.585 -1.718088 .9692293

            5 | 0 (empty)

            6 | 1.426111 .7480216 1.91 0.057 -.039984 2.892207

            7 | 0 (empty)

            8 | 0 (empty)

            9 | .5941595 .6767971 0.88 0.380 -.7323384 1.920657

            |

            1.ColourD | .0873519 .2673539 0.33 0.744 -.436652 .6113559

            _cons | -2.418361 .9259913 -2.61 0.009 -4.233271 -.6034517


            Comment


            • #7
              Clyde Schechter has discussed this problem at some length in a separate topic started by Maarten.

              https://www.statalist.org/forums/for...ummy-in-probit

              Comment


              • #8
                Maarten:
                you do not post your regression code, hence it is difficult to comment on your results helpfully.
                If you have used -fvvarlist- one of the level included in the categorical variable is omitted automatically by Stata to protect you against the dummy trap (see: https://en.wikipedia.org/wiki/Dummy_...le_(statistics).
                Kind regards,
                Carlo
                (StataNow 18.5)

                Comment


                • #9
                  the code was
                  Probit Vegan Price i.Region1 i.Store i.Country i.Coulor
                  Does this help?

                  Comment


                  • #10
                    Maarten:
                    please use CODE delimiters to share what you typed and what Stata gave you back (see the FAQ on this).
                    Tha said, by including both regions and countries, you surely experience multicollinerarity problems.
                    Kind regards,
                    Carlo
                    (StataNow 18.5)

                    Comment


                    • #11
                      That is important advice, thank you.
                      I clustered the country and regions.
                      1 dummy for country (Old vs New world countries, [That is a classification in the wine world, comes roughly down to European vs other wine countries])
                      And 2 dummies for 3 region categories based on characteristics.

                      Code:
                       probit Vegan Price CountryNewOld ColourD i.RegionCat i.Store
                      I received the following output,
                      It seems good to me.
                      Click image for larger version

Name:	Knipsel Stata.PNG
Views:	1
Size:	21.6 KB
ID:	1475877

                      What do you think?
                      Can I include this in my report?


                      Why is it that when I change the order of the categories that the results differ so much.
                      e.g. instead of 1, Germany, 2 Bordeaux, 3 Marlborough --> 1, Bordeaux 2, Germany 3, Marlborough,
                      This changes all the results including the significance of the independent variable.
                      I understand now that one category is taken out for dummy trap, but why do the results vary so strongly.
                      Which order should I pick?

                      Thank you so much

                      Comment


                      • #12
                        Maarten:
                        actually, you did not -cluster- countries and regions, you simply adjusted for their effects.
                        Clustering concerns the standard errors, not the point estimates (ie, the coefficients).
                        The fact that resut changes (and possibly their sign flips) when you change the reference category (ie, the level of the categorical variable that is automatically omitted by Stata to protect you from the dummy trap) is pretty normal. However, given that most of your coefficients are not statistically significant, the change in their signs is practically immaterial,
                        The main drivers of your results seem to be -price- and one tipology of store (which may end up so due to a pretty different number of obsrevations vs remaining type of stores): check if thgis outcome is in line with the literature of your research field and/or other reference standard.
                        I wouls also check the joint significance of regions and country via -testparm-.
                        Kind regards,
                        Carlo
                        (StataNow 18.5)

                        Comment


                        • #13
                          Hello Carlo,
                          Yes you are right, I did 'gathered them' as you suggested at #5
                          ''gathering Regions together according to a given set of criteria may be a wise approach (although I would not take it for granted that over-fitting will disappear)''
                          Is it allowed how I did it or should I combine the findings differently?

                          I checked the store significance and there is no real explanation for it, unfortunately.
                          I also do not see a resemblance between the 2 stores.


                          Code:
                          testparm i.RegionsCat CountryNewOld
                          And I got this outcome on the testparm you advised.

                          ( 1) [Vegan]CountryNewOld = 0
                          ( 2) [Vegan]2.RegionCat = 0
                          ( 3) [Vegan]3.RegionCat = 0

                          chi2( 3) = 5.90
                          Prob > chi2 = 0.1163

                          What is the cut point to know whether there is a joint significance?

                          Comment


                          • #14
                            Maarten:
                            the usual arbitrarily choses cut point is 0.05.
                            However, the lack of statistical significance may mean two things, mainly:
                            - there's a difference in regions, but your sample is not large enough to show it;
                            - there's actually no difference in regions (when adjusted for the remaining predictors). Hence, the question should be: is it good or bad? Is it good that all over the country customers receive the same product/service? Or not? Or else?
                            That said, I would test the joint significance of the levels of the same categorical variables, that is:
                            Code:
                            testparm i.RegionsCat
                            Kind regards,
                            Carlo
                            (StataNow 18.5)

                            Comment


                            • #15
                              Code:
                              testparm i.RegionCat
                              1) [Vegan]2.RegionCat = 0
                              ( 2) [Vegan]3.RegionCat = 0

                              chi2( 2) = 5.26
                              Prob > chi2 = 0.0722

                              Following your 0.05 cut point, I suppose this is '' good''
                              meaning that they are not jointly significance,
                              meaning that the control variable ''Region categories'' does not significantly affect the relationship between price and use of vegan logo's.
                              Meaning that the origin of the wine is not an explanation for the use of vegan logo?
                              Is that the right conclusion?

                              I highly appreciate your constant help, I really do!

                              Comment

                              Working...
                              X