What test to choose?

Maarten Vander

Join Date: Dec 2018

Posts: 14
#1

What test to choose?

20 Dec 2018, 09:58

I believe I tried every test on both SPSS and Stata, but I keep on encountering problems.
Maybe you can help me to think from scratch.

The dependent variable is logo (0 or 1 whether there is a logo on the label or not)
The independent variable is price (can be in euro's or in 3 ordinal price categories)

There are 5 control variables
4 are either in dummy or categorical: Region(44 different), Country(9 different), Store(5 different), Colour(2 different).
1 is the number of hectares but can be transformed into categories as well.
N=161

Problem,
some Regions occur only once and are therefore perfect predictors ( collinearity error)
I tried to categorize the regions, but the results don't seem to be right.

What test should I pick?
And what code should I use?
Tags: None
Bruce Weaver

Join Date: May 2014

Posts: 1133
#2

20 Dec 2018, 10:04

Hello Maarten. Do you really have 44 regions? Or was that a typo? I ask, because if you are including a variable with 44 levels, you are over-fitting your model. (See this nice article by Mike Babyak for more info on over-fitting.) It's also not clear to me whether you have clustering that needs to be taken into account. E.g., are the regions clustered within countries? HTH.

--
Bruce Weaver
Email: [email protected]
Version: Stata/MP 19.5 (Windows)
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17708
#3

20 Dec 2018, 10:04

Maarten:
posting what you tyoped and what Stata give you back can help (as usual).
That said, have you considered interactions (say between Store and Price)?
As an aside, please note that categozing continuous predictors is, in general, a bad idea: http://citeseerx.ist.psu.edu/viewdoc...=rep1&type=pdf

PS: crossed in the cyberspace with Bruce's helpful advice.

Kind regards,
Carlo
(Stata 19.0)
Comment
Maarten Vander

Join Date: Dec 2018

Posts: 14
#4

20 Dec 2018, 10:22

Hello Bruce and Carlo,

I have indeed 44 regions.(the data is about wine labels, and the wines come from 44 different regions)

I am not sure what you mean with clustering.
Each region belongs to a certain country. but I have a variable for both.
How can I check this?

Regarding interactions, that is a good idea. You mean by entering # between variables, right?

As a solution to over-fitting, I was thinking about combining the regions into only 4 types of regions, i.e. Large unknown regions, Specified and expensive regions.

Do you think Probit or regression is suited, or should I think about different tests?
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17708
#5

20 Dec 2018, 11:17

Maarten:
- yes, I mean something along the lines you sketch in your reply (please, see -fvvarlist- for further details, especilly for the difference between -#- and -##-;
- Bruce is correct in suspecting that clustering may be an issue with your data, as regions that belong to the same country are probably more similar vs regions belonging to different countries. By the way, having regions nested within countries can be enough to consider a hierarchical model (see -melogit-);
- gathering Regions together according to a given set of criteria may be a wise approach (although I would not take it for granted that over-fitting wiill disappear);
- eventually, I'm not clear what you mean by test (which are inferential procidures to detect difference and/or statistical significance): I think that you're seeking advice about regression model, instead. That said, if your dependent variable is categorical (yes/no), there's no room for -regression- and you should consider -logit- or -probit- which give similar results.

Kind regards,
Carlo
(Stata 19.0)
1 like
Comment
Maarten Vander

Join Date: Dec 2018

Posts: 14
#6

20 Dec 2018, 11:33

Thank you for that advise.

Very good to know that regression is not possible!

I have combined the 44 regions into 4 categories
When I run it as 4 dummies, the results are as followed.

Price | .3477105 .0984887 3.53 0.000 .1546762 .5407448

Region1D | 3.760862 352.7587 0.01 0.991 -687.6335 695.1552

Region2D | 3.752317 352.7583 0.01 0.992 -687.6413 695.1459

Region3D | 3.692831 352.7584 0.01 0.992 -687.701 695.0867

I have also placed them under 1 categorical variable (instead of 4 dummies)
these are the results.
I do not understand the empty variables and what to do with the ormitted ones?
Is this data interpretable, or should i make more changes?

Price | .3470497 .0986867 3.52 0.000 .1536273 .540472

|

Region1 |

1 | 0 (empty)

2 | -.0086384 .4393754 -0.02 0.984 -.8697984 .8525216

3 | -.0565939 .5758633 -0.10 0.922 -1.185265 1.072077

4 | 0 (omitted)

|

Store |

2 | -1.124323 .4174382 -2.69 0.007 -1.942487 -.3061589

3 | -1.644975 .6311591 -2.61 0.009 -2.882024 -.4079256

4 | -.9489177 .5176832 -1.83 0.067 -1.963558 .0657228

5 | -1.153552 .4774939 -2.42 0.016 -2.089422 -.2176809

|

Country |

2 | 0 (empty)

3 | 1.030602 .6420614 1.61 0.108 -.2278153 2.289019

4 | -.3744293 .6855527 -0.55 0.585 -1.718088 .9692293

5 | 0 (empty)

6 | 1.426111 .7480216 1.91 0.057 -.039984 2.892207

7 | 0 (empty)

8 | 0 (empty)

9 | .5941595 .6767971 0.88 0.380 -.7323384 1.920657

|

1.ColourD | .0873519 .2673539 0.33 0.744 -.436652 .6113559

_cons | -2.418361 .9259913 -2.61 0.009 -4.233271 -.6034517
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#7

20 Dec 2018, 11:42

Clyde Schechter has discussed this problem at some length in a separate topic started by Maarten.

https://www.statalist.org/forums/for...ummy-in-probit
1 like
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17708
#8

20 Dec 2018, 11:47

Maarten:
you do not post your regression code, hence it is difficult to comment on your results helpfully.
If you have used -fvvarlist- one of the level included in the categorical variable is omitted automatically by Stata to protect you against the dummy trap (see: https://en.wikipedia.org/wiki/Dummy_...le_(statistics).

Kind regards,
Carlo
(Stata 19.0)
Comment
Maarten Vander

Join Date: Dec 2018

Posts: 14
#9

20 Dec 2018, 11:59

the code was
Probit Vegan Price i.Region1 i.Store i.Country i.Coulor
Does this help?
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17708
#10

21 Dec 2018, 00:37

Maarten:
please use CODE delimiters to share what you typed and what Stata gave you back (see the FAQ on this).
Tha said, by including both regions and countries, you surely experience multicollinerarity problems.

Kind regards,
Carlo
(Stata 19.0)
1 like
Comment
Maarten Vander

Join Date: Dec 2018

Posts: 14
#11

21 Dec 2018, 06:53

That is important advice, thank you.
I clustered the country and regions.
1 dummy for country (Old vs New world countries, [That is a classification in the wine world, comes roughly down to European vs other wine countries])
And 2 dummies for 3 region categories based on characteristics.

Code:

probit Vegan Price CountryNewOld ColourD i.RegionCat i.Store

I received the following output,
It seems good to me.

What do you think?
Can I include this in my report?

Why is it that when I change the order of the categories that the results differ so much.
e.g. instead of 1, Germany, 2 Bordeaux, 3 Marlborough --> 1, Bordeaux 2, Germany 3, Marlborough,
This changes all the results including the significance of the independent variable.
I understand now that one category is taken out for dummy trap, but why do the results vary so strongly.
Which order should I pick?

Thank you so much
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17708
#12

21 Dec 2018, 08:04

Maarten:
actually, you did not -cluster- countries and regions, you simply adjusted for their effects.
Clustering concerns the standard errors, not the point estimates (ie, the coefficients).
The fact that resut changes (and possibly their sign flips) when you change the reference category (ie, the level of the categorical variable that is automatically omitted by Stata to protect you from the dummy trap) is pretty normal. However, given that most of your coefficients are not statistically significant, the change in their signs is practically immaterial,
The main drivers of your results seem to be -price- and one tipology of store (which may end up so due to a pretty different number of obsrevations vs remaining type of stores): check if thgis outcome is in line with the literature of your research field and/or other reference standard.
I wouls also check the joint significance of regions and country via -testparm-.

Kind regards,
Carlo
(Stata 19.0)
Comment
Maarten Vander

Join Date: Dec 2018

Posts: 14
#13

21 Dec 2018, 09:15

Hello Carlo,
Yes you are right, I did 'gathered them' as you suggested at #5
''gathering Regions together according to a given set of criteria may be a wise approach (although I would not take it for granted that over-fitting will disappear)''
Is it allowed how I did it or should I combine the findings differently?

I checked the store significance and there is no real explanation for it, unfortunately.
I also do not see a resemblance between the 2 stores.

Code:

testparm i.RegionsCat CountryNewOld

And I got this outcome on the testparm you advised.

( 1) [Vegan]CountryNewOld = 0
( 2) [Vegan]2.RegionCat = 0
( 3) [Vegan]3.RegionCat = 0

chi2( 3) = 5.90
Prob > chi2 = 0.1163

What is the cut point to know whether there is a joint significance?
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17708
#14

21 Dec 2018, 09:43

Maarten:
the usual arbitrarily choses cut point is 0.05.
However, the lack of statistical significance may mean two things, mainly:
- there's a difference in regions, but your sample is not large enough to show it;
- there's actually no difference in regions (when adjusted for the remaining predictors). Hence, the question should be: is it good or bad? Is it good that all over the country customers receive the same product/service? Or not? Or else?
That said, I would test the joint significance of the levels of the same categorical variables, that is:

Code:

testparm i.RegionsCat

Kind regards,
Carlo
(Stata 19.0)
1 like
Comment
Maarten Vander

Join Date: Dec 2018

Posts: 14
#15

21 Dec 2018, 10:08

Code:

testparm i.RegionCat

1) [Vegan]2.RegionCat = 0
( 2) [Vegan]3.RegionCat = 0

chi2( 2) = 5.26
Prob > chi2 = 0.0722

Following your 0.05 cut point, I suppose this is '' good''
meaning that they are not jointly significance,
meaning that the control variable ''Region categories'' does not significantly affect the relationship between price and use of vegan logo's.
Meaning that the origin of the wine is not an explanation for the use of vegan logo?
Is that the right conclusion?

I highly appreciate your constant help, I really do!
Comment

Announcement

What test to choose?

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment