Using count data regression model with non-count data

Johannes Zeiher

Join Date: Aug 2014

Posts: 18
#1

Using count data regression model with non-count data

20 Apr 2015, 06:30

Dear Statalist Members,

I am trying to analyse which variables predict the the grade of nicotine dependence of smokers in a population based survey sample (n=1172) using the Fagerström Test of Nicotine Dependence (FTND) as dependent variable and multible variables(e.g. education, income, age when first cigarette was consumed, alcohol intake) as independed variables.
The FTND consists of six questions and the asweres are summed up in in a score ranging from 0 to 10, allowing only integers. The distribution of the FTND-Score in the sample is as shown below.

As I´m searching for a regression model to analyse these data, I was wondering if one can use a regression model for count data like the negativ binomial regression.
Due to the high account of zeros lineare regression does not seems appropriate.
I used the -countfit- command by Long&Freese (Stata 13.1 SE) and it showed good results for the negativ binomial regression model.

My question is: Is it possible to use regression models for count data for non-count data if the distribution is count data like? I´m searching for an answere quite a while and found some examples of the analyses of index-variables with count data regression modells, but no clear statement that the condition of having "real" count data can by violated in some cases.

I hope this was clear. If anyone could help, it would be very much appreciated.
Regards,
Johannes

Attached Files
Tags: count data, nbreg, survey data
William Lisowski

Join Date: Dec 2014

Posts: 10150
#2

20 Apr 2015, 07:48

What you have is, I believe, an ordinal outcome. In Stata search ordinal reveals a wealth of commands for analyzing ordinal outcomes. The ologit command might be a good place to start.

Other readers may have better, more well-informed advice for you.
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17707
#3

20 Apr 2015, 09:06

Johannes:
as an aside, I would also pay attention to the non-independent structure of your observations, as the test items are measured on the same individuals. For continuous dependent variables, -help manova- would be helpful. In your case, I suppose that you have at least to cluster the standard errors of your regression coefficients on patientid (or the like).

Kind regards,
Carlo
(Stata 19.0)
Comment
Johannes Zeiher

Join Date: Aug 2014

Posts: 18
#4

21 Apr 2015, 01:53

Hey William & Carlo, thanks for your prompt and helpful advice!
Comment
daniel klein

Join Date: Mar 2014

Posts: 3850
#5

21 Apr 2015, 02:40

In the social sciences it would not be uncommon to treat such outcome as "quasi-interval" data and fit a linear model. Note that, contrary to what some believe, OLS (which we use as an estimator for linear models) does not make any assumptions about the distribution of the dependent variable in the sample. It relies on some assumptions about the errors, which turn out to be violated more often in the case of skewed dependent variables - but I would check the residuals to get a feeling for how well the model fits.

From my experience the ordered logit model does not work well with more than 4 to 5 categories top. The parallel odds assumption is almost always violated and a generalization of the model oftentimes leads to results that are inconvenient to interpret. Whether you want to try this, depends on how much "truth" you are willing to sacrifice for a simple model (as a model's purpose always is to simplify reality).

I do not really see why you would want clustered standard errors in this scenario. Using an index (i.e. combination of more than one item) as a variable in regression framework does, in my view, not require corrected standard errors, as the units of observations are still independent (if this was true before creating the indices).

Fitting "count" models is also fine with such data. Depending on the audience you might want to state that you are estimating generalized linear models (GLM) instead of using the names negative binomial or poisson models, as these names are closely linked to count data - which is the very reason for your question.

You might find Austin Nichols talk very interesting. I have learned a lot from it.

Best
Daniel

Last edited by daniel klein; 21 Apr 2015, 02:42.
Comment
Jeff Wooldridge

Join Date: Apr 2014

Posts: 2167
#6

21 Apr 2015, 14:12

In my view this is a perfect application for binomial regression. I don't view this as ordinal data. Your dependent variable is the number of "successes" (not literally, of course) out of 10 trials -- and so can be any integer from 0 to 10. A sensible model for the conditional mean is E(y|x) = G(x*b)*10, where G(.) is typically a cumulative distribution function, so it is bounded between zero and one. Probably a logistic function is fine for G(.). One of the nice things about the binomial quasi-MLE is that it is fully robust to other distributional misspecification. So, if the answers to the questions are not independent -- as they almost certainly are not -- it does not cause inconsistency in the binomial QMLE. The binomial QMLE also allows for overdispersion that can come from individual heterogeneity. I discuss these issues in J.M. Wooldridge (2010), Econometric Analysis of Cross Section and Panel Data, 2e, MIT Press. See Section 18.3.2.

The parameters are easily estimated using the Stata GLM command. Robust standard errors and inference should be used.

Code:

glm y x1 x2 ... xK, fam(bin 10) link(logit) robust

You may get answers similar to Poisson regression or Negative Binomial, but neither of these enforces the upper limit of 10 on the conditional mean.

The margins command works. You will essentially get average marginal effects that are those of a binary logit model multiplied by 10.

Code:

margins, dydx(*)
2 likes
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35698
#7

22 Apr 2015, 08:45

This is an interesting thread, not least because of the quite different advice. I'd note that treating the scale as a counted fraction (in the useful terminology of J.W. Tukey) is being very trusting of the measurement scale, namely in trusting that one point given anywhere for any question is exactly equivalent to one point given anywhere else. But even regarding the scale as ordinal is also trusting. In practice, people who analyse data like this have little choice or scope to question the measurement scale.

If this were my problem I would go with Jeff's advice and note further that something like the binomial is qualitatively right as the variance must approach 0 as the mean approaches 0 or 10. Jeff probably spells this out in his book; at present my copy is in another office.
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#8

22 Apr 2015, 12:20

I shall defer to those with more expertise and experience than I possess. I only mention that the construction of the FTND as the sum of two 0/1/2/3 measures and four 0/1 measures in the cited paper is what led me to think of it as more of an ordered score rather than a binomial count. Jeff's discussion seems to ameliorate that concern.
Comment
Johannes Zeiher

Join Date: Aug 2014

Posts: 18
#9

28 Apr 2015, 06:53

Dear all,
thank you so much for your helpful comments. I performent sensitivity analysis with different models and the Binomial Quasi-MLE worked out very well, even if the results are close to Poisson or Negativ Binominal regression modells (as predicted byJeff Wooldridge).
Comment

Johannes Zeiher

Join Date: Aug 2014
Posts: 18

#10

19 May 2015, 06:28

Sorry for reactivating this thread.
As I´m continuing with my research (as described above) I´m facing some trouble in interpreting the results of the -margin- command.

Code:

clear
sysuse auto
egen price4cat = cut(price), group(4) label
glm rep78 weight length i.price4cat i.foreign, fam(bin 5) link(logit)

Code:

Average marginal effects                          Number of obs   =         69
Model VCE    : OIM

Expression   : Predicted mean rep78, predict()
dy/dx w.r.t. : weight length 1.price4cat 2.price4cat 3.price4cat 1.foreign

------------------------------------------------------------------------------
             |            Delta-method
             |      dy/dx   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
      weight |  -.0002863    .000538    -0.53   0.595    -.0013408    .0007682
      length |   .0031839    .017416     0.18   0.855    -.0309508    .0373186
             |
   price4cat |
      4195-  |   .0190345   .3538854     0.05   0.957    -.6745681    .7126371
    5006.5-  |   .3581804    .392638     0.91   0.362     -.411376    1.127737
      6342-  |   .3192328   .4632299     0.69   0.491    -.5886812    1.227147
             |
     foreign |
    Foreign  |   1.043873   .3795595     2.75   0.006     .2999504    1.787796
------------------------------------------------------------------------------
Note: dy/dx for factor levels is the discrete change from the base level.

As I´m not used to work with -margin-, could anyone give me a hint how to interpret the effect for example for "Foreign" against "Domestic"?
Trying -help margin-, I did not find a solution for this special case (binomial quasi-MLE with multible "Trials").
Thanks again!
Johannes

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#11

19 May 2015, 09:46

So the -margins- output for Foreign means that compared to Domestic vehicles, the expected value of rep78 is 1.04... greater. After -glm-, -margins-, by default, shows levels of and marginal effects for the actual dependent variable in the model.
Comment
Killian Mullan

Join Date: Aug 2014

Posts: 7
#12

27 May 2015, 05:27

I'll just throw this into the mix:

http://blog.stata.com/2011/08/22/use...tell-a-friend/
Comment
Jeff Wooldridge

Join Date: Apr 2014

Posts: 2167
#13

27 May 2015, 07:38

Johannes: Unlike your response variable, rep78 is properly viewed as an ordered response. It is not a number of "successes" out of 5 trials. I wouldn't use the binomial GLM in this case. An ordered logit or probit is appropriate.

Your variable is the number of yes responses out of 10 questions (trials). So the margins command will give the effect of on the expected number of "yes" answers given a one unit increase in the explanatory variable. It's the same interpretation as if you do a Poisson and use margins.
Comment
Johannes Zeiher

Join Date: Aug 2014

Posts: 18
#14

06 Jul 2015, 04:33

Checking my statalist profile I just relalised that I did´t reply on your helpful posts. Sorry for that & thank you all for your support!
Johannes
Comment

Announcement

Using count data regression model with non-count data

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment