Model for dependent variable with integer values of certain magnitude

Kleopatra Koulikidou

Join Date: Jun 2014

Posts: 19
#1

Model for dependent variable with integer values of certain magnitude

24 Nov 2016, 00:34

Dear Statalist members,

I would like to run a regression in which the dependent variable is not continuous and takes only certain values (specifically, integer values from 0 to 40).
Due to the nature of the variable I was advised to not treat such a case with normal estimation methods (i.e, OLS). However, I am not aware which model can address this issue ( i.e., rank-ordered logistic regression or something similar?).
Could you please advise me on that?

Thank you in advance
Tags: None
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17708
#2

24 Nov 2016, 00:40

Kleopatra:
you may want to consider -poisson-.

Kind regards,
Carlo
(Stata 19.0)
Comment
Rich Goldstein

Join Date: Mar 2014

Posts: 4462
#3

24 Nov 2016, 05:45

you don't really tell us enough; you might want to take a look at the help for -truncreg- as well as the help for -poisson- to see which is closer to your situation
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35698
#4

24 Nov 2016, 06:00

I'd consider dividing by 40 and using fracreg.
Comment
Kleopatra Koulikidou

Join Date: Jun 2014

Posts: 19
#5

24 Nov 2016, 06:26

To provide you with some additional info as requested, the dependent variable has a minimum integer value of "0" and a maximum value of "40".
In addition, If I am not mistaken Poisson model assumes same mean and variance, but this does not hold for my case.

Dr. Cox, unfortunately I believe that fracreg is available only in Stata 14 (I use Stata 13).
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35698
#6

24 Nov 2016, 07:54

Indeed, which is why the FAQ Advice requests that you will tell us that:

11. What should I say about the version of Stata I use?

The current version of Stata is 14.2. Please specify if you are using an earlier version; otherwise, the answer to your question may refer to commands or features unavailable to you. Moreover, as bug fixes and new features are issued frequently by StataCorp, make sure that you update your Stata before posting a query, as your problem may already have been solved.

But not to worry. Advice such as that within http://www.stata-journal.com/sjpdf.h...iclenum=st0147 still applies to you. You can fire up glm.
Comment
Jeff Wooldridge

Join Date: Apr 2014

Posts: 2167
#7

24 Nov 2016, 09:03

I suggest using the binomial distribution with an upper bound of 40. I discuss this in my book "Econometric Analysis of Cross Section and Panel Data." But you should use robust inference.

Code:

glm y x1 ... xk, fam(bin 40) link(logit) robust

JW
Comment
Jeff Wooldridge

Join Date: Apr 2014

Posts: 2167
#8

24 Nov 2016, 09:04

Nick's idea is good, too, but doesn't exploit the integer nature of the response.
Comment

Nick Cox

Join Date: Mar 2014
Posts: 35698

24 Nov 2016, 09:24

Jeff: Naturally I like the principle of respecting that the data are integers, but what difference will it make? An example is not an argument, but I fired up the auto data and I pretend that rep78 is counted. Stata doesn't know otherwise.

I suppose, in some circles, a report that you scaled the integers to fractions might cause puzzlement or discomfort, as lack of respect or an unjustified extra step, but that's a matter of public relations.

(In this example, it's not a good model either way; my suggestion is that it's the same model.)

Code:

.   sysuse auto, clear
(1978 Automobile Data)

. glm rep78 mpg weight, link(logit) f(binomial 5) vce(robust)

Iteration 0:   log pseudolikelihood = -90.880341  
Iteration 1:   log pseudolikelihood = -90.622171  
Iteration 2:   log pseudolikelihood = -90.622036  
Iteration 3:   log pseudolikelihood = -90.622036  

Generalized linear models                         No. of obs      =         69
Optimization     : ML                             Residual df     =         66
                                                  Scale parameter =          1
Deviance         =   64.7931372                   (1/df) Deviance =   .9817142
Pearson          =  54.26662699                   (1/df) Pearson  =   .8222216

Variance function: V(u) = u*(1-u/5)               [Binomial]
Link function    : g(u) = ln(u/(5-u))             [Logit]

                                                  AIC             =   2.713682
Log pseudolikelihood =  -90.6220359               BIC             =  -214.6579

------------------------------------------------------------------------------
             |               Robust
       rep78 |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         mpg |   .0438609   .0371556     1.18   0.238    -.0289627    .1166845
      weight |  -.0002177    .000224    -0.97   0.331    -.0006567    .0002213
       _cons |   .5158199   1.427755     0.36   0.718    -2.282528    3.314168
------------------------------------------------------------------------------

. gen rep78_2 = rep78/5
(5 missing values generated)

. fracreg logit rep78_2 mpg weight, vce(robust)

Iteration 0:   log pseudolikelihood = -42.816676  
Iteration 1:   log pseudolikelihood =  -42.06548  
Iteration 2:   log pseudolikelihood = -42.061805  
Iteration 3:   log pseudolikelihood = -42.061805  

Fractional logistic regression                  Number of obs     =         69
                                                Wald chi2(2)      =      16.96
                                                Prob > chi2       =     0.0002
Log pseudolikelihood = -42.061805               Pseudo R2         =     0.0262

------------------------------------------------------------------------------
             |               Robust
     rep78_2 |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         mpg |   .0438609   .0371556     1.18   0.238    -.0289627    .1166845
      weight |  -.0002177    .000224    -0.97   0.331    -.0006567    .0002213
       _cons |     .51582   1.427755     0.36   0.718    -2.282528    3.314168
------------------------------------------------------------------------------

Comment

Jeff Wooldridge

Join Date: Apr 2014

Posts: 2167
#10

24 Nov 2016, 09:37

Good point Nick! It's the same when the upper bound is the same for all observations. I once knew that ....

Cheers,
Jeff
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35698
#11

24 Nov 2016, 11:27

Jeff Wooldridge Thanks in turn for tactfully underlining the difference between constant and variable upper bounds.

This illuminates a current project with a student: her dataset includes a response which is # of days in a month with a particular hydrological condition and the upper bound is thus 28 to 31, depending on the month in question. While 14/28 (for example) is the same sample proportion as 15/30. its probability is not the same under any particular binomial model and standard errors will vary too. I must check how much difference it makes to ignore month length and to take it into consideration: at a guess, not much. ,

But I also guess that with say

# people in a family with university degrees

variations in family size would be more important.
Comment
Kleopatra Koulikidou

Join Date: Jun 2014

Posts: 19
#12

25 Nov 2016, 00:45

I would like to thank you all for your prompt and always up to the point responses.
Comment

Announcement