Can I fit poisson model for amount data

Chen Samulsion

Join Date: Jan 2018

Posts: 865
#1

Can I fit poisson model for amount data

30 Mar 2025, 21:59

Dear Stata users,

I have a dataset which contains how many social insurance one have. The data is derived from a questionnaire that listed five different social insurance and asked respondents whether they have or do not have. For example, the questionnaire asked Q1: Do you have basic pension insurance? Answer will be Yes or No (the same below). Q2: Do you have basic medical insurance? Q3: Do you have work injury insurance? Q4: Do you have maternity insurance? Q5: Do you have unemployment insurance? So for each respondent, his or her amount of social insurance will be from one to five. Some people will have only basic pension insurance, some people have pension insurance and medical insurance, and some people have all five social insurance, and so on other combinations.
My question is: can I use poisson model to fit this amount variable? Can I take the case that having different insurance as events occur independently and thus follow a Poisson distribution?
Tags: None
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17671
#2

31 Mar 2025, 00:58

Chen:
I would use -egen- -group- function to classify the different combinations of social insurance.
Then, if -social insurance- is the dependent variable, I would go -mlogit-.

Kind regards,
Carlo
(StataNow 18.5)
Comment
Chen Samulsion

Join Date: Jan 2018

Posts: 865
#3

31 Mar 2025, 02:11

Dear Carlo Lazzaro, thank you very much! I do suspect that poisson model does not fit my case. However, if I use mlogit model, how can I tackle so many combinations of these different insurance? Five kinds of insurance will produce 32 combinations. And if insurance kind grow up to 19, combinations will grow up to thousands.
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17671
#4

31 Mar 2025, 02:18

Chen:
you raised the real issue: too many combinations!
The usual recipe is to group together the low-frequency combinations or considering high-frequency combinations only as separate categories and group the rest in -Other- category.
I would take a look at the literature in your research field about the acceptability of this approach.

Kind regards,
Carlo
(StataNow 18.5)
1 like
Comment
Chen Samulsion

Join Date: Jan 2018

Posts: 865
#5

31 Mar 2025, 02:58

Dear Carlo Lazzaro, thank you so much professor. I once thought to firstly use latent class analysis technique to discover social insurance combination patterns, and then to fit a multinomial logit model to the latent classes. However, I doubt that neither lca & mlogit model nor poisson model has adequate theoretical base.
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17671
#6

31 Mar 2025, 03:03

Chen:
1) Carlo is enough. Thanks;
2) Skim through the literature of your research field. What others did when facing the very same research question?

Kind regards,
Carlo
(StataNow 18.5)
Comment
Chen Samulsion

Join Date: Jan 2018

Posts: 865
#7

31 Mar 2025, 03:32

I'm not quite sure about literature since I think it is rare. There are some literatue applying Poisson model to insurance claim data, however, that is a different case from mine. The amount or kinds of social insurance, or more generally social welfare that people have are determined mainly by their occupation and organization that they are working in. In some countries, people working in state owned enterprise will have diverse social insurance and a lot of social welfare, and people working as self-employed will have insufficient insurance and a minimum of social welfare.
Comment
Jeff Wooldridge

Join Date: Apr 2014

Posts: 2120
#8

31 Mar 2025, 06:28

If you simply want to model the count of social insurance types then you can use binomial regression with glm and an upper bound of 5. With robust standard errors you don’t have to assume independence of the selected options — which is too strong. Because all combos are possible (I think) you can model each as logit or probit. GEE could be used, or estimate each separately.
2 likes
Comment

Chen Samulsion

Join Date: Jan 2018
Posts: 865

31 Mar 2025, 07:01

Dear professor Wooldridge, thanks so much for your reply. Do you mean using glm, family(binomial)? And what an upper bound of 5 mean? Thank you.

Code:

codebook insurance4, compact

Variable     Obs Unique      Mean  Min  Max  Label
-----------------------------------------------------------------------------------------------------------------
insurance4  3673      2  .9234958    0    1  Medical insurance    
-----------------------------------------------------------------------------------------------------------------

egen insurance=anycount(insurance1-insurance20), values(1)
label variable insurance "how many social insurance/company benefits do you have?"
global covarlist age gender education

sumdetail insurance

      how many social insurance/company benefits do you
                            have?
-------------------------------------------------------------
      Percentiles      Smallest
 1%            0              0
 5%            0              0
10%            0              0       Obs               5,979
25%            1              0       Sum of Wgt.       5,979

50%            7                      Mean           6.997658
                        Largest       Std. Dev.      5.903285
75%           12             19
90%           15             19       Variance       34.84877
95%           18             19       Skewness       .3445001
99%           19             19       Kurtosis       1.933804

glm insurance $covarlist, family(binomial) //??
insurance > 1 in some cases
r(499);

Comment

Jeff Wooldridge

Join Date: Apr 2014

Posts: 2120
#10

31 Mar 2025, 20:27

The count is bounded below by zero and above by five. I believe the syntax is family(bin 5).
1 like
Comment
Jeff Wooldridge

Join Date: Apr 2014

Posts: 2120
#11

31 Mar 2025, 20:29

Wait, now it looks like you have 20 different possibilities? You should add them up and then 20 replaces 5.
Comment

Chen Samulsion

Join Date: Jan 2018
Posts: 865

#12

31 Mar 2025, 21:18

Thank you very much professor Jeff Wooldridge ! Would you please point out some textbook or literature or presentation on this use of model? Does it belong to Binomial Response Models? If it is, should I use binomial-logit family or general binomial family? I checked George H. Dunteman's An introduction to generalized linear models, but find nothing relevant.

Code:

. glm insurance $covarlist, family(binomial 20)

Iteration 0:   log likelihood = -31624.615  
Iteration 1:   log likelihood = -30571.207  
Iteration 2:   log likelihood = -30565.906  
Iteration 3:   log likelihood = -30565.905  

Generalized linear models                         Number of obs   =      5,977
Optimization     : ML                             Residual df     =      5,973
                                                  Scale parameter =          1
Deviance         =  47118.25865                   (1/df) Deviance =   7.888542
Pearson          =  40567.03046                   (1/df) Pearson  =   6.791735

Variance function: V(u) = u*(1-u/20)              [Binomial]
Link function    : g(u) = ln(u/(20-u))            [Logit]

                                                  AIC             =   10.22918
Log likelihood   = -30565.90505                   BIC             =  -4821.002

Comment

Jeff Wooldridge

Join Date: Apr 2014

Posts: 2120
#13

01 Apr 2025, 07:44

You should add the vce(robust) option to allow the binomial distribution to be incorrect. See Section 18.3 of my 2010 MIT Press book.
Comment
John Mullahy

Join Date: Dec 2016

Posts: 742
#14

01 Apr 2025, 07:53

In addition to what Carlo and Jeff have written you might also take a look at this paper https://pubmed.ncbi.nlm.nih.gov/38598916/
Comment
Chen Samulsion

Join Date: Jan 2018

Posts: 865
#15

01 Apr 2025, 09:16

Thank you Jeff Wooldridge John Mullahy. In Econometric Analysis of Cross Section and Panel Data you wrote:

Sometimes we wish to analyze count data conditional on a known upper bound. For example, Thomas, Strauss, and Henriques (1990) study child mortality within families conditional on number of children ever born. Another example takes the dependent variable, y_i, to be the number of adult children in family i who are high school graduates; the known upper bound, n_i, is the number of children in family i. By conditioning on n_i we are, presumably, treating it as exogenous. A natural starting point is to assume that y_i given (n_i, x_i) has a binomial distribution, denoted Binomial [n_i, p(x_i, β)], where p(x_i, β) is a function bounded between zero and one. In this setup, usually, y_i is viewed as the sum of n_i independent Bernoulli (zero-one) random variables...

I think I understand it, not very deeply although. And professor Mullahy, sorry, I don't have access to your paper in 2024. However, I read its abstract which is very inspiring.
Comment

Announcement