Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Can I fit poisson model for amount data

    Dear Stata users,

    I have a dataset which contains how many social insurance one have. The data is derived from a questionnaire that listed five different social insurance and asked respondents whether they have or do not have. For example, the questionnaire asked Q1: Do you have basic pension insurance? Answer will be Yes or No (the same below). Q2: Do you have basic medical insurance? Q3: Do you have work injury insurance? Q4: Do you have maternity insurance? Q5: Do you have unemployment insurance? So for each respondent, his or her amount of social insurance will be from one to five. Some people will have only basic pension insurance, some people have pension insurance and medical insurance, and some people have all five social insurance, and so on other combinations.
    My question is: can I use poisson model to fit this amount variable? Can I take the case that having different insurance as events occur independently and thus follow a Poisson distribution?

  • #2
    Chen:
    I would use -egen- -group- function to classify the different combinations of social insurance.
    Then, if -social insurance- is the dependent variable, I would go -mlogit-.
    Kind regards,
    Carlo
    (StataNow 18.5)

    Comment


    • #3
      Dear Carlo Lazzaro, thank you very much! I do suspect that poisson model does not fit my case. However, if I use mlogit model, how can I tackle so many combinations of these different insurance? Five kinds of insurance will produce 32 combinations. And if insurance kind grow up to 19, combinations will grow up to thousands.
      Click image for larger version

Name:	Graph.png
Views:	1
Size:	142.2 KB
ID:	1775066

      Comment


      • #4
        Chen:
        you raised the real issue: too many combinations!
        The usual recipe is to group together the low-frequency combinations or considering high-frequency combinations only as separate categories and group the rest in -Other- category.
        I would take a look at the literature in your research field about the acceptability of this approach.
        Kind regards,
        Carlo
        (StataNow 18.5)

        Comment


        • #5
          Dear Carlo Lazzaro, thank you so much professor. I once thought to firstly use latent class analysis technique to discover social insurance combination patterns, and then to fit a multinomial logit model to the latent classes. However, I doubt that neither lca & mlogit model nor poisson model has adequate theoretical base.

          Comment


          • #6
            Chen:
            1) Carlo is enough. Thanks;
            2) Skim through the literature of your research field. What others did when facing the very same research question?
            Kind regards,
            Carlo
            (StataNow 18.5)

            Comment


            • #7
              I'm not quite sure about literature since I think it is rare. There are some literatue applying Poisson model to insurance claim data, however, that is a different case from mine. The amount or kinds of social insurance, or more generally social welfare that people have are determined mainly by their occupation and organization that they are working in. In some countries, people working in state owned enterprise will have diverse social insurance and a lot of social welfare, and people working as self-employed will have insufficient insurance and a minimum of social welfare.

              Comment


              • #8
                If you simply want to model the count of social insurance types then you can use binomial regression with glm and an upper bound of 5. With robust standard errors you don’t have to assume independence of the selected options — which is too strong. Because all combos are possible (I think) you can model each as logit or probit. GEE could be used, or estimate each separately.

                Comment


                • #9
                  Dear professor Wooldridge, thanks so much for your reply. Do you mean using glm, family(binomial)? And what an upper bound of 5 mean? Thank you.
                  Code:
                  codebook insurance4, compact
                  
                  Variable     Obs Unique      Mean  Min  Max  Label
                  -----------------------------------------------------------------------------------------------------------------
                  insurance4  3673      2  .9234958    0    1  Medical insurance    
                  -----------------------------------------------------------------------------------------------------------------
                  
                  egen insurance=anycount(insurance1-insurance20), values(1)
                  label variable insurance "how many social insurance/company benefits do you have?"
                  global covarlist age gender education
                  
                  sumdetail insurance
                  
                        how many social insurance/company benefits do you
                                              have?
                  -------------------------------------------------------------
                        Percentiles      Smallest
                   1%            0              0
                   5%            0              0
                  10%            0              0       Obs               5,979
                  25%            1              0       Sum of Wgt.       5,979
                  
                  50%            7                      Mean           6.997658
                                          Largest       Std. Dev.      5.903285
                  75%           12             19
                  90%           15             19       Variance       34.84877
                  95%           18             19       Skewness       .3445001
                  99%           19             19       Kurtosis       1.933804
                  
                  glm insurance $covarlist, family(binomial) //??
                  insurance > 1 in some cases
                  r(499);

                  Comment


                  • #10
                    The count is bounded below by zero and above by five. I believe the syntax is family(bin 5).

                    Comment


                    • #11
                      Wait, now it looks like you have 20 different possibilities? You should add them up and then 20 replaces 5.

                      Comment


                      • #12
                        Thank you very much professor Jeff Wooldridge ! Would you please point out some textbook or literature or presentation on this use of model? Does it belong to Binomial Response Models? If it is, should I use binomial-logit family or general binomial family? I checked George H. Dunteman's An introduction to generalized linear models, but find nothing relevant.

                        Code:
                        . glm insurance $covarlist, family(binomial 20)
                        
                        Iteration 0:   log likelihood = -31624.615  
                        Iteration 1:   log likelihood = -30571.207  
                        Iteration 2:   log likelihood = -30565.906  
                        Iteration 3:   log likelihood = -30565.905  
                        
                        Generalized linear models                         Number of obs   =      5,977
                        Optimization     : ML                             Residual df     =      5,973
                                                                          Scale parameter =          1
                        Deviance         =  47118.25865                   (1/df) Deviance =   7.888542
                        Pearson          =  40567.03046                   (1/df) Pearson  =   6.791735
                        
                        Variance function: V(u) = u*(1-u/20)              [Binomial]
                        Link function    : g(u) = ln(u/(20-u))            [Logit]
                        
                                                                          AIC             =   10.22918
                        Log likelihood   = -30565.90505                   BIC             =  -4821.002

                        Comment


                        • #13
                          You should add the vce(robust) option to allow the binomial distribution to be incorrect. See Section 18.3 of my 2010 MIT Press book.

                          Comment


                          • #14
                            In addition to what Carlo and Jeff have written you might also take a look at this paper https://pubmed.ncbi.nlm.nih.gov/38598916/

                            Comment


                            • #15
                              Thank you Jeff Wooldridge John Mullahy. In Econometric Analysis of Cross Section and Panel Data you wrote:
                              Sometimes we wish to analyze count data conditional on a known upper bound. For example, Thomas, Strauss, and Henriques (1990) study child mortality within families conditional on number of children ever born. Another example takes the dependent variable, yi, to be the number of adult children in family i who are high school graduates; the known upper bound, ni, is the number of children in family i. By conditioning on ni we are, presumably, treating it as exogenous. A natural starting point is to assume that yi given (ni, xi) has a binomial distribution, denoted Binomial [ni, p(xi, β)], where p(xi, β) is a function bounded between zero and one. In this setup, usually, yi is viewed as the sum of ni independent Bernoulli (zero-one) random variables...
                              I think I understand it, not very deeply although. And professor Mullahy, sorry, I don't have access to your paper in 2024. However, I read its abstract which is very inspiring.

                              Comment

                              Working...
                              X