Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Interpretation of log-likelihood value

    I am using the gllamm command and doing sensitivity analysis. I am choosing between 4 models, the variable what I am changing between the models are using age and income as either categorical or continuous, so my models would be both continuous, only age categorical, only income categorical, and both categorical along with some unchanged controls. My final model which uses them both as categorical has the highest log-likelihood value, however the income categories are all insignificant whereas the age categories are 2/3 significant. Is this enough justification to say that this is my preferred model? Any advice would be greatly appreciated

    Thanks
    Last edited by Martin Orr; 16 Oct 2018, 10:56.

  • #2
    gllamm is from SSC. You are asked to specify the source of user-written commands.

    what I am changing between the models are using age and income as either categorical or continuous
    You do not need a model selection criterion for this. If you have continuous age and income variables, you should use these because categorizing them is throwing away information. Sometimes, the individuals who conduct a survey may frame their question in such a way as the responses are categorical. In such a case, you do not have a choice but to use the variable as is. But in your case, can you come up with any good reason for wanting to categorize your variables? For anyone following this thread, the standard model selection criteria, i.e., AIC and BIC are available after running glamm. Just type

    Code:
    estat ic
    The lrtest works as well, and one can also compute McFadden's pseudo R-squared, although in nested models, a direct comparison of the log-likelihoods will provide the same information as obtained from a comparison of pseudo R-squared statistics.


    Comment


    • #3
      Originally posted by Andrew Musau View Post
      gllamm is from SSC. You are asked to specify the source of user-written commands.



      You do not need a model selection criterion for this. If you have continuous age and income variables, you should use these because categorizing them is throwing away information. Sometimes, the individuals who conduct a survey may frame their question in such a way as the responses are categorical. In such a case, you do not have a choice but to use the variable as is. But in your case, can you come up with any good reason for wanting to categorize your variables? For anyone following this thread, the standard model selection criteria, i.e., AIC and BIC are available after running glamm. Just type

      Code:
      estat ic
      The lrtest works as well, and one can also compute McFadden's pseudo R-squared, although in nested models, a direct comparison of the log-likelihoods will provide the same information as obtained from a comparison of pseudo R-squared statistics.

      My original model was continuous but I wanted to test if there was a non linear relationship, that is why I switched to categorical. Income didn't become significant, but I also want to try another type of sensitivity analysis which would involve me restricting the sample size, do you think I should return back to the continuous measure for this test? Also what do you mean throwing away information?

      Also when I type estat ic I get an error message reading type mismatch that's why I only looked to the log likelihood value
      Last edited by Martin Orr; 16 Oct 2018, 12:00.

      Comment


      • #4
        Also what do you mean throwing away information?
        Assume that John is in my sample and has an annual income of $126,453. If I categorize income, with a range of $20,000 within a category starting from $0, I will place John in the 120,000- 140,000 category. But wait, I know that John has an annual income of $126,453 and by categorizing income, I just threw away this information.

        My original model was continuous but I wanted to test if there was a non linear relationship
        If you want to test whether income is non-linear, you add a quadratic term, i.e., income2 to the regression or some additional higher order terms. You cannot use factor variables with gllamm, so you have to generate the variable yourself.

        Code:
        gen income2= income*income
        Categorizing the variable is not the way to check for non-linearity.

        restricting the sample size
        You need to have some basis for this. Say, if you want to examine whether individuals aged 50 and over behave the same way as other individuals in the sample. But you can do this directly with interactions.

        Also when I type estat ic I get an error message reading type mismatch that's why I only looked to the log likelihood value
        I have used estat ic after gllamm and it works fine. Please review the FAQ advice on posting and make sure that you show all the commands that you entered and the resulting Stata output, including errors. Use code delimiters when you post the Stata output.
        Last edited by Andrew Musau; 16 Oct 2018, 12:53.

        Comment


        • #5
          Originally posted by Andrew Musau View Post

          Assume that John is in my sample and has an annual income of $126,453. If I categorize income, with a range of $20,000 within a category starting from $0, I will place John in the 120,000- 140,000 category. But wait, I know that John has an annual income of $126,453 and by categorizing income, I just threw away this information.



          If you want to test whether income is non-linear, you add a quadratic term, i.e., income2 to the regression or some additional higher order terms. You cannot use factor variables with gllamm, so you have to generate the variable yourself.

          Code:
          gen income2= income*income
          Categorizing the variable is not the way to check for non-linearity.



          You need to have some basis for this. Say, if you want to examine whether individuals aged 50 and over behave the same way as other individuals in the sample. But you can do this directly with interactions.



          I have used estat ic after gllamm and it works fine. Please review the FAQ advice on posting and make sure that you show all the commands that you entered and the resulting Stata output, including errors. Use code delimiters when you post the Stata output.
          When I say non linear I justified the categories by saying that different income groups have a different response to my dependent variable. I conducted a linktest to see if any of my variables would require a square term and that result was that they didn't, does this mean my categorisation also isn't valid?

          I am restricting the sample size based on distance to work to fit one of my specification assumptions its a bit long winded to mention on here, it isn't based on age or income

          Code:
          gllamm (dependent variable) (indepdent variables), i(surveycode) link(soprobit) constr(1) s(het) thresh(thresh) init trace
          
          estat ic
          Is the exact code I inputted

          Also I was mistaken my data came from a survey that was coded 1-8 with ranges of income, I created a continuous variable from taking the midpoint and also a categorical by creating dummy variables, age however was originally continuous which I then made into categories.

          Last edited by Martin Orr; 16 Oct 2018, 13:12.

          Comment


          • #6
            I am restricting the sample size based on distance to work to fit one of my specification assumptions its a bit long winded to mention on here, it isn't based on age or income
            Fine, this is a reason. As long as it makes sense within your research area.

            Also I was mistaken my data came from a survey that was coded 1-8 with ranges of income, I created a continuous variable from taking the midpoint
            There is no point of doing this. You can never recover the true income values and the results may be misleading to your readers because they will interpret income as continuous whereas it really is not.

            When I say non linear I justified the categories by saying that different income groups have a different response to my dependent variable.
            This is a valid hypothesis. What you could do and what is usually done is to run a model with the full sample. Then, to test your hypothesis, run regressions over the income sub-samples and compare your results to the full sample and across sub-samples. It is also possible to run one regression interacting your variables with the different groups to test your hypothesis. To be honest, I do not see any reason for you to compare models using the various information criteria because you have the same set of variables. These comparisons are useful only when you have additional variables, and you want to establish whether a model with more variables or a more parsimonious one is preferred. Bottom line, my advice is to keep continuous variables continuous (e.g., age in your model) and not to make a continuous variable from categorical variable as this does not add anything useful.

            Comment


            • #7
              Originally posted by Andrew Musau View Post

              Fine, this is a reason. As long as it makes sense within your research area.



              There is no point of doing this. You can never recover the true income values and the results may be misleading to your readers because they will interpret income as continuous whereas it really is not.



              This is a valid hypothesis. What you could do and what is usually done is to run a model with the full sample. Then, to test your hypothesis, run regressions over the income sub-samples and compare your results to the full sample and across sub-samples. It is also possible to run one regression interacting your variables with the different groups to test your hypothesis. To be honest, I do not see any reason for you to compare models using the various information criteria because you have the same set of variables. These comparisons are useful only when you have additional variables, and you want to establish whether a model with more variables or a more parsimonious one is preferred. Bottom line, my advice is to keep continuous variables continuous (e.g., age in your model) and not to make a continuous variable from categorical variable as this does not add anything useful.
              I explained in my data that this is one way I will be specifying income. I like the sub sample idea what you suggested, however 2 of the income categories have about 50% of the data in with 3 having about 70% of the data, whatever way I split it I will end up having groups that are completely different sizes with the larger groups have almost no variance due to the categorical nature. Do you think that in light of this my previous method of simply running one regression with dummy variables of a few of the categories grouped into bands justifies this choice. I feel like this might be useful to bring up in the limitations as a result.

              Also slightly off topic, but aside from the ordered probit regression diagnostics, do you know any diagnositics to apply to the chopit model ran by gllamm. I have read there are no additional ways to formally test the two new assumptions in the chopit model vignette equivalence and response consistency.

              Comment


              • #8
                I explained in my data that this is one way I will be specifying income. I like the sub sample idea what you suggested, however 2 of the income categories have about 50% of the data in with 3 having about 70% of the data, whatever way I split it I will end up having groups that are completely different sizes with the larger groups have almost no variance due to the categorical nature. Do you think that in light of this my previous method of simply running one regression with dummy variables of a few of the categories grouped into bands justifies this choice. I feel like this might be useful to bring up in the limitations as a result.
                As I understood, your income variable is categorical, so you will have the group dummies in the regression. My comment was that taking the mean values across groups and treating it as continuous adds no value and may mislead your readers. So there is no issue here. Different sample sizes across groups is to be expected, and unless the observations are too few (e.g., less than 30), it should be OK to run the sub-sample regressions.

                Also slightly off topic, but aside from the ordered probit regression diagnostics, do you know any diagnositics to apply to the chopit model ran by gllamm. I have read there are no additional ways to formally test the two new assumptions in the chopit model vignette equivalence and response consistency.
                I am more familiar with logit, ordered logit and multinomial logit but not your current model. Sorry, I cannot help here.

                Comment


                • #9
                  Originally posted by Andrew Musau View Post

                  As I understood, your income variable is categorical, so you will have the group dummies in the regression. My comment was that taking the mean values across groups and treating it as continuous adds no value and may mislead your readers. So there is no issue here. Different sample sizes across groups is to be expected, and unless the observations are too few (e.g., less than 30), it should be OK to run the sub-sample regressions.



                  I am more familiar with logit, ordered logit and multinomial logit but not your current model. Sorry, I cannot help here.
                  How would I run the sub sample regression? There would be barely any variation in the group.

                  Comment


                  • #10
                    The variation comes from your other independent variables, not income. From #5,

                    I justified the categories by saying that different income groups have a different response to my dependent variable.
                    I interpret this as you believing that individuals in the different income categories behave differently. How you establish this is through examining variation in other non-income variables that predict your dependent variable between individuals in the different income categories. Am I missing something?

                    Comment


                    • #11
                      Originally posted by Andrew Musau View Post
                      The variation comes from your other independent variables, not income. From #5,



                      I interpret this as you believing that individuals in the different income categories behave differently. How you establish this is through examining variation in other non-income variables that predict your dependent variable between individuals in the different income categories. Am I missing something?
                      I see what your talking about now, do you think both methods are valid? The sub sample and the dummy variable.
                      Last edited by Martin Orr; 17 Oct 2018, 10:38.

                      Comment


                      • #12
                        Here is an illustration using the Stata data set nlswork. Here, I specify a logistic model predicting union membership and assuming that I have a categorical wage variable with 3 categories.

                        Code:
                        webuse nlswork
                        sum ln_wage,d
                        *GENERATE CATEGORICAL VARIABLE FOR WAGE (3 CATEGORIES)
                        gen hiwage=1
                        replace hiwage =2 if inrange(ln_wage, 1.361496, 1.964083)
                        replace hiwage =3 if ln_wage> 1.964083
                        
                        *ODDS OF UNION MEMBERSHIP (WAGE DUMMIES)
                        logistic union age i.race tenure hours i.hiwage, nolog
                        
                        *LOGISTIC REGERSSION ACROSS WAGE CATEGORIES
                        logistic union age i.race tenure hours if 1.hiwage, nolog
                        logistic union age i.race tenure hours if 2.hiwage, nolog
                        logistic union age i.race tenure hours if 3.hiwage, nolog
                        
                        *NOTE THAT SUB-SAMPLEs REGRESSIONS ARE EQUIVALENT TO ONE
                        *REGRESSION WITH GROUP INTERACTIONS
                        
                        logistic union (c.age i.race c.tenure c.hours)#i.hiwage, nolog
                        1. The regression with wage dummies

                        Code:
                        . logistic union age i.race tenure hours i.hiwage, nolog
                        
                        Logistic regression                             Number of obs     =     18,976
                                                                        LR chi2(7)        =    1408.71
                                                                        Prob > chi2       =     0.0000
                        Log likelihood = -9647.7396                     Pseudo R2         =     0.0680
                        
                        ------------------------------------------------------------------------------
                               union | Odds Ratio   Std. Err.      z    P>|z|     [95% Conf. Interval]
                        -------------+----------------------------------------------------------------
                                 age |   .9847804   .0031682    -4.77   0.000     .9785904    .9910095
                                     |
                                race |
                              black  |   1.981473   .0769879    17.60   0.000     1.836182    2.138261
                              other  |   .9456696   .1661107    -0.32   0.750     .6702278    1.334309
                                     |
                              tenure |   1.054628   .0047789    11.74   0.000     1.045303    1.064036
                               hours |   1.011964   .0021389     5.63   0.000     1.007781    1.016165
                                     |
                              hiwage |
                                  2  |   2.458343   .1501493    14.73   0.000     2.180988     2.77097
                                  3  |   4.629003    .297622    23.83   0.000     4.080933     5.25068
                                     |
                               _cons |   .0759769   .0103453   -18.93   0.000     .0581806    .0992166
                        ------------------------------------------------------------------------------
                        Note: _cons estimates baseline odds.
                        From this regression, the odds ratio for the second wage group (2.458343) is a ratio of the odds of being in a union in the second wage group compared to the odds of being in a union in the first wage group (the reference group). The regression with wage dummies will tell you whether there are differences between the odds that your dependent variable=1 within that category of the independent variable compared to the odds that your dependent variable=1 within the reference category.

                        2. The sub-samples regression (all together using a group interaction)

                        Code:
                        . logistic union (c.age i.race c.tenure c.hours)#i.hiwage, nolog
                        
                        Logistic regression                             Number of obs     =     18,976
                                                                        LR chi2(17)       =    1445.72
                                                                        Prob > chi2       =     0.0000
                        Log likelihood =  -9629.234                     Pseudo R2         =     0.0698
                        
                        ---------------------------------------------------------------------------------
                                  union | Odds Ratio   Std. Err.      z    P>|z|     [95% Conf. Interval]
                        ----------------+----------------------------------------------------------------
                           hiwage#c.age |
                                     1  |   .9980041   .0088175    -0.23   0.821      .980871    1.015437
                                     2  |   .9891414   .0043976    -2.46   0.014     .9805597    .9977981
                                     3  |   .9732133   .0053996    -4.89   0.000     .9626877    .9838541
                                        |
                            race#hiwage |
                               white#2  |   2.506535   .9620272     2.39   0.017     1.181344    5.318283
                               white#3  |   12.19546   4.836788     6.31   0.000       5.6054    26.53319
                               black#1  |   1.406716    .154683     3.10   0.002     1.133987    1.745038
                               black#2  |   4.774363   1.843987     4.05   0.000     2.239539    10.17823
                               black#3  |   28.94366   11.60697     8.39   0.000      13.1888    63.51871
                               other#1  |   1.779644   .8781077     1.17   0.243     .6766051    4.680917
                               other#2  |   2.313195   1.131764     1.71   0.087     .8866446    6.034966
                               other#3  |   10.72816     4.9266     5.17   0.000     4.361494    26.38853
                                        |
                        hiwage#c.tenure |
                                     1  |   1.044426   .0195413     2.32   0.020      1.00682    1.083437
                                     2  |   1.054399    .007735     7.22   0.000     1.039347    1.069669
                                     3  |   1.060023   .0065722     9.40   0.000      1.04722    1.072983
                                        |
                         hiwage#c.hours |
                                     1  |   1.016313   .0047067     3.49   0.000      1.00713     1.02558
                                     2  |   1.017895   .0033604     5.37   0.000      1.01133    1.024502
                                     3  |   1.004221   .0034116     1.24   0.215     .9975565     1.01093
                                        |
                                  _cons |   .0528797   .0177209    -8.77   0.000     .0274181    .1019861
                        ---------------------------------------------------------------------------------
                        Note: _cons estimates baseline odds.

                        This regression allows you to test, relative to a given independent variable, whether a coefficient differs across income groups. For example, is the coefficient of age different in the highest wage group relative to the lowest wage group above. This example is based on logit but the idea holds generally.

                        Comment


                        • #13
                          Originally posted by Andrew Musau View Post
                          Here is an illustration using the Stata data set nlswork. Here, I specify a logistic model predicting union membership and assuming that I have a categorical wage variable with 3 categories.

                          Code:
                          webuse nlswork
                          sum ln_wage,d
                          *GENERATE CATEGORICAL VARIABLE FOR WAGE (3 CATEGORIES)
                          gen hiwage=1
                          replace hiwage =2 if inrange(ln_wage, 1.361496, 1.964083)
                          replace hiwage =3 if ln_wage> 1.964083
                          
                          *ODDS OF UNION MEMBERSHIP (WAGE DUMMIES)
                          logistic union age i.race tenure hours i.hiwage, nolog
                          
                          *LOGISTIC REGERSSION ACROSS WAGE CATEGORIES
                          logistic union age i.race tenure hours if 1.hiwage, nolog
                          logistic union age i.race tenure hours if 2.hiwage, nolog
                          logistic union age i.race tenure hours if 3.hiwage, nolog
                          
                          *NOTE THAT SUB-SAMPLEs REGRESSIONS ARE EQUIVALENT TO ONE
                          *REGRESSION WITH GROUP INTERACTIONS
                          
                          logistic union (c.age i.race c.tenure c.hours)#i.hiwage, nolog
                          1. The regression with wage dummies

                          Code:
                          . logistic union age i.race tenure hours i.hiwage, nolog
                          
                          Logistic regression Number of obs = 18,976
                          LR chi2(7) = 1408.71
                          Prob > chi2 = 0.0000
                          Log likelihood = -9647.7396 Pseudo R2 = 0.0680
                          
                          ------------------------------------------------------------------------------
                          union | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]
                          -------------+----------------------------------------------------------------
                          age | .9847804 .0031682 -4.77 0.000 .9785904 .9910095
                          |
                          race |
                          black | 1.981473 .0769879 17.60 0.000 1.836182 2.138261
                          other | .9456696 .1661107 -0.32 0.750 .6702278 1.334309
                          |
                          tenure | 1.054628 .0047789 11.74 0.000 1.045303 1.064036
                          hours | 1.011964 .0021389 5.63 0.000 1.007781 1.016165
                          |
                          hiwage |
                          2 | 2.458343 .1501493 14.73 0.000 2.180988 2.77097
                          3 | 4.629003 .297622 23.83 0.000 4.080933 5.25068
                          |
                          _cons | .0759769 .0103453 -18.93 0.000 .0581806 .0992166
                          ------------------------------------------------------------------------------
                          Note: _cons estimates baseline odds.
                          From this regression, the odds ratio for the second wage group (2.458343) is a ratio of the odds of being in a union in the second wage group compared to the odds of being in a union in the first wage group (the reference group). The regression with wage dummies will tell you whether there are differences between the odds that your dependent variable=1 within that category of the independent variable compared to the odds that your dependent variable=1 within the reference category.

                          2. The sub-samples regression (all together using a group interaction)

                          Code:
                          . logistic union (c.age i.race c.tenure c.hours)#i.hiwage, nolog
                          
                          Logistic regression Number of obs = 18,976
                          LR chi2(17) = 1445.72
                          Prob > chi2 = 0.0000
                          Log likelihood = -9629.234 Pseudo R2 = 0.0698
                          
                          ---------------------------------------------------------------------------------
                          union | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]
                          ----------------+----------------------------------------------------------------
                          hiwage#c.age |
                          1 | .9980041 .0088175 -0.23 0.821 .980871 1.015437
                          2 | .9891414 .0043976 -2.46 0.014 .9805597 .9977981
                          3 | .9732133 .0053996 -4.89 0.000 .9626877 .9838541
                          |
                          race#hiwage |
                          white#2 | 2.506535 .9620272 2.39 0.017 1.181344 5.318283
                          white#3 | 12.19546 4.836788 6.31 0.000 5.6054 26.53319
                          black#1 | 1.406716 .154683 3.10 0.002 1.133987 1.745038
                          black#2 | 4.774363 1.843987 4.05 0.000 2.239539 10.17823
                          black#3 | 28.94366 11.60697 8.39 0.000 13.1888 63.51871
                          other#1 | 1.779644 .8781077 1.17 0.243 .6766051 4.680917
                          other#2 | 2.313195 1.131764 1.71 0.087 .8866446 6.034966
                          other#3 | 10.72816 4.9266 5.17 0.000 4.361494 26.38853
                          |
                          hiwage#c.tenure |
                          1 | 1.044426 .0195413 2.32 0.020 1.00682 1.083437
                          2 | 1.054399 .007735 7.22 0.000 1.039347 1.069669
                          3 | 1.060023 .0065722 9.40 0.000 1.04722 1.072983
                          |
                          hiwage#c.hours |
                          1 | 1.016313 .0047067 3.49 0.000 1.00713 1.02558
                          2 | 1.017895 .0033604 5.37 0.000 1.01133 1.024502
                          3 | 1.004221 .0034116 1.24 0.215 .9975565 1.01093
                          |
                          _cons | .0528797 .0177209 -8.77 0.000 .0274181 .1019861
                          ---------------------------------------------------------------------------------
                          Note: _cons estimates baseline odds.

                          This regression allows you to test, relative to a given independent variable, whether a coefficient differs across income groups. For example, is the coefficient of age different in the highest wage group relative to the lowest wage group above. This example is based on logit but the idea holds generally.
                          Do you think there is anything similar to this I could report given my results on income are insignificant?

                          Comment


                          • #14
                            Only if you think there is some useful comparison between one or more of the independent variables and the dependent variable across income groups. This, combined with tests, will show that explicitly. Otherwise, if you think that the comparisons are not necessary, just run the main regression and forget about this. Of course, you can add interactions of variables in your main regression too.

                            Comment


                            • #15
                              Originally posted by Andrew Musau View Post
                              Only if you think there is some useful comparison between one or more of the independent variables and the dependent variable across income groups. This, combined with tests, will show that explicitly. Otherwise, if you think that the comparisons are not necessary, just run the main regression and forget about this. Of course, you can add interactions of variables in your main regression too.
                              Well maybe not similar to this exactly, but any other way I could report results as my results section is quite small?

                              Comment

                              Working...
                              X