Interpretation of log-likelihood value

Martin Orr

Join Date: Oct 2018

Posts: 15
#1

Interpretation of log-likelihood value

16 Oct 2018, 10:37

I am using the gllamm command and doing sensitivity analysis. I am choosing between 4 models, the variable what I am changing between the models are using age and income as either categorical or continuous, so my models would be both continuous, only age categorical, only income categorical, and both categorical along with some unchanged controls. My final model which uses them both as categorical has the highest log-likelihood value, however the income categories are all insignificant whereas the age categories are 2/3 significant. Is this enough justification to say that this is my preferred model? Any advice would be greatly appreciated

Thanks

Last edited by Martin Orr; 16 Oct 2018, 10:56.
Tags: None
Andrew Musau

Join Date: Oct 2014

Posts: 10078
#2

16 Oct 2018, 11:39

gllamm is from SSC. You are asked to specify the source of user-written commands.

what I am changing between the models are using age and income as either categorical or continuous

You do not need a model selection criterion for this. If you have continuous age and income variables, you should use these because categorizing them is throwing away information. Sometimes, the individuals who conduct a survey may frame their question in such a way as the responses are categorical. In such a case, you do not have a choice but to use the variable as is. But in your case, can you come up with any good reason for wanting to categorize your variables? For anyone following this thread, the standard model selection criteria, i.e., AIC and BIC are available after running glamm. Just type

Code:

estat ic

The lrtest works as well, and one can also compute McFadden's pseudo R-squared, although in nested models, a direct comparison of the log-likelihoods will provide the same information as obtained from a comparison of pseudo R-squared statistics.
Comment
Martin Orr

Join Date: Oct 2018

Posts: 15
#3

16 Oct 2018, 11:57

Originally posted by Andrew Musau View Post

gllamm is from SSC. You are asked to specify the source of user-written commands.

You do not need a model selection criterion for this. If you have continuous age and income variables, you should use these because categorizing them is throwing away information. Sometimes, the individuals who conduct a survey may frame their question in such a way as the responses are categorical. In such a case, you do not have a choice but to use the variable as is. But in your case, can you come up with any good reason for wanting to categorize your variables? For anyone following this thread, the standard model selection criteria, i.e., AIC and BIC are available after running glamm. Just type

Code:

estat ic

The lrtest works as well, and one can also compute McFadden's pseudo R-squared, although in nested models, a direct comparison of the log-likelihoods will provide the same information as obtained from a comparison of pseudo R-squared statistics.

My original model was continuous but I wanted to test if there was a non linear relationship, that is why I switched to categorical. Income didn't become significant, but I also want to try another type of sensitivity analysis which would involve me restricting the sample size, do you think I should return back to the continuous measure for this test? Also what do you mean throwing away information?

Also when I type estat ic I get an error message reading type mismatch that's why I only looked to the log likelihood value

Last edited by Martin Orr; 16 Oct 2018, 12:00.
Comment
Andrew Musau

Join Date: Oct 2014

Posts: 10078
#4

16 Oct 2018, 12:43

Also what do you mean throwing away information?

Assume that John is in my sample and has an annual income of $126,453. If I categorize income, with a range of $20,000 within a category starting from $0, I will place John in the 120,000- 140,000 category. But wait, I know that John has an annual income of $126,453 and by categorizing income, I just threw away this information.

My original model was continuous but I wanted to test if there was a non linear relationship

If you want to test whether income is non-linear, you add a quadratic term, i.e., income²to the regression or some additional higher order terms. You cannot use factor variables with gllamm, so you have to generate the variable yourself.

Code:

gen income2= income*income

Categorizing the variable is not the way to check for non-linearity.

restricting the sample size

You need to have some basis for this. Say, if you want to examine whether individuals aged 50 and over behave the same way as other individuals in the sample. But you can do this directly with interactions.

Also when I type estat ic I get an error message reading type mismatch that's why I only looked to the log likelihood value

I have used estat ic after gllamm and it works fine. Please review the FAQ advice on posting and make sure that you show all the commands that you entered and the resulting Stata output, including errors. Use code delimiters when you post the Stata output.

Last edited by Andrew Musau; 16 Oct 2018, 12:53.
Comment
Martin Orr

Join Date: Oct 2018

Posts: 15
#5

16 Oct 2018, 13:09

Originally posted by Andrew Musau View Post

Assume that John is in my sample and has an annual income of $126,453. If I categorize income, with a range of $20,000 within a category starting from $0, I will place John in the 120,000- 140,000 category. But wait, I know that John has an annual income of $126,453 and by categorizing income, I just threw away this information.

If you want to test whether income is non-linear, you add a quadratic term, i.e., income²to the regression or some additional higher order terms. You cannot use factor variables with gllamm, so you have to generate the variable yourself.

Code:

gen income2= income*income

Categorizing the variable is not the way to check for non-linearity.

You need to have some basis for this. Say, if you want to examine whether individuals aged 50 and over behave the same way as other individuals in the sample. But you can do this directly with interactions.

I have used estat ic after gllamm and it works fine. Please review the FAQ advice on posting and make sure that you show all the commands that you entered and the resulting Stata output, including errors. Use code delimiters when you post the Stata output.

When I say non linear I justified the categories by saying that different income groups have a different response to my dependent variable. I conducted a linktest to see if any of my variables would require a square term and that result was that they didn't, does this mean my categorisation also isn't valid?

I am restricting the sample size based on distance to work to fit one of my specification assumptions its a bit long winded to mention on here, it isn't based on age or income

Code:

gllamm (dependent variable) (indepdent variables), i(surveycode) link(soprobit) constr(1) s(het) thresh(thresh) init trace estat ic

Is the exact code I inputted

Also I was mistaken my data came from a survey that was coded 1-8 with ranges of income, I created a continuous variable from taking the midpoint and also a categorical by creating dummy variables, age however was originally continuous which I then made into categories.

Last edited by Martin Orr; 16 Oct 2018, 13:12.
Comment
Andrew Musau

Join Date: Oct 2014

Posts: 10078
#6

16 Oct 2018, 16:37

I am restricting the sample size based on distance to work to fit one of my specification assumptions its a bit long winded to mention on here, it isn't based on age or income

Fine, this is a reason. As long as it makes sense within your research area.

Also I was mistaken my data came from a survey that was coded 1-8 with ranges of income, I created a continuous variable from taking the midpoint

There is no point of doing this. You can never recover the true income values and the results may be misleading to your readers because they will interpret income as continuous whereas it really is not.

When I say non linear I justified the categories by saying that different income groups have a different response to my dependent variable.

This is a valid hypothesis. What you could do and what is usually done is to run a model with the full sample. Then, to test your hypothesis, run regressions over the income sub-samples and compare your results to the full sample and across sub-samples. It is also possible to run one regression interacting your variables with the different groups to test your hypothesis. To be honest, I do not see any reason for you to compare models using the various information criteria because you have the same set of variables. These comparisons are useful only when you have additional variables, and you want to establish whether a model with more variables or a more parsimonious one is preferred. Bottom line, my advice is to keep continuous variables continuous (e.g., age in your model) and not to make a continuous variable from categorical variable as this does not add anything useful.
Comment
Martin Orr

Join Date: Oct 2018

Posts: 15
#7

16 Oct 2018, 17:23

Originally posted by Andrew Musau View Post

Fine, this is a reason. As long as it makes sense within your research area.

There is no point of doing this. You can never recover the true income values and the results may be misleading to your readers because they will interpret income as continuous whereas it really is not.

This is a valid hypothesis. What you could do and what is usually done is to run a model with the full sample. Then, to test your hypothesis, run regressions over the income sub-samples and compare your results to the full sample and across sub-samples. It is also possible to run one regression interacting your variables with the different groups to test your hypothesis. To be honest, I do not see any reason for you to compare models using the various information criteria because you have the same set of variables. These comparisons are useful only when you have additional variables, and you want to establish whether a model with more variables or a more parsimonious one is preferred. Bottom line, my advice is to keep continuous variables continuous (e.g., age in your model) and not to make a continuous variable from categorical variable as this does not add anything useful.

I explained in my data that this is one way I will be specifying income. I like the sub sample idea what you suggested, however 2 of the income categories have about 50% of the data in with 3 having about 70% of the data, whatever way I split it I will end up having groups that are completely different sizes with the larger groups have almost no variance due to the categorical nature. Do you think that in light of this my previous method of simply running one regression with dummy variables of a few of the categories grouped into bands justifies this choice. I feel like this might be useful to bring up in the limitations as a result.

Also slightly off topic, but aside from the ordered probit regression diagnostics, do you know any diagnositics to apply to the chopit model ran by gllamm. I have read there are no additional ways to formally test the two new assumptions in the chopit model vignette equivalence and response consistency.
Comment
Andrew Musau

Join Date: Oct 2014

Posts: 10078
#8

16 Oct 2018, 19:53

I explained in my data that this is one way I will be specifying income. I like the sub sample idea what you suggested, however 2 of the income categories have about 50% of the data in with 3 having about 70% of the data, whatever way I split it I will end up having groups that are completely different sizes with the larger groups have almost no variance due to the categorical nature. Do you think that in light of this my previous method of simply running one regression with dummy variables of a few of the categories grouped into bands justifies this choice. I feel like this might be useful to bring up in the limitations as a result.

As I understood, your income variable is categorical, so you will have the group dummies in the regression. My comment was that taking the mean values across groups and treating it as continuous adds no value and may mislead your readers. So there is no issue here. Different sample sizes across groups is to be expected, and unless the observations are too few (e.g., less than 30), it should be OK to run the sub-sample regressions.

Also slightly off topic, but aside from the ordered probit regression diagnostics, do you know any diagnositics to apply to the chopit model ran by gllamm. I have read there are no additional ways to formally test the two new assumptions in the chopit model vignette equivalence and response consistency.

I am more familiar with logit, ordered logit and multinomial logit but not your current model. Sorry, I cannot help here.
Comment
Martin Orr

Join Date: Oct 2018

Posts: 15
#9

17 Oct 2018, 05:11

Originally posted by Andrew Musau View Post

As I understood, your income variable is categorical, so you will have the group dummies in the regression. My comment was that taking the mean values across groups and treating it as continuous adds no value and may mislead your readers. So there is no issue here. Different sample sizes across groups is to be expected, and unless the observations are too few (e.g., less than 30), it should be OK to run the sub-sample regressions.

I am more familiar with logit, ordered logit and multinomial logit but not your current model. Sorry, I cannot help here.

How would I run the sub sample regression? There would be barely any variation in the group.
Comment
Andrew Musau

Join Date: Oct 2014

Posts: 10078
#10

17 Oct 2018, 06:43

The variation comes from your other independent variables, not income. From #5,

I justified the categories by saying that different income groups have a different response to my dependent variable.

I interpret this as you believing that individuals in the different income categories behave differently. How you establish this is through examining variation in other non-income variables that predict your dependent variable between individuals in the different income categories. Am I missing something?
Comment
Martin Orr

Join Date: Oct 2018

Posts: 15
#11

17 Oct 2018, 10:35

Originally posted by Andrew Musau View Post

The variation comes from your other independent variables, not income. From #5,

I interpret this as you believing that individuals in the different income categories behave differently. How you establish this is through examining variation in other non-income variables that predict your dependent variable between individuals in the different income categories. Am I missing something?

I see what your talking about now, do you think both methods are valid? The sub sample and the dummy variable.

Last edited by Martin Orr; 17 Oct 2018, 10:38.
Comment

Andrew Musau

Join Date: Oct 2014
Posts: 10078

#12

17 Oct 2018, 12:42

Here is an illustration using the Stata data set nlswork. Here, I specify a logistic model predicting union membership and assuming that I have a categorical wage variable with 3 categories.

Code:

webuse nlswork
sum ln_wage,d
*GENERATE CATEGORICAL VARIABLE FOR WAGE (3 CATEGORIES)
gen hiwage=1
replace hiwage =2 if inrange(ln_wage, 1.361496, 1.964083)
replace hiwage =3 if ln_wage> 1.964083

*ODDS OF UNION MEMBERSHIP (WAGE DUMMIES)
logistic union age i.race tenure hours i.hiwage, nolog

*LOGISTIC REGERSSION ACROSS WAGE CATEGORIES
logistic union age i.race tenure hours if 1.hiwage, nolog
logistic union age i.race tenure hours if 2.hiwage, nolog
logistic union age i.race tenure hours if 3.hiwage, nolog

*NOTE THAT SUB-SAMPLEs REGRESSIONS ARE EQUIVALENT TO ONE
*REGRESSION WITH GROUP INTERACTIONS

logistic union (c.age i.race c.tenure c.hours)#i.hiwage, nolog

1. The regression with wage dummies

Code:

. logistic union age i.race tenure hours i.hiwage, nolog

Logistic regression                             Number of obs     =     18,976
                                                LR chi2(7)        =    1408.71
                                                Prob > chi2       =     0.0000
Log likelihood = -9647.7396                     Pseudo R2         =     0.0680

------------------------------------------------------------------------------
       union | Odds Ratio   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         age |   .9847804   .0031682    -4.77   0.000     .9785904    .9910095
             |
        race |
      black  |   1.981473   .0769879    17.60   0.000     1.836182    2.138261
      other  |   .9456696   .1661107    -0.32   0.750     .6702278    1.334309
             |
      tenure |   1.054628   .0047789    11.74   0.000     1.045303    1.064036
       hours |   1.011964   .0021389     5.63   0.000     1.007781    1.016165
             |
      hiwage |
          2  |   2.458343   .1501493    14.73   0.000     2.180988     2.77097
          3  |   4.629003    .297622    23.83   0.000     4.080933     5.25068
             |
       _cons |   .0759769   .0103453   -18.93   0.000     .0581806    .0992166
------------------------------------------------------------------------------
Note: _cons estimates baseline odds.

From this regression, the odds ratio for the second wage group (2.458343) is a ratio of the odds of being in a union in the second wage group compared to the odds of being in a union in the first wage group (the reference group). The regression with wage dummies will tell you whether there are differences between the odds that your dependent variable=1 within that category of the independent variable compared to the odds that your dependent variable=1 within the reference category.

2. The sub-samples regression (all together using a group interaction)

Code:

. logistic union (c.age i.race c.tenure c.hours)#i.hiwage, nolog

Logistic regression                             Number of obs     =     18,976
                                                LR chi2(17)       =    1445.72
                                                Prob > chi2       =     0.0000
Log likelihood =  -9629.234                     Pseudo R2         =     0.0698

---------------------------------------------------------------------------------
          union | Odds Ratio   Std. Err.      z    P>|z|     [95% Conf. Interval]
----------------+----------------------------------------------------------------
   hiwage#c.age |
             1  |   .9980041   .0088175    -0.23   0.821      .980871    1.015437
             2  |   .9891414   .0043976    -2.46   0.014     .9805597    .9977981
             3  |   .9732133   .0053996    -4.89   0.000     .9626877    .9838541
                |
    race#hiwage |
       white#2  |   2.506535   .9620272     2.39   0.017     1.181344    5.318283
       white#3  |   12.19546   4.836788     6.31   0.000       5.6054    26.53319
       black#1  |   1.406716    .154683     3.10   0.002     1.133987    1.745038
       black#2  |   4.774363   1.843987     4.05   0.000     2.239539    10.17823
       black#3  |   28.94366   11.60697     8.39   0.000      13.1888    63.51871
       other#1  |   1.779644   .8781077     1.17   0.243     .6766051    4.680917
       other#2  |   2.313195   1.131764     1.71   0.087     .8866446    6.034966
       other#3  |   10.72816     4.9266     5.17   0.000     4.361494    26.38853
                |
hiwage#c.tenure |
             1  |   1.044426   .0195413     2.32   0.020      1.00682    1.083437
             2  |   1.054399    .007735     7.22   0.000     1.039347    1.069669
             3  |   1.060023   .0065722     9.40   0.000      1.04722    1.072983
                |
 hiwage#c.hours |
             1  |   1.016313   .0047067     3.49   0.000      1.00713     1.02558
             2  |   1.017895   .0033604     5.37   0.000      1.01133    1.024502
             3  |   1.004221   .0034116     1.24   0.215     .9975565     1.01093
                |
          _cons |   .0528797   .0177209    -8.77   0.000     .0274181    .1019861
---------------------------------------------------------------------------------
Note: _cons estimates baseline odds.

This regression allows you to test, relative to a given independent variable, whether a coefficient differs across income groups. For example, is the coefficient of age different in the highest wage group relative to the lowest wage group above. This example is based on logit but the idea holds generally.

Comment

Martin Orr

Join Date: Oct 2018
Posts: 15

#13

18 Oct 2018, 05:04

Originally posted by Andrew Musau View Post

Here is an illustration using the Stata data set nlswork. Here, I specify a logistic model predicting union membership and assuming that I have a categorical wage variable with 3 categories.

Code:

webuse nlswork
sum ln_wage,d
*GENERATE CATEGORICAL VARIABLE FOR WAGE (3 CATEGORIES)
gen hiwage=1
replace hiwage =2 if inrange(ln_wage, 1.361496, 1.964083)
replace hiwage =3 if ln_wage> 1.964083

*ODDS OF UNION MEMBERSHIP (WAGE DUMMIES)
logistic union age i.race tenure hours i.hiwage, nolog

*LOGISTIC REGERSSION ACROSS WAGE CATEGORIES
logistic union age i.race tenure hours if 1.hiwage, nolog
logistic union age i.race tenure hours if 2.hiwage, nolog
logistic union age i.race tenure hours if 3.hiwage, nolog

*NOTE THAT SUB-SAMPLEs REGRESSIONS ARE EQUIVALENT TO ONE
*REGRESSION WITH GROUP INTERACTIONS

logistic union (c.age i.race c.tenure c.hours)#i.hiwage, nolog

1. The regression with wage dummies

Code:

. logistic union age i.race tenure hours i.hiwage, nolog

Logistic regression Number of obs = 18,976
LR chi2(7) = 1408.71
Prob > chi2 = 0.0000
Log likelihood = -9647.7396 Pseudo R2 = 0.0680

------------------------------------------------------------------------------
union | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
age | .9847804 .0031682 -4.77 0.000 .9785904 .9910095
|
race |
black | 1.981473 .0769879 17.60 0.000 1.836182 2.138261
other | .9456696 .1661107 -0.32 0.750 .6702278 1.334309
|
tenure | 1.054628 .0047789 11.74 0.000 1.045303 1.064036
hours | 1.011964 .0021389 5.63 0.000 1.007781 1.016165
|
hiwage |
2 | 2.458343 .1501493 14.73 0.000 2.180988 2.77097
3 | 4.629003 .297622 23.83 0.000 4.080933 5.25068
|
_cons | .0759769 .0103453 -18.93 0.000 .0581806 .0992166
------------------------------------------------------------------------------
Note: _cons estimates baseline odds.

Code:

. logistic union (c.age i.race c.tenure c.hours)#i.hiwage, nolog

Logistic regression Number of obs = 18,976
LR chi2(17) = 1445.72
Prob > chi2 = 0.0000
Log likelihood = -9629.234 Pseudo R2 = 0.0698

---------------------------------------------------------------------------------
union | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]
----------------+----------------------------------------------------------------
hiwage#c.age |
1 | .9980041 .0088175 -0.23 0.821 .980871 1.015437
2 | .9891414 .0043976 -2.46 0.014 .9805597 .9977981
3 | .9732133 .0053996 -4.89 0.000 .9626877 .9838541
|
race#hiwage |
white#2 | 2.506535 .9620272 2.39 0.017 1.181344 5.318283
white#3 | 12.19546 4.836788 6.31 0.000 5.6054 26.53319
black#1 | 1.406716 .154683 3.10 0.002 1.133987 1.745038
black#2 | 4.774363 1.843987 4.05 0.000 2.239539 10.17823
black#3 | 28.94366 11.60697 8.39 0.000 13.1888 63.51871
other#1 | 1.779644 .8781077 1.17 0.243 .6766051 4.680917
other#2 | 2.313195 1.131764 1.71 0.087 .8866446 6.034966
other#3 | 10.72816 4.9266 5.17 0.000 4.361494 26.38853
|
hiwage#c.tenure |
1 | 1.044426 .0195413 2.32 0.020 1.00682 1.083437
2 | 1.054399 .007735 7.22 0.000 1.039347 1.069669
3 | 1.060023 .0065722 9.40 0.000 1.04722 1.072983
|
hiwage#c.hours |
1 | 1.016313 .0047067 3.49 0.000 1.00713 1.02558
2 | 1.017895 .0033604 5.37 0.000 1.01133 1.024502
3 | 1.004221 .0034116 1.24 0.215 .9975565 1.01093
|
_cons | .0528797 .0177209 -8.77 0.000 .0274181 .1019861
---------------------------------------------------------------------------------
Note: _cons estimates baseline odds.

Do you think there is anything similar to this I could report given my results on income are insignificant?

Comment

Andrew Musau

Join Date: Oct 2014

Posts: 10078
#14

18 Oct 2018, 05:55

Only if you think there is some useful comparison between one or more of the independent variables and the dependent variable across income groups. This, combined with tests, will show that explicitly. Otherwise, if you think that the comparisons are not necessary, just run the main regression and forget about this. Of course, you can add interactions of variables in your main regression too.
Comment
Martin Orr

Join Date: Oct 2018

Posts: 15
#15

18 Oct 2018, 06:13

Originally posted by Andrew Musau View Post

Only if you think there is some useful comparison between one or more of the independent variables and the dependent variable across income groups. This, combined with tests, will show that explicitly. Otherwise, if you think that the comparisons are not necessary, just run the main regression and forget about this. Of course, you can add interactions of variables in your main regression too.

Well maybe not similar to this exactly, but any other way I could report results as my results section is quite small?
Comment

Announcement