Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Question on Multinomial logit model

    Hello everybody!

    I would have a question on the multinomial logit model.

    I have a dataset of 25,000 observations (individuals), and I am using a multinomial logit regression to test the effect of some control variables (namely age, income, etc.) on a set of dependent variables denoting the different 5 choices that these individuals chose in terms of heating system (i.e., gas, electric, wood, carbon, other). The problem is that there are cases where an individual chooses more than one category (e.g., gas and wood; or gas, carbon and other, etc.), therefore I was wondering how to treat these special cases.

    For the theory behind the multinomial logit model (if I'm not wrong) it is assumed that choices (in the dependent variable) should be independent among each other.

    To cope with this issue, I thought about creating different choice categories comprehensive of multiple responses (e.g., gas and electric; gas, carbon and other,...) but the overall number of choices available, if I proceed this way, increases remarkably! Luckily, there are few cases when a person selects more than one alternative, but still, I do think this constitutes a problem.

    Alternatively, I thought about inserting a dummy variable, in the controls, for capturing the event of individuals selecting more than one alternative (even though I am not sure whether this way is correct).

    Whether you could provide me some help on this, I would be extremely grateful.

    Thank you very much.

    With best regards,

    Kodi

  • #2
    There is no entirely satisfactory way to handle this situation. Every approach has its advantages and drawbacks, and you need to choose the approach which is least damaging in your context.

    1. Creating comprehensive choice categories including multiple responses. Not only does the number of response variables become unwieldy (with 5 options, a total of 25 = 32 combinations, or perhaps only 31 if people choosing no system are excluded from the analysis). Not only is this unwieldy (and will cause very slow estimation of the model), since the number of people choosing more than one is small, each combination will only be present in a handful of observations and the corresponding parameter estimates will be so imprecise as to be useless. However, this approach has the advantage that it is a faithful representation of the reality you are trying to model so that no bias is introduced into the analysis from doing this. In short, this approach is unbiased but highly inefficient.

    2. Including an indicator variable (dummy) for multiple choices strikes me as problematic. If a person chose both coal and wood, this indicator will be set to 1, but what value will you assign to the choice variable? Clearly whether you consider it as coal + multiple selection or wood + multiple selection affects your results. If you decide to create two such observations, one each for coal and wood, then you are giving greater weight to those who chose multiple systems. Either way, you are biasing the analysis in a potentially severe way. And I don't see anything on the favorable side to recommend this way. At best one can say that if there are only a small number of such cases, the damage may not be too severe.

    There are other possible approaches, as well.

    3. You could create a separate response category called "multiple selections" so that you now have 6 levels of outcome instead of 5 This is not a completely faithful representation of reality because it treats coal + wood as being equivalent to gas + electric. So you lose the ability to distinguish the different combinations, and also the coefficient for this "multiple selection" category are difficult to interpret. But it will not bias the other results, nor decrease their precision.

    4. You could simply exclude those who selected multiple responses from the analysis. This will bias your sample, but if there aren't many of these, the damage may not be too severe.

    5. You could treat those who selected multiple responses as if they had selected "other." Since "other" is a non-specific category that typically contains a mix of unrelated things, adding in multiple responses simply enlarges a category whose inclusion in the modeling is regrettable to begin with. It makes the interpretation of the "other" parameters even more muddled than it would have been in the first place, but it does no harm to the analysis for the other categories.

    In choosing among these approaches (and there may be others I haven't thought of) you have to trade off the pluses and minuses of each. The relative importance of bias and efficiency (precision) depends on your particular context and research goals, so only you can make these value judgments.

    Finally, you are probably not the first person in your field to encounter this difficulty, and there may be a generally accepted way of dealing with this in your field. If that is the case, your audience will expect you to follow that established convention and may find your results confusing or unacceptable if you do not. So check the literature in your field.

    Comment


    • #3
      Thank you Clyde!
      K

      Comment

      Working...
      X