Marginal Effects after Logistic Regression

Alexander Huber

Join Date: Oct 2014

Posts: 23
#1

Marginal Effects after Logistic Regression

17 Apr 2015, 03:59

Dear community members,

currently Iam struggeling with marginal effects (ME) after my logistic regression. My framwork looks as follows: Iam regressing Age (Values 1,2,3,4,5), Gender (Values 1 for both male and female and 0 for only male), House (Values 1,0) and so on against the variable car ownership. In a simple meaning I am trying to find out which variables have an effect on the liklyhood that people purchase a car, based on what they already posess and how old they are, and which do not.

Question 1) Which logistic regression is more correct and what is the difference of each regarding the marginal effect postestimation

My logistic regression looks like this: logistic Car age gender house (1)
Literature also meantions the following in regards to ME analysis: logistic car age i.gender i.house (2)
Using the "i." in front of the variable shall give Stata the indication that the covariate is not continuous since house and gender are only bivariate in nature...

Now I have two versions of ME in place

Version one following my initial logit regression logistic Car age gender house (1)

1) margins, dydx (house) This command gives me the average marginal effect, i.e. the likely effect the possession over non posession of a house has on the probability to purchase a car
2) margins house This command causes the error "House" not found in the list of covariates. This error is in connection with the missing "i." in front of the discrete variables in the logit regression before i assume. In essence I simply want to see the probability of house=0 and house=1. To my understanding, the difference must be the same as margins, dydx (house) or am I mistaken?
3) If I use margins, at (house=(1 0)), the difference between house=0 and house=1 does however not equal the value of margins, dydx (house).

In version two, logit regression (2) logistic car age i.gender i.house, the commands margins, dydx (house) and margins house above work well and the difference between house=0 and house =1 after margins house equals exactly the value after margins, dydx (house).

My first question now is, which logistic regression is "more" correct (using the "i." or leaving it out) and if the the difference in covariate=1 and covariate=0 must always equal the value margins, dydx (covariate)?

Thank you very much in advance!!!
Tags: None

Clyde Schechter

Join Date: Apr 2014
Posts: 30100

17 Apr 2015, 10:54

I can't reproduce the problem you are having. I tried something analogous using the Stata auto.dta. For a dependent variable, I simply created something called "expensive" defined as price exceeding the mean price. I then did logistic regression of that against mpg (continuous) and foreign (dichotomous). I did both using i.foreign factor-variable notation, and just plain foreign. As you can see, the results I get are the same either way.

Code:

. sysuse auto,
(1978 Automobile Data)

.
. summ price

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
       price |         74    6165.257    2949.496       3291      15906

. gen byte expensive = (price > `r(mean)')

.
. logistic expensive mpg i.foreign

Logistic regression                             Number of obs     =         74
                                                LR chi2(2)        =      21.24
                                                Prob > chi2       =     0.0000
Log likelihood = -34.413711                     Pseudo R2         =     0.2358

------------------------------------------------------------------------------
   expensive | Odds Ratio   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         mpg |   .7356844   .0665681    -3.39   0.001     .6161281    .8784401
             |
     foreign |
    Foreign  |   10.12642   8.229433     2.85   0.004     2.059256    49.79681
       _cons |   95.83255   155.6475     2.81   0.005     3.972141    2312.073
------------------------------------------------------------------------------

. margins foreign

Predictive margins                              Number of obs     =         74
Model VCE    : OIM

Expression   : Pr(expensive), predict()

------------------------------------------------------------------------------
             |            Delta-method
             |     Margin   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
     foreign |
   Domestic  |   .2107033   .0468777     4.49   0.000     .1188246     .302582
    Foreign  |   .5816868   .0894399     6.50   0.000     .4063879    .7569857
------------------------------------------------------------------------------

. margins, dydx(foreign)

Average marginal effects                        Number of obs     =         74
Model VCE    : OIM

Expression   : Pr(expensive), predict()
dy/dx w.r.t. : 1.foreign

------------------------------------------------------------------------------
             |            Delta-method
             |      dy/dx   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
     foreign |
    Foreign  |   .3709835   .1020291     3.64   0.000     .1710101    .5709569
------------------------------------------------------------------------------
Note: dy/dx for factor levels is the discrete change from the base level.

.
. logistic expensive mpg foreign

Logistic regression                             Number of obs     =         74
                                                LR chi2(2)        =      21.24
                                                Prob > chi2       =     0.0000
Log likelihood = -34.413711                     Pseudo R2         =     0.2358

------------------------------------------------------------------------------
   expensive | Odds Ratio   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         mpg |   .7356844   .0665681    -3.39   0.001     .6161281    .8784401
     foreign |   10.12642   8.229433     2.85   0.004     2.059256    49.79681
       _cons |   95.83255   155.6475     2.81   0.005     3.972141    2312.073
------------------------------------------------------------------------------

. margins, at(foreign = (0 1))

Predictive margins                              Number of obs     =         74
Model VCE    : OIM

Expression   : Pr(expensive), predict()

1._at        : foreign         =           0

2._at        : foreign         =           1

------------------------------------------------------------------------------
             |            Delta-method
             |     Margin   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         _at |
          1  |   .2107033   .0468777     4.49   0.000     .1188246     .302582
          2  |   .5816868   .0894399     6.50   0.000     .4063879    .7569857
------------------------------------------------------------------------------

Perhaps there is some other difference in the regressions you tried. It would be best if you posted (in a code block) your exact commands and Stata's output, pasted directly from the Results window.

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#3

17 Apr 2015, 11:38

Hmm. I wrote a response about 5 minutes ago, and it doesn't seem to be here now, even though the forum page says it should be.

Anyway, in that post I said that I couldn't reproduce your problem in the auto.dta set. But I realize that I wasn't doing exactly what you did.

Whether you use i.house or house you will get the same logistic regression coefficient for that variable, and the results for -margins house- in the factor-variables version will be the same as the results for -margins, at(house = (0 1))- in the non-factor variable version.

But the results of -margins, dydx(house)- will differ between the two. If you go to the [R] reference manual section on methods and formulas for the -margins- command, p. 1396 it is explained there.

When you use the factor variable notation, -margins- "knows" that house is a discrete variable and it calculates the marginal effect simply as the difference in predicted probability at house = 0 and house = 1. When you don't use the factor variable notation, -margins- does not know what house is, and treats it as a continuous variable. In that case, it estimates the partial derivative of the inverse logistic function of the predicted value (xb) with respect to house, evaluated at house = 0. So, those are two different things, given the non-linearity of the logistic function.

Which one you want is up to you. When dealing with a truly dichotomous variable, as is the case here, people normally think of the marginal effect as the effect of a 1-unit change in the variable--which is what the i.house version gives you. But sometimes economists really think of marginal effects as the derivative of an output with respect to an input--which would be the version without factor variables. I think the former makes more sense when the variable is, as here, genuinely discrete--but you need to consider who you are doing this analysis for and what they will assume is meant by the term "marginal effect."

All of that said, in my opinion, the introduction of factor variables and the -margins- command were one of the great leaps forward in Stata's history. Using them makes life after regression so much simpler. And if you consistently use factor variable notation wherever it is applicable, you will at the very least get sensible results all the time--even if there are circumstances where, as here, another interpretation is possible. And, from my experience (as an epidemiologist), the factor variables with margins result has always been more sensible than the other possibilities
Comment
Alexander Huber

Join Date: Oct 2014

Posts: 23
#4

20 Apr 2015, 13:55

Dear Clyde, thank you very much for your outstanding help concerning my issue. I support your opinion about the factor variables. This seems to make more sense to me since for example the ownership of a car can only be 0 or 1 and nothing between. Nonetheless, I´ve uploaded my dataset about my ME commands. One with the "i." in front of the covariate and one without as well as the codes I use. Maybe you can reproduce the problem I describe above?

Which version of ME do you prefer in general? Having all other covariates at mean or leaving them at the values they actually are? I find the "atmeans" command somehow weird since car ownership can for example not be 0,47, right?
Attached Files

Data.dta (5.5 KB, 1 view)

Do File.do (815 Bytes, 1 view)
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#5

20 Apr 2015, 14:30

Yes, I reproduce your findings. The error message means what it says: you can't invoke -margins Cov1- unless Cov1 is a factor variable in your model.

And, yes the result you get from -margins, dydx(Cov1)- is different when Cov1 is a factor variable from when it is not. That's what I was pointing out in #3.

With regard to doing adjustments with all covariates at mean or at their observed values, I don't have a general preference. It would depend on what I'm trying to use the results for, which in turn depends on the substantive aspects of the problem. Is it more useful, in a pragmatic sense from the perspective of whoever will be seeing and using your results, to deal with results based on a hypothetical average case that may or may not exist in reality but that abstracts away from other sources of variation (the -atmeans- version), or is it more useful to look at the population-averaged effect that will be observed overall in the population (the calculation using observed values of all covariates). [Note, in a linear model it wouldn't matter--you'd get the same result.] The answer will be different depending on the subject matter, and even perhaps will be differ among users of the same data.

For example, if Cov1 is a variable that can be changed independently of the other covariates, and if the effect on your dependent variable is causal, the change in dependent variable that a person/firm/country/whatever these are, with average values of the other variables, would experience from changing the value of Cov1 would be given by the result of the -atmeans- version. On the other hand, if we are contemplating the impact of a population-wide intervention that changes the values of Cov1 in every observed entity, then the resulting effect would be given by the version using observed values of all covariates. These are clearly different things, and one might be more important than the other depending on what the question is about and who is asking the question for what reason.
Comment
Asad Rind

Join Date: Mar 2016

Posts: 55
#6

28 Mar 2019, 02:02

Dear Clyde Schechter

I am having a problem in interpreting the results of average marginal effects,
After logit, i run the following command

margins, dydx(*)

The results for my main independent variable are as follows

| Delta-method
| dy/dx Std. Err. z P>|z| [95% Conf. Interval]
------------------------------------------+----------------------------------------------------------------
fraud_msa3 | .0702511 .0161183 4.36 0.000 .0386598 .1018423

the Main dependent variables is a fraud dummy and the explanatory variable is the ratio of frauds to total observations in an area.

Can anyone help me interpret these results?. I assume that a one unit change in frauds in an area increases the probability of fraud for the firm by 7% points.
Plus, how can i interpret this in terms of standard deviation changes. Like one standard deviation increase in the independent variable brings how much change in the dependent variable?

Last edited by Asad Rind; 28 Mar 2019, 02:46.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#7

28 Mar 2019, 12:08

Can anyone help me interpret these results?. I assume that a one unit change in frauds in an area increases the probability of fraud for the firm by 7% points.

This is more or less correct. I would avoid causal language and rephrase it as a one unit difference in the ratio of frauds to total observations in an area is associated with a 7 percentage point difference in fraud.

Plus, how can i interpret this in terms of standard deviation changes. Like one standard deviation increase in the independent variable brings how much change in the dependent variable?

There is no simple or direct way to do this. You would be best off creating a standardized variable and then re-running the regression and -margins- with that variable replacing fraud_msa3.

BUT, given that nobody but you will have a clue what the standard deviation of fraud_msa3 is, and given that fraud_msa3 itself is a clearly defined variable that is easy to understand, why on earth would you want to do this? All it would do is obfuscate your results.
Comment
Asad Rind

Join Date: Mar 2016

Posts: 55
#8

01 Apr 2019, 00:27

Thanks Clyde Schechter
Comment

Announcement