Using a counting model to estimate a binary response model

Ariel Soto-Caro

Join Date: Mar 2021

Posts: 30
#1

Using a counting model to estimate a binary response model

01 Sep 2021, 08:35

Hello, statalisters.
I apologize in advance for this question because it might be too naive.
I would like to know if, statistically speaking, I can estimate a binary model (usually estimated via probit or logit) but using a counting model instead.
This is because I'm struggling with the running time of -xtlogit- with FE. And I was wondering if can I use -ppmlhdfe- instead, which is way faster.
I know the nature of the dependent variable is different, but I think maybe it is an "acceptable" trick to deal with the computation time.
By the way, I have a binary response model with geographic (over 1000 different places) and time (10 years) fixed effects. Plus other controls.
Thank you for your advice.
Tags: count, HDFE, logit

George Ford

Join Date: Aug 2014
Posts: 3118

01 Sep 2021, 15:36

HTML Code:

https://stats.stackexchange.com/questions/18595/poisson-regression-to-estimate-relative-risk-for-binary-outcomes

Comment

Ariel Soto-Caro

Join Date: Mar 2021

Posts: 30
#3

01 Sep 2021, 17:08

Originally posted by George Ford View Post

HTML Code:

https://stats.stackexchange.com/questions/18595/poisson-regression-to-estimate-relative-risk-for-binary-outcomes

Thank you George Ford. I had not seen that post before. I appreciate it.
Comment
Joro Kolev

Join Date: Aug 2018

Posts: 3047
#4

02 Sep 2021, 01:09

My view is that Yes, it is appropriate to use Poisson regression on binary outcomes. And I think so because the binary outcome is a subclass of Poisson outcome, where the count can be only 0 or 1.

We have here experts on this class of models ( Jeff Wooldridge and Joao Santos Silva ), and apart from OP I am also interested to hear the experts views on whether Poisson model is appropriate for binary outcomes.
3 likes
Comment
Joao Santos Silva

Join Date: Apr 2014

Posts: 3000
#5

02 Sep 2021, 01:58

Like Joro, I would be interested in hear what Jeff thinks about this but, since I was tagged, here are my two cents.

Before addressing the question, I would say that estimating a FE logit with 10 periods is always going to take a long time, so my first suggestion is that you simply let it run over night to see if you get results. There is also another potential problem with Stata's implementation of this estimator: as far as I know (but I may be wrong!), Stata does not check for the existence of "separation" or perfect predictors in this estimator, and if this is an issue in your sample the estimator may never converge. Maybe you can try to estimate the model using just pairs of years in your sample (this should be quick) and see if those converges; if it does you can then use the estimates as starting values for the estimation with the full sample; hopefully it will speed up things.

Moving now to the ppmlhdfe command, at least you can use it to investigate the problem above because it will give you information about separated observations, and it can provide starting values.

Whether the PPML estimates can be used for more than that will depend on your data. The obvious problem of using PPML in this context is that it assumes an exponential conditional expectation that will not be valid for binary data. However, if the probabilities you are estimating are all sufficiently close to zero, the exponential function will be a good approximation to the logistic and therefore the use of PPML can be justified on those grounds. So, perhaps you can start by computing the average of the dependent variable for each unit and see what that looks like. If you have a decent proportion of units with averages around 0.5 or above, I would not advise using PPML in your case.

Best wishes,

Joao
5 likes
Comment
Ariel Soto-Caro

Join Date: Mar 2021

Posts: 30
#6

02 Sep 2021, 15:03

Originally posted by Joao Santos Silva View Post

Like Joro, I would be interested in hear what Jeff thinks about this but, since I was tagged, here are my two cents.

Before addressing the question, I would say that estimating a FE logit with 10 periods is always going to take a long time, so my first suggestion is that you simply let it run over night to see if you get results. There is also another potential problem with Stata's implementation of this estimator: as far as I know (but I may be wrong!), Stata does not check for the existence of "separation" or perfect predictors in this estimator, and if this is an issue in your sample the estimator may never converge. Maybe you can try to estimate the model using just pairs of years in your sample (this should be quick) and see if those converges; if it does you can then use the estimates as starting values for the estimation with the full sample; hopefully it will speed up things.

Moving now to the ppmlhdfe command, at least you can use it to investigate the problem above because it will give you information about separated observations, and it can provide starting values.

Whether the PPML estimates can be used for more than that will depend on your data. The obvious problem of using PPML in this context is that it assumes an exponential conditional expectation that will not be valid for binary data. However, if the probabilities you are estimating are all sufficiently close to zero, the exponential function will be a good approximation to the logistic and therefore the use of PPML can be justified on those grounds. So, perhaps you can start by computing the average of the dependent variable for each unit and see what that looks like. If you have a decent proportion of units with averages around 0.5 or above, I would not advise using PPML in your case.

Best wishes,

Joao

Thank you, Dr. Santos Silva and Dr. Kolev. I ran the -xtlogit- and it converged during the night. I'm also checking the means for each unit, and it seems most of them are lower than .1.
Regarding your other recommendation, is it possible to set initial values in xtlogit? Thank you
Comment
Joao Santos Silva

Join Date: Apr 2014

Posts: 3000
#7

02 Sep 2021, 15:29

I believe you can you the option "from()" to set the staring values; please check the maximization options.
2 likes
Comment
Bengt Soderlund

Join Date: Sep 2021

Posts: 2
#8

28 Sep 2021, 04:26

I face very similar challenges. I am trying to estimate a gravity style event study regression with a binary dependent variable and a binary treatment variable. I have a panel covering trade of a certain product code between 324 regions over 10 years. The dependent variable, y_ijt, takes the value 1 if region i exports the product to region j at time t and 0 otherwise. The treatment variable is a policy change that affects a subset of the region pairs at t=4 (all in the same year). I want to see if the policy change affected the probability for regions to trade that product code. I have tried to run the following code:

ppmlhdfe y treated*t_1 treated*t_2 …. treated*t_10, absorb(i.exporter_t i.importer_t i.exporter_importer) vce(cluster exporter_importer)

(treated takes value 1 if the exporter-importer pair was affected by the policy, t_n is a dummy variable that takes the value 1 if time equals n, I omit treated*t_4 to avoid collinearity)

My concerns are the following:
1. While I obtain output, I get the following message “ReLU separation check: maximum number of iterations reached; aborting”. I get similar estimates while running the same specification using ppml_panel_sg without getting any error messages.

2. I worry that the probabilities I estimate or not close enough to zero, causing the exponential function not be a good approximation for the logistic function. The average probability of observing trade in the certain product between two regions is 0.25.

3. I have tried to estimate the model using xtlogit without the estimator converging. I have also tried using only pairs of years without it converging either.

I wonder if perhaps Joao Santos Silva or Sergio Correia may have some useful answers. I would be very thankful for any input.
Comment
Joao Santos Silva

Join Date: Apr 2014

Posts: 3000
#9

29 Sep 2021, 00:11

Dear Bengt Soderlund,

It is a bit early for me, so my brain may not bee fully caffeinated, but I am afraid I do not see a good solution for this. From the description of your data, it looks as if a binary model would be preferable, but I do not think a logit would be consistent with the 3 sets of FE. So, you are left with the PPML results, which will be somewhat unreliable because the exponential functional form may not be suitable. As for the message you are getting about the ReLU method, I think it is safe to ignore it, but you can try different flavours of the separation option.

An alternative would be to model the volume of trade and compare results for the full sample and for the sub-sample where trade was positive before the policy change. This would allow you to use PPML and may allow you to see if an increase in trade volumes can be fully explained by the extensive margin.

Best wishes,

Joao
Comment
Bengt Soderlund

Join Date: Sep 2021

Posts: 2
#10

30 Sep 2021, 03:28

Dear Joao Santos Silva ,

Thank you so much for taking the time to write this very valuable response. I did not know much about the various way to handle separation. Also, comparing results between always-traders and the full sample to check the validity of PPML is a great suggestion.

Best,
Bengt
1 like
Comment

Announcement

Using a counting model to estimate a binary response model

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment