HDFE logit model

Valentina Rutigliano

Join Date: Jan 2021

Posts: 17
#1

HDFE logit model

29 Nov 2021, 12:01

Dear Statalist,

I am trying to estimate a HDFE logit model, with millions of individuals and millions of firms. I read all the posts in the forum and it seems that as of Nov 2021 there is no equivalent to user-written

Code:

reghdfe

for logit models.
There is, however, a command to estimate psuedo-Poisson models:

Code:

ppmlhdfe

.
The authors note that "Gourieroux et al.’s results greatly extend the realm of application of Poisson regression because there is no need to specify a distributional assumption for the dependent variable and, therefore, application is no longer restricted to count data".
Does that mean that this model (and the related command) are appropriate for binary outcomes and potentially preferable to a linear probability model (since if I understand we would not incur the problem of negative predicted values)?
Alternatively, is there any way that I am not aware of to estimate a HDFE logit model?
Thank you for your clarifications.

References:
Gourieroux, C., A. Monfort, and A. Trognon. 1984. Pseudo Maximum Likelihood Methods: Theory. Econometrica 52(3): 681–700
Correia, S., P. Guimarães, and T. Zylkin. 2019. ppmlhdfe: Fast Poisson estimation with high-dimensional fixed effects. arXiv. org.
Tags: None
Joao Santos Silva

Join Date: Apr 2014

Posts: 2961
#2

29 Nov 2021, 12:39

Dear Valentina Rutigliano,

I do not think that ppmlhdfe would work in your case (e.g., it can deliver predictions above 1). Also, ppmlhdfe works because Poisson regression has rather special properties, so it is not clear that an equivalent logit command can be useful. Maybe there are ways of doing what you want to do, but you need to provide more information on your data.

Best wishes,

Joao
Comment

Valentina Rutigliano

Join Date: Jan 2021
Posts: 17

29 Nov 2021, 13:13

Hi Joao,

Thanks for your answer.
This is a minimum example of what the data looks like:

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input float firmid str1 worker float(year outcome indepvar)
1 "A" 2000 1  .8
1 "B" 2000 0  .8
1 "C" 2000 0  .8
1 "D" 2000 0  .8
1 "E" 2000 0  .8
1 "A" 2001 1  .5
1 "B" 2001 0  .5
1 "D" 2001 0  .5
1 "E" 2001 0  .5
1 "F" 2001 1  .5
2 "F" 2000 1 .33
2 "G" 2000 0 .33
2 "H" 2000 1 .33
2 "G" 2001 0   1
2 "H" 2001 1   1
end

As you can see the outcome is at the worker level, while the independent variable is at the firm-year level.
For a linear probability model I would run

Code:

reghdfe outcome indepvar, absorb(firmid year)

and I am looking to run a logit model.

Comment

Joao Santos Silva

Join Date: Apr 2014

Posts: 2961
#4

29 Nov 2021, 16:00

Dear Valentina Rutigliano

Can you please also give us information on the sample size along the dimensions of the fixed effects you want to include?

Best wishes,

Joao
Comment
Valentina Rutigliano

Join Date: Jan 2021

Posts: 17
#5

29 Nov 2021, 16:29

Sure, the dataset has approximately 300 milion worker-firm-year observations so N=300,000,000. It has around 2 million unique firmid and T=15 years.
Comment
Joao Santos Silva

Join Date: Apr 2014

Posts: 2961
#6

30 Nov 2021, 02:41

Dear Valentina Rutigliano

In that case I would use xtlogit fe to absorb the fixed effects corresponding to the largest dimension, and use dummies for the other fixed effects. This, however, may take a very long time and may not be feasible if your computer is not powerful enough.

Best wishes,

Joao
1 like
Comment
Fei Wang

Join Date: Oct 2021

Posts: 726
#7

30 Nov 2021, 03:50

As Joao suggested, -xtlogit- is a wise choice because logit is one of the few models that can accommodate individual fixed effects and is not affected by the incidental parameter problem. If -xtlogit- takes too long, you may try the correlated random effect logit model, which includes the within-group means of all time varying covariates to a regular logit model. My colleague spent 23 days on a few logit regressions with millions of fixed effects, and the CRE took her only 18 hours -- Still too long but much improved.
1 like
Comment
Andrew Musau

Join Date: Oct 2014

Posts: 9944
#8

30 Nov 2021, 03:55

From this thread https://www.statalist.org/forums/for...g-fixed-ffects, xtlogit (or clogit) already fails (in Stata MP) when the number of groups is 9000. With 2 million groups, I do not see a solution in Stata as of today. Someone may be able to come up with an efficient algorithm to do this at some point in time.
1 like
Comment
Valentina Rutigliano

Join Date: Jan 2021

Posts: 17
#9

30 Nov 2021, 12:52

Thank you Joao Santos Silva Fei Wang Andrew Musau. I will try to let -xtlogit- and -clogit- run over the weekend and update for future reference. Even though, as Andrew is saying, reading earlier threads I also suspect that it might fail.
Comment
Lingfeng Zhao

Join Date: Jun 2024

Posts: 1
#10

30 Jun 2024, 22:36

I got the same problem now. I'm running a pooled cross-sectional data, so the xtlogit is impossible. I need to control fixed effect like survey year, states and their interaction term. When I tried use i.states, it's OK. However, when the high dimensional fixed effect comes in, things go wrong. Stata reported the calculation can't be accomplished because the dummy variables are too much. I did try to find something like reghdfe for logit regression, but I haven't found any.
Comment
Qiuyi Wang

Join Date: Jul 2024

Posts: 2
#11

03 Jul 2024, 22:03

I met the same question for mlogit. Using i.group variable such as city, school fixed effect takes a very very long time. I have found fast estimation for Possion regression but have not found any about mlogit
Comment

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment