Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • HDFE logit model

    Dear Statalist,

    I am trying to estimate a HDFE logit model, with millions of individuals and millions of firms. I read all the posts in the forum and it seems that as of Nov 2021 there is no equivalent to user-written
    Code:
    reghdfe
    for logit models.
    There is, however, a command to estimate psuedo-Poisson models:
    Code:
    ppmlhdfe
    .
    The authors note that "Gourieroux et al.’s results greatly extend the realm of application of Poisson regression because there is no need to specify a distributional assumption for the dependent variable and, therefore, application is no longer restricted to count data".
    Does that mean that this model (and the related command) are appropriate for binary outcomes and potentially preferable to a linear probability model (since if I understand we would not incur the problem of negative predicted values)?
    Alternatively, is there any way that I am not aware of to estimate a HDFE logit model?
    Thank you for your clarifications.

    References:
    Gourieroux, C., A. Monfort, and A. Trognon. 1984. Pseudo Maximum Likelihood Methods: Theory. Econometrica 52(3): 681–700
    Correia, S., P. Guimarães, and T. Zylkin. 2019. ppmlhdfe: Fast Poisson estimation with high-dimensional fixed effects. arXiv. org.

  • #2
    Dear Valentina Rutigliano,

    I do not think that ppmlhdfe would work in your case (e.g., it can deliver predictions above 1). Also, ppmlhdfe works because Poisson regression has rather special properties, so it is not clear that an equivalent logit command can be useful. Maybe there are ways of doing what you want to do, but you need to provide more information on your data.

    Best wishes,

    Joao

    Comment


    • #3
      Hi Joao,

      Thanks for your answer.
      This is a minimum example of what the data looks like:
      Code:
      * Example generated by -dataex-. For more info, type help dataex
      clear
      input float firmid str1 worker float(year outcome indepvar)
      1 "A" 2000 1  .8
      1 "B" 2000 0  .8
      1 "C" 2000 0  .8
      1 "D" 2000 0  .8
      1 "E" 2000 0  .8
      1 "A" 2001 1  .5
      1 "B" 2001 0  .5
      1 "D" 2001 0  .5
      1 "E" 2001 0  .5
      1 "F" 2001 1  .5
      2 "F" 2000 1 .33
      2 "G" 2000 0 .33
      2 "H" 2000 1 .33
      2 "G" 2001 0   1
      2 "H" 2001 1   1
      end
      As you can see the outcome is at the worker level, while the independent variable is at the firm-year level.
      For a linear probability model I would run
      Code:
      reghdfe outcome indepvar, absorb(firmid year)
      and I am looking to run a logit model.

      Comment


      • #4
        Dear Valentina Rutigliano

        Can you please also give us information on the sample size along the dimensions of the fixed effects you want to include?

        Best wishes,

        Joao

        Comment


        • #5
          Sure, the dataset has approximately 300 milion worker-firm-year observations so N=300,000,000. It has around 2 million unique firmid and T=15 years.

          Comment


          • #6
            Dear Valentina Rutigliano

            In that case I would use xtlogit fe to absorb the fixed effects corresponding to the largest dimension, and use dummies for the other fixed effects. This, however, may take a very long time and may not be feasible if your computer is not powerful enough.

            Best wishes,

            Joao

            Comment


            • #7
              As Joao suggested, -xtlogit- is a wise choice because logit is one of the few models that can accommodate individual fixed effects and is not affected by the incidental parameter problem. If -xtlogit- takes too long, you may try the correlated random effect logit model, which includes the within-group means of all time varying covariates to a regular logit model. My colleague spent 23 days on a few logit regressions with millions of fixed effects, and the CRE took her only 18 hours -- Still too long but much improved.

              Comment


              • #8
                From this thread https://www.statalist.org/forums/for...g-fixed-ffects, xtlogit (or clogit) already fails (in Stata MP) when the number of groups is 9000. With 2 million groups, I do not see a solution in Stata as of today. Someone may be able to come up with an efficient algorithm to do this at some point in time.

                Comment


                • #9
                  Thank you Joao Santos Silva Fei Wang Andrew Musau. I will try to let -xtlogit- and -clogit- run over the weekend and update for future reference. Even though, as Andrew is saying, reading earlier threads I also suspect that it might fail.

                  Comment


                  • #10
                    I got the same problem now. I'm running a pooled cross-sectional data, so the xtlogit is impossible. I need to control fixed effect like survey year, states and their interaction term. When I tried use i.states, it's OK. However, when the high dimensional fixed effect comes in, things go wrong. Stata reported the calculation can't be accomplished because the dummy variables are too much. I did try to find something like reghdfe for logit regression, but I haven't found any.

                    Comment


                    • #11
                      I met the same question for mlogit. Using i.group variable such as city, school fixed effect takes a very very long time. I have found fast estimation for Possion regression but have not found any about mlogit

                      Comment

                      Working...
                      X