Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Multinomial Logistic Regression for Unbalanced Panel Data with dependent variable class imbalance

    I have unbalanced panel of firms for 2011-19. I would like to run logistic regression for unbalanced panel data with unequal number of observations for the dependent variable class.


    To make it clearer, the dependent variable is (A)good performing firms with 2973 observations, (B) medium performing firms with 34,737 observations and (C)poor performing firms with 33,889 observations.

    The independent variables are industry dummy for the industry the firm belongs to, debt-equity ratio for the firms, size and age of the firm.

    I have a set of questions for the running logistic regression for panel data
    (1) Can I run a logistic regression for panel data even when the number of observations within each class of dependent variable is highly skewed? Will it give sensible results?
    (2) If yes, then how does one find whether to use fixed effects or random effects for panel logistic regression?
    (3) Is weighted logistic panel regression more useful for my work? If yes, then how are the weights decided?

  • #2
    See xtmlogit in Stata 17+ (https://www.stata.com/new-in-stata/p...inomial-logit/). Most estimators including xtmlogit handle unbalanced panels, so the fact that you have one is not an issue from an estimation point of view. However, it may be an issue from an inference point of view if the data are non-random missing. For the choice between fixed and random effects, use the Hausman test.

    Code:
    help hausman
    Last edited by Andrew Musau; 11 Apr 2022, 04:37.

    Comment


    • #3
      Hi Andrew Musau. Thank you for replying.

      I could not understand in which context are you sayong "However, it may be an issue from an inference point of view if the data are non-random missing."


      Also, the dependent variable has high class imbalance problem [(A)good performing firms with 2973 observations, (B) medium performing firms with 34,737 observations and (C)poor performing firms with 33,889 observations]. Would sensible results and interpretation make sense if I use xtmlogit in such a case?

      Comment


      • #4
        Sorry, there are two issues that are being confounded here: the panel being unbalanced and the distribution of observations across categories. The former relates to missing values which if not randomly missing may lead to biased inference. The latter given those numbers should not be problematic - if only a few firms in the sample can be classified as "good performing", then well and good. The only thing that you can consider is whether simplification from multinomial logit to ordered logit is warranted.

        Comment


        • #5
          Hi. Thanks again!

          I was just wondering that there is so much literature on penalized logistic regression for class imbalance. Why is class imbalance not a concern while conducting causal analysis? Please help me clarify this doubt.

          Comment


          • #6
            You have \(\approx\) 3000/70000 = 4 percent of observations in one category which is not that small given the numbers. If you look at applications of penalized logit, the numbers are usually very small, at times a lot less than 1%. Take a look at Richard Williams's notes on some tests for combining dependent categories in multinomial logit which may suggest a move from mlogit to logit: https://www3.nd.edu/~rwilliam/stats3/Mlogit2.pdf.

            Comment


            • #7
              Thank you so much for clarifying the doubts. I am really grateful. I would also humbly like to request you if you could provide any reference in the literature that suggests that penalized regression methods is useful if the number of observations in one of the classes is less than 1% and normal multinomial logit regression is fine even with class imbalance .

              Comment


              • #8
                See, e.g., Gary King and Langche Zeng. “Logistic Regression in Rare Events Data.” Political Analysis 9 (2001): 137-163. It is not just the percentage that matters, but also the total sample size. Paul Allison discusses this in his blog at https://statisticalhorizons.com/logi...r-rare-events/.

                Comment


                • #9
                  Thank you so much! It is a great help!

                  Comment


                  • #10
                    Hi,
                    I have some additional questions to ask. I would be grateful for help.
                    time
                    I am running a binary logistic panel regression model by in 3 ways:
                    Dependent variable is status of the firm denoted by status=1 if firm performs well and status=0 if the firm is a bad performer
                    Independent variables include various firm characteristics some of them are ownership of the firm which rarely change overtime, age of the firm which increase by one unit each year, size of the firm, profitability ratio, debt-equity ratio and other independent variables

                    (a) pooling data as a cross-section and running a binary logit regression using clustered standard errors [logit status i.ownership age log_size other_independent_variables, vce(cluster id)]
                    How do we interpret the beta coefficients?

                    (b) Binary fixed effects panel logit regression [xtlogit status i.ownership age log_size other_independent_variables, fe]
                    How do we interepret the beta coefficients?

                    (c) Binary random effects panel logit regression [xtlogit status i.ownership age log_size other_independent_variables, re]
                    How do we interpret the beta coefficients?


                    (i)The objective is to decide which of the (a), (b) and (c) is best suited for my analysis. I would be grateful if someone helps me with this as I have not been able to find any resource which could help me find the same for panel logit.
                    If possible, please provide a resource for the same.

                    Comment


                    • #11
                      Your questions are too broad and can easily be resolved if you do some reading. Here are some pointers and references.

                      1. If you estimate a random effects model, at the foot of the table are results of a likelihood-ratio test which compares the pooled estimator (logit) with the panel estimator (xtlogit). This will tell you whether you can justify estimating a pooled model.

                      2. Random effects relies on very strong assumptions, i.e., no correlation between the individual effects and your right-hand side variables. Use a Hausman test to justify estimating a random effects model.

                      3. You are always safe with conditional fixed effects and do not need any justification in estimating the model if you have panel data.

                      4. As far as the output goes, use margins to obtain average marginal effects (AMEs) which are easier to interpret. If I have to interpret the logit coefficients, I always use the terms "positive association", "negative association" or "no association".

                      5. Some helpful references on the above are the manual entry of xtlogit (see references therein) and Richard Williams's paper on margins.

                      https://www.stata.com/manuals13/xtxtlogit.pdf
                      https://journals.sagepub.com/doi/pdf...867X1201200209
                      Last edited by Andrew Musau; 01 May 2022, 07:31.

                      Comment


                      • #12
                        Thank you, Prof. Andrew Musau.
                        I just wanted to clarify few more clarificatory questions. I would be grateful if you could help me with it.

                        (a) Hausman test which is used for panel regression is also valid for logistic panel regression? |
                        (b) Serial correlation in the error term is considered for when we run xtlogit, re and xtlogit, fe?
                        If not, is there any way to check for it for xtlogit?
                        And, if I find serial correlation, will the coefficients from xtlogit, re and xtlogit, fe be still valid?

                        Comment


                        • #13
                          Originally posted by Jessica Thacker View Post
                          Thank you, Prof. Andrew Musau.
                          (a) Hausman test which is used for panel regression is also valid for logistic panel regression? |
                          Yes. See the following linked example: https://www.stata.com/statalist/arch.../msg00669.html

                          (b) Serial correlation in the error term is considered for when we run xtlogit, re and xtlogit, fe?
                          If not, is there any way to check for it for xtlogit?
                          And, if I find serial correlation, will the coefficients from xtlogit, re and xtlogit, fe be still valid?
                          Conditional logit is inconsistent in the presence of serial correlation and heteroskedasticity, see the reference provided by Jeff Wooldridge below:
                          https://www.statalist.org/forums/for...tandard-errors

                          As suggested in the link, if your time dimension is large, you can estimate an unconditional fixed effects logit model and cluster the standard errors. You can cluster with random effects logit, but I state in #11, it relies on very strong assumptions.

                          Comment


                          • #14
                            Dear Prof Andrew Musau & Carlo Lazzaro,

                            Thank you so much for helping me.


                            On (a) I did find the statalist forum for using Hausman test for panel logistic regression. But, I have not been able to find any journal article which advocates it for citation purposes.
                            I ran the following commands

                            xtlogit status i.ownership age log_size, re
                            estimates store re
                            xtlogit status i.ownership age log_size, fe
                            estimates store fe

                            Then I run hausman test for which I get two different results for the same analysis. For which I get the following results. Please let me know where am I going wrong. I am afraid that these results are meaningless as I do suspect serial correlation of the errors in the panel data of the firm.







                            Hence, I did use
                            xtlogit status i.ownership age log_size, re vce(robust)
                            estimates store rree
                            xtlogit status i.ownership age log_size, fe vce(boot)
                            estimates store ffee
                            But using hausman test after this is giving error which Carlo already had warned me.

                            I am perplexed as to how to work with this data now. I beleive that serial correlation will be present for a panel of firms in my data and their unobserved fixed effects will also be correlated with other explanatory variables in my data. Is there a way to solve this issue or check for it?


                            Warm regards,
                            Jessica
                            Attached Files
                            Last edited by Jessica Thacker; 03 May 2022, 03:52.

                            Comment


                            • #15
                              The Hausman test is well known and has been used extensively, so you do not need a reference. But here is Jerry Hausman's Econometrica article describing the test: https://www.jstor.org/stable/1913827?origin=crossref. The order needs to be

                              Code:
                              hausman fe re
                              as fixed effects is consistent and random effects is efficient in the case that it is consistent. You do not have to test for serial correlation, but you can operate under the assumption that it is present (given the nature of your data).

                              Comment

                              Working...
                              X