Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Multivariate regression with Categorical variables

    Hi awesome people,

    I am working a multivariate regression consisting of a categorical (4 classes) independent variable and 52 independent variables cross-sectional data (both numerical and binary,0,1).
    Below is the OLS of the 52 variables


    Click image for larger version

Name:	Capture.PNG
Views:	1
Size:	7.0 KB
ID:	1549300

    I want to test for multicollinearity, heteroscedasticity and autocorrelation. I am going for Breusch pagan and White test for heteroscedasticity test but am not sure these tests are appropriate for a categorical dependent variable. Please how do I proceed with these tests given my model?

    any suggestions on how to proceed will make my day since this is occupying my nights



  • #2
    "Siege Taker" could be a standard name in your country but I am guessing not. Please note our long-standing explicit requests for use of full real names.

    https://www.statalist.org/forums/help#realnames

    https://www.statalist.org/forums/help#adviceextras #3


    From your output I guess that you applied regress -- in which case "multivariate regression" is not a good term -- see help mvreg -- and regression is fine. .

    It's hard to say to much about your question -- given absolutely no details about any variable -- except that

    1. If your categorical variable is nominal -- in an over-worked but occasionally helpful terminology -- then your regression is meaningless, as different codes for the outcome would yield different results.

    2. If your categorical variable is ordinal -- so that e.g. codes 1, 2, 3, 4 represent unambiguously a monotonic sequence on some scale -- then "wrong in principle but just possibly may be defensible in practice" is my average across what I have read in several lively if not angry debates.

    52 predictors! I am old enough, and old-fashioned enough, to think that simplicity is a hallmark of a good model, but principles and practices vary across fields.


    Comment


    • #3
      As usual, Nick gave outstanding advice.
      It is difficult to envisage a situation in which you can safely use more than 50 regressors and obtain a model that makes sense and, even in the hypothetical scenario that your approach were correct, I cannot envisage an effective way to dissemninate more than 50 coefficients.
      That said, as far as post estimation checks are concerned, you do not seem to worry about model misspecification and/or overfitting, that, in my opinion, should be at the top of your checklist.
      Kind regards,
      Carlo
      (Stata 19.0)

      Comment


      • #4
        Thank you for your valuable comments
        Firstly, I picked "Siege Taker" out of my 3 names Siege Stephen Taker.

        I am new to building model but I definitely know that I can't build with 52 variables. My plan is to first test which variables are significant and drop the ones that are not.
        I have not been able to do this yet because I noticed am getting different coef for the Independent variables after each "regress". Below is the last regress of distress (dependent variable) and the 52 variables.

        Click image for larger version

Name:	Ind_Var.PNG
Views:	1
Size:	49.7 KB
ID:	1549398

        And below are the categories of the dependent variable. The categories ought to be ordinal, since normally the financial health of a firm should progress as: NST(0), ST(1), SST(2), Delisted (3) but it is very possible for a firm to go from 0 to 2 and vice versa.

        Click image for larger version

Name:	dep_var.PNG
Views:	1
Size:	7.0 KB
ID:	1549399

        My questions:
        1. Do you think I can rely on the above regression and pick about 12 most significant variables to initially run with? (I can further cut it down 6-7 later). OR is there a more efficient command to test the significance of each variable against DISTRESS?
        2. Is it valid to say the categories of DISTRESS is ordinal? (although it is possible for firms to go nominal)
        Thank you

        Comment


        • #5
          Siege:
          thanks for clarifying.
          Are your data cross-sectional or panel?
          That said, your aim should be to give a fair and true view of the data generating process: check your literature of your research field in this respect. Currently, your regression model includes too many predictors.
          I would be more parsimonious.
          Eventually, I read -DISTRESS- as ordinal.
          Kind regards,
          Carlo
          (Stata 19.0)

          Comment


          • #6
            Thanks Carlo,
            It is panel data for 15years

            Comment


            • #7
              Siege:
              see -xtlogit- then.
              In your future posts, please be clearer in describing your data.
              Stating in #1 that you have panel data, would have saved everybody's time.
              Kind regards,
              Carlo
              (Stata 19.0)

              Comment


              • #8
                I think Carlo Lazzaro means xtologit.

                Comment


                • #9
                  Nick is correct.
                  Sorry for the misspelling
                  Kind regards,
                  Carlo
                  (Stata 19.0)

                  Comment


                  • #10
                    Many thanks @Carlo @Nick,
                    This is my first post on the forum and I am pretty new to STATA. Sorry, it is Panel data, that skipped me

                    At the moment have a panel data of 12years, 1300 firms, a categorical dependent variable and 52 independent variables am working to cut down. My aim is to predict the dependent variable.
                    I have on todo list:
                    * Dropping independent variables that are not significant to at most 8 variables
                    * GMM (including Heteroscedasicity, Autocorrelation, and Multicollinearity)
                    * xtologit (suggested by Nick)
                    * mvreg
                    I am stuck on how to proceed given these bunch of tests.
                    Any ideas will be much appreciated.

                    Comment


                    • #11
                      Siege:
                      I would follow these steps:
                      1) select, on the ground of the literature in your research field and/or the qualified opinion of more skilled colleagues/supervisor/Teachers, a set of predictors that gives a fair and true view of the data generating process. With 1300 observations, I would say that they should not exceed 15;
                      2) go -xtologit- with -cluster()- stanbdard errors if yiou detect heteroskedastcicity and/or autocorrelation.
                      3) Forget multicollinearity: often, it is not an issue.
                      Kind regards,
                      Carlo
                      (Stata 19.0)

                      Comment


                      • #12
                        Thanks, Carlo,
                        I will progress with that and see how it goes

                        Comment


                        • #13
                          Hi there,

                          I just realized I have unbalanced panel data since some firms just got listed somewhere over the research period of 12years. Does anyone have experience with unbalanced panel data especially if thats going to be a problem with logistic regression model?.
                          I have the options of:
                          1. Balancing the panel data by dropping firms that don't have 12 years of data however this will cut down my sample from circa 1350 to 900.
                          2. Keeping and working with the unbalanced panel data

                          Comment


                          • #14
                            Siege:
                            option #2 is the way to go. Otherwise (#1), you would end up with a maked-up subsample of your original dataset.
                            Kind regards,
                            Carlo
                            (Stata 19.0)

                            Comment


                            • #15
                              @carlo,

                              Right, I just feel somehow uncomfortable with unbalanced panel data. Do you have any safeguards I can put in place to ensure there is no impact of missing data on my model? For instance, of the 12yrs sample period, some of those firms have as little as one year data.

                              I guess its not right to use data imputation since these companies wasn't live for those years.

                              Comment

                              Working...
                              X