Multivariate regression with Categorical variables

Siege Taker

Join Date: Apr 2020

Posts: 22
#1

Multivariate regression with Categorical variables

26 Apr 2020, 02:04

Hi awesome people,

I am working a multivariate regression consisting of a categorical (4 classes) independent variable and 52 independent variables cross-sectional data (both numerical and binary,0,1).
Below is the OLS of the 52 variables

I want to test for multicollinearity, heteroscedasticity and autocorrelation. I am going for Breusch pagan and White test for heteroscedasticity test but am not sure these tests are appropriate for a categorical dependent variable. Please how do I proceed with these tests given my model?

any suggestions on how to proceed will make my day since this is occupying my nights
Tags: None
Nick Cox

Join Date: Mar 2014

Posts: 35438
#2

26 Apr 2020, 02:38

"Siege Taker" could be a standard name in your country but I am guessing not. Please note our long-standing explicit requests for use of full real names.

https://www.statalist.org/forums/help#realnames

https://www.statalist.org/forums/help#adviceextras #3

From your output I guess that you applied regress -- in which case "multivariate regression" is not a good term -- see help mvreg -- and regression is fine. .

It's hard to say to much about your question -- given absolutely no details about any variable -- except that

1. If your categorical variable is nominal -- in an over-worked but occasionally helpful terminology -- then your regression is meaningless, as different codes for the outcome would yield different results.

2. If your categorical variable is ordinal -- so that e.g. codes 1, 2, 3, 4 represent unambiguously a monotonic sequence on some scale -- then "wrong in principle but just possibly may be defensible in practice" is my average across what I have read in several lively if not angry debates.

52 predictors! I am old enough, and old-fashioned enough, to think that simplicity is a hallmark of a good model, but principles and practices vary across fields.
1 like
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17675
#3

26 Apr 2020, 05:14

As usual, Nick gave outstanding advice.
It is difficult to envisage a situation in which you can safely use more than 50 regressors and obtain a model that makes sense and, even in the hypothetical scenario that your approach were correct, I cannot envisage an effective way to dissemninate more than 50 coefficients.
That said, as far as post estimation checks are concerned, you do not seem to worry about model misspecification and/or overfitting, that, in my opinion, should be at the top of your checklist.

Kind regards,
Carlo
(Stata 19.0)
Comment
Siege Taker

Join Date: Apr 2020

Posts: 22
#4

26 Apr 2020, 15:03

Thank you for your valuable comments
Firstly, I picked "Siege Taker" out of my 3 names Siege Stephen Taker.

I am new to building model but I definitely know that I can't build with 52 variables. My plan is to first test which variables are significant and drop the ones that are not.
I have not been able to do this yet because I noticed am getting different coef for the Independent variables after each "regress". Below is the last regress of distress (dependent variable) and the 52 variables.

And below are the categories of the dependent variable. The categories ought to be ordinal, since normally the financial health of a firm should progress as: NST(0), ST(1), SST(2), Delisted (3) but it is very possible for a firm to go from 0 to 2 and vice versa.

My questions:
1. Do you think I can rely on the above regression and pick about 12 most significant variables to initially run with? (I can further cut it down 6-7 later). OR is there a more efficient command to test the significance of each variable against DISTRESS?
2. Is it valid to say the categories of DISTRESS is ordinal? (although it is possible for firms to go nominal)
Thank you
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17675
#5

26 Apr 2020, 15:49

Siege:
thanks for clarifying.
Are your data cross-sectional or panel?
That said, your aim should be to give a fair and true view of the data generating process: check your literature of your research field in this respect. Currently, your regression model includes too many predictors.
I would be more parsimonious.
Eventually, I read -DISTRESS- as ordinal.

Kind regards,
Carlo
(Stata 19.0)
1 like
Comment
Siege Taker

Join Date: Apr 2020

Posts: 22
#6

26 Apr 2020, 19:14

Thanks Carlo,
It is panel data for 15years
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17675
#7

26 Apr 2020, 23:19

Siege:
see -xtlogit- then.
In your future posts, please be clearer in describing your data.
Stating in #1 that you have panel data, would have saved everybody's time.

Kind regards,
Carlo
(Stata 19.0)
1 like
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35438
#8

27 Apr 2020, 00:32

I think Carlo Lazzaro means xtologit.
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17675
#9

27 Apr 2020, 00:38

Nick is correct.
Sorry for the misspelling

Kind regards,
Carlo
(Stata 19.0)
Comment
Siege Taker

Join Date: Apr 2020

Posts: 22
#10

01 May 2020, 00:51

Many thanks @Carlo @Nick,
This is my first post on the forum and I am pretty new to STATA. Sorry, it is Panel data, that skipped me

At the moment have a panel data of 12years, 1300 firms, a categorical dependent variable and 52 independent variables am working to cut down. My aim is to predict the dependent variable.
I have on todo list:
* Dropping independent variables that are not significant to at most 8 variables
* GMM (including Heteroscedasicity, Autocorrelation, and Multicollinearity)
* xtologit (suggested by Nick)
* mvreg
I am stuck on how to proceed given these bunch of tests.
Any ideas will be much appreciated.
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17675
#11

01 May 2020, 04:38

Siege:
I would follow these steps:
1) select, on the ground of the literature in your research field and/or the qualified opinion of more skilled colleagues/supervisor/Teachers, a set of predictors that gives a fair and true view of the data generating process. With 1300 observations, I would say that they should not exceed 15;
2) go -xtologit- with -cluster()- stanbdard errors if yiou detect heteroskedastcicity and/or autocorrelation.
3) Forget multicollinearity: often, it is not an issue.

Kind regards,
Carlo
(Stata 19.0)
1 like
Comment
Siege Taker

Join Date: Apr 2020

Posts: 22
#12

01 May 2020, 15:06

Thanks, Carlo,
I will progress with that and see how it goes
Comment
Siege Taker

Join Date: Apr 2020

Posts: 22
#13

08 May 2020, 22:55

Hi there,

I just realized I have unbalanced panel data since some firms just got listed somewhere over the research period of 12years. Does anyone have experience with unbalanced panel data especially if thats going to be a problem with logistic regression model?.
I have the options of:
1. Balancing the panel data by dropping firms that don't have 12 years of data however this will cut down my sample from circa 1350 to 900.
2. Keeping and working with the unbalanced panel data
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17675
#14

09 May 2020, 03:01

Siege:
option #2 is the way to go. Otherwise (#1), you would end up with a maked-up subsample of your original dataset.

Kind regards,
Carlo
(Stata 19.0)
Comment
Siege Taker

Join Date: Apr 2020

Posts: 22
#15

09 May 2020, 22:33

@carlo,

Right, I just feel somehow uncomfortable with unbalanced panel data. Do you have any safeguards I can put in place to ensure there is no impact of missing data on my model? For instance, of the 12yrs sample period, some of those firms have as little as one year data.

I guess its not right to use data imputation since these companies wasn't live for those years.
Comment

Announcement

Multivariate regression with Categorical variables

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment