Stata command

Ihab Man

Join Date: Jul 2020

Posts: 56
#1

Stata command

10 Sep 2020, 07:49

Dear all
First of all, thank you so much to you (statistical professionals) for your help and advices always.
Please, I have simple questions, and many thanks in advance for your replies.
I have a panel data set for 500 companies with regular period (2000-2010) with 9 explanatory variables (not dummies). On the other hands my dependent variable is dummy (0, 1).
Please, could you tell me what is the different between logit and xtlogit command results (table) in Stata? Or which one is correct for my case?

There is an specific command for my (methodology (logistic regression) to lagged all my explanatory variables? Rather than (by companyName : gen lagVAR1 = VAR1 [_n-1])? I mean for the main command logit depV Var1 Var2 …etc? please take in your account that I have some missing value in my independent variables

Will be any problem if I changed the company name (ID) to number from 1-500 in excel before import them into Stata?

Thank you so much
All the best
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#2

10 Sep 2020, 11:11

Please, could you tell me what is the different between logit and xtlogit command results (table) in Stata? Or which one is correct for my case?

-logit- should be used when all of the observations in the data set are independent of each other. When you have panel data, independence usually does not hold: observations of the same firm over time will generally be correlated with each other. -xtlogit- deals with this kind of situation. In general, when working with panel data, the -xt- command series is your best bet. There are occasional circumstances where it would be appropriate to use -logit- instead of -xtlogit- with panel data, but they arise only rarely.

There is an specific command for my (methodology (logistic regression) to lagged all my explanatory variables? Rather than (by companyName : gen lagVAR1 = VAR1 [_n-1])? I mean for the main command logit depV Var1 Var2 …etc? please take in your account that I have some missing value in my independent variables

There is seldom any need in Stata to explicitly create new variables to be the lagged values of variables you already have. And when there is, the code you show is dangerous because if there are any time gaps in the data, that will be handled incorrectly by your code. The simple and correct way to do this is to -xtset- your data (including a time variable) and then use the lag operator. So, for example,

Code:

xtset company_id year xtlogit depV Var1 L1.Var2

will give you a random effects logistic regression of depV on the current value of Var1 and the 1st lagged value of Var2.

Do read -help xtset-, -help tsvarlist-, and -help xtlogit- for more details.

Will be any problem if I changed the company name (ID) to number from 1-500 in excel before import them into Stata?

In fact, in order to -xtset- your data, you must have a numeric company id variable. But you should not do it in Excel. In fact, you should never do any data management in Excel unless you are just playing around. If your work has any serious purpose, all of the data management should be done in a serious statistical package (like Stata, though there are others) and should leave an audit trail of exactly what has been done. To create a numeric company ID in Stata you can run

Code:

encode company_name, gen(company_id)

(And you must do that before you run the -xtset- command).

The new variable company_id created by the -encode- command will actually be numeric, but it will also be value labeled so that when you browse the data, or list or display it, your eyes will see the company names themselves.
Comment
Ihab Man

Join Date: Jul 2020

Posts: 56
#3

10 Sep 2020, 11:37

Dear
Clyde Schechter
I am happy for your suggestions , Thank you so much really . every things is clear now . please, as you have mentioned that I should do all the data management in stata , so I have here the final question .
please, in my data (excel file) the symbol (%) exist behind on the all variable name which mean all the numbers will be in percentage for example in the excel file the number appear like 62.23 but with out the symbol (%). well, there is any code in stata to convert all the number into percentage? for example 62.23 convert it by stata code into 0.6223? Thank you so much
Comment
Rich Goldstein

Join Date: Mar 2014

Posts: 4464
#4

10 Sep 2020, 11:58

.6223 sounds more like a proportion than a percentage but you can get from 62.23 to .6223 by just dividing the variable by 100; e.g.

Code:

gen newvar = oldvar/100
Comment
Ihab Man

Join Date: Jul 2020

Posts: 56
#5

10 Sep 2020, 12:22

Daer Rich Goldstein
perfect , Thank you so much . All the best
Comment
Ihab Man

Join Date: Jul 2020

Posts: 56
#6

13 Sep 2020, 11:52

Dear Clyde Schechter ,
I am Sorry for the inconvenience.
I felt good with your suggestions, but I have simple questions according to my casa.
Please, as you know I have a panel data set for 500 companies with regular period (2000-2010) with 9 explanatory variables (not dummies). On the other hands my dependent variable is dummy (0, 1). I don’t have gab in years I have from 2000-2010 (11 year for each company) , but I have missing value for some independent variables:
As you have mentioned to me about lagging my independent variables ( xtlogit depV Var1 L1.Var2) is it work good with my case if I have some missing value in the variables that I want to lag?

For my case if I want to run fixed effect , so which one is correct and can you tell me the different between them please:

Xtlogit DepVIndpV,fe
Xtlogit DepVIndpV i.year ,fe
If I will deal with missing value using ipolate will be fine ?

Thank you so much
Kind Regards
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#7

13 Sep 2020, 12:14

As you have mentioned to me about lagging my independent variables ( xtlogit depV Var1 L1.Var2) is it work good with my case if I have some missing value in the variables that I want to lag?

Yes, that is the correct syntax and it will work when there are missing values.

For my case if I want to run fixed effect , so which one is correct and can you tell me the different between them please:

Xtlogit DepVIndpV,fe
Xtlogit DepVIndpV i.year ,fe

The difference between them is that the second one includes adjustment for yearly shocks to the dependent variable, whereas the first does not. As for which is correct, that is a substantive question, not a statistical one. If the dependent variable is, in fact, subject to substantial yearly shocks, then the second model will probably be better, and if not, then the first will avoid adding unnecessary noise. You don't say what your dependent variable is, so nobody can advise you without that information. Even if you said what it is, since it appears you are working in finance or economics, that's outside my area and I wouldn't be able to tell you anyway. So you would have to consult the literature or a colleague in your discipline to answer that.

The implications of missing values are that any observation containing one will be excluded from the estimation of your regression. So your sample size shrinks. There is also the question of whether the missing data introduces a bias into the sample. So whenever you are dealing with missing values in data you need to have some understanding of why the missing values are missing. What causes that to happen? Is it an essentially random process, or might the missingness of those values in fact be associated with the (unknown) real value, or with the values of other variables? This kind of analysis depends on substantive knowledge of your area and how these data were created in the first place.

Linear interpolation is unlikely to be a good approach to missing values. First, it is only appropriate if there is some reason to believe that the variable actually varies, at least more or less, linearly with time. Next, since your dependent variable is a 0/1 variable, linear interpolation will often produce values between 0 and 1, which are not valid.
Comment
Ihab Man

Join Date: Jul 2020

Posts: 56
#8

13 Sep 2020, 12:46

Dear Clyde Schechter

Thank you so much, I understand everything you wrote. I appreciate that
Regarding to the nature of missing value. It’s already missing from the sources that I have downloaded the data. So I think that will be unknown value. I think if I will use the average for missing value will be work (the average between previous and post value). There is any code dealing with the average for missing value? Or do you thing that there is another approach if the missing is already unknown form the sources? i mean just for the independent variabels
Thank You
Kind Regards
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#9

13 Sep 2020, 13:09

Well, you probably should try to find out from the curators of your data sources why the missing values are there. It can be important. To take an example from my own field, in certain kinds of international health data, missing values arise because some governments will refuse to report results that look embarrassingly bad. Clearly, replacing those missing values with average values would be misleading: they are systematically too high (for adverse outcomes) or too low (for favorable outcomes) and not at all average but probably at the tails of the distributions.

Sometimes, though, missing values arise through circumstances that would not introduce bias. For example, again in health data, results of lab tests may be missing because there was a power failure in the lab at a certain time, and the specimens that were being processed at the time were destroyed as a result. Clearly that is a completely random creation of missing values. In that case, the best approach actually is to just omit the observations with missing data (which is what Stata would do by default).

And there are all sorts of situations between those extremes that can arise. The bottom line regarding missing data is that there are no good solutions. It is a matter of finding the least bad solution to the problem.

For an overview of different approaches see https://statisticalhorizons.com/wp-c...aterials-1.pdf.
Comment
Ihab Man

Join Date: Jul 2020

Posts: 56
#10

13 Sep 2020, 13:24

Dear Clyde Schechter
Thank you so much. Everything is clear now. Your examples help me a lot to understand the idea behind the missing value. I will do as you have said. I will contact the curators of the data sources and then we will see what will happen. Thank you so much Dear Clyde Schechter
Comment
Ihab Man

Join Date: Jul 2020

Posts: 56
#11

15 Sep 2020, 15:10

Dear Clyde Schechter ,
I am Sorry for the inconvenience. really sorry , but I have something to tell you and I hope there is a solutions .
1- as you know for my case when I run the fixed effect model many observations drooped . this is because my DepV has not a different value ? if yes , there is no solutions ?
2- I have used this command ( Xtlogit Dep V Indp V i.compnyID i.year) and I got notifications that many company drooped because of (e.g.CompanyID != 0 predicts failure perfectly
.companyID dropped and 11 obs not used) and still only 76 company out of 500 ( like when i used fixed ) . please, this is because if I introduce i.Year and i.comanyID in the random command? if yes , so we should not used them yes ? or there is another problem ? Thank you so much
Kind Regards
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#12

15 Sep 2020, 15:47

Before discussing the issue of the dropped variable and observationss, I should point out that the command -Xtlogit Dep V Indp V i.compnyID i.year- is ill-formed. What you have done her is dropped the -fe- option and instead substituted i.companyID as part of the predictor list. This is wrong for a couple of reasons. First, because you do not specify -fe-, you get -re- by default. So your model now contains both fixed and random effects for companyID. That's a really badly specified model, and I wouldn't attempt to interpret any results it gave you, even if there were no warnings or error messages accompanying the output. It is also wrong for another reason: you might have been thinking that you can just use i.companyID to get a fixed-effects logistic regression just the way you can use i.companyID with -regress- to get a fixed effects linear regression. But, in fact, you can't. That trick only works for linear regression. What you would get from -logistic DepV IndpV i.companyID- (with or without i.year) would be an unconditional logistic regression--the results of which are inconsistent due to a technical difficulty known as the "incidental parameters problem."

So the first thing is to fix up your command to -xtlogit DepV IndpV i.year, fe-.

When you do that, most of those warning messages will go away. But you will still have a problem with dropped observations: if your DepV doesn't vary over time within some company (or companies) that company (those companies) are not informative about the within-company relationship between IndpV and DepV: Remember that a fixed-effects regression is always estimating only within-company effects. Consequently these companies are useless in fitting the fixed-effects logistic model. Actually, they are worse than useless: if they are left in, the formulas used in calculating the coefficients "blow up." Consequently, Stata removes those from the analysis. Stata will tell you that it is doing this by including a warning message before the regular output.

There is no solution to this problem. In one sense, it is not a problem at all. It is not a problem because all that is happening is that data that has no information relative to the question you are asking is being ignored. It is a "cosmetic" problem in the sense that your actual sample size will be smaller, perhaps much smaller, than you were hoping for. And if your actual sample size ends up being too small to produce useful results, well, then it is a real problem--but one whose only solution is to get more data that is informative.
1 like
Comment
Ihab Man

Join Date: Jul 2020

Posts: 56
#13

16 Sep 2020, 03:23

Dear Clyde Schechter
Yes . Everything is clear now . I understand you .
Thank you so much Dear . I appreciate it too much . Thanks
Comment
Ihab Man

Join Date: Jul 2020

Posts: 56
#14

03 Oct 2020, 17:33

Dear Clyde Schechter
I hope you are doing well
Please, I have seen some posts related to many fixed effect, but I don’t find my case solution and I hope you can help me with my questions.
as you know for my case I have a panel data set for 500 companies from 11 countries with regular period (2000-2010) with 9 explanatory variables (not dummies) and my Dep is Dummy(0,1). I did not use xtlogit, Fe because many observations drooped. Then I tried to use Xtlogit, re vce (cluster Company ID).
When I run the command (Xtlogit Dep Ind Var i. Year i. Country , re vce (cluster Company ID) Stata tell me that :

note: 2001.Year != 0 predicts failure perfectly
note: 2010.Year omitted because of collinearity
In this case, what happened here please? And how can I fix it? But when I did not introduce i. Year no message note
When I lag my independent variables by one year by write L.Var1 should I sort the data because panel data or the same with and without sort?Thank you so much
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#15

03 Oct 2020, 18:00

Duplicate post: asked and answered at https://www.statalist.org/forums/for...effect-xtlogit.
Comment

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment