Dear Statalist Community,
I am looking for a command that tells me how many observations are lost with each variable that my regression contains.
Let's say my regression would look like this:
reg goals rankdiff teamvalue coachexperience weather
And let's assume I'd work with a dataset containing several similarly defined variables and I am looking for the ones that leave me with the highest number of observations.
So far, I have played around with excluding single variables and see how the observations react and which combination of variables within the regression may cause the most significant drop in observations.
I am imagining a command that gives me s.th. like:
------------------
1. goals - 300k observations left (=100%)
2. rankdiff - 280k observations left
3. teamvalue - 270k observations left
4. coachexperience - 110k observations left
5. weather - 100k observations left
------------------
In this scenario, I would then proceed to look for a good replacement for "coachexperience", as it seems to have too many missing values in data rows where the other variables contain values.
The real dataset and the regression are bigger than this example and finding out which variables decrease the overall observations the most is more tedious.
I would appreciate any help regarding this matter.
Thank you very much,
Björn.
I am looking for a command that tells me how many observations are lost with each variable that my regression contains.
Let's say my regression would look like this:
reg goals rankdiff teamvalue coachexperience weather
And let's assume I'd work with a dataset containing several similarly defined variables and I am looking for the ones that leave me with the highest number of observations.
So far, I have played around with excluding single variables and see how the observations react and which combination of variables within the regression may cause the most significant drop in observations.
I am imagining a command that gives me s.th. like:
------------------
1. goals - 300k observations left (=100%)
2. rankdiff - 280k observations left
3. teamvalue - 270k observations left
4. coachexperience - 110k observations left
5. weather - 100k observations left
------------------
In this scenario, I would then proceed to look for a good replacement for "coachexperience", as it seems to have too many missing values in data rows where the other variables contain values.
The real dataset and the regression are bigger than this example and finding out which variables decrease the overall observations the most is more tedious.
I would appreciate any help regarding this matter.
Thank you very much,
Björn.
Comment