Indicate loss of observations by variable

Bjorn Becker

Join Date: Aug 2021

Posts: 18
#1

Indicate loss of observations by variable

11 Aug 2021, 10:09

Dear Statalist Community,

I am looking for a command that tells me how many observations are lost with each variable that my regression contains.

Let's say my regression would look like this:

reg goals rankdiff teamvalue coachexperience weather

And let's assume I'd work with a dataset containing several similarly defined variables and I am looking for the ones that leave me with the highest number of observations.

So far, I have played around with excluding single variables and see how the observations react and which combination of variables within the regression may cause the most significant drop in observations.

I am imagining a command that gives me s.th. like:

------------------
1. goals - 300k observations left (=100%)
2. rankdiff - 280k observations left
3. teamvalue - 270k observations left
4. coachexperience - 110k observations left
5. weather - 100k observations left
------------------

In this scenario, I would then proceed to look for a good replacement for "coachexperience", as it seems to have too many missing values in data rows where the other variables contain values.

The real dataset and the regression are bigger than this example and finding out which variables decrease the overall observations the most is more tedious.

I would appreciate any help regarding this matter.

Thank you very much,

Björn.
Tags: None

Ken Chui

Join Date: Aug 2014
Posts: 1054

11 Aug 2021, 10:21

There is a downloadable command called mvpatterns that may do this job. Use search mvpatterns to install it. Here is an example:

Code:

. sysuse nlsw88
(NLSW, 1988 extract)

. mvpatterns
variables with no mv's: idcode age race married never_married collgrad south smsa c_city wage ttl_exp

Variable     | type     obs   mv   variable label
-------------+--------------------------------------------
grade        | byte    2244    2   current grade completed
industry     | byte    2232   14   industry
occupation   | byte    2237    9   occupation
union        | byte    1878  368   union worker
hours        | byte    2242    4   usual hours worked
tenure       | float   2231   15   job tenure (years)
----------------------------------------------------------

Patterns of missing values

  +------------------------+
  | _pattern   _mv   _freq |
  |------------------------|
  |   ++++++     0    1848 |
  |   +++.++     1     359 |
  |   +++++.     1      10 |
  |   +.++++     1       8 |
  |   +++.+.     2       5 |
  |------------------------|
  |   +..+++     2       5 |
  |   ++.+++     1       4 |
  |   +++..+     2       3 |
  |   .+++++     1       2 |
  |   ++++.+     1       1 |
  |------------------------|
  |   +.+.++     2       1 |
  +------------------------+

It creates a matrix showing the combinations of missing and non-missing. That way you should be able to gauge which model yield the highest n. Just remember to also throw the dependent variable into it.

Comment

Bjorn Becker

Join Date: Aug 2021

Posts: 18
#3

18 Aug 2021, 04:57

Thank you so much! This really helped!

While looking for a way to entangle the obs and mv, as they are basically displayed as a single number for larger values in "mvpatterns" (and the command "skip" doesn't solve it), I also stumbled upon other useful commands regarding the search for the main contributors of missing values like:

mdesc
rmiss2
misschk.

Just dropping them here, if anyone else stumbles upon this thread.

Again, thank you so much for your answer! Really appreciated!

All the best,
Björn.
Comment

Nick Cox

Join Date: Mar 2014
Posts: 35212

18 Aug 2021, 05:32

Immodesty aside, let me add missings (Stata Journal) and missingplot (SSC). The first can be a little hard to find, but the esoteric detail you need is below

Code:

. search dm0085, entry

Search of official help files, FAQs, Examples, and Stata Journals

SJ-20-4 dm0085_2  . . . . . . . . . . . . . . . . Software update for missings
        (help missings if installed)  . . . . . . . . . . . . . . .  N. J. Cox
        Q4/20   SJ 20(4):1028--1030
        sorting has been extended for missings report

SJ-17-3 dm0085_1  . . . . . . . . . . . . . . . . Software update for missings
        (help missings if installed)  . . . . . . . . . . . . . . .  N. J. Cox
        Q3/17   SJ 17(3):779
        identify() and sort options have been added

SJ-15-4 dm0085  Speaking Stata: A set of utilities for managing missing values
        (help missings if installed)  . . . . . . . . . . . . . . .  N. J. Cox
        Q4/15   SJ 15(4):1174--1185
        provides command, missings, as a replacement for, and extension
        of, previous commands nmissing and dropmiss

Comment

Bjorn Becker

Join Date: Aug 2021

Posts: 18
#5

18 Aug 2021, 07:25

Very much appreciated! Thank you!
Comment

Announcement