Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Indicate loss of observations by variable

    Dear Statalist Community,

    I am looking for a command that tells me how many observations are lost with each variable that my regression contains.

    Let's say my regression would look like this:

    reg goals rankdiff teamvalue coachexperience weather

    And let's assume I'd work with a dataset containing several similarly defined variables and I am looking for the ones that leave me with the highest number of observations.

    So far, I have played around with excluding single variables and see how the observations react and which combination of variables within the regression may cause the most significant drop in observations.

    I am imagining a command that gives me s.th. like:

    ------------------
    1. goals - 300k observations left (=100%)
    2. rankdiff - 280k observations left
    3. teamvalue - 270k observations left
    4. coachexperience - 110k observations left
    5. weather - 100k observations left
    ------------------

    In this scenario, I would then proceed to look for a good replacement for "coachexperience", as it seems to have too many missing values in data rows where the other variables contain values.

    The real dataset and the regression are bigger than this example and finding out which variables decrease the overall observations the most is more tedious.


    I would appreciate any help regarding this matter.


    Thank you very much,

    Björn.



  • #2
    There is a downloadable command called mvpatterns that may do this job. Use search mvpatterns to install it. Here is an example:

    Code:
    . sysuse nlsw88
    (NLSW, 1988 extract)
    
    . mvpatterns
    variables with no mv's: idcode age race married never_married collgrad south smsa c_city wage ttl_exp
    
    Variable     | type     obs   mv   variable label
    -------------+--------------------------------------------
    grade        | byte    2244    2   current grade completed
    industry     | byte    2232   14   industry
    occupation   | byte    2237    9   occupation
    union        | byte    1878  368   union worker
    hours        | byte    2242    4   usual hours worked
    tenure       | float   2231   15   job tenure (years)
    ----------------------------------------------------------
    
    Patterns of missing values
    
      +------------------------+
      | _pattern   _mv   _freq |
      |------------------------|
      |   ++++++     0    1848 |
      |   +++.++     1     359 |
      |   +++++.     1      10 |
      |   +.++++     1       8 |
      |   +++.+.     2       5 |
      |------------------------|
      |   +..+++     2       5 |
      |   ++.+++     1       4 |
      |   +++..+     2       3 |
      |   .+++++     1       2 |
      |   ++++.+     1       1 |
      |------------------------|
      |   +.+.++     2       1 |
      +------------------------+
    It creates a matrix showing the combinations of missing and non-missing. That way you should be able to gauge which model yield the highest n. Just remember to also throw the dependent variable into it.

    Comment


    • #3
      Thank you so much! This really helped!

      While looking for a way to entangle the obs and mv, as they are basically displayed as a single number for larger values in "mvpatterns" (and the command "skip" doesn't solve it), I also stumbled upon other useful commands regarding the search for the main contributors of missing values like:

      mdesc
      rmiss2
      misschk.

      Just dropping them here, if anyone else stumbles upon this thread.

      Again, thank you so much for your answer! Really appreciated!

      All the best,
      Björn.

      Comment


      • #4
        Immodesty aside, let me add missings (Stata Journal) and missingplot (SSC). The first can be a little hard to find, but the esoteric detail you need is below


        Code:
        . search dm0085, entry
        
        Search of official help files, FAQs, Examples, and Stata Journals
        
        SJ-20-4 dm0085_2  . . . . . . . . . . . . . . . . Software update for missings
                (help missings if installed)  . . . . . . . . . . . . . . .  N. J. Cox
                Q4/20   SJ 20(4):1028--1030
                sorting has been extended for missings report
        
        SJ-17-3 dm0085_1  . . . . . . . . . . . . . . . . Software update for missings
                (help missings if installed)  . . . . . . . . . . . . . . .  N. J. Cox
                Q3/17   SJ 17(3):779
                identify() and sort options have been added
        
        SJ-15-4 dm0085  Speaking Stata: A set of utilities for managing missing values
                (help missings if installed)  . . . . . . . . . . . . . . .  N. J. Cox
                Q4/15   SJ 15(4):1174--1185
                provides command, missings, as a replacement for, and extension
                of, previous commands nmissing and dropmiss

        Comment


        • #5
          Very much appreciated! Thank you!

          Comment

          Working...
          X