Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Comparing final model sample against original full sample

    Hello All –

    I Hope all is well – please note that I am relatively new to STATA, so any/all advice is greatly appreciated.

    Data Structure:
    Panel data including 4 waves of survey data
    Issue:
    Across waves, there is subject attrition and missing data that is removing cases due to listwise deletion of performed regression models.
    Goal:
    Need advice with STATA syntax that compares final model sample against original full sample to check for significant differences among included variables; that is, to assess whether the final sample reflects the original sample along all independent variables and covariates.
    Again, I know this is likely a basic/beginner ask, so I appreciate any guidance. Thank you.

    Best -


  • #2
    The key here is that Stata estimation commands leave behind an indicator for which observations were included in the analysis, called e(sample). You can turn it into an explicit variable in your data set and then use that as a grouping variable for comparisons using the standard commands such as -summarize- or -tabstat- or -table-, etc.

    Let me also point out that from a statistical perspective, the contrast you should look at is not the included sample vs the original whole sample, but the include sample vs the excluded sample. So, for example, your code might look something like this:

    Code:
    // ANALYSIS
    regress y x1 x2 i.x3 ...
    gen byte insample = e(sample)
    
    dtable c.(y x1 x2) i.x3, by(insample)
    This will create a descriptive table for the variables mentioned, disaggregated by whether or not they were included in the regression. (Caution: -dtable- was just introduced in version 18, so if you are using an older Stata, you will have to use some other commands to create the table.)

    Comment


    • #3
      Good morning, Clyde:

      Hope all is well - especially as we move into the end of the week! I can’t thank you enough for your insight and guidance with this.

      I have been digging into this syntax and the dtable command suite and learning everything I can. Your advice 100% fits the bill – so, again, thank you.

      With that, is there a way in STATA to compare the analytic sample (i.e. insample =1) to the full, original? I certainly understand comparing the analytic against those removed, but was still curious about how to examine the cases analyzed against those in the full while testing for differences.

      Thank you so much again – your guidance is appreciated more than you know.

      Best –
      John

      Comment


      • #4
        I'm not aware of any tests for comparing a subset with the whole set. And if I were to attempt to develop one, I don't see any way it could be done other than, behind the scenes, comparing the subset with the complementary subset and presenting those results.

        Here are two important reasons why comparing the subset with the whole is problematic. Since the subset is also included in the whole set, the observations in the subset and the whole set are not independent. So the usual t-tests, chi square, etc. all fail because they rely crucially on independence of observations in the groups being compared. Probably even more important is that the size of the difference between the subset and the whole is a very sensitive function of what proportion of the whole set is also in the subset. To take an extreme example that is probably unrealistic but illustrates the point clearly, suppose the whole set contains 1000 observations and the subset of interest contains 990 of those. Then the subset is almost the same as the whole set, and any measure of the difference between them, say a difference in mean values of a variable, will necessarily be quite small, even if the subset was chosen by excluding the 10 most extreme values in the whole set. Consequently, such a test would have great difficulty in detecting the difference between that subset and whole set even though the former was actually created by culling extreme values out. The whole scheme is just unworkable.

        Comment


        • #5
          Thank you again! This all makes perfect sense and I am aware of the methodological snags in trying to accomplish this. I was only curious after being asked to provide evidence that the analytic sample reflects the full data sample.

          Comment

          Working...
          X