Comparing sample to a sub-sample

Tommaso Felici Netherlands

Join Date: Dec 2022

Posts: 3
#1

Comparing sample to a sub-sample

10 Dec 2024, 03:44

Hi everyone,

I have a survey (=18,327 obs) of the Dutch population with the weights to represent the entire population.

I have to run some regressions. Because of missing values when I merge different databases and other circumstances, my final database drops 1,937 obs. So, my final sample has 16,390 obs.

Now, I would like to still use the original weights. My idea is to check whether socio-demographic information (Age, income, gender and education) in the new sample (16,390) are the same as the previous one (18,327). How can I check it considering that the variables are both continous and categorical?

If I cannot do it, are there any other ways to still use the weights? If not, would it be a big problem to run my regressions without using them?

Thank you for your help!
Tags: None
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17711
#2

10 Dec 2024, 07:31

Tommaso:
take a look at -suest-.

Kind regards,
Carlo
(Stata 19.0)
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30101
#3

10 Dec 2024, 10:29

Well, Carlo's advice in #2 is about ways to check whether the socio-demographic variables are producing statistically significant differences in their regression coefficients in the two versions of the data set. If that's what you are asking about, it is excellent advice.

But I interpreted the question differently: I read it as asking how to verify that the actual values of the socioo-demographic variables are the same in the subset observations as they were in the original complete data set. That's a different matter. The key to answering that question depends on their being some variable(s) in the data set that identify which observation in the original data set corresponds to any given observation in the subset. Those variables need to be present in both data sets, to link them. For demonstration purposes, I'll assume that a pair of such variables, called id and date, do this, and that they uniquely identify observations in both data sets.

Code:

use original_data_set, clear frame create subset_data frame subset_data: use subset_data_set frlink 1:1 id date, frame(subset_data) foreach v of varlist age income gender education { capture assert `v' = frval(subset_data, `v') /// if !missing(subset_data, `v') if c(rc) == 0 { display "Variable `v' OK" } else { display "Variable `v' does not match" } }

Note: This code considers the variable to match in the original and subset data if the value is the same, or if the value is missing in the original data but not in the subset. It also allows some id date pairs to be present in the original data but not at all in the subset. That's based on my interpretation of your question. Unfortunately, the question is not entirely clear on these points, and the absence of example data from both data sets leaves much to my imagination. So, with the hope that my imaginary code above works, I ask that if this is not what you wanted, please post back with example data from both data sets (using the -dataex- command for this purpose, of course!) and a more detailed explanation of the desired state of the data.

If you are running version 18, 17, 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.
1 like
Comment

Tommaso Felici Netherlands

Join Date: Dec 2022
Posts: 3

11 Dec 2024, 04:07

First of all, thank you both.
Unfortunately (my fault!), the answers don't solve my problem yet. I try to rephrase and I add -dataex-. Please note I cannot publish my database for privacy, so I have created a hypothetical database with 50 obs.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input float(id Age Education Income Policy_Support Sample Sub_Sample)
 1 39 2  77650 3 1 1
 2 58 1  47417 . 1 0
 3 72 0  21160 2 1 1
 4 54 2 110546 2 1 1
 5 68 2  33154 1 1 1
 6 51 0  85394 2 1 1
 7 44 2  20123 . 1 0
 8 66 2  65089 3 1 1
 9 70 2  51107 2 1 1
10 39 3  24919 3 1 1
11 45 2  56601 3 1 1
12 57 3  93542 3 1 1
13 57 1  58582 . 1 0
14 26 3  46459 . 1 0
15 79 2  21235 1 1 1
16 78 2  88054 2 1 1
17 19 2  30743 2 1 1
18 50 1 105503 . 1 0
19 22 0  94054 . 1 0
20 54 1  37341 3 1 1
21 49 2  26423 . 1 0
22 70 0  72328 3 1 1
23 48 1  63439 2 1 1
24 74 2  79861 . 1 0
25 75 1  33601 1 1 1
26 31 0  95347 3 1 1
27 31 1  61199 . 1 0
28 62 1  71658 . 1 0
29 21 2  23763 . 1 0
30 35 2  21269 1 1 1
31 23 1  51821 2 1 1
32 51 2  35878 . 1 0
33 33 3  73247 2 1 1
34 29 3  73057 1 1 1
35 62 2 118494 3 1 1
36 69 2  42175 . 1 0
37 75 1  70306 1 1 1
38 40 1  88107 2 1 1
39 64 2  91865 3 1 1
40 57 2 112950 2 1 1
41 78 3 104630 . 1 0
42 58 0 103933 2 1 1
43 35 1  99624 . 1 0
44 39 0  85679 2 1 1
45 31 0  88029 1 1 1
46 52 2  83137 3 1 1
47 76 1 106846 3 1 1
48 46 3  50056 . 1 0
49 78 1  45232 3 1 1
50 69 1  27082 . 1 0
end

In this database, I have a sample of 50 obs. Age, Income and Education of these 50 obs are representative of the entire population. These 50 obs are marked with Sample ==1. However, when running the analyses, I have to drop some observations because of some missing data in Policy_Support. So, when Policy_Support == ., the obs should be dropped. The sub-sample I can run the analyses on is marked with Sub_Sample==1. In this example, this means dropping 17 obs when running the analyses.

What I want to check it's if the distributions of Age, Education and Income are statistically the same between the entire sample (n=50 \ Sample ==1) and the sub-sample (n=33 \ Sub_Sample==1).

Is it clearer now?

Comment

Carlo Lazzaro

Join Date: Apr 2014
Posts: 17711

11 Dec 2024, 04:48

Tommaso:
I would consider -suest-:

Code:

.

. regress Age if Sample==1

      Source |       SS           df       MS      Number of obs   =        50
-------------+----------------------------------   F(0, 49)        =      0.00
       Model |           0         0           .   Prob > F        =         .
    Residual |    15539.38        49  317.130204   R-squared       =    0.0000
-------------+----------------------------------   Adj R-squared   =    0.0000
       Total |    15539.38        49  317.130204   Root MSE        =    17.808

------------------------------------------------------------------------------
         Age | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
       _cons |      52.18   2.518453    20.72   0.000     47.11898    57.24102
------------------------------------------------------------------------------

. estimates store A

. regress Age if Sub_Sample==1

      Source |       SS           df       MS      Number of obs   =        33
-------------+----------------------------------   F(0, 32)        =      0.00
       Model |           0         0           .   Prob > F        =         .
    Residual |  10182.1818        32  318.193182   R-squared       =    0.0000
-------------+----------------------------------   Adj R-squared   =    0.0000
       Total |  10182.1818        32  318.193182   Root MSE        =    17.838

------------------------------------------------------------------------------
         Age | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
       _cons |   53.54545   3.105192    17.24   0.000     47.22039    59.87052
------------------------------------------------------------------------------

. estimates store B

. suest A B

Simultaneous results for A, B                               Number of obs = 50

------------------------------------------------------------------------------
             |               Robust
             | Coefficient  std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
A_mean       |
       _cons |      52.18   2.518453    20.72   0.000     47.24392    57.11608
-------------+----------------------------------------------------------------
A_lnvar      |
       _cons |   5.759312    .131744    43.72   0.000     5.501099    6.017526
-------------+----------------------------------------------------------------
B_mean       |
       _cons |   53.54545   3.088826    17.34   0.000     47.49147    59.59944
-------------+----------------------------------------------------------------
B_lnvar      |
       _cons |   5.762659   .1571248    36.68   0.000       5.4547    6.070618
------------------------------------------------------------------------------

.  lincom [B_mean]_cons - [A_mean]_cons

 ( 1)  - [A_mean]_cons + [B_mean]_cons = 0

------------------------------------------------------------------------------
             | Coefficient  Std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
         (1) |   1.365455   1.813707     0.75   0.452    -2.189346    4.920255
------------------------------------------------------------------------------

Kind regards,
Carlo
(Stata 19.0)

Comment

Andrea Discacciati

Join Date: Feb 2016

Posts: 194
#6

11 Dec 2024, 05:24

If you want to compare mean age etc, this can be useful: https://www.statalist.org/forums/for...tandard-errors
1 like
Comment

Carlo Lazzaro

Join Date: Apr 2014
Posts: 17711

11 Dec 2024, 08:12

Tommaso:
if you plan to use -ttest-, the following code might be useful:

Code:

. sum Age if Sample==1

    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
         Age |         50       52.18    17.80815         19         79

. sum Age if Sub_Sample ==1

    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
         Age |         33    53.54545    17.83797         19         79

. ttesti 50 52.18 17.81 33 53.55 17.84

Two-sample t test with equal variances
------------------------------------------------------------------------------
         |     Obs        Mean    Std. err.   Std. dev.   [95% conf. interval]
---------+--------------------------------------------------------------------
       x |      50       52.18    2.518714       17.81    47.11845    57.24155
       y |      33       53.55    3.105545       17.84    47.22421    59.87579
---------+--------------------------------------------------------------------
Combined |      83     52.7247    1.945648    17.72569    48.85419    56.59521
---------+--------------------------------------------------------------------
    diff |               -1.37    3.997146               -9.323067    6.583067
------------------------------------------------------------------------------
    diff = mean(x) - mean(y)                                      t =  -0.3427
H0: diff = 0                                     Degrees of freedom =       81

    Ha: diff < 0                 Ha: diff != 0                 Ha: diff > 0
 Pr(T < t) = 0.3663         Pr(|T| > |t|) = 0.7327          Pr(T > t) = 0.6337
.

Kind regards,
Carlo
(Stata 19.0)

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30101
#8

11 Dec 2024, 09:30

Re #4: OK, now I understand what is wanted. I correctly read the request as dealing with a comparison of the distributions of the variables themselves, not their coefficients in regression analyses. But I misunderstood it as thinking that the subset was created by some complicated data analysis and there was a concern that some values of the variables may have been overwritten by incorrect data.

This is actually a straightforward issue. The confusion arises because O.P. thinks of this in terms of a comparison between the entire sample and the subsample and can't think of any statistics that are designed to work that way. That confusion is natural because, in fact, there are no statistics that do that. However, it just requires a change in perspective. If the distribution of a variable in the subsample is the same as its distribution in the entire sample, then that distribution is also the same as its distribution in the complementary sample, and vice versa. So it is just a simple matter of applying the usual sample comparison tests to these variables in the sample and its complement.

Code:

foreach v of varlist Age Income { svy: regress `v' i.Sub_Sample if Sample } foreach v of varlist Education { svy: tab `v' Sub_Sample if Sample, pearson }

Turning to the question of whether it would be OK to run the analyses ignoring the weights, the answer is emphatically no. Even simple statistics like means can be seriously biased if calculated without appropriate weights.

My knowledge of working with survey data is limited, so it would not be appropriate for me to recommend a workaround if it turns out that the subsample distributions turn out to be different from those of the full sample. For that matter, even if the sample and subsample do have the same distributions of these variables, I am not entirely sure that the survey design parameters (weights, stratification, sampling units) can be used without modification for the subset.
1 like
Comment

Announcement