Statistical test of subpopulation vs. entire population (using jackknife standard errors)

Mickey Jackson

Join Date: Jul 2014

Posts: 2
#1

Statistical test of subpopulation vs. entire population (using jackknife standard errors)

09 Jul 2014, 06:35

I'm trying to compare a subpopulation to the overall population for the purpose of evaluating survey nonresponse bias. In other words, I have frame data for the entire sample (both respondents and nonrespondents), and I want to run a t-test to evaluate whether the proportion of the respondent subpopulation with a certain characteristic is significantly different from the proportion of the overall population (not just the proportion of the nonrespondent population; I want the "respondent + nonrespondent" population) with that characteristic. Furthermore, I need to use jackknife standard errors in this analysis.

In theory, it seems that the best thing to do would be to run "svy: tabulate" (or "svy: proportion") on the overall sample, then run it again using the "if" qualifier to restrict the sample to respondents only, and then use the "suest" command to compare the proportions from the two tabulations. Unfortunately, however, "suest" does not support jackknife standard errors. I've come up with a workaround, but I'm not sure if it's correct, so I was hoping to get some input. Here's an example of what I'm doing, using a hypothetical "gender" variable:

svyset [pweight=pweight], vce(jackknife) jkrweight(jkweight1-jkweight70) mse

expand 2 if complete==1, generate(respondentsonly) /*Duplicating the respondent observations and creating a new variable respondentsonly that equals 1 for the "respondents only" sample and 0 for the "respondents + nonrespondents" sample*/

svy: proportion gender, over(respondentsonly)

lincom _b[Male:0] - _b[Male:1] /*Testing whether the estimated population proportion of males from the "respondents + nonrespondents" sample is different from estimated proportion from the "respondents only" sample*/

What I'm concerned specifically concerned about is the way that the "over" option interacts with the duplicate observations for the respondents. From what I've been able to find in Stata's subpopulation estimation documentation, it seems that the "if" qualifier would be the more appropriate means of subsetting the sample between the "respondent + nonrespondent" observations and the (duplicate) respondent observations, since I don't want Stata to count the duplicate observations twice when calculating the standard errors; however, I can't figure out a way to run a hypothesis test on the coefficients from two separate tabulations. I'm very much a nonspecialist when it comes to survey variance estimation, so I thought I'd see if anyone here can tell me whether my hack is correct, or if I've made a monumentally stupid mistake (which is entirely possible). It's worth noting that the proportions and standard errors that result from "svy: proportion gender, over(respondentsonly)" are the same as those that result from running "svy: proportion gender if respondentsonly==1" and "svy: proportion gender if respondentsonly==0."

Thanks! Let me know if you need any more info; I'm new to Statalist, so excuse any rookie mistakes or omissions

Last edited by Mickey Jackson; 09 Jul 2014, 06:38.
Tags: None
Steve Samuels

Join Date: Mar 2014

Posts: 1786
#2

11 Jul 2014, 08:46

I want to run a t-test to evaluate whether the proportion of the respondent subpopulation with a certain characteristic is significantly different from the proportion of the overall population (not just the proportion of the nonrespondent population)..

The two comparisons are logically and statistically equivalent, so just do the comparison of responders to non-responders. The proof is simple:

Let $P_1$ and $P_2$ be the prevalence rates for responders and non-responders, respectively, and let $w$ be the proportion of responders in the population. Then the overall population prevalence can be written

$$
\quad P = w P_1 + (1-w)P_2
$$
Then the difference $P - P_1$ can be written:

$$
P- P_1 = (1-w)(P_2 - P_1)
$$

As a consequence, any hypothesis about $P - P_1$ is equivalent to a hypothesis about $P_2 - P_1$. The test statistics are identical: since

$$
\text{se}(\hat{P}- \hat{P}_1) = (1-w)\times \text{se}(\hat{P}_2 - \hat{P}_1),
$$

$(1-w)$ will cancel out in numerator and denominator. To get a confidence confidence interval for $P - P_1$, just multiply the endpoints of the $P_2 - P_1$ interval by $(1-w)$.

Last edited by Steve Samuels; 11 Jul 2014, 08:58.

Steve Samuels
Statistical Consulting
[email protected]

Stata 14.2
3 likes
Comment
skolenik

Join Date: Mar 2014

Posts: 100
#3

13 Jul 2014, 04:25

Welcome to Statalist website, Mickey.

To take off where Steve Samuels left, all you need to do is

Code:

svy: proportion gender, over(respondentsonly)

to compare respondents and non-respondents on gender. Then you can use the standard test commands to generate formal tests and p-values. Type matrix list e(b) to figure out the parameter names to test.

In terms of your rookiness, your post is very difficult to read without code formatting. May be you can take a minute and go back to edit it to make more easily readable by applying [ CODE ] or boldface formatting to your Stata commands. (I would do this myself if we were on StackExchange... but we aren't .

-- Stas Kolenikov || http://stas.kolenikov.name
-- Principal Survey Scientist, Abt SRBI
-- Opinions stated in this post are mine only
Comment
Mickey Jackson

Join Date: Jul 2014

Posts: 2
#4

14 Jul 2014, 12:16

Thank you both! So, just to confirm, the equivalence--in particular, the relationship between the standard errors--still holds if I'm using the jackknife, correct?

Originally posted by Steve Samuels View Post

The two comparisons are logically and statistically equivalent, so just do the comparison of responders to non-responders. The proof is simple:
Let $P_1$ and $P_2$ be the prevalence rates for responders and non-responders, respectively, and let $w$ be the proportion of responders in the population. Then the overall population prevalence can be written
$$\quad P = w P_1 + (1-w)P_2$$Then the difference $P - P_1$ can be written:$$
P- P_1 = (1-w)(P_2 - P_1)$$
As a consequence, any hypothesis about $P - P_1$ is equivalent to a hypothesis about $P_2 - P_1$. The test statistics are identical: since
$$\text{se}(\hat{P}- \hat{P}_1) = (1-w)\times \text{se}(\hat{P}_2 - \hat{P}_1),
$$
$(1-w)$ will cancel out in numerator and denominator. To get a confidence confidence interval for $P - P_1$, just multiply the endpoints of the $P_2 - P_1$ interval by $(1-w)$.
Comment
Steve Samuels

Join Date: Mar 2014

Posts: 1786
#5

14 Jul 2014, 18:48

So, just to confirm, the equivalence--in particular, the relationship between the standard errors--still holds if I'm using the jackknife, correct?

'

Correct.

Steve Samuels
Statistical Consulting
[email protected]

Stata 14.2
Comment

Announcement

Statistical test of subpopulation vs. entire population (using jackknife standard errors)

Comment

Comment

Comment

Comment