Specifying Appropriate PSU or Cluster ID in svy

Stephen Hornbeck

Join Date: May 2018

Posts: 3
#1

Specifying Appropriate PSU or Cluster ID in svy

22 May 2018, 14:10

Hi Everyone,

I am analyzing a multi-stage cluster sample and am attempting to appropriately calculate the design effect. The Primary Sampling Units are districts that were selected using PPS with replacement. The population sizes of the districts are large enough that multiple clusters are allocated to a district. Population information below the district is not available and the remaining stages are selected using SRS.
District (dis) Cluster Id (clus) Household Respondent in HH

1 1 1 3

1 1 2 1

1 1 3 4

1 1 4 1

1 2 1 1

1 2 2 3

1 2 3 2

1 2 4 1

2 3 1 1

etc.

When using the svyset command, should the PSU be specified as the district or should the PSU be specified as the cluster id?

For example, should it be:

svyset dis

or

svyset clus

Thank you for the help!

Last edited by Stephen Hornbeck; 22 May 2018, 14:52.
Tags: cluster, deff, psu, svy
Steve Samuels

Join Date: Mar 2014

Posts: 1786
#2

23 May 2018, 09:18

Welcome to Statalist, Stephen!
Be sure to read the FAQ, especially FAQ 12 on how to ask questions and how to show graphs, code, results, and data listings.

As district was the highest level unit selected by random numbers, the svyset statement should start with

Code:

svyset dist

. Because districts were sampled with replacement, the only other elements of the svyset statement should be the pweight option and (if districts were stratified) the strata() option.

Last edited by Steve Samuels; 23 May 2018, 09:23.

Steve Samuels
Statistical Consulting
[email protected]

Stata 14.2
Comment
Stephen Hornbeck

Join Date: May 2018

Posts: 3
#3

23 May 2018, 16:42

Thank you for the response Steve.

My apologies, I looked at the FAQ and saw the note about specifying cross-posting. The question I had was based on the following post on stackexchange: https://stats.stackexchange.com/ques...luster-samples.

Reading through the svy documentation I saw that the PSU should be specified, which for the example in the first post I was specifying as district. However, the stackexchange post states that I should be specifying the cluster. I was worried specifying the district would drastically overestimate the standard errors and wanted to see what variable would be best to include in the svyset.

Thanks again for your help!
Comment
Steve Samuels

Join Date: Mar 2014

Posts: 1786
#4

24 May 2018, 10:56

Thanks for reminding me of the Stackexchange post, Stephen. In that post, I stated that the "draw" was the unit, as in the sampling texts, the the \(n\) is the number of draws, not the distinct number of PSUs. realize that my advice there is not correct: PSU is the PSU, whether sampling with or without replacement.

Stephen, with replacement surveys are not common. For my benefit and those of other readers, could you clarify your design a bit? By cluster, you obviously mean a cluster of HH located in a small geographic area. When a PSU was drawn, how many of the "clusters" were sampled at the second stage? If the PSU was drawn another time, were the clusters selected previously also eligible for the new subsample? Thanks

Last edited by Steve Samuels; 24 May 2018, 11:37.

Steve Samuels
Statistical Consulting
[email protected]

Stata 14.2
Comment
Stephen Hornbeck

Join Date: May 2018

Posts: 3
#5

24 May 2018, 11:34

Thanks for the response.

The design is a stratified multi-stage cluster sample meant to be representative of a geographic region.

Stage 1: In total there are 30 clusters with ten respondents selected within each cluster. These are allocated proportionally to district for the stratification.

Stage 2: We have sub-district level information that we use for the PPS selection with replacement. Some sub-districts are large and receive multiple clusters for the PSU.

Stage 3: Once the PPS is complete, we randomly assign the allocated clusters to a street without replacement within the sub-districts for the SSU.

Stage 4: Enumeration of households and random selection of households

Stage 5: Random selection within the household

based on this design I am using the code:

Code:

svyset subdistrict [pweight=weight], strata(district)

but thought it might be:

Code:

svyset street [pweight=weight], strata(district)

Last edited by Stephen Hornbeck; 24 May 2018, 11:37.
Comment
Steve Samuels

Join Date: Mar 2014

Posts: 1786
#6

24 May 2018, 13:22

Thanks, Stephen. If "sub-districts" are the largest units selected by random numbers, then your first svyset is the right one.

Steve Samuels
Statistical Consulting
[email protected]

Stata 14.2
Comment
Steve Samuels

Join Date: Mar 2014

Posts: 1786
#7

04 Jun 2018, 12:49

I must revert to my original opinion expressed in the Stack Exchange thread that the psu for svyset is the draw number, not the physical cluster. I was persuaded by a comment added by David Rae in that thread

“In PPS sampling with replacement, a given cluster (i.e., hospital, in this instance) can be sampled more than once, and each drawing of a cluster is considered a primary sampling unit.” Levy, Lemeshow Sampling of Populations p. 346

I apologize for the flip-flop. I didn't consider that authors (like you) of with-replacement sampling data sets would provide the draw number as well as the ID of the physical unit. Although it's uncommon to encounter a with-replacement data set these days, in Deming's (1960) replicated sampling designs, the same physical unit can appear in more than one replicate. I've often used those designs in my own work.

Reference:
WE Deming, 1960. Sample Design in Business Research, Wiley

Steve Samuels
Statistical Consulting
[email protected]

Stata 14.2
Comment

District (dis)	Cluster Id (clus)	Household	Respondent in HH
1	1	1	3
1	1	2	1
1	1	3	4
1	1	4	1
1	2	1	1
1	2	2	3
1	2	3	2
1	2	4	1
2	3	1	1

Announcement

Specifying Appropriate PSU or Cluster ID in svy

Comment

Comment

Comment

Comment

Comment

Comment