Sampling with probability proportional to size, without replacement

Rahul Kumar

Join Date: Jul 2015

Posts: 6
#1

Sampling with probability proportional to size, without replacement

04 Feb 2016, 02:47

I'm trying to draw 14 clusters from my population of 35 cluster ids (14 is unfortunately dictated by project funding) from which 14 clusters I can then go on to sample my end units. I'm trying to do this based on the size of the cluster - the number of end units present in each cluster. I used gsample

gsample 14 [aw=size], wor

and get the error

mm_upswor(): 3300 3 cases have w_i*n/sum(w)>1

I understand from the mm_sample help file that this is happening because the size of some of the clusters is too large.

What can I do in this case to get the sample I need?

Thanks

Last edited by Rahul Kumar; 04 Feb 2016, 03:24.
Tags: None
Andrew Lover

Join Date: Apr 2014

Posts: 182
#2

04 Feb 2016, 03:59

You might look at -SAMPLEPPS- (on SSC) which should do exactly what you need.

__________________________________________________ __
Assistant Professor, Department of Biostatistics and Epidemiology
School of Public Health and Health Sciences
University of Massachusetts- Amherst
Comment
Rahul Kumar

Join Date: Jul 2015

Posts: 6
#3

04 Feb 2016, 05:53

Thanks for the response Andrew. Unfortunately, samplepps gives me exactly the same error.
Comment
Andrew Lover

Join Date: Apr 2014

Posts: 182
#4

04 Feb 2016, 06:09

Hi Rahul, it's unlikely anyone will be able to help without a 'dummy' example to reproduce the error (see the FAQ). Alternatively, you can try using -trace- which might help sort out the issue.

Last edited by Andrew Lover; 04 Feb 2016, 07:07.

__________________________________________________ __
Assistant Professor, Department of Biostatistics and Epidemiology
School of Public Health and Health Sciences
University of Massachusetts- Amherst
Comment

Rahul Kumar

Join Date: Jul 2015
Posts: 6

04 Feb 2016, 07:32

Hi Andrew
Thank you for the pointer.
The following is dataex output and my command that should allow you to reproduce the issue

Code:

clear
input int size str2 cluster
  49 "1" 
  80 "10"
  51 "11"
  31 "12"
 128 "13"
 252 "14"
  96 "15"
  33 "16"
  95 "17"
 199 "18"
 944 "19"
 155 "2" 
1564 "20"
7298 "21"
4716 "22"
1081 "23"
1295 "24"
4515 "25"
 563 "26"
  69 "27"
 143 "28"
  52 "29"
 692 "3" 
 363 "30"
 387 "31"
 217 "32"
  51 "33"
1459 "34"
 531 "35"
 269 "4" 
 335 "5" 
  76 "6" 
 217 "7" 
 110 "8" 
1169 "9" 
end

ssc install gsample

gsample 14 [aw=size], wor

the equivalent samplepps code:

samplepps test, n(14) size( size)

Comment

William Lisowski

Join Date: Dec 2014

Posts: 10150
#6

04 Feb 2016, 10:03

I think you have a tough problem rooted in the nature of sampling without replacement.

The output of help gsample indicates that it uses the mm_sample() function from the moremata package. For those without moremata installed, the help information for mm_sample() is available with the following command

Code:

rnethelp "http://fmwww.bc.edu/RePEc/bocode/m/mf_mm_sample.hlp"

The key is the discussion of unequal probability sampling without replacement.

Unequal probability sampling is also possible without replacement. However, note that in the without replacement case a problem exists if there are population members for which w(i) * n / sum(w) > 1. Consider the following example:

Code:

: mm_sample(4, 5, ., (1::5),1,1) mm_upswor(): 3300 2 cases have w_i*n/sum(w)>1 mm_sample(): - function returned error <istmt>: - function returned error

What happened? Population member no. 5 has size 5 and the sum of sizes over all members is 15. That is, the population share of member no. 5 is 5/15 = 33.3%. However, even if member no. 5 is selected with certainty into the sample, i.e. if member no. 5 is sampled with probability 1, it can only reach a maximum sample share of 1/4 = 25%. (A similar problem exists with member no. 4 whose population share is 4/15 = 26.7%.) Apparently, unbiased PPS sampling without replacement is not possible in this situation.

Sorry to say I have no advice to offer on a solution. Perhaps one of our members with expertise in sampling will weigh in.
Comment
Stephen Jenkins

Join Date: Apr 2014

Posts: 1425
#7

04 Feb 2016, 10:40

I suspect there isn't a "solution". There is a binding constraint that comes into play when one tries to apply the rule. Thanks for the plug, Andrew. (Actually, Ben Jann's program, written after mine, is probably better than samplepps, and it's more general, I recall.) The constraint cited by William is also cited in my help file, and the bibliography may provide more detailed explanations of its origin.
Comment
Steve Samuels

Join Date: Mar 2014

Posts: 1786
#8

04 Feb 2016, 13:59

Some possibilities come to mind. Try them in this order.

1. Rank the clusters in descending size order. Set aside the largest as a certainty unit and see if the constraint is violated for the remainder. If not, then take the PPS sample from the remainder. If the constraint is violated, then set aside the next largest cluster as a certainty unit and check the constraint. Repeat, if necessary, until you can take a PPS sample with gsample from the remainder. Each certainty unit goes into its own stratum. As an alternativs to gsample, try one or both of Jonathan Mendelson's ppschromy and ppssampford commands (SSC). One great advantage of ppschromy is that it implements a hierarchical serpentine sort of designated characteristics, which will implicitly stratify by these characteristics; see the help. The current version of ppssampford is described as a beta.

2. Stratify the clusters by size, with the strata formed so that the MOS totals are about equal. (The largest clusters might each go into its own stratum and would be designated certainty units). Say you have seven strata, then draw a SRS without replacement of two clusters in each stratum. This is sampling with probability approximately proportional to size. You can apply slight reweighting corrections if necessary to make stratum representation exactly proportional to the MOS. A disadvantage is the loss of degrees of freedom (one per stratum).

3. Sample PPS with minimum replacement: use ppschromy with the pmr option. You will wind up with fewer than 14 clusters, though you might select more to get 14 unique ones. Select a new second-stage sample of the same size each time a cluster is drawn. I'm not sure that there is a finite population correction for this design. I would risk \(1- f\) or, perhaps more safely, \(1- f/2\) where \(f\) is the fraction of clusters selected. See Cochran, 1977, p 30 for a reference to the \(1-f/2\) for simple random sampling. Stata's standard errors will be an approximation.

4. There is an approach related to systematic sampling of "sampling units" that consist of nominal groups of secondary units in the entire population. To get standard errors you draw several independent systematic samples of these sampling units and visit the clusters in which they fall. You do get an fpc with this. See Chapter 7 of Deming's Sample Design in Business Research (Wiley, 1960) for an example. This approach will have a small number of degrees of freedom (no. of systematic samples -1).

.

Reference: WG Cochran (1977), Sampling Techniques, Wiley.

Last edited by Steve Samuels; 04 Feb 2016, 14:45.

Steve Samuels
Statistical Consulting
[email protected]

Stata 14.2
2 likes
Comment
Rahul Kumar

Join Date: Jul 2015

Posts: 6
#9

04 Feb 2016, 23:08

Thank you all for your responses.

Originally posted by Steve Samuels View Post

Some possibilities come to mind. Try them in this order.

1. Rank the clusters in descending size order. Set aside the largest as a certainty unit and see if the constraint is violated for the remainder. If not, then take the PPS sample from the remainder. If the constraint is violated, then set aside the next largest cluster as a certainty unit and check the constraint. Repeat, if necessary, until you can take a PPS sample with gsample from the remainder. Each certainty unit goes into its own stratum. As an alternativs to gsample, try one or both of Jonathan Mendelson's ppschromy and ppssampford commands (SSC). One great advantage of ppschromy is that it implements a hierarchical serpentine sort of designated characteristics, which will implicitly stratify by these characteristics; see the help. The current version of ppssampford is described as a beta.

Thank you for your response as well Steve.

A follow up question for your preferred approach - While estimation(once the survey is done), how should I weight the largest clusters that I set aside as certainty units?
Comment
Steve Samuels

Join Date: Mar 2014

Posts: 1786
#10

05 Feb 2016, 09:50

Please start a new thread. See Nick Cox's post at http://www.statalist.org/forums/foru...-similar-topic

Steve Samuels
Statistical Consulting
[email protected]

Stata 14.2
Comment
Rahul Kumar

Join Date: Jul 2015

Posts: 6
#11

05 Feb 2016, 23:50

Ok thanks. I would, but the method didn't pan out anyway. Setting the largest unit aside and sampling PPS without replacement on the remaining units was just creating new problem units. I was (often) ending up selecting the largest units as my sample. I'll try some of the other methods.
Comment

Announcement

Sampling with probability proportional to size, without replacement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment