Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Sampling with probability proportional to size, without replacement

    I'm trying to draw 14 clusters from my population of 35 cluster ids (14 is unfortunately dictated by project funding) from which 14 clusters I can then go on to sample my end units. I'm trying to do this based on the size of the cluster - the number of end units present in each cluster. I used gsample

    gsample 14 [aw=size], wor

    and get the error

    mm_upswor(): 3300 3 cases have w_i*n/sum(w)>1

    I understand from the mm_sample help file that this is happening because the size of some of the clusters is too large.

    What can I do in this case to get the sample I need?

    Thanks
    Last edited by Rahul Kumar; 04 Feb 2016, 03:24.

  • #2
    You might look at -SAMPLEPPS- (on SSC) which should do exactly what you need.
    __________________________________________________ __
    Assistant Professor, Department of Biostatistics and Epidemiology
    School of Public Health and Health Sciences
    University of Massachusetts- Amherst

    Comment


    • #3
      Thanks for the response Andrew. Unfortunately, samplepps gives me exactly the same error.

      Comment


      • #4
        Hi Rahul, it's unlikely anyone will be able to help without a 'dummy' example to reproduce the error (see the FAQ). Alternatively, you can try using -trace- which might help sort out the issue.
        Last edited by Andrew Lover; 04 Feb 2016, 07:07.
        __________________________________________________ __
        Assistant Professor, Department of Biostatistics and Epidemiology
        School of Public Health and Health Sciences
        University of Massachusetts- Amherst

        Comment


        • #5
          Hi Andrew
          Thank you for the pointer.
          The following is dataex output and my command that should allow you to reproduce the issue
          Code:
          clear
          input int size str2 cluster
            49 "1" 
            80 "10"
            51 "11"
            31 "12"
           128 "13"
           252 "14"
            96 "15"
            33 "16"
            95 "17"
           199 "18"
           944 "19"
           155 "2" 
          1564 "20"
          7298 "21"
          4716 "22"
          1081 "23"
          1295 "24"
          4515 "25"
           563 "26"
            69 "27"
           143 "28"
            52 "29"
           692 "3" 
           363 "30"
           387 "31"
           217 "32"
            51 "33"
          1459 "34"
           531 "35"
           269 "4" 
           335 "5" 
            76 "6" 
           217 "7" 
           110 "8" 
          1169 "9" 
          end
          
          ssc install gsample
          
          gsample 14 [aw=size], wor
          the equivalent samplepps code:

          samplepps test, n(14) size( size)

          Comment


          • #6
            I think you have a tough problem rooted in the nature of sampling without replacement.

            The output of help gsample indicates that it uses the mm_sample() function from the moremata package. For those without moremata installed, the help information for mm_sample() is available with the following command
            Code:
            rnethelp "http://fmwww.bc.edu/RePEc/bocode/m/mf_mm_sample.hlp"
            The key is the discussion of unequal probability sampling without replacement.

            Unequal probability sampling is also possible without replacement. However, note that in the without replacement case a problem exists if there are population members for which w(i) * n / sum(w) > 1. Consider the following example:
            Code:
                    : mm_sample(4, 5, ., (1::5),1,1)
                                mm_upswor():  3300  2 cases have w_i*n/sum(w)>1
                                 mm_sample():     -  function returned error
                                     <istmt>:     -  function returned error
            What happened? Population member no. 5 has size 5 and the sum of sizes over all members is 15. That is, the population share of member no. 5 is 5/15 = 33.3%. However, even if member no. 5 is selected with certainty into the sample, i.e. if member no. 5 is sampled with probability 1, it can only reach a maximum sample share of 1/4 = 25%. (A similar problem exists with member no. 4 whose population share is 4/15 = 26.7%.) Apparently, unbiased PPS sampling without replacement is not possible in this situation.
            Sorry to say I have no advice to offer on a solution. Perhaps one of our members with expertise in sampling will weigh in.

            Comment


            • #7
              I suspect there isn't a "solution". There is a binding constraint that comes into play when one tries to apply the rule. Thanks for the plug, Andrew. (Actually, Ben Jann's program, written after mine, is probably better than samplepps, and it's more general, I recall.) The constraint cited by William is also cited in my help file, and the bibliography may provide more detailed explanations of its origin.

              Comment


              • #8
                Some possibilities come to mind. Try them in this order.

                1. Rank the clusters in descending size order. Set aside the largest as a certainty unit and see if the constraint is violated for the remainder. If not, then take the PPS sample from the remainder. If the constraint is violated, then set aside the next largest cluster as a certainty unit and check the constraint. Repeat, if necessary, until you can take a PPS sample with gsample from the remainder. Each certainty unit goes into its own stratum. As an alternativs to gsample, try one or both of Jonathan Mendelson's ppschromy and ppssampford commands (SSC). One great advantage of ppschromy is that it implements a hierarchical serpentine sort of designated characteristics, which will implicitly stratify by these characteristics; see the help. The current version of ppssampford is described as a beta.

                2. Stratify the clusters by size, with the strata formed so that the MOS totals are about equal. (The largest clusters might each go into its own stratum and would be designated certainty units). Say you have seven strata, then draw a SRS without replacement of two clusters in each stratum. This is sampling with probability approximately proportional to size. You can apply slight reweighting corrections if necessary to make stratum representation exactly proportional to the MOS. A disadvantage is the loss of degrees of freedom (one per stratum).

                3. Sample PPS with minimum replacement: use ppschromy with the pmr option. You will wind up with fewer than 14 clusters, though you might select more to get 14 unique ones. Select a new second-stage sample of the same size each time a cluster is drawn. I'm not sure that there is a finite population correction for this design. I would risk \(1- f\) or, perhaps more safely, \(1- f/2\) where \(f\) is the fraction of clusters selected. See Cochran, 1977, p 30 for a reference to the \(1-f/2\) for simple random sampling. Stata's standard errors will be an approximation.

                4. There is an approach related to systematic sampling of "sampling units" that consist of nominal groups of secondary units in the entire population. To get standard errors you draw several independent systematic samples of these sampling units and visit the clusters in which they fall. You do get an fpc with this. See Chapter 7 of Deming's Sample Design in Business Research (Wiley, 1960) for an example. This approach will have a small number of degrees of freedom (no. of systematic samples -1).

                .

                Reference: WG Cochran (1977), Sampling Techniques, Wiley.
                Last edited by Steve Samuels; 04 Feb 2016, 14:45.
                Steve Samuels
                Statistical Consulting
                [email protected]

                Stata 14.2

                Comment


                • #9
                  Thank you all for your responses.

                  Originally posted by Steve Samuels View Post
                  Some possibilities come to mind. Try them in this order.

                  1. Rank the clusters in descending size order. Set aside the largest as a certainty unit and see if the constraint is violated for the remainder. If not, then take the PPS sample from the remainder. If the constraint is violated, then set aside the next largest cluster as a certainty unit and check the constraint. Repeat, if necessary, until you can take a PPS sample with gsample from the remainder. Each certainty unit goes into its own stratum. As an alternativs to gsample, try one or both of Jonathan Mendelson's ppschromy and ppssampford commands (SSC). One great advantage of ppschromy is that it implements a hierarchical serpentine sort of designated characteristics, which will implicitly stratify by these characteristics; see the help. The current version of ppssampford is described as a beta.
                  Thank you for your response as well Steve.

                  A follow up question for your preferred approach - While estimation(once the survey is done), how should I weight the largest clusters that I set aside as certainty units?

                  Comment


                  • #10
                    Please start a new thread. See Nick Cox's post at http://www.statalist.org/forums/foru...-similar-topic

                    Steve Samuels
                    Statistical Consulting
                    [email protected]

                    Stata 14.2

                    Comment


                    • #11
                      Ok thanks. I would, but the method didn't pan out anyway. Setting the largest unit aside and sampling PPS without replacement on the remaining units was just creating new problem units. I was (often) ending up selecting the largest units as my sample. I'll try some of the other methods.

                      Comment

                      Working...
                      X