Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Two stage sampling

    Dear all,

    I have a dataset of children and their schooling preferences from a school district, using which I am trying to do discrete choice modeling of school choices. There are about 25,000 children spread across 200 neighborhoods. I have the money to survey 600 children and want to do a two stage sampling (stage 1 of neighborhoods and stage 2 of children within neighborhoods). I have arrived at a design of 60 neighborhoods and 10 children per neighborhood. I want to do PPS in stage 1 based on the population of applicants in neighborhoods and stratified sampling in stage 2 (want to sample half poor and half non-poor applicants). Finally I want to have 600 sampled children and a buffer list from each neighborhood in case some of the sampled parents don't take part in the survey. I tried using gsample and samplepps commands, but they don't give me the buffer lists and stratification in stage 2. Example of my dataset is below (id-child identifier; nei_code- neighborhood code; available_schools- number of schools available in the neighborhood; population- number of applicants in the neighborhood; poor- indicator for poor applicant). Would really appreciate your help.

    Many thanks,
    Vijay
    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input long(id nei_code) float(available_schools population poor)
      8805 292001046  4  63 0
     34582 292001046  4  63 1
      6759 292001046  4  63 1
    204042 292001129 28 374 1
    148069 292001129 28 374 1
    213387 292001129 28 374 1
     28516 292001129 28 374 1
    169220 292001129 28 374 0
     39276 292001129 28 374 1
    282710 292001129 28 374 0
     26063 292001129 28 374 1
    250315 292001129 28 374 0
    277483 292001129 28 374 1
    286421 292001129 28 374 1
    214859 292001129 28 374 1
    286817 292001129 28 374 0
     11116 292001130 30 316 1
    259525 292001130 30 316 1
    193110 292001130 30 316 1
     26748 292001130 30 316 1
    end

  • #2
    This is an interesting problem, but I have some questions/comments


    * I'd like to make sure I understand your design. From your data listing, it looks like you have a list of all children in the population, with neighborhood and poverty status. You will select neighborhoods PPS at the first stage, then at the second stage, sample five children with SRS from each poverty group. Is this correct?

    * You say you have funds (and time) enough to "survey 600 children". ( I assume that you mean parents of selected children.) Have you factored in time and cost to: 1) contact parents and arrange for interview: to try to convert (a sample of) nonrespondents (two phase sampling, Lohr 2010, Section 8.3); and to recruit parents of the "buffer" children?

    * What size of buffer are you planning for? If you haven't data from similar studies or from pilot tests, you will be forced to guess, high.

    However if participation rates are low in the initial sample, they are also likely to be low for the replacements. Better to pilot test the protocol/survey instrument to improve initial response; then do intensive follow-up of (at least a sample of ) nonparticipants before going to the buffer. See also, Lohr, Chapter 8 and Groves et al., Chapter 6)

    Your design has a noteworthy feature : the random selection of children will bias the sample towards selecting children from large families. Or, equivalently, small famiies will underrpresented. You have to do something about this bias. My advice is use families as the sampling unit, perhaps pick one child from the family for report. You can also re-weight after the fact or use family size as a control category.

    To your question. You draw the sample in two stages (code not tested)
    Stage 1: select neighborhoods
    Code:
    set seed 43619
    tempfile t1
    save `t1'
    bys: nei_code: keep if _n==1
    samplepps pick1, size(population) n(60)
    keep if pick1
    merge 1:nei_code using `t1', keep(3)
    save nei_sampl, replace
    Stage 2. Select e.g. 8 children (5 + 3 buffer) from each stratum of selected neighborhoods
    Code:
    local k = 8   r
    use nei_sampl, clear
    sample `k', count  by(nei_code poor)
    save child_sampl, replace
    References:

    Groves, R. M., Fowler, F. J., Couper, M. P., Lepkowski, J. M., Singer, E., & Tourangeau, R. (2009). Survey methodology, Second Edition (2nd ed.). Hoboken, N.J.: Wiley.

    Sharon Lohr (2010) Sampling: Design and Analysis, Brooks/Cole, Boston.
    Last edited by Steve Samuels; 08 Jan 2018, 19:10.
    Steve Samuels
    Statistical Consulting
    [email protected]

    Stata 14.2

    Comment


    • #3
      Dear Steve,

      Thank you for the detailed comments and the code.

      * I'd like to make sure I understand your design. From your data listing, it looks like you have a list of all children in the population, with neighborhood and poverty status. You will select neighborhoods PPS at the first stage, then at the second stage, sample five children with SRS from each poverty group. Is this correct?

      That is correct, though I am now considering sampling different number of children per neighborhood. This is because I am now thinking of using school fixed effects in the model (to account for the unobserved heterogeneity of schools); as different neighborhoods have different number of schools (a few have more than 10), I will need more than 10 observations per neighborhood in some. I was advised to apply some rule of thumb in selecting the sample size per cluster rather than, say, sampling proportionate to school number. One rule of thumb could be to, say, have multiples of 5 children per cluster (5 in smaller clusters (less than 5 schools), 10 in clusters with 5-10 schools, 15 in clusters with less than 15 schools and 20 in the rest). This approach might require re-weighting at the end. Does this sound okay? Finally, I now want to have the poor and the non-poor in 3:2 ratio, rather than 1:1 ratio. What would be the second stage code in this kind of situation? It will probably be very complicated.

      * You say you have funds (and time) enough to "survey 600 children". ( I assume that you mean parents of selected children.) Have you factored in time and cost to: 1) contact parents and arrange for interview: to try to convert (a sample of) nonrespondents (two phase sampling, Lohr 2010, Section 8.3); and to recruit parents of the "buffer" children?

      I already did a bigger survey (1600 children and their parents (testing children and HH survey)) six months ago and the response rate is typically 80 percent. My survey team is pretty hands on with the local logistics.


      * What size of buffer are you planning for? If you haven't data from similar studies or from pilot tests, you will be forced to guess, high.

      However if participation rates are low in the initial sample, they are also likely to be low for the replacements. Better to pilot test the protocol/survey instrument to improve initial response; then do intensive follow-up of (at least a sample of ) nonparticipants before going to the buffer. See also, Lohr, Chapter 8 and Groves et al., Chapter 6)


      The buffer size should be about 20-30 percent. In the code for stage 2, is there a way to number the sampled observations so that I know which is the sample and which is the buffer. May be using runiform? Will also read the references carefully.

      Your design has a noteworthy feature : the random selection of children will bias the sample towards selecting children from large families. Or, equivalently, small famiies will underrpresented. You have to do something about this bias. My advice is use families as the sampling unit, perhaps pick one child from the family for report. You can also re-weight after the fact or use family size as a control category.

      I suspect this won't be a huge problem as it is unlikely that families have two children of the same age (the dataset is for entry into grade 1 only). However, I will try to see if I have a family level identifier in the data.

      Would be grateful for you thoughts and code on the revised stage 2 sampling strategy. I really appreciate your help.

      Regards,
      Vijay




      Comment


      • #4
        Sorry, I don't understand enough about discrete choice modeling to advise you on the first stage design.
        As to the second stage, instead of selecting 5 poor and 5 "not poor" families (or multiples) from the poverty strata, draw 6 poor and 4 not poor families (or multiples)

        Here's sample code not tested for drawing the second stage samples. I'm sure that it can be improved. 've kept the 3:2 poor:nopoor ratio in each school size group. So neighborhood poor sample sizes will be, e.g. 6, 12, 18, 24 and not-poor sample sizes will be 4, 8, 12, 16. I'm sure that you can improve this. This can't be your final code. You'll have to add something for the buffer,I define a variable "wtc", which indicates the second stage probability of selection. This will have to be multiplied by the first stage probability to get the final sampling weight


        Code:
        gen sstrat = 1 if  available_schools<6
        repl sstrat =2  if  inrange(available_schools, 6,10)
        replace sstrat=3  if inrange(available_schools, 11,15)
        replace  sstrat 4  =  4  if available_schools>15 & available_schools<.
        
        tempfile t1 t2 t3 t4 t5 t6 t7 t8
        preserve
        keep if sstrat ==1 & poor
        gen wtc = 6/_N
        sample 6
        save `t1', replace
        restore
        keep if sstrat ==2 & poor
        gen wtc= 12/_N
        sample 12, count
        save `t2', replace
        restore
         keep if sstrat ==3 & poor
        gen wtc = 18/_N
        sample 18, count
        save `t3', replace
        restore
        keep if sstrat==4 & poor
        gen wtc =  24/_N
        
        append using `t1' `t2' `t3'
        tempfile tpoor
        save `tpoor', replace
        
        restore
        keep if sstrat ==1 & !poor
        gen wtc = 4/_N
        sample 4
        save `t1', replace
        restore
        keep if sstrat ==2 & !poor
        gen wtc= 8/_N
        sample  8, count
        save `t2', replace
        restore
         keep if sstrat ==3 & !poor
        gen wtc = 12/_N
        sample 12, count
        save `t3', replace
        restore
        keep if sstrat==4 &! poor
        gen wtc =  `16'/_N
        append using `t1' `t2' `t3' `tpoor'
        
        save, replace
        Steve Samuels
        Statistical Consulting
        [email protected]

        Stata 14.2

        Comment


        • #5
          Thanks Steve! This code was very helpful in finalizing my sampling.

          Comment

          Working...
          X