Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Weighted sampling

    I'd like to simulate the results of a poll of n=400 voters from the population of nearly 5 million Illinois voters in 1960. Is there a way to this without writing a population dataset thanearly 5 million rows? It seems to me it should be possible applying some simple command to a 3 line dataset like this (but -sample- won't do it):

    candidate votes
    Kennedy 2377846
    Nixon 2368988
    Other 10000

  • #2
    I do not understand where the technical challenge lies (e.g. are you opposing having to generate 5 million cases, or are you not sure how to generate 5 millions cases easily.) I'm going to, actually, generate 5 million cases, and then do a sample. It's just one more line to expand:

    Code:
    * The three original cases:
    clear
    input str10 candidate votes
    Kennedy 2377846
    Nixon 2368988
    Other 10000
    end
    
    expand votes
    * Optionally add a "set seed ###" for reproducible outcomes.
    sample 400, count
    Last edited by Ken Chui; 23 Feb 2022, 15:17.

    Comment


    • #3
      A potentially faster solution is to use Ben Jann's gsample. I say potentially because I didn't benchmark times but it felt slow with a sample of 10M records (just to verify sampling proportions). You can also "roll your own" sample program if you prefer a bespoke solution.

      Code:
      set seed 17
      
      input byte candidate long votes
      1 2377846
      2 2368988
      3 10000
      end
      compress
      label define candidate 1 "Kennedy" 2 "Nixon" 3 "Other"
      label values candidate candidate
      list
      tab candidate [fw=votes]
      
      gsample 400 [iw=votes], replace
      tab candidate

      Comment


      • #4
        This problem might also be framed as "generate a multinomial random variable" in Stata, which was discussed here a few years ago, a conversation in which I participated but had forgotten. The short story is that Stata lacks a multinomial r. v. function, but the obscure built-in -irecode- function can be used to make something like that, as can Mata's -rdiscrete- function. Neither of these offers an easier solution in the current context, but perhaps someone thinking in that direction might find a pointer useful.

        Comment


        • #5
          Thanks to all! I agree that -gsample- is the fastest and simplest solution here.

          Comment

          Working...
          X