Weighted sampling

paulvonhippel

Join Date: Apr 2014

Posts: 502
#1

Weighted sampling

23 Feb 2022, 13:20

I'd like to simulate the results of a poll of n=400 voters from the population of nearly 5 million Illinois voters in 1960. Is there a way to this without writing a population dataset thanearly 5 million rows? It seems to me it should be possible applying some simple command to a 3 line dataset like this (but -sample- won't do it):

candidate votes
Kennedy 2377846
Nixon 2368988
Other 10000
Tags: None
Ken Chui

Join Date: Aug 2014

Posts: 1058
#2

23 Feb 2022, 13:48

I do not understand where the technical challenge lies (e.g. are you opposing having to generate 5 million cases, or are you not sure how to generate 5 millions cases easily.) I'm going to, actually, generate 5 million cases, and then do a sample. It's just one more line to expand:

Code:

* The three original cases: clear input str10 candidate votes Kennedy 2377846 Nixon 2368988 Other 10000 end expand votes * Optionally add a "set seed ###" for reproducible outcomes. sample 400, count

Last edited by Ken Chui; 23 Feb 2022, 14:17.
Comment
Leonardo Guizzetti

Join Date: Jul 2016

Posts: 2402
#3

23 Feb 2022, 14:52

A potentially faster solution is to use Ben Jann's gsample. I say potentially because I didn't benchmark times but it felt slow with a sample of 10M records (just to verify sampling proportions). You can also "roll your own" sample program if you prefer a bespoke solution.

Code:

set seed 17 input byte candidate long votes 1 2377846 2 2368988 3 10000 end compress label define candidate 1 "Kennedy" 2 "Nixon" 3 "Other" label values candidate candidate list tab candidate [fw=votes] gsample 400 [iw=votes], replace tab candidate
Comment
Mike Lacy

Join Date: Apr 2014

Posts: 2416
#4

23 Feb 2022, 15:55

This problem might also be framed as "generate a multinomial random variable" in Stata, which was discussed here a few years ago, a conversation in which I participated but had forgotten. The short story is that Stata lacks a multinomial r. v. function, but the obscure built-in -irecode- function can be used to make something like that, as can Mata's -rdiscrete- function. Neither of these offers an easier solution in the current context, but perhaps someone thinking in that direction might find a pointer useful.
Comment
paulvonhippel

Join Date: Apr 2014

Posts: 502
#5

23 Feb 2022, 20:05

Thanks to all! I agree that -gsample- is the fastest and simplest solution here.
Comment

Announcement

Comment

Comment

Comment

Comment