Bootstrapping with frequency weights

paulvonhippel

Join Date: Apr 2014

Posts: 496
#1

Bootstrapping with frequency weights

23 Jan 2015, 19:51

I'm trying to bootstrap from data that represents a frequency table. To simplify, say x is a variable and n is its frequency, and say I have data called original, which summarizes the distribution as follows:

x n
1 40
2 30
3 30

So the original data summarizes 100 cases with x=(1,2,3) in proportions 40:30:30.

What I'd like to do is generate another dataset representing the distribution of 100 cases drawn at random, with replacement, from the distribution described by the original data. Or actually I'd like to do that 200 times and stack the results. I'm open to different ways of representing the results, but they might look something like this:

sample x n
1 1 42
1 2 32
1 3 26
2 1 35
2 2 30
2 3 35
....
200 1 34
200 2 27
200 3 39

bsample 3, weight(n) doesn't do this, and neither does bsample 100, weight(n).

Many thanks for any suggestions.
Tags: None
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17673
#2

24 Jan 2015, 07:28

Paul:
some years ago, I started a thread on a similar topic http://statalist.1588530.n2.nabble.c...td3743854.html
I do hope that the related replies can be helpful.

Kind regards,
Carlo
(Stata 19.0)
Comment
ben earnhart

Join Date: May 2014

Posts: 1027
#3

24 Jan 2015, 08:44

"expand n" then bootstrap? Each boostrap sample will be a little off on the frequency weights, but on average will achieve it.

Last edited by ben earnhart; 24 Jan 2015, 09:16.
Comment
paulvonhippel

Join Date: Apr 2014

Posts: 496
#4

24 Jan 2015, 10:14

n can be pretty big (outside my toy example). I'm not sure "expand n" is the way to go.
Comment
Stephen Jenkins

Join Date: Apr 2014

Posts: 1425
#5

24 Jan 2015, 11:19

Paul: does the following code achieve the sort of thing that you're after? What writing the code brought home to me was one has to make assumptions about how/where the randomness comes in. I've written something that works directly on your frequency table, even though the random process generating the distribution across categories is presumably occurring at some individual unit level (which you've then summarised).

Code:

clear all set obs 600 // 3 * 200 ge id = _n seq cat, from(1) to (3) seq rep, b(3) list in 1/21, noobs sepby(rep) set seed 12345 ge catprop = . replace catprop = round( 40 + 100*rnormal(0,.05) ) if cat == 1 replace catprop = round( 30 + 100*rnormal(0,.05) ) if cat == 2 bysort rep (cat): replace catprop = 100 - catprop[_n-1] - catprop[_n-2] if cat == 3 sort id list in 1/30, noobs sepby(rep) ta cat, su(catprop)
Comment
paulvonhippel

Join Date: Apr 2014

Posts: 496
#6

24 Jan 2015, 19:02

I don't think you need to make any assumptions. I just want to repeatedly simulate 100 draws from a multinomial distribution with values x=(1,2,3) in proportions 40:30:30, summarized in the stated form.
Comment
Stephen Jenkins

Join Date: Apr 2014

Posts: 1425
#7

25 Jan 2015, 04:18

Paul: I don't think you quite take my remarks in the constructive spirit that they were intended, though I concede the remarks may not have been entirely clear. I'll try again. You refer a summary table of frequencies and associated proportions. Underlying the table is presumably a sample of units (let's call them persons), each of which has an associated categorical outcome value (1, 2, or 3). With access to the underlying (unit-record) data on the persons, I think one can sample from a multinomial distribution using the methods such set as out at e.g. http://en.wikipedia.org/wiki/Multino...l_distribution . My point was that I don't know how one goes about this when one only has the summary table. That led to my code -- which did at least produce something looking like you said you wanted. Are there literature references on related (re)sampling problems that you might point us to?
Comment
paulvonhippel

Join Date: Apr 2014

Posts: 496
#8

25 Jan 2015, 09:51

Thanks! I do recognize the constructive spirit of your comments -- so sorry if that didn't come across.

I just don't think you need unit-level data to estimate and sample from a multinomial distribution. The frequencies alone are adequate to do that.

Since originally posting, I've come across Stata's rmultinom() function, and Buis' suggestions for using the uniform() function to simulate multinomial data. I can work with those functions, but I was sort of hoping for something as simple and elegant as the bootstrap, bsample, sample, or gsample command.

Best,
Paul
Comment
paulvonhippel

Join Date: Apr 2014

Posts: 496
#9

25 Jan 2015, 10:04

Stephen, I just realized that you will be familiar with the actual problem that motivated my question. Consider the situation where (min,max) is a range of incomes and n is the number of households with incomes in that range. Then it is common to see an income distribution summarized like this:

x1 x2 n
0 9999 4000
10000 19999 6600
20000 29999 11340
etc.

Now there are a variety of methods for estimating summary statistics such as the Gini coefficient -- including your favored approach of fitting the generalized beta distribution or something similar.

What I'm trying to do is estimate a confidence interval for the Gini by using the bootstrap. Make sense?
Comment
paulvonhippel

Join Date: Apr 2014

Posts: 496
#10

25 Jan 2015, 10:05

P.S. And I'd like to generate the bootstrap samples using only the multinomial frequencies from the original data. I don't want to assume anything further about the underlying distribution, which may or not be generalized beta etc.
Comment
Stephen Jenkins

Join Date: Apr 2014

Posts: 1425
#11

25 Jan 2015, 10:47

I've come across Stata's rmultinom() function

Would you give a more precise reference please? I can't find this (which may be my obtuseness). I know of binomial functions in Stata and Mata, but not that multinominal one. [If I had found one, I wouldn't have written the code in post #5!] Maarten's recommendations to use uniform() are in effect implementing the sort of algorithm in the Wikipedia article I cited, I think. I've used that approach in other work (data generation for Monte Carlo analysis; working with unit record data, however, not a table of frequencies.)

But thanks for setting out what more clearly what you actually want to do. Got it! However, I am now led to ask whether your cart is a bit before your horse. You have grouped data (published US Census data?), so there are serious issues to consider about the estimation of inequality indices per se, let alone their sampling variances. Which estimator is "best" to use also depends on how much "information" you have, including e.g. the mean within each band, and what you know about the top interval (typically open). Yes, parametric estimators are one way to go. (BTW they are not my favoured approach necessarily; it depends what one is trying to do.) Indeed, I like non-parametrically estimated indices, which are what you want. One of my favourite articles on this is:
Cowell, F.A. and Mehta, F. 1982. The estimation and interpolation of inequality measures. Review of Economic Studies 49 (2): 273-290. This also reviews previous literature (about placing bounds, and getting point estimates of inequality indices like the Gini). In short, won't your resampling design also depend on which estimator you use? Whatever, Cowell and Mehta also refer to SEs and CIs in their empirical section (see also footnote 15 re methods). Given your samples are likely large, won't a linearization formula for the SE work as well as bootstrapping?
Comment
paulvonhippel

Join Date: Apr 2014

Posts: 496
#12

25 Jan 2015, 15:26

Sorry, I misspoke. rmultinom() is a function in R. The closest thing in the Stata environment is rdiscrete(), but that's actually part of Mata. So I'm starting to think that the type of resampling I want to do will be a bit of work -- not excessive, but not as easy as just invoking the bootstrap command.

Regarding the estimation of inequality: as you might imagine there is a larger project beyond my question about the bootstrap. I can share that with Stephen separately, outside of Statalist.
Comment
Steve Samuels

Join Date: Mar 2014

Posts: 1786
#13

25 Jan 2015, 16:38

Interesting problem, and one I'm not familiar with. I do have a couple of thoughts, perhaps way off the mark. If you have census data, then, presumably, there is no sampling error, though there might be measurement error. But the re-sampling approach would estimate non-existent sampling error, unless you want to take a super-population modeling approach. If, on the other hand, you have sample survey data, then why not base confidence intervals on the sample design?

Steve Samuels
Statistical Consulting
[email protected]

Stata 14.2
Comment
paulvonhippel

Join Date: Apr 2014

Posts: 496
#14

25 Jan 2015, 18:45

Unfortunately income distributions are estimated from samples, not populations. I'm not sure what I can do about the sampling design without unit-level data or a published design effect.
Comment
Maarten Buis

Join Date: Mar 2014

Posts: 3426
#15

26 Jan 2015, 01:30

Originally posted by Stephen Jenkins View Post

Maarten's recommendations to use uniform() are in effect implementing the sort of algorithm in the Wikipedia article I cited, I think.

That is correct.

For those following this thread: the complete reference to that Stata tip is: M.L. Buis (2007), "Stata tip 48: Discrete uses for uniform()", The Stata Journal, 7(3), pp. 434-435. It can be freely downloaded here: http://www.stata-journal.com/article...article=pr0032

Notice, that I wrote that tip when runiform() was still called uniform(). The code in that article still works (StataCorp does a great job in ensuring that old code continues to work on newer versions of Stata), but if I were to write such code now I would replace all occurances of uniform() with runiform().

---------------------------------
Maarten L. Buis
University of Konstanz
Department of history and sociology
box 40
78457 Konstanz
Germany
http://www.maartenbuis.nl
---------------------------------
Comment

Announcement

Bootstrapping with frequency weights

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment