Random assignment of a categorical variable - with know frequency distribution - taking weights into account

Andreas Thiemann

Join Date: Feb 2017

Posts: 4
#1

Random assignment of a categorical variable - with know frequency distribution - taking weights into account

20 Mar 2017, 08:29

Dear all,

I would like to kindly ask for suggestions on the following imputation/random assignment problem:

Let’s assume that a categorical variable X (values are 1,2,3,4) has the following distribution (based on ‘reference data’):

X % Cum. %

1 20% 20%

2 20% 40%

3 30% 70%

4 30% 100%

Total 100%

Now, lets have a look at my example data:

Code:

sysuse nlsw88, clear

Further, let's assume that the frequency weights are distributed in the following (very unequal) way:

Code:

gen weight = (2.25/(_n^(2.5)))*20000

Now, I generate the new variable X and randomly assign its values to the observations, replicating the initial distribution:

Code:

set seed 123 gen random = runiform() gen x=0 replace x=1 if random < .2 replace x=2 if inrange(random,.2,.4) replace x=3 if inrange(random,.4,.7) replace x=4 if random>.7

Replicating the initial sample distribution has worked relatively well in the unweighted case:

Code:

tabulate x

However, when tabulating the weighted frequency distribution, the random assignment has not worked well:

Code:

tabulate x [aw=weight]

Does anyone have a suggestion how I can randomly assign a categorical variable, taking weights into account when they are unequally distributed?

Any suggestion is greatly appreciated!

Andreas

-----------------------------------------------------------
Plain Stata code:

Code:

sysuse nlsw88, clear gen weight = (2.25/(_n^(2.5)))*20000 sum weight set seed 123 gen random = runiform() gen x=0 replace x=1 if random < .2 replace x=2 if inrange(random,.2,.4) replace x=3 if inrange(random,.4,.7) replace x=4 if random>.7 tab x tab x [aw=weight]
Tags: None

Andrew Musau

Join Date: Oct 2014
Posts: 10077

20 Mar 2017, 12:02

You should search -help weights- to see the different kind of weights that Stata has and how these work. Here are some points:

1. The weighting function matters (i.e., you cannot ignore its consequences). In your case,

gen weight = (2.25/(_n^(2.5)))*20000

you have a rapidly declining function which attributes most of the weight to the first observation in the dataset.

Code:

set obs 10
gen weight = (2.25/(_n^(2.5)))*20000
gen n= _n
scatter weight n, mlabel(n)

Click image for larger version

Name: weight.png
Views: 1
Size: 9.1 KB
ID: 1379308

2. The frequency for x=1 using aweights (or fweights if weights are integers) can be calculated as follows in your example:

Code:

. sysuse nlsw88, clear
(NLSW, 1988 extract)

. gen weight = (2.25/(_n^(2.5)))*20000

. set seed 123

. gen random = runiform()
 
. gen x=0
 
. replace x=1 if random < .2
(439 real changes made)
 
. replace x=2 if inrange(random,.2,.4)
(440 real changes made)
 
. replace x=3 if inrange(random,.4,.7)
(695 real changes made)
 
. replace x=4 if random>.7
(672 real changes made)

. sum weight

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
      weight |      2,246     26.8774    966.7423   .0001882      45000

. scalar total = r(N)*r(mean)

. sum weight if x==1

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
      weight |        439    3.395281    45.58292   .0001884   804.9845

. scalar total0 = r(N)*r(mean)

. scalar w1= (total0/total)

. di w1
.02469126

Comment

Andreas Thiemann

Join Date: Feb 2017
Posts: 4

21 Mar 2017, 02:00

Hi Andrew,

thank you very much for your suggestions and the illustration of the weighting function! I am aware of the fact that the weights quickly become smaller. The underlying weighting function is an application of the probability density function of the Pareto distribution, multiplied by a constant. Unfortunately, I cannot change this function.

You are right, frequency weights probably would have been more appropriate in this example. However, even if I had integer frequency weights (that are unequally distributed) the problem remains. Modifying the 'weighting function' slightly:

Code:

 sysuse nlsw88, clear
(NLSW, 1988 extract)

.
. gen weight  = (2.25/(_n^(2.5)))*20000

. gen weight2=round(weight)+1

. sum weight weight2

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
      weight |      2,246     26.8774    966.7423   .0001882      45000
     weight2 |      2,246    27.86554    966.7429          1      45001

.
. set seed 123

. gen random = runiform()

.
. gen x=0

. replace x=1 if random < .2
(439 real changes made)

. replace x=2 if inrange(random,.2,.4)
(440 real changes made)

. replace x=3 if inrange(random,.4,.7)
(695 real changes made)

. replace x=4 if random>.7
(672 real changes made)

.
. tab x

          x |      Freq.     Percent        Cum.
------------+-----------------------------------
          1 |        439       19.55       19.55
          2 |        440       19.59       39.14
          3 |        695       30.94       70.08
          4 |        672       29.92      100.00
------------+-----------------------------------
      Total |      2,246      100.00

. tab x [fw=weight2]

          x |      Freq.     Percent        Cum.
------------+-----------------------------------
          1 |      1,926        3.08        3.08
          2 |     45,865       73.28       76.36
          3 |      9,099       14.54       90.90
          4 |      5,696        9.10      100.00
------------+-----------------------------------
      Total |     62,586      100.00

I was wondering whether Stata might provide an random assignment mechanism that takes weights into account.

Thanks again!

Comment

Stephen Jenkins

Join Date: Apr 2014

Posts: 1425
#4

21 Mar 2017, 05:18

Do you need something special? Here is a quick thought. You have categorised data; let r_k be the proportion of total obs in category k (your reference/target), for k = 1,...,K. You want to randomly assign obs to a category such as the post-randomised raw proportion p_k multiplied by the relevant weight w_k is such that p_k * w_k = r_k. Things are more complicated than this -- because you have individual-level, not category-level, weights. But won't the solution use the same basic idea? I.e. in the random assignment you want to use something like r_k/w_k rather than r_k as you are currently doing
Comment

X	%	Cum. %
1	20%	20%
2	20%	40%
3	30%	70%
4	30%	100%
Total	100%

Announcement