(Stata: Version 14.2)
Dear all,
I would like to kindly ask for suggestions on the following imputation/random assignment problem:
Let’s assume that a categorical variable X (values are 1,2,3,4) has the following distribution (based on ‘reference data’):
Now, lets have a look at my example data:
sysuse nlsw88, clear
// Further, let's assume that the frequency weights are distributed in the following (very unequal) way:
gen weight = (2.25/(_n^(2.5)))*20000
//Now, I generate the new variable X and randomly assign its values to the observations, replicating the initial distribution:
set seed 123
gen random = runiform()
gen x=0
replace x=1 if random < .2
replace x=2 if inrange(random,.2,.4)
replace x=3 if inrange(random,.4,.7)
replace x=4 if random>.7
//Replicating the initial sample distribution has worked relatively well in the unweighted case:
tabulate x
However, when tabulating the weighted frequency distribution, the random assignment has not worked well:
tabulate x [aw=weight]
Does anyone have a suggestion randomly assigning a categorical variable, taking frequency weights into account when weights are unequally distributed?
Any suggestion is greatly appreciated!
-----------------------------------------------------------
Plain Stata code:
sysuse nlsw88, clear
gen weight = (2.25/(_n^(2.5)))*20000
sum weight
set seed 123
gen random = runiform()
gen x=0
replace x=1 if random < .2
replace x=2 if inrange(random,.2,.4)
replace x=3 if inrange(random,.4,.7)
replace x=4 if random>.7
tab x
tab x [aw=weight]
Dear all,
I would like to kindly ask for suggestions on the following imputation/random assignment problem:
Let’s assume that a categorical variable X (values are 1,2,3,4) has the following distribution (based on ‘reference data’):
X | % | Cum. % | |
1 | 20% | 20% | |
2 | 20% | 40% | |
3 | 30% | 70% | |
4 | 30% | 100% | |
Total | 100% |
sysuse nlsw88, clear
// Further, let's assume that the frequency weights are distributed in the following (very unequal) way:
gen weight = (2.25/(_n^(2.5)))*20000
//Now, I generate the new variable X and randomly assign its values to the observations, replicating the initial distribution:
set seed 123
gen random = runiform()
gen x=0
replace x=1 if random < .2
replace x=2 if inrange(random,.2,.4)
replace x=3 if inrange(random,.4,.7)
replace x=4 if random>.7
//Replicating the initial sample distribution has worked relatively well in the unweighted case:
tabulate x
X | Freq | % | Cum. % |
1 | 439 | 19.55% | 19.55% |
2 | 440 | 19.59% | 39.14% |
3 | 695 | 30.94% | 70.08% |
4 | 672 | 29.92% | 100% |
Total | 2246 | 100% |
tabulate x [aw=weight]
X | Freq | % | Cum. % |
1 | 55.4565667 | 2.47% | 2.47% |
2 | 1,690.2808 | 75.26% | 77.73% |
3 | 312.920003 | 13.93% | 91.66% |
4 | 187.342635 | 8.34% | 100% |
Total | 2,246 | 100% |
Any suggestion is greatly appreciated!
-----------------------------------------------------------
Plain Stata code:
sysuse nlsw88, clear
gen weight = (2.25/(_n^(2.5)))*20000
sum weight
set seed 123
gen random = runiform()
gen x=0
replace x=1 if random < .2
replace x=2 if inrange(random,.2,.4)
replace x=3 if inrange(random,.4,.7)
replace x=4 if random>.7
tab x
tab x [aw=weight]
Comment