Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Random assignment of a categorical variable - with know frequency distribution - taking weights into account

    Dear all,

    I would like to kindly ask for suggestions on the following imputation/random assignment problem:

    Let’s assume that a categorical variable X (values are 1,2,3,4) has the following distribution (based on ‘reference data’):
    X % Cum. %
    1 20% 20%
    2 20% 40%
    3 30% 70%
    4 30% 100%
    Total 100%

    Now, lets have a look at my example data:

    Code:
    sysuse nlsw88, clear
    Further, let's assume that the frequency weights are distributed in the following (very unequal) way:

    Code:
    gen weight = (2.25/(_n^(2.5)))*20000
    Now, I generate the new variable X and randomly assign its values to the observations, replicating the initial distribution:

    Code:
    set seed 123
    gen random = runiform()
    
    gen x=0
    replace x=1 if random < .2
    replace x=2 if inrange(random,.2,.4)
    replace x=3 if inrange(random,.4,.7)
    replace x=4 if random>.7
    Replicating the initial sample distribution has worked relatively well in the unweighted case:

    Code:
    tabulate x
    Click image for larger version

Name:	tabx.png
Views:	2
Size:	4.3 KB
ID:	1379256





    However, when tabulating the weighted frequency distribution, the random assignment has not worked well:

    Code:
    tabulate x [aw=weight]

    Click image for larger version

Name:	tabxw.png
Views:	2
Size:	5.0 KB
ID:	1379257




    Does anyone have a suggestion how I can randomly assign a categorical variable, taking weights into account when they are unequally distributed?

    Any suggestion is greatly appreciated!

    Andreas


    -----------------------------------------------------------
    Plain Stata code:

    Code:
    sysuse nlsw88, clear
    
    gen weight = (2.25/(_n^(2.5)))*20000
    sum weight
    
    set seed 123
    gen random = runiform()
    
    gen x=0
    replace x=1 if random < .2
    replace x=2 if inrange(random,.2,.4)
    replace x=3 if inrange(random,.4,.7)
    replace x=4 if random>.7
    
    tab x
    tab x [aw=weight]

  • #2
    You should search -help weights- to see the different kind of weights that Stata has and how these work. Here are some points:

    1. The weighting function matters (i.e., you cannot ignore its consequences). In your case,

    gen weight = (2.25/(_n^(2.5)))*20000
    you have a rapidly declining function which attributes most of the weight to the first observation in the dataset.


    Code:
    set obs 10
    gen weight = (2.25/(_n^(2.5)))*20000
    gen n= _n
    scatter weight n, mlabel(n)
    Click image for larger version

Name:	weight.png
Views:	1
Size:	9.1 KB
ID:	1379308



    2. The frequency for x=1 using aweights (or fweights if weights are integers) can be calculated as follows in your example:



    Code:
    . sysuse nlsw88, clear
    (NLSW, 1988 extract)
    
    . gen weight = (2.25/(_n^(2.5)))*20000
    
    . set seed 123
    
    . gen random = runiform()
     
    . gen x=0
     
    . replace x=1 if random < .2
    (439 real changes made)
     
    . replace x=2 if inrange(random,.2,.4)
    (440 real changes made)
     
    . replace x=3 if inrange(random,.4,.7)
    (695 real changes made)
     
    . replace x=4 if random>.7
    (672 real changes made)
    
    . sum weight
    
        Variable |        Obs        Mean    Std. Dev.       Min        Max
    -------------+---------------------------------------------------------
          weight |      2,246     26.8774    966.7423   .0001882      45000
    
    . scalar total = r(N)*r(mean)
    
    . sum weight if x==1
    
        Variable |        Obs        Mean    Std. Dev.       Min        Max
    -------------+---------------------------------------------------------
          weight |        439    3.395281    45.58292   .0001884   804.9845
    
    . scalar total0 = r(N)*r(mean)
    
    . scalar w1= (total0/total)
    
    . di w1
    .02469126

    Comment


    • #3
      Hi Andrew,

      thank you very much for your suggestions and the illustration of the weighting function! I am aware of the fact that the weights quickly become smaller. The underlying weighting function is an application of the probability density function of the Pareto distribution, multiplied by a constant. Unfortunately, I cannot change this function.

      You are right, frequency weights probably would have been more appropriate in this example. However, even if I had integer frequency weights (that are unequally distributed) the problem remains. Modifying the 'weighting function' slightly:

      Code:
       sysuse nlsw88, clear
      (NLSW, 1988 extract)
      
      .
      . gen weight  = (2.25/(_n^(2.5)))*20000
      
      . gen weight2=round(weight)+1
      
      . sum weight weight2
      
          Variable |        Obs        Mean    Std. Dev.       Min        Max
      -------------+---------------------------------------------------------
            weight |      2,246     26.8774    966.7423   .0001882      45000
           weight2 |      2,246    27.86554    966.7429          1      45001
      
      .
      . set seed 123
      
      . gen random = runiform()
      
      .
      . gen x=0
      
      . replace x=1 if random < .2
      (439 real changes made)
      
      . replace x=2 if inrange(random,.2,.4)
      (440 real changes made)
      
      . replace x=3 if inrange(random,.4,.7)
      (695 real changes made)
      
      . replace x=4 if random>.7
      (672 real changes made)
      
      .
      . tab x
      
                x |      Freq.     Percent        Cum.
      ------------+-----------------------------------
                1 |        439       19.55       19.55
                2 |        440       19.59       39.14
                3 |        695       30.94       70.08
                4 |        672       29.92      100.00
      ------------+-----------------------------------
            Total |      2,246      100.00
      
      . tab x [fw=weight2]
      
                x |      Freq.     Percent        Cum.
      ------------+-----------------------------------
                1 |      1,926        3.08        3.08
                2 |     45,865       73.28       76.36
                3 |      9,099       14.54       90.90
                4 |      5,696        9.10      100.00
      ------------+-----------------------------------
            Total |     62,586      100.00


      I was wondering whether Stata might provide an random assignment mechanism that takes weights into account.

      Thanks again!


      Comment


      • #4
        Do you need something special? Here is a quick thought. You have categorised data; let r_k be the proportion of total obs in category k (your reference/target), for k = 1,...,K. You want to randomly assign obs to a category such as the post-randomised raw proportion p_k multiplied by the relevant weight w_k is such that p_k * w_k = r_k. Things are more complicated than this -- because you have individual-level, not category-level, weights. But won't the solution use the same basic idea? I.e. in the random assignment you want to use something like r_k/w_k rather than r_k as you are currently doing

        Comment

        Working...
        X