Random assignment of values according to existing empirical distribution of other variable

Maria Ventura

Join Date: Jun 2018

Posts: 40
#1

Random assignment of values according to existing empirical distribution of other variable

09 Dec 2020, 16:04

Hi,

I have two populations in my data. A population of parents (mothers and fathers) with their respective occupations and one of children. I want to create a variable that is a random assignment of occupations to children, based on parental occupation. In other words I want to preserve the occupational distribution of parents (i.e. if 60% of parents are doctors and 40% are lawyers, I want 60% of children to be doctors and 40% are lawyers) on the children's population. Even better if that could be conditional on some characteristics (say, education).

If I had exactly the same number of observations for parents and children I could just assign random numbers (within the same range) to both and match them. Any idea of how to do it with different population sizes?

Thanks!

Last edited by Maria Ventura; 09 Dec 2020, 16:05. Reason: Added tags
Tags: categorical, data, distribution, random
Mike Lacy

Join Date: Apr 2014

Posts: 2416
#2

09 Dec 2020, 20:24

Do you want parents' occupations to be assigned to their own children (difficult, I think), or do you want to use the overall distribution of parental occupations? And, in what sort of way would you want that assignment to be conditioned on some other variable? For example, I can imagine that you might want something like "if a child's has 10 yr. of education, then assign an occupation chosen with equal probability from among all occupations held by parents with 10 yr. of education? Finally, I don't think there is any general way to do this "perfectly" with unequal sample sizes, but it should be possible to do it so that the children's distribution is sampled from the parental distribution. I'm not positive I know how to do this. but at least for me, these are necessary clarifications. Also, as described in the StataList FAQ, can you post some example data?
Comment
Maria Ventura

Join Date: Jun 2018

Posts: 40
#3

10 Dec 2020, 04:15

Hi Mike,

Thanks a lot for this! You are correct, I don't want occupations to be assigned to own children but only according to the overall distribution and that is exactly what I meant by conditioning on other variables.

An example:

This would be the parents's sample

Parent_ID Parent_occupational_code Parent_education

1 42 bachelor

2 65 master

3 23 master

4 74 highschool

5 42 bachelor

6 65 phd

7 74 highschool

8 45 bachelor

9 13 bachelor

And the children's sample:
Child_ID Child_education

1 bachelor

2 bachelor

3 highschool

4 master

5 master

6 highschool

7 bachelor

8 master

9 highschool

10 highschool

11 bachelor

12 bachelor

13 bachelor

14 phd

And I'd want to add a variable in the second dataset with this randomly assigned occupation. Here for instance a half of the parents with education = bachelor are in the occupation corresponding to code 42. So half of children with a bachelor degree should have occupation 42 randomly assigned (happens to be an odd number here so happy to round up share to the closest integer).

Thanks!
Comment

Mike Lacy

Join Date: Apr 2014
Posts: 2416

10 Dec 2020, 08:54

Here's one approach. It assigns occupational values for each child that are sampled from the distribution of values of all parents with the same value for education. If your parent and child files are too large, this might use too much memory to be feasible, in which case I think there are other more difficult ways to do the same thing.

Code:

// Simulate parent and child data to have something to illustrate the method.
set seed 84673
local Nocc = 10   // # of occ categories
local Neduc = 5  // # of educ categories 
local NParent = 1000   // sample sizes for parents and children
local NChild = 2000
clear
set obs `NParent'
gen int parent_id = _n
gen int educ = ceil(runiform() * `Neduc')
gen int occ = ceil(runiform() * `Nocc')
gen byte parent = 1
tempfile parents
save `parents'
clear
set obs `NChild'
gen int child_id = _n
gen int educ = ceil(runiform() * `Neduc')
gen byte parent = 0
//
// Now that we have data to play with, the real stuff starts.
//
// -joinby- will combine each child observation with all parent
// observations that match on education. This creates multiple
// observations with parent occupation values for each child.
// If you need to condition on several variables, note that
// -joinby- accepts a varlist.
joinby educ using `parents'
// Randomly choose one of the observations for each child.
gen random = runiform()
bysort child_id (random): keep if _n == 1
drop random parent_id
//
//  Compare parent and child distributions at each education
// level, just to check the result.
append using `parents'
// While of course not identical, occ distributions do not systematically differ.
forval i = 1/`Neduc'  {
    tab2 occ parent if educ == `i', chi2 col
}

Comment

Maria Ventura

Join Date: Jun 2018

Posts: 40
#5

11 Dec 2020, 08:25

Hi Mike thank you very much, I think this works!

I am still slightly confused by the distribution comparison you do at the very end. Do you think that is better than having for instance occupation randomly assigned and parental occupations on two different columns and tabulating that instead?

Thanks!
Comment
Maria Ventura

Join Date: Jun 2018

Posts: 40
#6

11 Dec 2020, 09:10

Ok nevermind, please correct me if I am wrong, but I guess what i said wouldn't make sense as I'd be looking at the all different combinations of occupations between parental and randomized but in fact I only care about whether the shares of each occupation for parent = 1 or 0 are significantly different, right?

Thanks!
Comment
Mike Lacy

Join Date: Apr 2014

Posts: 2416
#7

11 Dec 2020, 10:52

"...whether the shares of each occupation for parent = 1 or 0 are significantly different, right?" Yes, that would be my understanding of I think that you would want. And, by the way, I did the statistical comparison of the distributions at the end just as a check on whether my code actually worked the way I thought it should.

Now, I have a broader comment: I wonder if what you asked for is the best way to achieve your ultimate goal. For example, I can imagine that for certain purposes, some kind of matching method might be better. On that count, *perhaps* you might want to explain the context and purpose of this procedure, as maybe someone (not necessarily me <grin>) might have a different/better idea of how to that. On the other hand, you might have reason to be certain that what you asked for is the best choice.
Comment

Jeph Herrin

Join Date: Apr 2014
Posts: 335

11 Dec 2020, 12:08

Another approach that is likely faster - get the specific % for each occupation first, then allocate the kids:

Code:

u parentfile,clear
levelsof parent_occ_codes, local(occs)
local i=0
local p0=0
foreach O in `occs' {
   local ++i
   count if parent_occ_codes=`O'
   local p`i'=p``i'-1''+r(N)/_N                      // note these are cummulative %s
   local occ_`i'=`O'
}

u childfile, clear
gen child_occ_code
set seed 20201211
gen rnd=uniform()
sort rnd
gen occ_code=.
forv j=1/`i' {
   replace child_occ_code=`occ_`j'' if mi(occ_code) & _n<`p`j''*_N        
}
drop rnd

NB, I didn't test this.

hth
Jeph

Parent_ID	Parent_occupational_code	Parent_education
1	42	bachelor
2	65	master
3	23	master
4	74	highschool
5	42	bachelor
6	65	phd
7	74	highschool
8	45	bachelor
9	13	bachelor

Child_ID	Child_education
1	bachelor
2	bachelor
3	highschool
4	master
5	master
6	highschool
7	bachelor
8	master
9	highschool
10	highschool
11	bachelor
12	bachelor
13	bachelor
14	phd

Announcement