Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Random assignment of values according to existing empirical distribution of other variable

    Hi,

    I have two populations in my data. A population of parents (mothers and fathers) with their respective occupations and one of children. I want to create a variable that is a random assignment of occupations to children, based on parental occupation. In other words I want to preserve the occupational distribution of parents (i.e. if 60% of parents are doctors and 40% are lawyers, I want 60% of children to be doctors and 40% are lawyers) on the children's population. Even better if that could be conditional on some characteristics (say, education).

    If I had exactly the same number of observations for parents and children I could just assign random numbers (within the same range) to both and match them. Any idea of how to do it with different population sizes?

    Thanks!
    Last edited by Maria Ventura; 09 Dec 2020, 16:05. Reason: Added tags

  • #2
    Do you want parents' occupations to be assigned to their own children (difficult, I think), or do you want to use the overall distribution of parental occupations? And, in what sort of way would you want that assignment to be conditioned on some other variable? For example, I can imagine that you might want something like "if a child's has 10 yr. of education, then assign an occupation chosen with equal probability from among all occupations held by parents with 10 yr. of education? Finally, I don't think there is any general way to do this "perfectly" with unequal sample sizes, but it should be possible to do it so that the children's distribution is sampled from the parental distribution. I'm not positive I know how to do this. but at least for me, these are necessary clarifications. Also, as described in the StataList FAQ, can you post some example data?

    Comment


    • #3
      Hi Mike,

      Thanks a lot for this! You are correct, I don't want occupations to be assigned to own children but only according to the overall distribution and that is exactly what I meant by conditioning on other variables.

      An example:

      This would be the parents's sample
      Parent_ID Parent_occupational_code Parent_education
      1 42 bachelor
      2 65 master
      3 23 master
      4 74 highschool
      5 42 bachelor
      6 65 phd
      7 74 highschool
      8 45 bachelor
      9 13 bachelor
      And the children's sample:
      Child_ID Child_education
      1 bachelor
      2 bachelor
      3 highschool
      4 master
      5 master
      6 highschool
      7 bachelor
      8 master
      9 highschool
      10 highschool
      11 bachelor
      12 bachelor
      13 bachelor
      14 phd
      And I'd want to add a variable in the second dataset with this randomly assigned occupation. Here for instance a half of the parents with education = bachelor are in the occupation corresponding to code 42. So half of children with a bachelor degree should have occupation 42 randomly assigned (happens to be an odd number here so happy to round up share to the closest integer).

      Thanks!

      Comment


      • #4
        Here's one approach. It assigns occupational values for each child that are sampled from the distribution of values of all parents with the same value for education. If your parent and child files are too large, this might use too much memory to be feasible, in which case I think there are other more difficult ways to do the same thing.

        Code:
        // Simulate parent and child data to have something to illustrate the method.
        set seed 84673
        local Nocc = 10   // # of occ categories
        local Neduc = 5  // # of educ categories 
        local NParent = 1000   // sample sizes for parents and children
        local NChild = 2000
        clear
        set obs `NParent'
        gen int parent_id = _n
        gen int educ = ceil(runiform() * `Neduc')
        gen int occ = ceil(runiform() * `Nocc')
        gen byte parent = 1
        tempfile parents
        save `parents'
        clear
        set obs `NChild'
        gen int child_id = _n
        gen int educ = ceil(runiform() * `Neduc')
        gen byte parent = 0
        //
        // Now that we have data to play with, the real stuff starts.
        //
        // -joinby- will combine each child observation with all parent
        // observations that match on education. This creates multiple
        // observations with parent occupation values for each child.
        // If you need to condition on several variables, note that
        // -joinby- accepts a varlist.
        joinby educ using `parents'
        // Randomly choose one of the observations for each child.
        gen random = runiform()
        bysort child_id (random): keep if _n == 1
        drop random parent_id
        //
        //  Compare parent and child distributions at each education
        // level, just to check the result.
        append using `parents'
        // While of course not identical, occ distributions do not systematically differ.
        forval i = 1/`Neduc'  {
            tab2 occ parent if educ == `i', chi2 col
        }

        Comment


        • #5
          Hi Mike thank you very much, I think this works!

          I am still slightly confused by the distribution comparison you do at the very end. Do you think that is better than having for instance occupation randomly assigned and parental occupations on two different columns and tabulating that instead?

          Thanks!

          Comment


          • #6
            Ok nevermind, please correct me if I am wrong, but I guess what i said wouldn't make sense as I'd be looking at the all different combinations of occupations between parental and randomized but in fact I only care about whether the shares of each occupation for parent = 1 or 0 are significantly different, right?

            Thanks!

            Comment


            • #7
              "...whether the shares of each occupation for parent = 1 or 0 are significantly different, right?" Yes, that would be my understanding of I think that you would want. And, by the way, I did the statistical comparison of the distributions at the end just as a check on whether my code actually worked the way I thought it should.

              Now, I have a broader comment: I wonder if what you asked for is the best way to achieve your ultimate goal. For example, I can imagine that for certain purposes, some kind of matching method might be better. On that count, *perhaps* you might want to explain the context and purpose of this procedure, as maybe someone (not necessarily me <grin>) might have a different/better idea of how to that. On the other hand, you might have reason to be certain that what you asked for is the best choice.

              Comment


              • #8
                Another approach that is likely faster - get the specific % for each occupation first, then allocate the kids:

                Code:
                u parentfile,clear
                levelsof parent_occ_codes, local(occs)
                local i=0
                local p0=0
                foreach O in `occs' {
                   local ++i
                   count if parent_occ_codes=`O'
                   local p`i'=p``i'-1''+r(N)/_N                      // note these are cummulative %s
                   local occ_`i'=`O'
                }
                
                u childfile, clear
                gen child_occ_code
                set seed 20201211
                gen rnd=uniform()
                sort rnd
                gen occ_code=.
                forv j=1/`i' {
                   replace child_occ_code=`occ_`j'' if mi(occ_code) & _n<`p`j''*_N        
                }
                drop rnd
                NB, I didn't test this.


                hth
                Jeph

                Comment

                Working...
                X