Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Joining records using predefined weights

    Stata Win ver 13.1

    Here's my problem: I have 2 files. One is a record file where each record has a 6 digit postal code. The second file is an "allocation" file that has 108 neighbourhoods and the postal codes that map to that neighbourhood. Because the postal code and neighbourhood boundaries overlap, each postal code is assigned a weight which denotes the proportion of records with that postal code that should be allocated to a particular neighbourhood.

    Are there any recommendations on how to approach this? I've used a somewhat mechanistic approach as below and I'm concerned I'm overlooking something more concise and elegant.

    The data format in the second file would appear as below if sorted by postal code:

    Neighbourhood Postal Code Weight
    N1 K0A1A0 0.96
    N2 K0A1A0 0.04
    N2 K0A1B0 0.50
    N3 K0A1B0 0.50
    If there were 200 records with 100 having each postal code, N1 would get 96 records, N2 would get 54 records and N2 would get 50.

    The approach I've used is to use joinby to get every valid combination of postal code to neighbourhood and then, using -levelsof pcodes- (postalcode), iterating through each postal code and assign the records as randomly as possible to the neighbourhood (nhoods) using the weight. The rather convoluted code is below. Some randomization process in the code was my attempt to see how much, if at all, allocation varied between runs on the same data. I've included the comments because not all the steps are intuitive.

    The problem I'm now having is that, when using large record files, -levelsof pcodes- generates a macro that exceeds the allowable length. Starting with neighbourhood may fix that, but this problem makes me think that I'm probably overlooking something fairly straightforward to accomplish what I'm trying to do.

    Any suggestions would be greatly appreciated.

    -cam



    Code:
    use records.dta, clear
    gen original=_n  /* generates a sequential number for each original record*/
    
    joinby pcodes using "nhood_postalcode.dta", unmatched(none)
    
    egen tag=tag(original) /* select only one instance of each original record */
    gen random=runiform()   /* generates a random number for each record */
    gen assigned=.  /* Use to determine if a record has already been assigned to a neighbourhood*/
    * Assign a random number to each neighbourhood otherwise assignment is always alphabetical*********
    egen taghood=tag(nhood)  /* indexes neighbourhood */
    egen hood=cut(random) if taghood==1, group(108) /* assigns a random integer from the number of neighbourhoods */
    bysort nhoods: egen tothoods=total(hood)  /* assigns all neighbourhood value the random cut value "hood" */
    labmask hood, values(tothoods)  /* labels the random cut value with the neighbourhood name */
    drop nhoods
    rename tothoods nhoods
    drop tothoods taghood
    
    *****************************************************
    * Attribute records to the neighbourhoods by postal code according to the weights
    ****************************************************
    levelsof pcodes, loc(pcodes)
    foreach pcode of loc pcodes {
        sum tag if pcodes=="`pcode'" /*  How many people in the original dataset belong to a given postal code */
        loc denominator `r(sum)'
        levelsof nhoods if pcodes=="`pcode'",loc(nhoods)  /* All neighbourhoods included in the given postal code */
        gsort pcode -tag random /*  Sorts to put the original records in random order at the start */
        gen index=_n if pcodes=="`pcode'" /* Gives a temporary sequential number to each record within the postal code */
     
        loc total=0
            foreach nhood of loc nhoods{
                        sum proportion if nhoods==`nhood' & pcodes=="`pcode'"  /* this is a unique value within nhood and pcode */
                        local p `r(mean)'  /* Just a way of extracting the weight */
                        local assign=ceil(`denominator'*`p')  /* The number of records from the total which are to be assigned to this nhood */
                        loc total=`total'+`assign'  /* used with index as below allows you to assign sequential records */
                        replace assigned=`nhood' if index<=`total' &amp; pcodes=="`pcode'" & missing(assigned)
                        
            }
                
    drop index
    drop if tag!=1 & pcodes=="`pcode'"  /* Keeps only original records within the postal code*/
    }

  • #2
    Cam,

    I assume that a given zip code will contain a relatively limited number of postal codes. Accordingly, you might be better off reshaping your allocation file so that there is one record per postal code:

    Code:
    bysort PostalCode: gen seq=_n
    reshape wide Neighborhood Weight, i(PostalCode) j(seq)
    merge 1:m PostalCode using RecordFile
    gen nhood=""
    forval i=1/6 {     // Change "6" to the maximum number of Neighborhoods per Postal Code
       replace nhood=Neighborhood`i' if runiform()<=Weight`i'    // Modify to taste
    }
    // Add code here necessary to make sure that each record was assigned to a neighborhood
    Sorry, I didn't look very carefully at your code to see what variable names you used.

    One slight problem with this is that since the neighborhood assignments are assigned independently, you may end up with allocations that are slightly off from what you expect. One possible solution is to sort the records in a random order and then assign them sequentially:

    Code:
    gen r=runiform()
    forval i=1/6 {
      bysort PostalCode (r): replace nhood=Neighborhood`i' if _n<=_N*Weight`i'
    }
    You may still need to tweak this some, but the general idea should be right.

    Regards,
    Joe

    Comment


    • #3
      Hi Joe

      This is a very elegant approach, thanks very much. The approach is also blindingly fast compared to my joinby and reduction approach. It works pretty well as you've described with the minor change from

      Code:
      replace nhood=Neighborhood`i' if runiform()<=Weight`i'
      to
      Code:
      replace nhood=Neighborhood`i' if runiform()<=Weight`i'& !missing(Weight`i')
      and I've added

      Code:
      ds Neighborhood*
      local count: word count `r(varlist)'
      to generate the upper limit `i' for the number of neighbourhoods by postal code.

      Less than 2% of the records aren't assigned and I'll come up with a way to deal with that perhaps by looping through again for those records that haven't been assigned a postal code until there are none.

      I'm just looking at the allocations individually as opposed to by postalcode as you've described to see if they make a substantive enough difference to worry about. I've credited you in the code; I'm sure this script and any derivatives based on the approach will prove very useful.

      Thanks again for your help.

      -cam



      Comment


      • #4
        Cam,

        Glad to help. Good idea on generating the upper limit of the number of neighborhoods; I should probably do that more often myself.

        Regards,
        Joe

        Comment

        Working...
        X