Joining records using predefined weights

cammcdermaid

Join Date: May 2014
Posts: 23

Joining records using predefined weights

08 May 2014, 10:22

Stata Win ver 13.1

Here's my problem: I have 2 files. One is a record file where each record has a 6 digit postal code. The second file is an "allocation" file that has 108 neighbourhoods and the postal codes that map to that neighbourhood. Because the postal code and neighbourhood boundaries overlap, each postal code is assigned a weight which denotes the proportion of records with that postal code that should be allocated to a particular neighbourhood.

Are there any recommendations on how to approach this? I've used a somewhat mechanistic approach as below and I'm concerned I'm overlooking something more concise and elegant.

The data format in the second file would appear as below if sorted by postal code:

Neighbourhood	Postal Code	Weight
N1	K0A1A0	0.96
N2	K0A1A0	0.04
N2	K0A1B0	0.50
N3	K0A1B0	0.50

If there were 200 records with 100 having each postal code, N1 would get 96 records, N2 would get 54 records and N2 would get 50.

The approach I've used is to use joinby to get every valid combination of postal code to neighbourhood and then, using -levelsof pcodes- (postalcode), iterating through each postal code and assign the records as randomly as possible to the neighbourhood (nhoods) using the weight. The rather convoluted code is below. Some randomization process in the code was my attempt to see how much, if at all, allocation varied between runs on the same data. I've included the comments because not all the steps are intuitive.

The problem I'm now having is that, when using large record files, -levelsof pcodes- generates a macro that exceeds the allowable length. Starting with neighbourhood may fix that, but this problem makes me think that I'm probably overlooking something fairly straightforward to accomplish what I'm trying to do.

Any suggestions would be greatly appreciated.

-cam

Code:

use records.dta, clear
gen original=_n  /* generates a sequential number for each original record*/

joinby pcodes using "nhood_postalcode.dta", unmatched(none)

egen tag=tag(original) /* select only one instance of each original record */
gen random=runiform()   /* generates a random number for each record */
gen assigned=.  /* Use to determine if a record has already been assigned to a neighbourhood*/
* Assign a random number to each neighbourhood otherwise assignment is always alphabetical*********
egen taghood=tag(nhood)  /* indexes neighbourhood */
egen hood=cut(random) if taghood==1, group(108) /* assigns a random integer from the number of neighbourhoods */
bysort nhoods: egen tothoods=total(hood)  /* assigns all neighbourhood value the random cut value "hood" */
labmask hood, values(tothoods)  /* labels the random cut value with the neighbourhood name */
drop nhoods
rename tothoods nhoods
drop tothoods taghood

*****************************************************
* Attribute records to the neighbourhoods by postal code according to the weights
****************************************************
levelsof pcodes, loc(pcodes)
foreach pcode of loc pcodes {
    sum tag if pcodes=="`pcode'" /*  How many people in the original dataset belong to a given postal code */
    loc denominator `r(sum)'
    levelsof nhoods if pcodes=="`pcode'",loc(nhoods)  /* All neighbourhoods included in the given postal code */
    gsort pcode -tag random /*  Sorts to put the original records in random order at the start */
    gen index=_n if pcodes=="`pcode'" /* Gives a temporary sequential number to each record within the postal code */
 
    loc total=0
        foreach nhood of loc nhoods{
                    sum proportion if nhoods==`nhood' & pcodes=="`pcode'"  /* this is a unique value within nhood and pcode */
                    local p `r(mean)'  /* Just a way of extracting the weight */
                    local assign=ceil(`denominator'*`p')  /* The number of records from the total which are to be assigned to this nhood */
                    loc total=`total'+`assign'  /* used with index as below allows you to assign sequential records */
                    replace assigned=`nhood' if index<=`total' &amp; pcodes=="`pcode'" & missing(assigned)
                    
        }
            
drop index
drop if tag!=1 & pcodes=="`pcode'"  /* Keeps only original records within the postal code*/
}

Tags: None

Joe Canner

Join Date: Mar 2014

Posts: 580
#2

08 May 2014, 11:26

Cam,

I assume that a given zip code will contain a relatively limited number of postal codes. Accordingly, you might be better off reshaping your allocation file so that there is one record per postal code:

Code:

bysort PostalCode: gen seq=_n reshape wide Neighborhood Weight, i(PostalCode) j(seq) merge 1:m PostalCode using RecordFile gen nhood="" forval i=1/6 { // Change "6" to the maximum number of Neighborhoods per Postal Code replace nhood=Neighborhood`i' if runiform()<=Weight`i' // Modify to taste } // Add code here necessary to make sure that each record was assigned to a neighborhood

Sorry, I didn't look very carefully at your code to see what variable names you used.

One slight problem with this is that since the neighborhood assignments are assigned independently, you may end up with allocations that are slightly off from what you expect. One possible solution is to sort the records in a random order and then assign them sequentially:

Code:

gen r=runiform() forval i=1/6 { bysort PostalCode (r): replace nhood=Neighborhood`i' if _n<=_N*Weight`i' }

You may still need to tweak this some, but the general idea should be right.

Regards,
Joe
1 like
Comment
cammcdermaid

Join Date: May 2014

Posts: 23
#3

08 May 2014, 13:39

Hi Joe

This is a very elegant approach, thanks very much. The approach is also blindingly fast compared to my joinby and reduction approach. It works pretty well as you've described with the minor change from

Code:

replace nhood=Neighborhood`i' if runiform()<=Weight`i'

to

Code:

replace nhood=Neighborhood`i' if runiform()<=Weight`i'& !missing(Weight`i')

and I've added

Code:

ds Neighborhood* local count: word count `r(varlist)'

to generate the upper limit `i' for the number of neighbourhoods by postal code.

Less than 2% of the records aren't assigned and I'll come up with a way to deal with that perhaps by looping through again for those records that haven't been assigned a postal code until there are none.

I'm just looking at the allocations individually as opposed to by postalcode as you've described to see if they make a substantive enough difference to worry about. I've credited you in the code; I'm sure this script and any derivatives based on the approach will prove very useful.

Thanks again for your help.

-cam
Comment
Joe Canner

Join Date: Mar 2014

Posts: 580
#4

08 May 2014, 13:55

Cam,

Glad to help. Good idea on generating the upper limit of the number of neighborhoods; I should probably do that more often myself.

Regards,
Joe
Comment

Announcement

Joining records using predefined weights

Comment

Comment

Comment