Stata Win ver 13.1
Here's my problem: I have 2 files. One is a record file where each record has a 6 digit postal code. The second file is an "allocation" file that has 108 neighbourhoods and the postal codes that map to that neighbourhood. Because the postal code and neighbourhood boundaries overlap, each postal code is assigned a weight which denotes the proportion of records with that postal code that should be allocated to a particular neighbourhood.
Are there any recommendations on how to approach this? I've used a somewhat mechanistic approach as below and I'm concerned I'm overlooking something more concise and elegant.
The data format in the second file would appear as below if sorted by postal code:
If there were 200 records with 100 having each postal code, N1 would get 96 records, N2 would get 54 records and N2 would get 50.
The approach I've used is to use joinby to get every valid combination of postal code to neighbourhood and then, using -levelsof pcodes- (postalcode), iterating through each postal code and assign the records as randomly as possible to the neighbourhood (nhoods) using the weight. The rather convoluted code is below. Some randomization process in the code was my attempt to see how much, if at all, allocation varied between runs on the same data. I've included the comments because not all the steps are intuitive.
The problem I'm now having is that, when using large record files, -levelsof pcodes- generates a macro that exceeds the allowable length. Starting with neighbourhood may fix that, but this problem makes me think that I'm probably overlooking something fairly straightforward to accomplish what I'm trying to do.
Any suggestions would be greatly appreciated.
-cam
Here's my problem: I have 2 files. One is a record file where each record has a 6 digit postal code. The second file is an "allocation" file that has 108 neighbourhoods and the postal codes that map to that neighbourhood. Because the postal code and neighbourhood boundaries overlap, each postal code is assigned a weight which denotes the proportion of records with that postal code that should be allocated to a particular neighbourhood.
Are there any recommendations on how to approach this? I've used a somewhat mechanistic approach as below and I'm concerned I'm overlooking something more concise and elegant.
The data format in the second file would appear as below if sorted by postal code:
Neighbourhood | Postal Code | Weight |
N1 | K0A1A0 | 0.96 |
N2 | K0A1A0 | 0.04 |
N2 | K0A1B0 | 0.50 |
N3 | K0A1B0 | 0.50 |
The approach I've used is to use joinby to get every valid combination of postal code to neighbourhood and then, using -levelsof pcodes- (postalcode), iterating through each postal code and assign the records as randomly as possible to the neighbourhood (nhoods) using the weight. The rather convoluted code is below. Some randomization process in the code was my attempt to see how much, if at all, allocation varied between runs on the same data. I've included the comments because not all the steps are intuitive.
The problem I'm now having is that, when using large record files, -levelsof pcodes- generates a macro that exceeds the allowable length. Starting with neighbourhood may fix that, but this problem makes me think that I'm probably overlooking something fairly straightforward to accomplish what I'm trying to do.
Any suggestions would be greatly appreciated.
-cam
Code:
use records.dta, clear gen original=_n /* generates a sequential number for each original record*/ joinby pcodes using "nhood_postalcode.dta", unmatched(none) egen tag=tag(original) /* select only one instance of each original record */ gen random=runiform() /* generates a random number for each record */ gen assigned=. /* Use to determine if a record has already been assigned to a neighbourhood*/ * Assign a random number to each neighbourhood otherwise assignment is always alphabetical********* egen taghood=tag(nhood) /* indexes neighbourhood */ egen hood=cut(random) if taghood==1, group(108) /* assigns a random integer from the number of neighbourhoods */ bysort nhoods: egen tothoods=total(hood) /* assigns all neighbourhood value the random cut value "hood" */ labmask hood, values(tothoods) /* labels the random cut value with the neighbourhood name */ drop nhoods rename tothoods nhoods drop tothoods taghood ***************************************************** * Attribute records to the neighbourhoods by postal code according to the weights **************************************************** levelsof pcodes, loc(pcodes) foreach pcode of loc pcodes { sum tag if pcodes=="`pcode'" /* How many people in the original dataset belong to a given postal code */ loc denominator `r(sum)' levelsof nhoods if pcodes=="`pcode'",loc(nhoods) /* All neighbourhoods included in the given postal code */ gsort pcode -tag random /* Sorts to put the original records in random order at the start */ gen index=_n if pcodes=="`pcode'" /* Gives a temporary sequential number to each record within the postal code */ loc total=0 foreach nhood of loc nhoods{ sum proportion if nhoods==`nhood' & pcodes=="`pcode'" /* this is a unique value within nhood and pcode */ local p `r(mean)' /* Just a way of extracting the weight */ local assign=ceil(`denominator'*`p') /* The number of records from the total which are to be assigned to this nhood */ loc total=`total'+`assign' /* used with index as below allows you to assign sequential records */ replace assigned=`nhood' if index<=`total' & pcodes=="`pcode'" & missing(assigned) } drop index drop if tag!=1 & pcodes=="`pcode'" /* Keeps only original records within the postal code*/ }
Comment