Spatial Regression Discontinuity and Computing Distance to border

Maxence Morlet

Join Date: Mar 2021

Posts: 634
#1

Spatial Regression Discontinuity and Computing Distance to border

14 Apr 2022, 09:27

Hi all,

My data:

Code:

input str2 progState long ZIP double(LAT LNG) "GA" 30523 34.716543 -83.552878 "GA" 30274 33.554561 -84.399687 "GA" 30274 33.554561 -84.399687 "GA" 30274 33.554561 -84.399687 "GA" 30274 33.554561 -84.399687 "GA" 30274 33.554561 -84.399687 "GA" 30274 33.554561 -84.399687 "GA" 30274 33.554561 -84.399687 "GA" 31532 31.72207 -82.758822 "SC" 29102 33.641983 -80.188481 "SC" 29169 33.997527 -81.097406 "SC" 29169 33.997527 -81.097406 "SC" 29169 33.997527 -81.097406 "SC" 29169 33.997527 -81.097406 "SC" 29169 33.997527 -81.097406 "SC" 29169 33.997527 -81.097406 "SC" 29169 33.997527 -81.097406 "SC" 29169 33.997527 -81.097406

I would like to analyse a reform that has happened in South Carolina using spatial regression discontinuity. South Carolina shares a border with Georgia, so I would like to generate a running variable being the distance of each entity to the Georgia-South Carolina border. Distances within Georgia should receive a negative value.

The ZIP code, latitude and longitude you see are the ZIP codes, latitudes and longitudes of units as they are geographically positioned.

I would like to conduct a spatial RD specification resembling that of this paper: https://www.iza.org/publications/dp/...road-accidents

My problem is that of identifying the coordinates of the border points that are the closest to each unit, "as the crow flies". Once this is done, it would then be a case of using Picard's geodist package in order to compute distance, and multiplying distances in Georgia by -1 I presume.

I have taken a look at this thread: https://www.statalist.org/forums/for...nt-to-a-border, however Robert Picard already starts with a grid file for Mexico. In which case, I guess my question could be: would there be an equivalent grid file for South Eastern USA that I could use?

I was also informed that obtaining coordinates is possible using QGIS.

Bottom line is, any help with identifying coordinates on a US state border would be greatly appreciated . Please let me know if I need to furnish any additional information.
Tags: None

Jared Greathouse

Join Date: Sep 2021
Posts: 2170

14 Apr 2022, 11:08

So let's begin from... the beginning. Do you have a shapefile of the United States?

EDIT: Because if not, a good place to start is here

Code:

copy "https://www2.census.gov/geo/tiger/GENZ2018/shp/cb_2018_us_zcta510_500k.zip" "cb_2018_us_zcta510_500k.zip", replace


// Grabbed from: https://www.census.gov/geographies/mapping-files/time-series/geo/carto-boundary-file.html

Comment

Maxence Morlet

Join Date: Mar 2021

Posts: 634
#3

14 Apr 2022, 11:21

I do now thanks to you, thank you! I have run the code you have written, and it has downloaded files to my documents folder which I have now located.

Is it now a case of running Kevin Crow's SSC installed

Code:

shp2dta using "TM_WORLD_BORDERS-0.3.shp", data("us_data.dta") coor("us_coor.dta") replace

?

i.e. converting the shp / dbf files into data files?
Comment

Jared Greathouse

Join Date: Sep 2021
Posts: 2170

14 Apr 2022, 11:39

I don't know what version of Stata you use. If you have anything north of Stata 15.1, I think shp2dta is part of standard Stata software. But either way yes, convert this to a Stata shapefile and then what you'll want to do is merge your individual units (businesses were they?) to which state the entity is in using Picard's geoinpoly. Then, this directly matches up the businesses/entities you have to the state of interest. From there, I'd imagine some form of Picard's solution would work, although I've never implemented it before.

EDIT: I know you didn't ask me, but I'll just provide what I think is a useful tool for panel data researchers, as it seems like you are given that you study labor econ.

Shapefiles are a good way of quickly and bloodlessly collecting and building a panel dataset. Here's my code from a file for a paper I'm currently doing on Puerto Rico. I use Python to grab my datafiles, but that's just because I'm ultra-obsessed with replicability and having a precise trail of where I got stuff from.

Code:

cd "$MP/Raw Data" /* Changes to raw data to grab the
municipalities and abbreviations */

foreach name in dta {

** Grabs the shapefile data we're interested in
local files : dir "`c(pwd)'" files "*`name'*"

foreach f of loc files { // moves these back to the raw data directory
    
erase  `f'
}
}


cls

/******************************************************************************

***            Section 1

This section makes all the macros we'll need to do this.

The file name here is obtained directly from INEGI's website

******************************************************************************/
loc zf "cb_2020_us_county_500k"

loc ttime: disp (td(01aug2021)-td(01jan2021))+1  /* Obviously, this represents the total 
number of time periods expressed in Stata language. However, since this number can
change (what if the analysis was done through 2020?), I make a macro here. 

In context, it stands for total time periods.*/



/******************************************************************************

***            Section 2

For some stupid reason, when you attempt to use the copy command
multiple times, Stata complains with a web error 403. No clue
what this means and I don't care, selenium does the job
perfectly.

******************************************************************************/
cls
python:
import time, os
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import Select

options = webdriver.ChromeOptions()
preferences= {"download.default_directory": os.getcwd(), "directory_upgrade": True}
options.add_experimental_option("prefs", preferences)
#options.headless = True
options.add_experimental_option('excludeSwitches', ['enable-logging'])

url = "https://tinyurl.com/yfyb68st"

# Path of my WebDriver
driver = webdriver.Chrome(ChromeDriverManager().install(), options=options)

wait = WebDriverWait(driver, 20)


# to maximize the browser window
driver.maximize_window()

#get method to launch the URL
driver.get(url)

path = "body > div.uscb-main-container > div > div > div.responsivegrid.uscb-main-responsivegrid.aem-GridColumn--tablet--12.aem-GridColumn--offset--tablet--0.aem-GridColumn--default--none.aem-GridColumn--phone--none.aem-GridColumn--phone--12.aem-GridColumn--tablet--none.aem-GridColumn.aem-GridColumn--default--7.aem-GridColumn--offset--phone--0.aem-GridColumn--offset--default--0 > div > div.pagetab.tab.aem-GridColumn.aem-GridColumn--default--12 > div.uscb-margin-LR-sm--20 > div.aem-Grid.aem-Grid--12.aem-Grid--default--12 > div:nth-child(24) > div > section > div > p:nth-child(2) > a:nth-child(1)"

wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, path))).click()

end
global file_ready = 1

while $file_ready {

sleep 1

local file : dir "`c(pwd)'" files "*.zip"

cap conf f `file' // make sure the file is here

if _rc ==7 {
disp "File loading..."
}

else if _rc ~= 7 {

global file_ready = 0

local file : dir "`c(pwd)'" files "*.zip"

disp `file'

disp c(current_time)


}

}

python:
from selenium import webdriver

driver.quit()
end

unzipfile `zf'.zip, replace

erase `zf'.zip

* Makes Stata format shapefile
spshape2dta cb_2020_us_county_500k.dbf, replace saving(us_final)

foreach name in cb {

** Grabs the shapefile data we're interested in
local files : dir "`c(pwd)'" files "*`name'*"

foreach f of loc files { // moves these back to the raw data directory
    
erase  `f'
}
}


cls

u us_final, clear

drop COUNTYNS A* LS* GEO

// NY FIPS = 36

destring *FP*, replace

sort STATEFP COUNTYFP

rename (STATEFP COUNTYFP NAME STATE_NAME) (sid cid county state)

keep if sid ==72 // FIPS ID for Puerto Rivo

sort sid cid 

cls
expand `ttime'

cap drop date day
loc t1: disp td(01jan2021)

loc t2: disp td(01aug2021)

qbys sid cid: egen date = seq(), f(`t1') t(`t2')

format date %td

sort sid cid date

labvars sid cid state date ///
"State ID" "County ID" "State" "Day"

order _ID sid cid state county  _CX _CY date

duplicates drop

xtset _ID date, d
cls

replace state = proper( state )

cls

sort sid cid date

g month=month(date), a(date)

g day=day(date), a(month)

lab var month "Month"

drop NAMELSAD STUSPS


sa "$MP/Master Data/masterpanel", replace

copy "$MP/Raw Data/us_final_shp.dta" "$MP/Master Data/us_final_shp.dta"

foreach name in dta {

** Grabs the shapefile data we're interested in
local files : dir "`c(pwd)'" files "*`name'*"

foreach f of loc files { // moves these back to the raw data directory
    
erase  `f'
}
}

cd "$MP/Replication Files/Management/Sub-Files"

Again, this is just my example, but what this means is that anytime I need to do a spatial merge, I can, and I also have the official IDs for each American state. I use the shapefiles to build my panel datasets, and relying on them here might be a good idea for you.

Last edited by Jared Greathouse; 14 Apr 2022, 11:49.

Comment

Maxence Morlet

Join Date: Mar 2021

Posts: 634
#5

14 Apr 2022, 11:44

I'm using Stata 17. The entities are programmes, in this case geographical locations of where apprenticeships are being conducted.

Thank you very much for pointing me in this direction, I'll repost once I got that far. In any case, I will try to follow advice issued by Robert Picard on existing threads.
Comment
Maxence Morlet

Join Date: Mar 2021

Posts: 634
#6

14 Apr 2022, 12:00

Thank you very much for providing an excerpt of your code; will definetely give it a look! And yes, I'm a doctoral student in microeconmetrics, you guessed right!
Comment
Jared Greathouse

Join Date: Sep 2021

Posts: 2170
#7

14 Apr 2022, 15:22

Maxence Morlet Oh I didn't guess, I just looked at your profile.
Comment

Scott Merryman

Join Date: Mar 2014
Posts: 895

14 Apr 2022, 15:45

Maxence Morlet

Here is an approximate method by calculating the distance from the ZIP points to the points that make the common border (one could also calculate the distance between the ZIP points and each line segment of the border):

Code:

clear*
input str2 progState long ZIP double(LAT LNG)
"GA" 30523 34.716543 -83.552878
"GA" 30274 33.554561 -84.399687
"GA" 30274 33.554561 -84.399687
"GA" 30274 33.554561 -84.399687
"GA" 30274 33.554561 -84.399687
"GA" 30274 33.554561 -84.399687
"GA" 30274 33.554561 -84.399687
"GA" 30274 33.554561 -84.399687
"GA" 31532  31.72207 -82.758822
"SC" 29102 33.641983 -80.188481
"SC" 29169 33.997527 -81.097406
"SC" 29169 33.997527 -81.097406
"SC" 29169 33.997527 -81.097406
"SC" 29169 33.997527 -81.097406
"SC" 29169 33.997527 -81.097406
"SC" 29169 33.997527 -81.097406
"SC" 29169 33.997527 -81.097406
"SC" 29169 33.997527 -81.097406
end
duplicates drop
save "zip_points", replace

copy "https://www2.census.gov/geo/tiger/GENZ2018/shp/cb_2018_us_state_20m.zip" "cb_2018_us_state_20m.zip"
unzipfile cb_2018_us_state_20m.zip ,  replace
spshape2dta  cb_2018_us_state_20m, replace saving(us)

use us.dta,clear
//Keep GA & SC
keep if inlist(STATEFP, "45", "13")
save sc_ga,replace

//Get _ID values for GA & SC
levelsof _ID, clean separate(,)
use us_shp
keep if inlist(_ID, `=r(levels)')
save sc_ga_shp,replace

use  sc_ga,clear
spmap using  sc_ga_shp, id(_ID) point(data(zip_points) y(LAT) x(LNG)) label(data(zip_points) label(ZIP) pos(12)  y(LAT) x(LNG))

//Find common border points
use sc_ga_shp,clear
keep if _ID == 41
sort shape_order
//Drop duplicates
drop in 1/2
save shp_41, replace

use sc_ga_shp,clear
keep if _ID == 15
drop in 1/2
merge m:1 _X _Y using shp_41
keep if _m == 3
drop _m
sort  shape_order
cross using zip_points
sort ZIP shape_order

//Calculate distance ZIP points to border points
geodist   _Y _X LAT LNG  , miles gen(dist)
sort ZIP dist
by ZIP: keep if _n == 1
list

Comment

Maxence Morlet

Join Date: Mar 2021

Posts: 634
#9

15 Apr 2022, 02:36

Thanks! I'll give it a go.
Comment
Maxence Morlet

Join Date: Mar 2021

Posts: 634
#10

10 Nov 2023, 05:06

Dear Scott Merryman,

Thank you very much for your previous answer. In #8, you suggested "(one could also calculate the distance between the ZIP points and each line segment of the border)". How could that be done?

Many thanks in advance!

Best,
Maxence
Comment
Maxence Morlet

Join Date: Mar 2021

Posts: 634
#11

10 Nov 2023, 10:50

Scott Merryman one follow-up question, sorry: how could I add border segments to the code you provided in #8? In order to ensure that each unit on one side of the border is compared to the nearest unit on the other side of the border, within a segment (as in Dell, 2010 and the spatial RD literature)?

Many thanks in advance again
Comment

Scott Merryman

Join Date: Mar 2014
Posts: 895

#12

11 Nov 2023, 13:30

Maxence Morlet. Below is an example of calculating the distance between the border segments and ZIP code points. I am not sure I understand the follow-up question but you can increase the number segments using -tsfil- and then -ipolate- to fill in the coordinates. (note: Dell, M. (2010), The Persistent Effects of Peru's Mining Mita. Econometrica, 78: 1863-1903. https://scholar.harvard.edu/files/de...ecta8121_0.pdf)

Using the example from #8.

Code:

use sc_ga_shp,clear
keep if _ID == 15
drop in 1/2
merge m:1 _X _Y using shp_41
keep if _m == 3
drop _m
sort  shape_order
cross using zip_points
drop rec
sort ZIP shape_order

//From: https://stackoverflow.com/questions/849211/shortest-distance-between-a-point-and-a-line-segment

gen double px = _X - _X[_n-1]
gen double py = _Y - _Y[_n-1]
gen double norm = px*px + py*py
gen double u = ((LNG - _X[_n-1]) * px + (LAT - _Y[_n-1]) * py)
replace u = 1 if u > 1 & u != .
replace u = 0 if u<0  & u != .

generate double x = _X[_n-1] + u * px
generate double y = _Y[_n-1] + u * py
generate double dx = x - LNG
generate double dy = y - LAT
generate double dist =   sqrt(dx*dx + dy*dy)
geodist  y x LAT  LNG ,gen(distance) miles 
sort ZIP distance
by ZIP: keep if _n == 1
list ZIP LAT LNG x y distance


//Add more border segments, linear interpolation of border points
use sc_ga_shp,clear
keep if _ID == 15
drop in 1/2
merge m:1 _X _Y using shp_41
keep if _m == 3
drop _m
sort  shape_order
gen t = (_n-1)*10
rename (_Y _X) =_original
tsset t
tsfill
sort t 

ipolate _Y_o t, gen(_Y)
ipolate _X_o t, gen(_X)

cross using zip_points
drop rec
sort ZIP t
//calculate distance from point to line segments ...

Comment

Maxence Morlet

Join Date: Mar 2021

Posts: 634
#13

13 Nov 2023, 10:47

Dear Scott Merryman,

Thank you very much for this! Apologies, my questions in #10 and #11 were very unclear.

In #8, does the code calculate the distance between the ZIP points and the nearest border points from the ZIP points?

I understand that the difference between #8 and #12 is that #8 calculates the distance between points, whereas #12 calculates the distance between a point and a line segment.

In the below code, please could you explain the difference between the distance measured by the variable "distance" and the distance measured by the variable "dist"?

Code:

gen double px = _X - _X[_n-1] gen double py = _Y - _Y[_n-1] gen double norm = px*px + py*py gen double u = ((LNG - _X[_n-1]) * px + (LAT - _Y[_n-1]) * py) replace u = 1 if u > 1 & u != . replace u = 0 if u<0 & u != . generate double x = _X[_n-1] + u * px generate double y = _Y[_n-1] + u * py generate double dx = x - LNG generate double dy = y - LAT generate double dist = sqrt(dx*dx + dy*dy) geodist y x LAT LNG ,gen(distance) miles sort ZIP distance by ZIP: keep if _n == 1 list ZIP LAT LNG x y distance

Thank you very much again! You have been extremely helpful.
Comment

Maxence Morlet

Join Date: Mar 2021
Posts: 634

#14

17 Jan 2024, 08:57

Dear Scott Merryman,

Thank you so much for your help so far.

Apologies, I just realised I displayed the code in a way that is particularly difficult to read in #13. Here it is, more legible:

Code:

 gen double px = _X - _X[_n-1]  
gen double py = _Y - _Y[_n-1]  
gen double norm = px*px + py*py  
gen double u = ((LNG - _X[_n-1]) * px + (LAT - _Y[_n-1]) * py)  
replace u = 1 if u > 1 & u != .  
replace u = 0 if u<0  & u != .  
generate double x = _X[_n-1] + u * px  
generate double y = _Y[_n-1] + u * py  
generate double dx = x - LNG  
generate double dy = y - LAT  
generate double dist =   sqrt(dx*dx + dy*dy)  
geodist  y x LAT  LNG ,gen(distance) miles  
sort ZIP distance by ZIP: keep if _n == 1  
list ZIP LAT LNG x y distance

Really sorry to ask you this, but would you mind explaining what each chunk / group of lines of code does? And what its purpose is? This would be immensely helpful for me for the future!

Many thanks in advance!

Best regards!

Announcement