Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Spatial Regression Discontinuity and Computing Distance to border

    Hi all,

    My data:

    Code:
    input str2 progState long ZIP double(LAT LNG)
    "GA" 30523 34.716543 -83.552878
    "GA" 30274 33.554561 -84.399687
    "GA" 30274 33.554561 -84.399687
    "GA" 30274 33.554561 -84.399687
    "GA" 30274 33.554561 -84.399687
    "GA" 30274 33.554561 -84.399687
    "GA" 30274 33.554561 -84.399687
    "GA" 30274 33.554561 -84.399687
    "GA" 31532  31.72207 -82.758822
    "SC" 29102 33.641983 -80.188481
    "SC" 29169 33.997527 -81.097406
    "SC" 29169 33.997527 -81.097406
    "SC" 29169 33.997527 -81.097406
    "SC" 29169 33.997527 -81.097406
    "SC" 29169 33.997527 -81.097406
    "SC" 29169 33.997527 -81.097406
    "SC" 29169 33.997527 -81.097406
    "SC" 29169 33.997527 -81.097406

    I would like to analyse a reform that has happened in South Carolina using spatial regression discontinuity. South Carolina shares a border with Georgia, so I would like to generate a running variable being the distance of each entity to the Georgia-South Carolina border. Distances within Georgia should receive a negative value.

    The ZIP code, latitude and longitude you see are the ZIP codes, latitudes and longitudes of units as they are geographically positioned.

    I would like to conduct a spatial RD specification resembling that of this paper: https://www.iza.org/publications/dp/...road-accidents

    My problem is that of identifying the coordinates of the border points that are the closest to each unit, "as the crow flies". Once this is done, it would then be a case of using Picard's geodist package in order to compute distance, and multiplying distances in Georgia by -1 I presume.

    I have taken a look at this thread: https://www.statalist.org/forums/for...nt-to-a-border, however Robert Picard already starts with a grid file for Mexico. In which case, I guess my question could be: would there be an equivalent grid file for South Eastern USA that I could use?

    I was also informed that obtaining coordinates is possible using QGIS.

    Bottom line is, any help with identifying coordinates on a US state border would be greatly appreciated . Please let me know if I need to furnish any additional information.

  • #2
    So let's begin from... the beginning. Do you have a shapefile of the United States?


    EDIT: Because if not, a good place to start is here
    Code:
    copy "https://www2.census.gov/geo/tiger/GENZ2018/shp/cb_2018_us_zcta510_500k.zip" "cb_2018_us_zcta510_500k.zip", replace
    
    
    // Grabbed from: https://www.census.gov/geographies/mapping-files/time-series/geo/carto-boundary-file.html

    Comment


    • #3
      I do now thanks to you, thank you! I have run the code you have written, and it has downloaded files to my documents folder which I have now located.

      Is it now a case of running Kevin Crow's SSC installed
      Code:
      shp2dta using "TM_WORLD_BORDERS-0.3.shp", data("us_data.dta") coor("us_coor.dta") replace
      ?

      i.e. converting the shp / dbf files into data files?

      Comment


      • #4
        I don't know what version of Stata you use. If you have anything north of Stata 15.1, I think shp2dta is part of standard Stata software. But either way yes, convert this to a Stata shapefile and then what you'll want to do is merge your individual units (businesses were they?) to which state the entity is in using Picard's geoinpoly. Then, this directly matches up the businesses/entities you have to the state of interest. From there, I'd imagine some form of Picard's solution would work, although I've never implemented it before.

        EDIT: I know you didn't ask me, but I'll just provide what I think is a useful tool for panel data researchers, as it seems like you are given that you study labor econ.

        Shapefiles are a good way of quickly and bloodlessly collecting and building a panel dataset. Here's my code from a file for a paper I'm currently doing on Puerto Rico. I use Python to grab my datafiles, but that's just because I'm ultra-obsessed with replicability and having a precise trail of where I got stuff from.
        Code:
        cd "$MP/Raw Data" /* Changes to raw data to grab the
        municipalities and abbreviations */
        
        foreach name in dta {
        
        ** Grabs the shapefile data we're interested in
        local files : dir "`c(pwd)'" files "*`name'*"
        
        foreach f of loc files { // moves these back to the raw data directory
            
        erase  `f'
        }
        }
        
        
        cls
        
        /******************************************************************************
        
        ***            Section 1
        
        This section makes all the macros we'll need to do this.
        
        The file name here is obtained directly from INEGI's website
        
        ******************************************************************************/
        loc zf "cb_2020_us_county_500k"
        
        loc ttime: disp (td(01aug2021)-td(01jan2021))+1  /* Obviously, this represents the total 
        number of time periods expressed in Stata language. However, since this number can
        change (what if the analysis was done through 2020?), I make a macro here. 
        
        In context, it stands for total time periods.*/
        
        
        
        /******************************************************************************
        
        ***            Section 2
        
        For some stupid reason, when you attempt to use the copy command
        multiple times, Stata complains with a web error 403. No clue
        what this means and I don't care, selenium does the job
        perfectly.
        
        ******************************************************************************/
        cls
        python:
        import time, os
        from selenium import webdriver
        from webdriver_manager.chrome import ChromeDriverManager
        from selenium.webdriver.support.ui import WebDriverWait
        from selenium.webdriver.common.by import By
        from selenium.webdriver.support import expected_conditions as EC
        from selenium.webdriver.support.ui import Select
        
        options = webdriver.ChromeOptions()
        preferences= {"download.default_directory": os.getcwd(), "directory_upgrade": True}
        options.add_experimental_option("prefs", preferences)
        #options.headless = True
        options.add_experimental_option('excludeSwitches', ['enable-logging'])
        
        url = "https://tinyurl.com/yfyb68st"
        
        # Path of my WebDriver
        driver = webdriver.Chrome(ChromeDriverManager().install(), options=options)
        
        wait = WebDriverWait(driver, 20)
        
        
        # to maximize the browser window
        driver.maximize_window()
        
        #get method to launch the URL
        driver.get(url)
        
        path = "body > div.uscb-main-container > div > div > div.responsivegrid.uscb-main-responsivegrid.aem-GridColumn--tablet--12.aem-GridColumn--offset--tablet--0.aem-GridColumn--default--none.aem-GridColumn--phone--none.aem-GridColumn--phone--12.aem-GridColumn--tablet--none.aem-GridColumn.aem-GridColumn--default--7.aem-GridColumn--offset--phone--0.aem-GridColumn--offset--default--0 > div > div.pagetab.tab.aem-GridColumn.aem-GridColumn--default--12 > div.uscb-margin-LR-sm--20 > div.aem-Grid.aem-Grid--12.aem-Grid--default--12 > div:nth-child(24) > div > section > div > p:nth-child(2) > a:nth-child(1)"
        
        wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, path))).click()
        
        end
        global file_ready = 1
        
        while $file_ready {
        
        sleep 1
        
        local file : dir "`c(pwd)'" files "*.zip"
        
        cap conf f `file' // make sure the file is here
        
        if _rc ==7 {
        disp "File loading..."
        }
        
        else if _rc ~= 7 {
        
        global file_ready = 0
        
        local file : dir "`c(pwd)'" files "*.zip"
        
        disp `file'
        
        disp c(current_time)
        
        
        }
        
        }
        
        python:
        from selenium import webdriver
        
        driver.quit()
        end
        
        unzipfile `zf'.zip, replace
        
        erase `zf'.zip
        
        * Makes Stata format shapefile
        spshape2dta cb_2020_us_county_500k.dbf, replace saving(us_final)
        
        foreach name in cb {
        
        ** Grabs the shapefile data we're interested in
        local files : dir "`c(pwd)'" files "*`name'*"
        
        foreach f of loc files { // moves these back to the raw data directory
            
        erase  `f'
        }
        }
        
        
        cls
        
        u us_final, clear
        
        drop COUNTYNS A* LS* GEO
        
        // NY FIPS = 36
        
        destring *FP*, replace
        
        sort STATEFP COUNTYFP
        
        rename (STATEFP COUNTYFP NAME STATE_NAME) (sid cid county state)
        
        keep if sid ==72 // FIPS ID for Puerto Rivo
        
        sort sid cid 
        
        cls
        expand `ttime'
        
        cap drop date day
        loc t1: disp td(01jan2021)
        
        loc t2: disp td(01aug2021)
        
        qbys sid cid: egen date = seq(), f(`t1') t(`t2')
        
        format date %td
        
        sort sid cid date
        
        labvars sid cid state date ///
        "State ID" "County ID" "State" "Day"
        
        order _ID sid cid state county  _CX _CY date
        
        duplicates drop
        
        xtset _ID date, d
        cls
        
        replace state = proper( state )
        
        cls
        
        sort sid cid date
        
        g month=month(date), a(date)
        
        g day=day(date), a(month)
        
        lab var month "Month"
        
        drop NAMELSAD STUSPS
        
        
        sa "$MP/Master Data/masterpanel", replace
        
        copy "$MP/Raw Data/us_final_shp.dta" "$MP/Master Data/us_final_shp.dta"
        
        foreach name in dta {
        
        ** Grabs the shapefile data we're interested in
        local files : dir "`c(pwd)'" files "*`name'*"
        
        foreach f of loc files { // moves these back to the raw data directory
            
        erase  `f'
        }
        }
        
        cd "$MP/Replication Files/Management/Sub-Files"
        Again, this is just my example, but what this means is that anytime I need to do a spatial merge, I can, and I also have the official IDs for each American state. I use the shapefiles to build my panel datasets, and relying on them here might be a good idea for you.
        Last edited by Jared Greathouse; 14 Apr 2022, 11:49.

        Comment


        • #5
          I'm using Stata 17. The entities are programmes, in this case geographical locations of where apprenticeships are being conducted.

          Thank you very much for pointing me in this direction, I'll repost once I got that far. In any case, I will try to follow advice issued by Robert Picard on existing threads.

          Comment


          • #6
            Thank you very much for providing an excerpt of your code; will definetely give it a look! And yes, I'm a doctoral student in microeconmetrics, you guessed right!

            Comment


            • #7
              Maxence Morlet Oh I didn't guess, I just looked at your profile.

              Comment


              • #8
                Maxence Morlet

                Here is an approximate method by calculating the distance from the ZIP points to the points that make the common border (one could also calculate the distance between the ZIP points and each line segment of the border):

                Code:
                clear*
                input str2 progState long ZIP double(LAT LNG)
                "GA" 30523 34.716543 -83.552878
                "GA" 30274 33.554561 -84.399687
                "GA" 30274 33.554561 -84.399687
                "GA" 30274 33.554561 -84.399687
                "GA" 30274 33.554561 -84.399687
                "GA" 30274 33.554561 -84.399687
                "GA" 30274 33.554561 -84.399687
                "GA" 30274 33.554561 -84.399687
                "GA" 31532  31.72207 -82.758822
                "SC" 29102 33.641983 -80.188481
                "SC" 29169 33.997527 -81.097406
                "SC" 29169 33.997527 -81.097406
                "SC" 29169 33.997527 -81.097406
                "SC" 29169 33.997527 -81.097406
                "SC" 29169 33.997527 -81.097406
                "SC" 29169 33.997527 -81.097406
                "SC" 29169 33.997527 -81.097406
                "SC" 29169 33.997527 -81.097406
                end
                duplicates drop
                save "zip_points", replace
                
                copy "https://www2.census.gov/geo/tiger/GENZ2018/shp/cb_2018_us_state_20m.zip" "cb_2018_us_state_20m.zip"
                unzipfile cb_2018_us_state_20m.zip ,  replace
                spshape2dta  cb_2018_us_state_20m, replace saving(us)
                
                use us.dta,clear
                //Keep GA & SC
                keep if inlist(STATEFP, "45", "13")
                save sc_ga,replace
                
                //Get _ID values for GA & SC
                levelsof _ID, clean separate(,)
                use us_shp
                keep if inlist(_ID, `=r(levels)')
                save sc_ga_shp,replace
                
                use  sc_ga,clear
                spmap using  sc_ga_shp, id(_ID) point(data(zip_points) y(LAT) x(LNG)) label(data(zip_points) label(ZIP) pos(12)  y(LAT) x(LNG))
                
                //Find common border points
                use sc_ga_shp,clear
                keep if _ID == 41
                sort shape_order
                //Drop duplicates
                drop in 1/2
                save shp_41, replace
                
                use sc_ga_shp,clear
                keep if _ID == 15
                drop in 1/2
                merge m:1 _X _Y using shp_41
                keep if _m == 3
                drop _m
                sort  shape_order
                cross using zip_points
                sort ZIP shape_order
                
                //Calculate distance ZIP points to border points
                geodist   _Y _X LAT LNG  , miles gen(dist)
                sort ZIP dist
                by ZIP: keep if _n == 1
                list

                Comment


                • #9
                  Thanks! I'll give it a go.

                  Comment


                  • #10
                    Dear Scott Merryman,

                    Thank you very much for your previous answer. In #8, you suggested "(one could also calculate the distance between the ZIP points and each line segment of the border)". How could that be done?

                    Many thanks in advance!

                    Best,
                    Maxence

                    Comment


                    • #11
                      Scott Merryman one follow-up question, sorry: how could I add border segments to the code you provided in #8? In order to ensure that each unit on one side of the border is compared to the nearest unit on the other side of the border, within a segment (as in Dell, 2010 and the spatial RD literature)?

                      Many thanks in advance again

                      Comment


                      • #12
                        Maxence Morlet. Below is an example of calculating the distance between the border segments and ZIP code points. I am not sure I understand the follow-up question but you can increase the number segments using -tsfil- and then -ipolate- to fill in the coordinates. (note: Dell, M. (2010), The Persistent Effects of Peru's Mining Mita. Econometrica, 78: 1863-1903. https://scholar.harvard.edu/files/de...ecta8121_0.pdf)


                        Using the example from #8.
                        Code:
                        use sc_ga_shp,clear
                        keep if _ID == 15
                        drop in 1/2
                        merge m:1 _X _Y using shp_41
                        keep if _m == 3
                        drop _m
                        sort  shape_order
                        cross using zip_points
                        drop rec
                        sort ZIP shape_order
                        
                        //From: https://stackoverflow.com/questions/849211/shortest-distance-between-a-point-and-a-line-segment
                        
                        gen double px = _X - _X[_n-1]
                        gen double py = _Y - _Y[_n-1]
                        gen double norm = px*px + py*py
                        gen double u = ((LNG - _X[_n-1]) * px + (LAT - _Y[_n-1]) * py)
                        replace u = 1 if u > 1 & u != .
                        replace u = 0 if u<0  & u != .
                        
                        generate double x = _X[_n-1] + u * px
                        generate double y = _Y[_n-1] + u * py
                        generate double dx = x - LNG
                        generate double dy = y - LAT
                        generate double dist =   sqrt(dx*dx + dy*dy)
                        geodist  y x LAT  LNG ,gen(distance) miles 
                        sort ZIP distance
                        by ZIP: keep if _n == 1
                        list ZIP LAT LNG x y distance
                        
                        
                        //Add more border segments, linear interpolation of border points
                        use sc_ga_shp,clear
                        keep if _ID == 15
                        drop in 1/2
                        merge m:1 _X _Y using shp_41
                        keep if _m == 3
                        drop _m
                        sort  shape_order
                        gen t = (_n-1)*10
                        rename (_Y _X) =_original
                        tsset t
                        tsfill
                        sort t 
                        
                        ipolate _Y_o t, gen(_Y)
                        ipolate _X_o t, gen(_X)
                        
                        cross using zip_points
                        drop rec
                        sort ZIP t
                        //calculate distance from point to line segments ...

                        Comment


                        • #13
                          Dear Scott Merryman,

                          Thank you very much for this! Apologies, my questions in #10 and #11 were very unclear.

                          In #8, does the code calculate the distance between the ZIP points and the nearest border points from the ZIP points?

                          I understand that the difference between #8 and #12 is that #8 calculates the distance between points, whereas #12 calculates the distance between a point and a line segment.

                          In the below code, please could you explain the difference between the distance measured by the variable "distance" and the distance measured by the variable "dist"?

                          Code:
                           
                           gen double px = _X - _X[_n-1] gen double py = _Y - _Y[_n-1] gen double norm = px*px + py*py gen double u = ((LNG - _X[_n-1]) * px + (LAT - _Y[_n-1]) * py) replace u = 1 if u > 1 & u != . replace u = 0 if u<0  & u != .  generate double x = _X[_n-1] + u * px generate double y = _Y[_n-1] + u * py generate double dx = x - LNG generate double dy = y - LAT generate double dist =   sqrt(dx*dx + dy*dy) geodist  y x LAT  LNG ,gen(distance) miles  sort ZIP distance by ZIP: keep if _n == 1 list ZIP LAT LNG x y distance
                          Thank you very much again! You have been extremely helpful.

                          Comment


                          • #14
                            Dear Scott Merryman,

                            Thank you so much for your help so far.

                            Apologies, I just realised I displayed the code in a way that is particularly difficult to read in #13. Here it is, more legible:

                            Code:
                             gen double px = _X - _X[_n-1]  
                            gen double py = _Y - _Y[_n-1]  
                            gen double norm = px*px + py*py  
                            gen double u = ((LNG - _X[_n-1]) * px + (LAT - _Y[_n-1]) * py)  
                            replace u = 1 if u > 1 & u != .  
                            replace u = 0 if u<0  & u != .  
                            generate double x = _X[_n-1] + u * px  
                            generate double y = _Y[_n-1] + u * py  
                            generate double dx = x - LNG  
                            generate double dy = y - LAT  
                            generate double dist =   sqrt(dx*dx + dy*dy)  
                            geodist  y x LAT  LNG ,gen(distance) miles  
                            sort ZIP distance by ZIP: keep if _n == 1  
                            list ZIP LAT LNG x y distance
                            Really sorry to ask you this, but would you mind explaining what each chunk / group of lines of code does? And what its purpose is? This would be immensely helpful for me for the future!

                            Many thanks in advance!

                            Best regards!

                            Comment

                            Working...
                            X