Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Looping with Geodist

    I have a dataset with individuals with their geocoordinates and supermarket branches with geocoordinates. I need to calculate distance between the house of each individual to all the supermarket branches operating on the year of birth of that individual.

    I know the format of geodist and geonear but I don't know how to loop it for each individuals year of birth to each supermarket's branch year.

    For example:


    clear
    set obs 10
    gen lat1 = 37 + (41 - 37) * uniform()
    gen lon1 = -109 + (109 - 102) * uniform()
    gen yrbirth= round(1950 + (12-5) * uniform())
    save "individual.dta", replace

    clear
    set obs 15
    gen lat2 = 37 + (41 - 37) * uniform()
    gen lon2 = -109 + (109 - 102) * uniform()
    gen stryr= round(1950 + (12-5) * uniform())
    save "store.dta", replace

    merge 1:1 _n using "individual.dta", nogen



    Now, in this example, individual 1 with year of birth 1952 does not have any supermarket branch operating in that year so I want the distance to be (.). For the individual with year of birth 1955, I want to calculate the distance to the 7 supermarket branches operating in 1955. I want to do it for all individuals and then keep for each individual the nearest two supermarket branches.

    I would appreciate if someone can help me.

  • #2
    Because you only have a single year for stores, my only guess would be that you are assuming that if that single store year is less than the individual's year of birth, the store is still open at the year of birth. I would not think that is a good assumption, but I suppose it could be.

    Anyway, that being said, I'd use -cross- to create all possible pairs of individuals and stores, and then use -geodist- on that file. There might also be a role for -geonear- (available from SSC) here. I don't have time to lay out all the details here at the moment, but perhaps this will get you started.

    Comment


    • #3
      Thank you for your responses!

      Liu: Yes of course, I am sorry I should used the seed command to save the results. Your output is different from mine but the question remains the same even with your output.

      Mike: My dataset is too big to use cross (I have previously used geodist by crossing and making pairwise combinations). As I do not need the individuals to pair up with all the stores I do not want to use cross.
      In addition, I should have actually written more clearly. Yes I agree with you, it does not seem like a good assumption. So I am only considering the stores operating in that particular year in which the individual is born. I want to match only the year of stores which match with the year of birth.

      Comment


      • #4
        hello Shreya Alam , I am still confused with your rule (Forgive me for my retarded understanding ability) as I am blind with your dataset and rule. If possible, you may share your data example so that others are more clear about how to help. Do you mean you just want to calculate the distance with the same birthyear and store years for a given individual?
        Last edited by Liu Qiang; 16 Jun 2019, 09:05.
        2B or not 2B, that's a question!

        Comment


        • #5
          Hi Liu, yes for every individual I want to calculate the distance of all the stores operating in the year of birth of the individual.

          Comment


          • #6
            hello Shreya Alam , I think Mike's advice is very beneficial. If you don't care about which two stores are the nearest, the following code may help you.
            Code:
            clear
            version 15.0
            set seed 12345
            set obs 10
            gen lat1 = 37 + (41 - 37) * uniform()
            gen lon1 = -109 + (109 - 102) * uniform()
            gen yrbirth= round(1950 + (12-5) * uniform())
            save "individual.dta", replace
            
            clear
            set obs 15
            gen lat2 = 37 + (41 - 37) * uniform()
            gen lon2 = -109 + (109 - 102) * uniform()
            gen stryr= round(1950 + (12-5) * uniform())
            save "store.dta", replace
            
            merge 1:1 _n using "individual.dta", nogen 
            
            
            use "store.dta", clear
            sort stryr
            gen id_sto=_n
            save "store.dta", replace
            
            
            use "individual.dta",clear
            sort yrbirth
            gen id_ind=_n
            save "individual.dta", replace
            
            use "individual.dta",clear
            count
            local n=r(N)
            forvalues i=1/`n'{
            use "individual.dta",clear
            tempfile temp`i'
            save temp`i',replace
            keep in `i'
            cross using "store.dta"
            geodist lat1 lon1 lat2 lon2, generate(dist)
            replace dist=. if yrbirth!=stryr
            sort dist
            keep in 1/2
            replace id_sto=1 in 1
            replace id_sto=2 in 2
            keep id_ind id_sto dist
            reshape wide dist,i(id_ind) j(id_sto)
            save temp`i',replace
            }
            clear
            forvalues i=1/`n'{
            append using temp`i'
            }
            2B or not 2B, that's a question!

            Comment


            • #7
              To get good advice, you need good data examples. In this case, there are no identifiers and a single store date. I suppose the store data may be annual but I'll assume that you have a store closing date instead and the stryr variable holds the year the store opened. You also need to be more specific about how far you want to look and still consider a store to be a neighbor. Here's how to create the data I use for this post:

              Code:
              version 14
              clear
              set seed 588
              
              set obs 10
              gen long personid = _n
              gen lat1 = 37 + runiform()
              gen lon1 = -109 + runiform()
              gen yrbirth= runiformint(1950,1960)
              save "individual.dta", replace
              
              clear
              set obs 15
              gen long storeid = _n
              gen lat2 = 37 + uniform()
              gen lon2 = -109 + uniform()
              gen stryr= runiformint(1952,1970)
              gen strclose = stryr + runiformint(0,50) if runiform() < .4
              save "store.dta", replace
              The real data may be too large to form all pairwise combinations of individuals/stores but you can do in parts. The natural choice in this case is to do it by birth year. For any given birth year, there is a sample of stores in operation that year so the task can be reduced to forming all pairwise combinations of individuals to stores in that birth year. That may still be too large of a problem so I would recommend using geonear (from SSC) to find the nearest stores. I'm also going to assume that stores that are more than 50km away are not suitable neighbors.

              Finding neighbors by group can require a fair amount of data management gymnastics. I prefer to use runby (from SSC) to process commands by groups. runby loads in memory the data for each group separately and then calls a user-defined program to perform the desired tasks for that group. What's left in memory when the user-defined program terminates is considered results and accumulates. Applied to this case, the oneyear program first notes the current year and puts the data aside in order to load the appropriate sample of stores (in this case those who are opened on the current year) and save it into a temporary file. The oneyear program then returns to the data for the current year and calls geonear to find all stores within 50km if there are stores in the sample.

              Code:
              version 14
              
              clear all
              
              * define a program that compute results on a single by-group
              program oneyear
              
                  // take note of the current birth year and preserve data
                  local year = yrbirth[1]
                  preserve
              
                  // select applicable store sample, in this case stores open on yob
                  use if inrange(`year', stryr, strclose) using "store.dta", clear
                  local obs = _N
                  tempfile stores_in_operation
                  save "`stores_in_operation'"
              
                  // return to data for the current birth year and 
                  // find all stores within a 50km radius
                  restore
              
                  if `obs' > 0 {
                      geonear personid lat1 lon1 using "`stores_in_operation'", ///
                          n(storeid lat2 lon2) within(50) long
                  }
                  else {
                      keep personid
                  }
              
              end
              
              * load the data on individuals and find nearby stores open on yob
              use "individual.dta"
              list
              runby oneyear, by(yrbirth)
              
              * order by increasing distance and save results
              sort personid km_to_storeid storeid
              list, sepby(personid)
              save "results.dta", replace
              
              * if desired, reduce to the nearest 2 stores
              by personid: keep if _n <= 2
              list, sepby(personid)
              Here is the output from the list command of results, once ordered by increasing distance
              Code:
              . list, sepby(personid)
              
                   +--------------------------------+
                   | personid   storeid   km_to_s~d |
                   |--------------------------------|
                1. |        1         9   77.543915 |
                   |--------------------------------|
                2. |        2         .           . |
                   |--------------------------------|
                3. |        3         5   12.819188 |
                4. |        3         1    17.17072 |
                5. |        3        12   38.183865 |
                6. |        3        14   40.202161 |
                   |--------------------------------|
                7. |        4         8   26.677266 |
                8. |        4         5   41.481993 |
                9. |        4         1   44.248372 |
               10. |        4        12   47.040715 |
                   |--------------------------------|
               11. |        5        14   13.487636 |
               12. |        5         1   13.562937 |
               13. |        5         5   18.715073 |
               14. |        5        12   24.366718 |
                   |--------------------------------|
               15. |        6         8   1.9787716 |
               16. |        6         9   48.974283 |
                   |--------------------------------|
               17. |        7         9   26.646381 |
                   |--------------------------------|
               18. |        8         9   35.492403 |
               19. |        8        14   41.231934 |
                   |--------------------------------|
               20. |        9         9   6.0726626 |
               21. |        9         8   42.067373 |
                   |--------------------------------|
               22. |       10         .           . |
                   +--------------------------------+
              
              . save "results.dta", replace
              file results.dta saved
              It's always a good idea to spot check results to make sure that you have correctly coded a solution. In this case, I would create a little do-file that loads the data for one individual and then draw the appropriate sample of stores (those opened on the year of birth) and calculate distances using geodist. I make it flexible so it is easy to spot check any individual:
              Code:
              version 14
              
              * check results for a single test case
              local case = 5
              use "individual.dta", clear
              keep if personid == `case'
              list
              
              * copy the observation's particular to locals
              local id   = personid
              local byear = yrbirth
              local lat  = lat1
              local lon  = lon1
              
              * get stores in operation in the year of birth
              use "store.dta", clear
              keep if stryr <= `byear' & strclose >= `byear'
              list
              
              * special case if there's no store open on the yob
              if _N == 0 exit
              
              * calculate the distance between the test case and the target group
              geodist `lat' `lon' lat2 lon2, gen(d) sphere
              
              sort d storeid
              drop if d > 50
              
              list 
              
              * compare to results
              use "results.dta", clear
              list if personid == `case'
              Here's the full output from running a test case for personid == 5:
              Code:
              . do check_one_case
              
              . version 14
              
              . 
              . * check results for a single test case
              . local case = 5
              
              . use "individual.dta", clear
              
              . keep if personid == `case'
              (9 observations deleted)
              
              . list
              
                   +-------------------------------------------+
                   | personid       lat1        lon1   yrbirth |
                   |-------------------------------------------|
                1. |        5   37.21551   -108.8116      1957 |
                   +-------------------------------------------+
              
              . 
              . * copy the observation's particular to locals
              . local id   = personid
              
              . local byear = yrbirth
              
              . local lat  = lat1
              
              . local lon  = lon1
              
              . 
              . * get stores in operation in the year of birth
              . use "store.dta", clear
              
              . keep if stryr <= `byear' & strclose >= `byear'
              (9 observations deleted)
              
              . list
              
                   +---------------------------------------------------+
                   | storeid       lat2        lon2   stryr   strclose |
                   |---------------------------------------------------|
                1. |       1   37.10405   -108.7494    1952          . |
                2. |       5   37.06136   -108.7268    1955       1993 |
                3. |       8   37.20441   -108.0742    1954          . |
                4. |       9    37.6328   -108.0706    1952       1991 |
                5. |      12   37.34875   -108.5929    1957          . |
                   |---------------------------------------------------|
                6. |      14   37.33517   -108.7866    1955          . |
                   +---------------------------------------------------+
              
              . 
              . * special case if there's no store open on the yob
              . if _N == 0 exit
              
              . 
              . * calculate the distance between the test case and the target group
              . geodist `lat' `lon' lat2 lon2, gen(d) sphere
              
              . 
              . sort d storeid
              
              . drop if d > 50
              (2 observations deleted)
              
              . 
              . list 
              
                   +---------------------------------------------------------------+
                   | storeid       lat2        lon2   stryr   strclose           d |
                   |---------------------------------------------------------------|
                1. |      14   37.33517   -108.7866    1955          .   13.487636 |
                2. |       1   37.10405   -108.7494    1952          .   13.562937 |
                3. |       5   37.06136   -108.7268    1955       1993   18.715073 |
                4. |      12   37.34875   -108.5929    1957          .   24.366718 |
                   +---------------------------------------------------------------+
              
              . 
              . * compare to results
              . use "results.dta", clear
              
              . list if personid == `case'
              
                   +--------------------------------+
                   | personid   storeid   km_to_s~d |
                   |--------------------------------|
               11. |        5        14   13.487636 |
               12. |        5         1   13.562937 |
               13. |        5         5   18.715073 |
               14. |        5        12   24.366718 |
                   +--------------------------------+
              
              . 
              end of do-file
              
              .
              Note that as coded, geonear will return at least one neighbor, even if it is more than 50km away (as happens with personid == 1). If there are no stores opened anywhere on the birth year, then both km_to_storeid and storeid are missing.

              Comment


              • #8
                Dear Robert Picard thanks A LOT for your response. I should have created identifiers in the sample data.

                I do not have a store closing year. I have data over a range of years of each supermarket branch, some branches have 10 years data, some 5 so I am assuming that if a particular store's data is available from 1950-1960 and missing afterwards (according to the source of the data, the branch is closed down). So I am not considering if it's closed or not and considering the branches available on just the birth year that I have data for.

                I want to code what you mentioned in your second paragraph: for any given birth year, there is a sample of stores in operation that year so I want to form pairwise combinations of that individual's house to all the stores operating in that year. Then I want to calculate the shortest distance to each of those stores and sort the distances to get the nearest two (however far they maybe). Whether they are suitable neighbours or not that might not matter much because I do not exactly need neighbours I need to calculate distance between the house of each individual to all the supermarket branches operating on the year of birth of that individual.


                I tried the code above, it worked for the generated data but it's not working for the original. It gives an error after the runby command. It says observations saved 0. I am trying to figure out the problem with my data.
                Last edited by Shreya Alam; 17 Jun 2019, 11:46.

                Comment


                • #9
                  If I understand correctly, you do not care about how far stores are. If that's the case, then there is no advantage in using geonear and you must calculate all pairwise distances by year. Here's much simpler code that will do that using joinby to form all pairwise observations.

                  Code:
                  version 14
                  clear all
                  set seed 588
                  
                  set obs 10
                  gen long personid = _n
                  gen lat1 = 37 + runiform()
                  gen lon1 = -109 + runiform()
                  gen yrbirth= runiformint(1950,1958)
                  save "individual.dta", replace
                  list
                  
                  clear
                  set obs 50
                  gen long storeid = _n
                  gen lat2 = 37 + uniform()
                  gen lon2 = -109 + uniform()
                  gen stryr= runiformint(1950,1957)
                  save "store.dta", replace
                  list
                  
                  use "individual.dta", clear
                  
                  gen stryr = yrbirth
                  joinby stryr using "store.dta", unmatched(master)
                  
                  geodist lat1 lon1 lat2 lon2, gen(d)
                  sort personid d storeid
                  list personid d storeid, sepby(personid)

                  Comment

                  Working...
                  X