Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Looping through observations, should I do it?

    Dear all,

    I have a computational problem (see below) that I have only solved by looping through all observations, which is incredibly slow and seems not to be recommended (http://www.stata.com/statalist/archi.../msg00525.html). I would be grateful if anyone had alternative suggestions.

    Each of my observation has a location (x, y) and for each observation that satisfies a condition (here: status==1) I want to count the number of observations that are within a radius of 1. So far I am doing it in the following way:

    gen distance=.
    gen within_radius =.

    count if status == 1
    local C = r(N)

    local i = 1
    while `i' <= `C' {
    quietly replace distance = sqrt((X-X[`i'])^2 + (Y-Y[`i'])^2)
    quietly count if distance<1
    quietly replace within_radius = r(N) if _n == `i'

    local i = `i' + 1
    }

    Technical details: Stata 13 on Windows 10.

    Many thanks in advance,
    Jan

  • #2
    You can remove the sqrt() call. If the squared distance is less than 1 then so also is the distance. Even if your threshold were not 1, you could still use the implied threshold for squared distance.

    Also,

    Code:
    quietly replace within_radius = r(N) in `i'
    is much faster than

    Code:
    quietly replace within_radius = r(N) if _n == `i'
    as the first is automatically restricted to a single observation and the second tests all observations to see whether the condition is true.

    These are tweaks. The bigger deal would be rewriting in Mata.

    But all that said, I would be surprised if this were not an existing program, but I can't remember writing it or which it might be.

    Comment


    • #3
      If these coordinates are geographic (lat/lon), you can use geonear (from SSC) to find all points within a radius of 1.

      I'm working on a new version that will be able to use cartesian distances. It was put on hold a while ago so that I could move on to another project. I'm not sure when I'll get back to it.

      Comment


      • #4
        Dear Jan Bakker
        Try this example, I hope maybe help you
        about using geographic latitude and longitude
        to create spatial weight matrix (W)

        Code:
        clear all
        input X Y
        38 36
        37 36
        36 36
        36 36
        40 35
        38 34
        39 34
         end
        
         qui {
         preserve
         local N=_N
         matrix _WM=J(`N',`N',0)
         local i=1
         while `i'<=`N' {
         local j=`i'+1
         while `j'<=`N' {
         local D=sqrt((X[`i']-X[`j'])^2+(Y[`i']-Y[`j'])^2)
         if `D'>0 & `D'<=3 {
         matrix _WM[`i',`j']=1
         matrix _WM[`j',`i']=1
         }
         local j=`j'+1
         }
         local i=`i'+1
         }
         restore
         }
         matlist _WM , twidth(5) border(all) lines(cells) rowtitle(#) format(%4.0f) nohalf
        HTML Code:
        .  matlist _WM , twidth(5) border(all) lines(cells) rowtitle(#) format(%4.0f) nohalf
        
        +--------------------------------------------------------+
        |     # |   c1 |   c2 |   c3 |   c4 |   c5 |   c6 |   c7 |
        |-------+------+------+------+------+------+------+------|
        |    r1 |    0 |    1 |    1 |    1 |    1 |    1 |    1 |
        |-------+------+------+------+------+------+------+------|
        |    r2 |    1 |    0 |    1 |    1 |    0 |    1 |    1 |
        |-------+------+------+------+------+------+------+------|
        |    r3 |    1 |    1 |    0 |    0 |    0 |    1 |    0 |
        |-------+------+------+------+------+------+------+------|
        |    r4 |    1 |    1 |    0 |    0 |    0 |    1 |    0 |
        |-------+------+------+------+------+------+------+------|
        |    r5 |    1 |    0 |    0 |    0 |    0 |    1 |    1 |
        |-------+------+------+------+------+------+------+------|
        |    r6 |    1 |    1 |    1 |    1 |    1 |    0 |    1 |
        |-------+------+------+------+------+------+------+------|
        |    r7 |    1 |    1 |    0 |    0 |    1 |    1 |    0 |
        +--------------------------------------------------------+

        Last edited by Emad Shehata; 29 Jun 2016, 22:29.
        Emad A. Shehata
        Professor (PhD Economics)
        Agricultural Research Center - Agricultural Economics Research Institute - Egypt
        Email: [email protected]
        IDEAS: http://ideas.repec.org/f/psh494.html
        EconPapers: http://econpapers.repec.org/RAS/psh494.htm
        Google Scholar: http://scholar.google.com/citations?...r=cOXvc94AAAAJ

        Comment


        • #5
          Dear Nick, Robert and Emad,

          Many thanks for your great replies.

          Nick, I did not know about this difference between using in and if. This is very helpful, even though in this case it is more like a drop in the ocean. I have been looking into Mata (I have not really worked with it before). Could you suggest any keywords that could guide a google or -findit- search for existing programs that might deal with this or a similar problem?


          Robert, I have not tried geonear because I realised that there are additional computations that I did not report earlier (sorry about that), that still require me to loop through all observations. So I have not given up on a more general solution.

          Emad, this is quite a useful code. For my problem it does not give any speed advantages unfortunately, at least in a small sample trial run.

          I realised that my minimal code might have been too minimal, as I missed out an additional step. After identifying which locations are within this radius I want to add up the number of people that live within the radius of each location. So far the code looks like this:

          Code:
          gen distance=.
          gen within_radius =.
          gen people_in_range=.
          
          count if status == 1
          local C = r(N)
          
          local i = 1
          while `i' <= `C' {
          quietly replace distance = sqrt((X-X[`i'])^2 + (Y-Y[`i'])^2)
          quietly count if distance<1
          quietly replace within_radius = r(N) if _n == `i'
          quietly gen temp = people if distance<50 & country[`i'] = country
          quietly egen temp2= sum(temp)
          
          quietly replace people_in_range= temp2 if _n == `i'
          
          quietly drop temp temp2
          local i = `i' + 1
          }
          I apologies for not bringing this up immediately.

          I would be very grateful about any additional comments.

          Comment


          • #6
            You haven't yet included all the suggestions. There are further problems with this code.

            1. It seems that you only want observations with status == 1, and you count them, but you loop over the first so many observations any way. There is no check on whether any observation processed satisfies that constraint. I've not fixed that below because you don't explain any more on that.

            2. Why is distance < 1 in one part of the code and distance < 50 in another?. Same comment. If you want different thresholds, note that the root of 50 is not 50.

            3. Firing up egen to get a sum, putting that constant in a variable and taking it out is very inefficient. There is a bug in that = should be == in your creation of temp in any case. You can and should use summarize directly. http://www.stata-journal.com/sjpdf.h...iclenum=st0135 says more.

            4. I've made other stylistic and efficiency changes.

            This code is untested.

            Code:
            gen double distance = .
            gen double within_radius = .
            gen double people_in_range = .
            
            count if status == 1
            local C = r(N)
            
            quietly forval i = 1/`C' { 
                replace distance = (X-X[`i'])^2 + (Y-Y[`i'])^2
                count if distance < 1
                replace within_radius = r(N) in `i'
                su people if distance < 50 & country[`i'] == country, meanonly 
                replace people_in_range= r(sum) in `i'
            }

            Comment


            • #7
              Dear Nick,

              Many thanks for your suggestions. I apologies for the bugs in my code, I was going for a minimal example and coded them in on the way.

              This has been really helpful, not only for this particularly problem but for my understanding regarding good/efficient coding in general. Thanks!

              On 1.:
              I do sort the variables before (now added to the code).

              On 2.:
              This is 'just' a typo.

              On 3./4.:
              Thanks for your suggestions.

              I made some changes to your code below. I am currently rerunning run both codes to test and compare the runtime. I will let you know how it goes. Do you have any further suggestions on how to avoid the explicit loop through the observations?

              Code:
              gen double distance = .
              gen double within_radius = .
              gen double people_in_range = .  
              
              gsort - status id
              
              count if status == 1
              local C = r(N)
              
              local t = (50)^2 // where 50 is the actual distance cut-off
              
              quietly forval i = 1/`C' {      
              replace distance = (X-X[`i'])^2 + (Y-Y[`i'])^2    
              count if distance < `t'    
              replace within_radius = r(N) in `i'    
              su people if distance < `t' & country[`i'] == country, meanonly      
              replace people_in_range= r(sum) in `i'
              }

              Comment


              • #8
                The sort is one thing; but otherwise values of status other than 1 are not ignored unless you say so. Do you want

                Code:
                  
                count if status == 1 & distance < `t'    
                ... 
                su people if status == 1 distance < `t' & country[`i'] == country, meanonly

                Comment


                • #9
                  Jan, you haven't explained if you are working with geographic coordinates or not. If these are latitudes and longitudes, then you are not calculating the distances correctly.

                  You also say that you would like to avoid the explicit loop but I see no task that requires a loop. You could use geonear to match all points within your specified radius and perform calculations on these matches. If you need to restrict the matches by country, then you can do it in parts (by country). I can post a small wrapper around geonear that can do that if needed.

                  With respect to efficiency, it all depends on the size of the problem. How many points to do have?

                  Please use dataex (from SSC) to generate a small data example with the relevant variables and post it here. I'm sure you'll get better code suggestions if we have a real dataset to play with. To install dataex, type in Stata's Command window:

                  Code:
                  ssc install dataex

                  Comment

                  Working...
                  X