Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Working with Google API's in Stata

    A user sent me this message individually, but I thought it could be useful/helpful to other members of the Stata community:

    Hi, recently I saw that you finished a new user written command to replace the old geocode3 among other things. If you have some experience using Google APIs in stata, I was hoping you could help with an issue we've been having using insheetjson and google APIs with one of our own commands. We are using google places API to pull some different results instead of the geocode API, but Places API is pretty similar to geocode API.

    The issue we are having is when we are running the command for any decently sized data set (anything over 1000 observations), for some reason we can't figure out why, the program will randomly terminate with a Obs. nos. out of range r(198) error. The weird thing is, if we run the command again over the same data set (the original data set, with no API results from the previous run), often we will not get the same error as the program passes the observation that it previously terminated on. For example, if on one run it terminated on observation 200, on the next run it will usually run just fine through observation 200.

    At first I thought this must be due to timing out when contacting google, so we added a delay in our loop with the sleep command. However this did nothing to fix the issue, no matter how long we made the delay. We tried a few other methods, even used different hard drives just in case there was some communication issue there, but nothing seems to solve the issue. I was curious if you had similar problems when running your command, and if so how you solved them?

    Thanks in advance,

  • #2
    The first thing that is helpful to identify is that there are three distinct APIs that all fall under the larger Google Places API: place search, place details, and autocomplete. Each of these APIs uses a distinct JSON Schema of varying complexity (e.g., objects nested in arrays nested in other objects and/or arrays). Even with fairly well established tools for JSON processing like the Jackson JSON library, it can be challenging to access specific elements in the payload in a consistent manner (e.g., if the payload from an API call includes null elements they can't really be used as reference/access points to other data in the payload, they need to be handled if the data are to be rectangularized, etc...).

    Without additional details it would be difficult to say with any form of certainty what could be causing the problem, but given the error message (e.g., Obs. nos. out of range r(198)) I would suspect it is a problem attempting to rectangularize the payload and join in to the existing source data. In other words, if the payload that results from data on a single observation returns a series of nested objects/arrays there are only two ways of managing those data in Stata: a long format (e.g., continue to keep the data nested within the observation) or wide format (create additional variables for each level of nesting).

    The challenge with managing the data in long format is inserting records and then populating the new records correctly (e.g., with the details API there will be varying numbers of ratings for given locations, sometimes including no ratings, other times including several hundreds or thousands of ratings) and consistently. If this is what insheetjson is attempting to do, it would make sense that an out of range error could occur if it is attempting to add more nested elements than the current data set could handle. The benefit, however, of this approach is that it is much more amenable to handling complexity (e.g., arrays nested within objects nested within other objects that are nested within arrays, etc...). The biggest challenge I would see in managing the data in wide format would be managing some of the more complex objects in a consistent manner.

    The approach I took didn't rely on insheetjson, so I can't comment on that aspect of things. One of the advantages to handling this on the JVM is that I can pull a representation of the data into the JVM and use the mechanisms in that language to handle some of the complexities. For example, all of the JSON can be processed and parsed before returning any of it to Stata, in which case it becomes possible to know how many, if any, additional observations need to be created to store/maintain the structure. It also makes it a bit easier to handle the structure of JSON in a manner that is a bit more consistent and transparent (e.g., if done well you could probably look at the JSON and the resulting data set and see fairly similar structure(s)).

    Comment


    • #3
      The two Place's APIs we use are place search and place details, but when we started running into issues while using both, we switched to just using place search (since you can't do a detail search without a place ID, the search API is more useful to us anyway) to see if we could iron out the issue. The issue still exists using just the place search API.

      What's really confusing me is that we can run the command we made 3 or 4 times over the same data set (original dataset each time), and it may only terminate 1 out of those 4 times, with the other 3 passes working just fine right past the observation that caused the termination that one time. So why would Stata terminate the program one time but not the other three times if the input is the same? This is what makes me think it's an issue with Places API itself and not just a parsing issue. Our current theory is that there is something in Google's policy with Places API that is not explicitly stated in their user agreement that is getting us banned by whatever algorithm they use to check for that sort of thing.

      Anyway I'm not at work at the moment, but tomorrow I could post some more details.

      Comment


      • #4
        Google tends to have public use API limits so I wouldn't be too surprised if that happened/caused some of the issues. I'll take a look at the places API again in the morning, but the driving/travel directions API returns a bunch of data that is less easy to handle in a manner that would be easy to manage and make it easier for users to work with (e.g., the API can return up to 23 distinct way points which either means appending a lot of observations and making rules about how to handle adding observations or generating a ton of variables).

        Comment

        Working...
        X