I'm trying to interface a large Scala + Akka + PlayMini application with an external REST API. The idea is to periodically poll (basically every 1 to 10 minutes) a root URL and then crawl through sub-level URLs to extract data which is then sent to a message queue.
I have come up with two ways to do this:
1st way
Create a hierarchy of actors to match the resource path structure of the API. In the Google Latitude case, that would mean, e.g.
- Actor 'latitude/v1/currentLocation' polls https://www.googleapis.com/latitude/v1/currentLocation
- Actor 'latitude/v1/location' polls https://www.googleapis.com/latitude/v1/location
- Actor 'latitude/v1/location/1' polls https://www.googleapis.com/latitude/v1/location/1
- Actor 'latitude/v1/location/2' polls https://www.googleapis.com/latitude/v1/location/2
- Actor 'latitude/v1/location/3' polls https://www.googleapis.com/latitude/v1/location/3
- etc.
In this case, each actor is responsible for polling its associated resource periodically, as well as creating / deleting child actors for next-level path resources (i.e. actor 'latitude/v1/location' creates actors 1, 2, 3, etc. for all locations it learns about through polling of https://www.googleapis.com/latitude/v1/location).
2nd way
Create a pool of identical polling actors which receive polling requests (containing the resource path) load-balanced by a router, poll the URL once, do some processing, and schedule polling requests (both for next-level resources and for the polled URL). In Google Latitude, that would mean for instance:
1 router, n poller actors. Initial polling request for https://www.googleapis.com/latitude/v1/location leads to several new (immediate) polling requests for https://www.googleapis.com/latitude/v1/location/1, https://www.googleapis.com/latitude/v1/location/2, etc. and one (delayed) polling request for the same resource, i.e. https://www.googleapis.com/latitude/v1/location.
I have implemented both solutions and can't immediately observe any relevant difference of performance, at least not for the API and polling frequencies I am interested in. I find the first approach to be somewhat easier to reason about and perhaps easier to use with system.scheduler.schedule(...) than the second approach (where I need to scheduleOnce(...)). Also, assuming resources are nested through several levels and somewhat short-lived (e.g. several resources may be added/removed between each polling), akka's lifecycle management makes it easy to kill off a whole branch in the 1st case. The second approach should (theoretically) be faster and the code is somewhat easier to write.
My questions are:
- What approach seems to be the best (in terms of performance, extensibility, code complexity, etc.)?
- Do you see anything wrong with the design of either approach (esp. the 1st one)?
- Has anyone tried to implement anything similar? How was it done?
Thanks!
Why not create a master poller, which then kicks of async resource requests on the schedule?
I'm no expert using Akka, but I gave this a shot:
The poller object that iterates through the list of resources to fetch:
The actor that reads the resource asynchronously and triggers more async reads. You could put the message dispatch on a schedule rather than call immediately if it was kinder: