2011-05-26

Distributed Workers

I have a project coming up where I'll need to utilize distributed workers. It's a bit odd in that the workers will come and go, and they'll most likely need a copy of the subset of the data they are working on so they can efficiently process it, but the workload is very time dependent. Put another way, I need to keep data synchronized in some fashion between the master server and the remote client in a way that is testable.

I'm thinking I'm going to see problems where two works are working on the same data set but in doing so get slightly different results. Returning different results is not only possible, but probable, since processing the data requires them check a resource that sometimes flaps between values when in transition, for minutes at a time.

Here's the criteria for the system as I see it so far:

Server:

  • Canonical data source; Data stored in some sort of DB

  • Accepts registrations from clients/workers

  • Creates jobs/tasks in a work queue

  • Assigns jobs/tasks from work queue to registered workers

  • Accepts results from workers or times out task after appropriate wait



Client (worker):

  • Mostly shared code base (re-use modules defining data as objects)

  • Registers with server

  • Accepts tasks from server, processes data, returns result

  • Keeps copy of current set of data it is responsible for processing, only returns changes to data, not whole update



Here's what I'm wondering:

How much of this is based in my assumptions for what I'll need underneath? I've already thought of the DB structure needed to support this, and how I'll link between all the structures in the data. If I assume I'm using some sort of NoSQL solution, such as MongoDB, CouchDB (or whatever it's called now), or something else, are there assumptions I can make about the system that reduces complexity?

Are there modules available (preferably in Perl) to manage some of the work assignment tasks for me?

I would prefer to pass object state back and forth for the tasks. I can imagine passing an object name and a way to initialize that object to the state defined, that's not too hard. I DO want to have the objects that I'm passing easily abstracted to the DB on the server side. If I have the workers contain the same object code, can I do that without requiring the client deal with DB code? That is, can I easily abstract the object ORM layer out from the client? Maybe with roles using Moose?

If I use Moose, I know there's a startup speed penalty, which is not a problem. I'm more worried about any execution inefficiencies, since this is time dependent (to a sub-second level, but not quite ms dependent level. I haven't had a chance to use Moose in a project yet, so I'm not aware of the specifics. I do hear it's tunable so I can omit features for speed, which is a nice trade-off.

Some representation for the changes in a data structure, or just JSON if that works as a common format, would be very useful. If I can find a module that provides this, great. Otherwise, I suspect I'll be writing my own after researching data diffs.

In any case, I'll update here as I come to conclusions or find solutions.

No comments:

Post a Comment