-
Notifications
You must be signed in to change notification settings - Fork 63
Cook Roadmap
This page will eventually contain a concrete, time-orient roadmap. In the meantime, it will serve to aggregate the features and projects that Cook would benefit from.
Today, Cook exposes an HTTP/rest API. This API is useful, but it is difficult to add support for new languages, particularly with a native interface. Also, there's no good way to get an server-side event driven update about a job or set of jobs that the connection is subscribed to. Also, the JSON representation isn't efficient for transmitting data about thousands of jobs. Building a gRPC API would give us a higher-performance API that's easier to integrate into target languages.
Beyond the usefulness of gRPC on its own, we would want to provide pre-made client libs for any languages we want to have first class support. There are 2 main features for the client libs:
-
The library should handle bundling delta-updates into full state structs to pass to the client, so that the client doesn't need to receive large status updates from the server, and so that the user of the API doesn't need to merge the deltas themselves. This should be easy with protobuf's
merge_from
functionality. -
The library should manage UUIDs for newly submitted jobs on its own. Importantly, it should serialize those UUIDs to a local instance of SQLlite or another multi-lang database. This way, we can provide a set of Python command-line tools to interact with and inspect jobs launched by a single client. This functionality must be extensible so that other database and integrations can be done easily.
Today, Cook has support for Spark in coarse-grained scheduling mode. We could add support for fine-grained scheduling mode in Spark, or other computational frameworks, such as Presto, Hadoop v2, or other engines.
Cook was built to federate between multiple data centers, and automatically balance the set of available jobs between those data centers. This will enable more robust support for hybrid cloud and multi-region deployments, so that jobs will continue to run in spite of an entire data center loss.
The major design question remaining is what should the semantics be of losing a region--should we assume that all jobs are "orphaned" there, or should every job be replicated to every region, so that regions can autonomously attempt to reclaim and run jobs that previously had failed.
Cook should have a DCOS package for getting started easily on DCOS clusters.
Cook could be used to support services that can be preempted. Although this is possible today, Cook assumes that all jobs should be completed after a few days, and will take steps to ensure they don't run forever. This would simply require adding new considerations for service jobs.
Cook could have a mechanism to express locality information to it, so that Fenzo could try to place tasks on hosts with better locality scores. Locality could be used to represent hosts that already have particular URIs on them (thanks to the fetcher cache); it could be used to have jobs run on hosts that are closer to datasets/caches they would need to use.
Locality would need to be designed to be extensible in a simple manner, so that custom locality hints would be easy to express. We must enumerate several use cases, so that we can form a design for these use cases.
It would be great if Cook hosted a human-viewable web page that made it easy to see the state of the system--both its health and what it's actually doing. This could be an excellent interactive administrative tool.
Cook should be able to run an instance of Datomic Free as its DB as a task, and automatically manage that DB, back it up, and restart when necessary, since it's difficult for some organizations to use & manage Datomic themselves. Alternatively, it wouldn't be too hard to refactor Cook to use any DB which supports KV data with 2ndary indices and efficient bulk queries on the 2ndary indices. These would make Cook easier for new users to deploy and administer.