okdataset is proof of concept of a lightweight, on demand map-reduce cluster. okdataset is intended for prototyping and analysis on small- to medium-sized datasets. A few motivations for creating okdataset:
- An API providing the majority of the PySpark API.
- No JVMs - tuning JVMs is just tedious and unnecessary, and this is all Python anyways.
- Supportability - simple JSON-structured logging, and basic fine-grained profiling.
- Containers containers containers!
- Schema-less - less time worrying about strict typing and unwinding.
from okdataset import ChainableList, Context, Logger
logger = Logger("sum example")
context = Context()
logger.info("Building list")
l = ChainableList([ x for x in xrange(1, 30) ])
logger.info("Building dataset")
ds = context.dataSet(l, label="sum", bufferSize=1)
logger.info("Calling reduce")
print ds.reduce(lambda x, y: x + y)
logger.info("All done!")
docker-compose up -d
# starts master, worker, and redisdocker-compose scale worker=8
# scale workers
okdataset uses ZeroMQ for middleware, and Redis as a distributed cache. There is a single server process which implements a REQ/REP pattern for client-server connectivity, and a ventilator/sink pattern for master/worker calculation task delegation.
The DataSet
class is a subclass of the ChainableList
class, which is where the PySpark API subset is implemented, providing the usual map
, reduce
, filter
, flatMap
, and reduceByKey
. Clients create DataSet
objects from lists, push the data to the master (and thus into the cache), push serialized (cloud pickle) methods to the server, and collect
calls trigger method chains to be applied to buffers of the dataset on workers. Results are stored in the cache and aggregated by the master for return to the client.