okdataset

okdataset is proof of concept of a lightweight, on demand map-reduce cluster. okdataset is intended for prototyping and analysis on small- to medium-sized datasets. A few motivations for creating okdataset:

An API providing the majority of the PySpark API.
No JVMs - tuning JVMs is just tedious and unnecessary, and this is all Python anyways.
Supportability - simple JSON-structured logging, and basic fine-grained profiling.
Containers containers containers!
Schema-less - less time worrying about strict typing and unwinding.

Client example - sum elements in a list

from okdataset import ChainableList, Context, Logger

logger = Logger("sum example")

context = Context()
logger.info("Building list")
l = ChainableList([ x for x in xrange(1, 30) ])

logger.info("Building dataset")
ds = context.dataSet(l, label="sum", bufferSize=1)

logger.info("Calling reduce")
print ds.reduce(lambda x, y: x + y)
logger.info("All done!")

QuickStart (docker-compose)

docker-compose up -d # starts master, worker, and redis
docker-compose scale worker=8 # scale workers

Architecture

okdataset uses ZeroMQ for middleware, and Redis as a distributed cache. There is a single server process which implements a REQ/REP pattern for client-server connectivity, and a ventilator/sink pattern for master/worker calculation task delegation.

The DataSet class is a subclass of the ChainableList class, which is where the PySpark API subset is implemented, providing the usual map, reduce, filter, flatMap, and reduceByKey. Clients create DataSet objects from lists, push the data to the master (and thus into the cache), push serialized (cloud pickle) methods to the server, and collect calls trigger method chains to be applied to buffers of the dataset on workers. Results are stored in the cache and aggregated by the master for return to the client.

Name		Name	Last commit message	Last commit date
Latest commit History 81 Commits
bin		bin
example		example
okdataset		okdataset
.gitignore		.gitignore
.travis.yml		.travis.yml
Dockerfile		Dockerfile
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
Makefile		Makefile
README.md		README.md
config.yml		config.yml
docker-compose.yml		docker-compose.yml
requirements.txt		requirements.txt
server		server
setup.py		setup.py
worker		worker

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

okdataset

Client example - sum elements in a list

QuickStart (docker-compose)

Architecture

About

Releases

Packages

Languages

License

anthonyserious/okdataset

Folders and files

Latest commit

History

Repository files navigation

okdataset

Client example - sum elements in a list

QuickStart (docker-compose)

Architecture

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages