Skip to content
forked from skale-me/skale

High performance distributed data processing engine

License

Notifications You must be signed in to change notification settings

sfast/skale-engine

 
 

Repository files navigation

skale-engine

Join the chat at https://gitter.im/skale-me/skale-engine Build Status Build Status

High performance distributed data processing engine

Skale-engine is a fast and general purpose distributed data processing system. It provides a high-level API in Javascript and an optimized parallel execution engine on top of NodeJS.

The following figure is a comparison against Spark for a logistic regression program:

logreg1

Word count using skale:

var sc = require('skale-engine').context();

sc.textFile('/path/...')
  .flatMap(line => line.split(' '))
  .map(word => [word, 1])
  .reduceByKey((a, b) => a + b, 0)
  .count().then(console.log);

Features

  • In-memory computing
  • Controlled memory usage, spill to disk when necessary
  • Fast multiple distributed streams
  • realtime lazy compiling and running of execution graphs
  • workers can connect through TCP or websockets
  • simple standalone or fully distributed mode
  • very fast, see benchmark

Docs & community

Quickstart

The best and quickest way to get started with skale-engine is to use skale to create, run and deploy skale applications.

$ sudo npm install -g skale  # Install skale command once and for all
$ skale create my_app        # Create a new app, install skale-engine
$ cd my_app
$ skale run                  # Starts a local cluster if necessary and run

Examples

In the following, we bypass skale toolbelt, and use directly and only skale-engine. It's for you if you are rather more interested by the skale-engine architecture, details and internals.

To run the internal examples, clone the skale-engine repository and install the dependencies:

$ git clone git://github.com/skale-me/skale-engine.git --depth 1
$ cd skale-engine
$ npm install

Then start a skale-engine server and workers on local host:

$ npm start

Then run whichever example you want

$ ./examples/wordcount.js /etc/hosts

Standalone local mode

The standalone mode is the default operating mode. All the processes, master and workers are running on the local host, using the cluster core NodeJS module. This mode is the simplest to operate: no dependency, and no server nor cluster setup and management required. It is used as any standard NodeJS package: simply require('skale-engine'), and that's it.

This mode is perfect for development, fast prototyping and tests on a single machine (i.e. a laptop). For unlimited scalibity, see distributed mode below.

Distributed mode

The distributed mode allows to run the exact same code as in standalone over a network of multiple machines, thus achieving horizontal scalability.

The distributed mode involves two executables, which must be running prior to launch application programs:

  • a skale-server process, which is the access point where the master (user application) and workers (running slaves) connect to, either by direct TCP connections, or by websockets.
  • A skale-worker process, which is a worker controller, running on each machine of the computing cluster, and connecting to the skale-server. The worker controller will spawn worker processes on demand (typically one per CPU), each time a new job is submitted.

To run in distributed mode, the environment variable SKALE_HOST must be set to the skale-server hostname or IP address. If unset, the application will run in standalone mode. Multiple applications, each with its own set of workers and master processes can run simultaneously using the same server and worker controllers.

Although not mandatory, running an external HTTP server on worker hosts, exposing skale temporary files, allows efficient peer-to-peer shuffle data transfer between workers. If not available, this traffic will go through the centralized skale-server. Any external HTTP server such as nginx, apache or busybox httpd, or even NodeJS (although not the most efficient for static file serving) will do.

For further details, see command line help for skale-worker and skale-server.

Docker

A sample docker image and compose file are provided, to run a minimal distributed skale stack for experimentation purpose, on a single machine or on a swarm cluster, as explained here.

Tests

To run the test suite, first install the dependencies, then run npm test:

$ npm install
$ npm test

People

The original authors of skale-engine are Cedric Artigue and Marc Vertes.

List of all contributors

License

Apache-2.0

About

High performance distributed data processing engine

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • JavaScript 98.8%
  • Other 1.2%