Benchmark

Data

The data source is traffic data from NDW (Nationaal Dataportaal Wegverkeer). We use a dataset which contains measurements of a set of road measurement locations in the Netherlands. At these locations, sensors count the amount of cars that pass by on a certain lane and the average speed of these cars in a time interval.

We use the Data Stream Generator to generate an input stream on two Kafka topics: ndwflow and ndwspeed. Flow measurements (ndwflow) describe the amount of cars that passed by a certain measurement point and lane. The speed measurements (ndwspeed) describe the average speed of the cars that passed by a certain measurement point and lane during that time frame. The throughput and data characteristics of the stream can be regulated with environment variables. More information can be found in the documentation of the Data Stream Generator.

Processing Pipelines

This benchmark suite aims to include the most important stream operations and to study the influence of different implementations: high-level vs. low-level DSL, tumbling vs. interval join, tumbling vs. sliding window, etc. Therefore, multiple pipelines have been included.

Each of these pipelines has different complexities and characteristics. There are three main groups of pipelines. The first one are the stateless pipelines: ingesting and ETL. The second group are the simple stateful pipelines with joins and windowing operations. Finally, there are the complex pipelines which contain multiple stateful operations: end-to-end analytics pipelines.

It is important to include these different pipelines since they stress different aspects of the system: CPU, memory, network, IO, shuffling, state management, ...

Most of these pipelines are deduced from a base pipeline which we will explain here to make the following sections more concise. The base pipeline looks as follows:

adapted from [1]

Ingest: Reading JSON event data from Kafka from two input streams: speed stream and flow stream.
Parse: Parsing the data from the two streams and extract useful fields.
Join: Joining the flow stream and the speed stream. The join is done on the key: measurement ID, internal ID and timestamp rounded down to second-level. The internal ID describes the lane of the road. After the join, we know how many cars passed and how fast they drove for each lane of the road and time period.
Tumbling window: Aggregating the speed and the flow over all the lanes belonging to the same measurement ID. So here the data is grouped by measurement ID and timestamp (rounded down to second-level). We compute the average speed accumulated flow over all the lanes. The tumbling window is computed using a reduce function and therefore, is incremental.
Sliding window: For the sliding window stage, we compute the relative change in flow and speed over two lookback period: a short one and a long one. The length of the lookback periods is defined in the configuration file. The relative change is calculated by computing:
```
                   speed(t) - speed(t-1) / speed(t-1)
```
Output: Publishing output back to Kafka.

The pipeline allows the execution of only a part of it. You can make the pipeline as complex as you want. The more stages you add, the more complex the pipeline becomes. This setting is set by the environment variable LAST_STAGE, as is explained in the documentation of each of the components. The last stage index is given next to the arrow of the output to Kafka. The following table gives an overview of the executed pipeline for each LAST_STAGE value:

LAST_STAGE	Pipeline
0	ingest - publish	Stateless
1	ingest - parse - publish	Stateless
2	ingest - parse - join - publish	Stateful
3	ingest - parse - join - tumbling window - publish	Stateful
4	ingest - parse - join - tumbling window - sliding window-publish	Stateful
5	ingest - parse - join - low level tumbling window - publish	Stateful
6	ingest - parse - join - low level tumbling window - low level sliding window-publish	Stateful
100	ingest - parse - incremental tumbling/sliding window	Stateful
101	ingest - parse - non incremental tumbling/sliding window	Stateful

The single-digit indices stand for pipelines that are a subset of the base pipeline. The other indices stand for pipelines that focus on one particular stateful operation.

Stateless pipelines

Stateless pipelines do simple transformations for which no other events are required. Within this benchmarking suite, two stateless pipelines are included.

Ingest

This pipeline executes the base pipeline until stage 0. It reads data and directly writes this data back to Kafka without doing any transformations on it. It is basically an empty pipeline which can be used to check network and framework overhead.

Characteristics:

Stateless
No operations (only adding a timestamp)

ETL: ingest, parse and publish

This pipeline executes the base pipeline until stage 1 and mimics a simple ETL pipeline.

Characteristics:

Stateless
Simple transformations: parsing JSON

Simple stateful pipelines

These pipelines allow benchmarking the performance on common stateful operations: joins, tumbling windows and sliding windows. The pipeline contains only one stateful operation.

Join pipeline

This pipeline executes the base pipeline until stage 2.

To test different join implementations, we included a pipeline which ingests two streams and joins them together. For Flink, an interval join and tumbling window join have been included to research the differences in performance that they give. For Spark Streaming, only a tumbling window join can be done. For Structured Streaming and Kafka Streams we do interval joins.

Characteristics:

Stateful
Short window
Quickly changing key space

Tumbling or sliding window pipeline

The implementation of sliding and tumbling windows relies on the same code base. The settings for window duration and slide duration determine whether a tumbling or sliding window will be executed.

When the slide and window duration are equal:

When the window duration is a multiple of the slide duration:

We implemented two versions: incremental and non-incremental.

Incremental with reduce

This version computes the window results incrementally as data comes in using a reduce function. Therefore, it doesn't require much state to be kept.

Characteristics:

Stateful
Longer windows
Unchanging key space (measurement IDs)
Only keeps one value per key so small state
Infrequent output (once per slide interval, i.e. one minute by default)
Not much shuffling since only one stateful operation

Non-incremental tumbling and sliding window pipeline

The second implementation keeps all the events of the window in state and computes the window result at the end of the window.

Characteristics:

Stateful
Longer windows
Unchanging key space (measurement IDs)
Few keys and many values per key
Large base state because all events are kept in state
Infrequent output (once per slide interval, i.e. one minute by default)
Not much shuffling since only one stateful operation

Complex stateful pipelines

Complex stateful pipelines contain multiple chained stateful operations. Chaining operations leads to more shuffling and higher latencies from the consecutive buffering. Therefore, they form an important part of benchmarking since they stress different aspects of the system.

Join and tumbling window aggregation

This pipeline executes the base pipeline until stage 3.

Characteristics:

Stateful
2x shuffling between stages
Quickly changing key space
Short windows
Not much state but quickly renewing
Frequent output (every second)

Join, tumbling window and sliding window aggregation

This pipeline executes the base pipeline until stage 4.

Characteristics:

Stateful
3x shuffling between stages
Quickly changing key space
Short windows
Not much state but quickly renewing
Frequent output (every second)

Where to go next?

References

[1] van Dongen, G., & Van den Poel, D. (2020). Evaluation of Stream Processing Frameworks. IEEE Transactions on Parallel and Distributed Systems, 31(8), 1845-1858.

This work has been made possible by Klarrio

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmark

Data

Processing Pipelines

Stateless pipelines

Ingest

ETL: ingest, parse and publish

Simple stateful pipelines

Join pipeline

Tumbling or sliding window pipeline

Incremental with reduce

Non-incremental tumbling and sliding window pipeline

Complex stateful pipelines

Join and tumbling window aggregation

Join, tumbling window and sliding window aggregation

Where to go next?

References

Clone this wiki locally