Skip to content

Benchmark

gvdongen edited this page Jul 9, 2021 · 18 revisions

Data

The data source is traffic data from NDW (Nationaal Dataportaal Wegverkeer). We use a dataset which contains measurements of a set of road measurement locations in the Netherlands. At these locations, sensors count the amount of cars that pass by on a certain lane and the average speed of these cars in a time interval.

We use the Data Stream Generator to generate an input stream on two Kafka topics: ndwflow and ndwspeed. Flow measurements (ndwflow) describe the amount of cars that passed by a certain measurement point and lane. The speed measurements (ndwspeed) describe the average speed of the cars that passed by a certain measurement point and lane during that time frame. The throughput and data characteristics of the stream can be regulated with environment variables. More information can be found in the documentation of the Data Stream Generator.

Processing Pipelines

This benchmark suite aims to include the most important stream operations and to study the influence of different implementations: high-level vs. low-level DSL, tumbling vs. interval join, tumbling vs. sliding window, etc. Therefore, multiple pipelines have been included.

Each of these pipelines has different complexities and characteristics. There are three main groups of pipelines. The first one are the stateless pipelines: ingesting and ETL. The second group are the simple stateful pipelines with joins and windowing operations. Finally, there are the complex pipelines which contain multiple stateful operations: end-to-end analytics pipelines.

It is important to include these different pipelines since they stress different aspects of the system: CPU, memory, network, IO, shuffling, state management, ...

Most of these pipelines are deduced from a base pipeline which we will explain here to make the following sections more concise. The base pipeline looks as follows:

processing flow

adapted from [1]

  1. Ingest: Reading JSON event data from Kafka from two input streams: speed stream and flow stream.

  2. Parse: Parsing the data from the two streams and extract useful fields.

  3. Join: Joining the flow stream and the speed stream. The join is done on the key: measurement ID, internal ID and timestamp rounded down to second-level. The internal ID describes the lane of the road. After the join, we know how many cars passed and how fast they drove for each lane of the road and time period.

  4. Tumbling window: Aggregating the speed and the flow over all the lanes belonging to the same measurement ID. So here the data is grouped by measurement ID and timestamp (rounded down to second-level). We compute the average speed accumulated flow over all the lanes. The tumbling window is computed using a reduce function and therefore, is incremental.

  5. Sliding window: For the sliding window stage, we compute the relative change in flow and speed over two lookback period: a short one and a long one. The length of the lookback periods is defined in the configuration file. The relative change is calculated by computing:

                       speed(t) - speed(t-1) / speed(t-1)
    
  6. Output: Publishing output back to Kafka.

The pipeline allows the execution of only a part of it. You can make the pipeline as complex as you want. The more stages you add, the more complex the pipeline becomes. This setting is set by the environment variable LAST_STAGE, as is explained in the documentation of each of the components. The last stage index is given next to the arrow of the output to Kafka. The following table gives an overview of the executed pipeline for each LAST_STAGE value:

LAST_STAGE Pipeline
0 ingest - publish Stateless
1 ingest - parse - publish Stateless
2 ingest - parse - join - publish Stateful
3 ingest - parse - join - tumbling window - publish Stateful
4 ingest - parse - join - tumbling window - sliding window-publish Stateful
5 ingest - parse - join - low level tumbling window - publish Stateful
6 ingest - parse - join - low level tumbling window - low level sliding window-publish Stateful
100 ingest - parse - incremental tumbling/sliding window Stateful
101 ingest - parse - non incremental tumbling/sliding window Stateful

The single-digit indices stand for pipelines that are a subset of the base pipeline. The other indices stand for pipelines that focus on one particular stateful operation.

Stateless pipelines

Stateless pipelines do simple transformations for which no other events are required. Within this benchmarking suite, two stateless pipelines are included.

Ingest

This pipeline executes the base pipeline until stage 0. It reads data and directly writes this data back to Kafka without doing any transformations on it. It is basically an empty pipeline which can be used to check network and framework overhead.

architecture

Characteristics:

  • Stateless
  • No operations (only adding a timestamp)

ETL: ingest, parse and publish

This pipeline executes the base pipeline until stage 1 and mimics a simple ETL pipeline.

architecture

Characteristics:

  • Stateless
  • Simple transformations: parsing JSON

Simple stateful pipelines

These pipelines allow benchmarking the performance on common stateful operations: joins, tumbling windows and sliding windows. The pipeline contains only one stateful operation.

Join pipeline

This pipeline executes the base pipeline until stage 2.

architecture

To test different join implementations, we included a pipeline which ingests two streams and joins them together. For Flink, an interval join and tumbling window join have been included to research the differences in performance that they give. For Spark Streaming, only a tumbling window join can be done. For Structured Streaming and Kafka Streams we do interval joins.

Characteristics:

  • Stateful
  • Short window
  • Quickly changing key space

Tumbling or sliding window pipeline

The implementation of sliding and tumbling windows relies on the same code base. The settings for window duration and slide duration determine whether a tumbling or sliding window will be executed.

When the slide and window duration are equal:

architecture

When the window duration is a multiple of the slide duration:

architecture

We implemented two versions: incremental and non-incremental.

Incremental with reduce

This version computes the window results incrementally as data comes in using a reduce function. Therefore, it doesn't require much state to be kept.

Characteristics:

  • Stateful
  • Longer windows
  • Unchanging key space (measurement IDs)
  • Only keeps one value per key so small state
  • Infrequent output (once per slide interval, i.e. one minute by default)
  • Not much shuffling since only one stateful operation

Non-incremental tumbling and sliding window pipeline

The second implementation keeps all the events of the window in state and computes the window result at the end of the window.

Characteristics:

  • Stateful
  • Longer windows
  • Unchanging key space (measurement IDs)
  • Few keys and many values per key
  • Large base state because all events are kept in state
  • Infrequent output (once per slide interval, i.e. one minute by default)
  • Not much shuffling since only one stateful operation

Complex stateful pipelines

Complex stateful pipelines contain multiple chained stateful operations. Chaining operations leads to more shuffling and higher latencies from the consecutive buffering. Therefore, they form an important part of benchmarking since they stress different aspects of the system.

Join and tumbling window aggregation

This pipeline executes the base pipeline until stage 3.

architecture

Characteristics:

  • Stateful
  • 2x shuffling between stages
  • Quickly changing key space
  • Short windows
  • Not much state but quickly renewing
  • Frequent output (every second)

Join, tumbling window and sliding window aggregation

This pipeline executes the base pipeline until stage 4.

architecture

Characteristics:

  • Stateful
  • 3x shuffling between stages
  • Quickly changing key space
  • Short windows
  • Not much state but quickly renewing
  • Frequent output (every second)

Where to go next?

References

[1] van Dongen, G., & Van den Poel, D. (2020). Evaluation of Stream Processing Frameworks. IEEE Transactions on Parallel and Distributed Systems, 31(8), 1845-1858.