Skip to content

Commit

Permalink
Document metrics and consumer tuning based on metrics (#1280)
Browse files Browse the repository at this point in the history
Also: fix typo and make metric descriptions consistent.
  • Loading branch information
erikvanoosten authored Jul 14, 2024
1 parent 9a60ccd commit e6af874
Show file tree
Hide file tree
Showing 5 changed files with 142 additions and 7 deletions.
4 changes: 4 additions & 0 deletions docs/consumer-internals.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
36 changes: 36 additions & 0 deletions docs/consumer-tuning.md
Original file line number Diff line number Diff line change
Expand Up @@ -90,3 +90,39 @@ On older zio-kafka versions `withMaxPollInterval` is not available. Use the foll
⚠️In zio-kafka versions 2.2 up to 2.5.0 it may also be necessary to increase the `runloopTimeout` setting.
When no stream is processing data for this amount of time (while new data is available), the consumer will halt with a
failure. In zio-kafka 2.5.0 `runloopTimeout` defaults to 4 minutes, a little bit lower than `max.poll.interval.ms`.

## Using metrics to tune the consumer

Zio-Kafka exposes [metrics](metrics.md) that can be used to further tune the consumer. To interpret these metrics you need to know how zio-kafka works internally.

![](consumer-internals.svg)

The runloop is at the heart of every zio-kafka consumer.
It creates a zstream for each partition, eventually this is the zstream your applications consumes from.
When the zstream starts, and every time the records queue is empty, it sends a request for data to the runloop.
The request causes the runloop to resume the partition so that the next poll may receive records.
Any received records are put in the records queue.
When the queue reaches a certain size (as determined by the configured `FetchStrategy`), the partition is paused.
Meanwhile, the zstream reads from the queue and emits the records to your application.

An optimally configured consumer has the following properties:

- the zstreams never have to wait for new records (to get high throughput),
- most of the time, the record queues are empty (to get low latency and low heap usage).

The following strategy can help you get to this state:

1. First make sure that `pollTimeout` and `max.poll.records` make sense for the latency and throughput requirements
of your application.
2. Configure `partitionPreFetchBufferLimit` to `0`.
3. Observe metric `ziokafka_consumer_queue_polls` which gives the number of polls during which records are idling in
the queue.
4. Increase `partitionPreFetchBufferLimit` in steps until most measurements of the `ziokafka_consumer_queue_polls`
histogram are in the `0` bucket .

During this process, it is useful to observe metric `ziokafka_consumer_queue_size` (number of records in the queues) to
see if the queues are indeed increasing in size.

When many (hundreds of) partitions need to be consumed, the metric `ziokafka_consumer_all_queue_size` should also be
observed as increasing `partitionPreFetchBufferLimit` can lead to high heap usage. (See 'High number of partitions'
above.)
Loading

0 comments on commit e6af874

Please sign in to comment.