USMON-830: Kafka fetch latency #27620

DanielLavie · 2024-07-16T11:56:09Z

What does this PR do?

Adds support for measuring Kafka fetch request latency and incorporating it into the Kafka aggregation

Motivation

Additional Notes

For the initial latency check of a given stat, we cannot rely on the count field as we do with other protocols. This is because the Kernel might report a count greater than 1 in a single transaction (it counts the number of Kafka produce/consume messages, which can exceed 1 per Kafka transaction). Therefore, we check that FirstLatencySample is greater than 0 to replicate this behavior

Load test results:

Staging Deployment

Possible Drawbacks / Trade-offs

Describe how to test/QA your changes

pr-commenter · 2024-07-16T14:42:11Z

Test changes on VM

Use this command from test-infra-definitions to manually test this PR changes on a VM:

inv create-vm --pipeline-id=40993567 --os-family=ubuntu

Note: This applies to commit 62a4515

pr-commenter · 2024-07-16T15:16:37Z

Regression Detector

Regression Detector Results

Run ID: 2acd5305-61ba-4ad7-9d9f-2b73d9f1a52e Metrics dashboard Target profiles

Baseline: ff7f7c6
Comparison: 62a4515

Performance changes are noted in the perf column of each table:

✅ = significantly better comparison variant performance
❌ = significantly worse comparison variant performance
➖ = no significant change in performance

No significant changes in experiment optimization goals

Confidence level: 90.00%
Effect size tolerance: |Δ mean %| ≥ 5.00%

There were no significant changes in experiment optimization goals at this confidence level and effect size tolerance.

Fine details of change detection per experiment

perf	experiment	goal	Δ mean %	Δ mean % CI	links
➖	tcp_syslog_to_blackhole	ingress throughput	+2.58	[-10.41, +15.57]	Logs
➖	basic_py_check	% cpu utilization	+1.65	[-1.01, +4.31]	Logs
➖	uds_dogstatsd_to_api_cpu	% cpu utilization	+0.47	[-0.41, +1.35]	Logs
➖	otel_to_otel_logs	ingress throughput	+0.08	[-0.73, +0.89]	Logs
➖	uds_dogstatsd_to_api	ingress throughput	-0.00	[-0.00, +0.00]	Logs
➖	tcp_dd_logs_filter_exclude	ingress throughput	-0.00	[-0.01, +0.01]	Logs
➖	idle	memory utilization	-0.10	[-0.13, -0.07]	Logs
➖	file_tree	memory utilization	-0.46	[-0.53, -0.39]	Logs
➖	pycheck_1000_100byte_tags	% cpu utilization	-0.58	[-5.28, +4.13]	Logs

Explanation

A regression test is an A/B test of target performance in a repeatable rig, where "performance" is measured as "comparison variant minus baseline variant" for an optimization goal (e.g., ingress throughput). Due to intrinsic variability in measuring that goal, we can only estimate its mean value for each experiment; we report uncertainty in that value as a 90.00% confidence interval denoted "Δ mean % CI".

For each experiment, we decide whether a change in performance is a "regression" -- a change worth investigating further -- if all of the following criteria are true:

Its estimated |Δ mean %| ≥ 5.00%, indicating the change is big enough to merit a closer look.
Its 90.00% confidence interval "Δ mean % CI" does not contain zero, indicating that if our statistical model is accurate, there is at least a 90.00% chance there is a difference in performance between baseline and comparison variants.
Its configuration does not mark it "erratic".

…mber of hits in the sketch

leeavital

LGTM for NPM owned files (just a test)

guyarb · 2024-07-20T16:49:36Z

pkg/network/protocols/kafka/stats.go

+		if newRequests.Latencies == nil {
+			// In this case, newRequests must have only FirstLatencySample, so use it when adding the request
+			r.AddRequest(statusCode, newRequests.Count, newRequests.StaticTags, newRequests.FirstLatencySample)
+			continue
+		}


where and how do you ensure it runs only for fetch requests?

What’s the issue with it running for produce requests? The latency will always be 0 in this case. Additionally, on the backend, we'll be only processing latency for fetch requests at the moment

vitkyrka · 2024-07-23T08:16:54Z

pkg/network/protocols/kafka/stats.go

+	// This field holds the value (in nanoseconds) of the first HTTP request
+	// in this bucket. We do this as optimization to avoid creating sketches with
+	// a single value. This is quite common in the context of HTTP requests without
+	// keep-alives where a short-lived TCP connection is used for a single request.


This is taking about HTTP, does this rationale hold for Kafka too?

That's a good observation. From what I've seen, the answer is no; most modern Kafka clients are reusing sockets and setting keepalives. An example of this can be found in the Apache Kafka Java Client

I'll work on removing the FirstLatencySample to simplify the code, as it appears to be an unnecessary optimization

After discussing with @guyarb, to fully understand the performance impact, I plan to:

Determine the difference between creating a ddsketch and a simple FirstLatency setup.

Estimate how often the FirstLatency scenario occurs for a connection in the Datadog environment. To do this, we can use the datadog.network_tracer.usm.kafka.events_captured metric for the number of Kafka requests and compare it to the number of Kafka connections.

Implemented the following benchmark tests:

func benchmarkCreateRequestStat(b *testing.B, useLatencies bool) { b.ReportAllocs() b.ResetTimer() latencySample := 3.2 count := 10 for i := 0; i < b.N; i++ { if useLatencies { latencies, err := ddsketch.NewDefaultDDSketch(0.01) require.NoError(b, err) require.NoError(b, latencies.AddWithCount(latencySample, float64(count))) _ = kafka.RequestStat{ Latencies: latencies, Count: count, } } else { _ = kafka.RequestStat{ FirstLatencySample: latencySample, Count: count, } } } } func BenchmarkCreateRequestStatWithLatencies(b *testing.B) { benchmarkCreateRequestStat(b, true) } func BenchmarkCreateRequestStatWithFirstLatencySample(b *testing.B) { benchmarkCreateRequestStat(b, false) }

The benchmark results are as follows:
BenchmarkCreateRequestStatWithLatencies 1811176 664.7 ns/op 784 B/op 8 allocs/op
BenchmarkCreateRequestStatWithFirstLatencySample 1000000000 0.3180 ns/op 0 B/op 0 allocs/op

Results Breakdown

BenchmarkCreateRequestStatWithLatencies:

Operations per second (ns/op): 664.7 ns/op
On average, each operation takes approximately 664.7 nanoseconds.

Memory allocated per operation (B/op): 784 B/op
Each operation allocates around 784 bytes of memory.

Allocations per operation (allocs/op): 8 allocs/op
Each operation results in 8 memory allocations.

BenchmarkCreateRequestStatWithFirstLatencySample:

Operations per second (ns/op): 0.3180 ns/op
On average, each operation takes approximately 0.318 nanoseconds. This is extraordinarily fast, likely indicating that the benchmark is measuring a minimal or zero-cost operation.

Memory allocated per operation (B/op): 0 B/op
Each operation does not allocate any memory.

Allocations per operation (allocs/op): 0 allocs/op
Each operation results in no memory allocations.

Interpretation

Performance:

The BenchmarkCreateRequestStatWithFirstLatencySample is significantly faster (0.318 ns/op vs. 664.7 ns/op) compared to BenchmarkCreateRequestStatWithLatencies. This is expected because storing a single float value (the FirstLatencySample) is much simpler and less costly than managing a DDSketch for latencies.

Memory Usage:

The BenchmarkCreateRequestStatWithFirstLatencySample uses no additional memory (0 B/op and 0 allocs/op), indicating that it’s very efficient in terms of memory usage.

The BenchmarkCreateRequestStatWithLatencies allocates 784 bytes per operation and performs 8 allocations, indicating higher memory overhead.

We don't currently have an effective way to estimate how often Kafka connections have a single request per connection, which is relevant for the FirstLatencySample optimization. We'll need to add a metric to track how frequently the FirstLatencySample is used and test this in staging and production. Since we lack conclusive data to support either approach at the moment, I'll proceed with the FirstLatencySample optimization for now and we can revisit it in a future task

vitkyrka · 2024-07-23T08:17:09Z

pkg/network/protocols/kafka/stats.go

+	Latencies *ddsketch.DDSketch
+	// Note: every time we add a latency value to the DDSketch, it's possible for the sketch to discard that value
+	// (ie if it is outside the range that is tracked by the sketch). For that reason, in order to keep an accurate count
+	// the number of http transactions processed, we have our own count field (rather than relying on DDSketch.GetCount())


vitkyrka · 2024-07-23T08:18:34Z

pkg/network/protocols/kafka/stats.go

+			if stats.FirstLatencySample != 0 {
+				err := stats.Latencies.AddWithCount(stats.FirstLatencySample, float64(stats.Count))
+				if err != nil {
+					log.Debugf("could not add kafka request latency to ddsketch: %v", err)


Should these errors be propagate to the caller and/or have a higher log level?

This behaves the same as in HTTP. I’ll open a ticket to address the potential issues related to ignoring errors from these functions

https://datadoghq.atlassian.net/browse/USMON-1142

…necessary optimization

DanielLavie · 2024-08-07T09:09:35Z

/merge

dd-devflow · 2024-08-07T09:09:42Z

🚂 MergeQueue: pull request added to the queue

The median merge time in main is 23m.

Use /merge -c to cancel this operation!

DanielLavie added 2 commits July 3, 2024 17:06

WIP: adding latency

6dcb9c9

Merge branch 'main' into daniel.lavie/USMON-830-kafka-fetch-latency

88db646

github-actions bot added the component/system-probe label Jul 16, 2024

Added user mode code for handling Kafka latencies

3c8afc4

DanielLavie added changelog/no-changelog team/usm The USM team labels Jul 16, 2024

DanielLavie added 10 commits July 17, 2024 17:17

Fixed TestKafkaSerializationWithLocalhostTraffic

30b5e96

Added a check for fetch api key as produce don't have a latency ATM

96dcb66

Fixed Kafka AddRequest function

71c872f

Increased KAFKA_BATCH_SIZE from 24 to 25

69ef394

Added Kafka latency to debug endpoint

3c0706a

Merge branch 'main' into daniel.lavie/USMON-830-kafka-fetch-latency

0982689

Added comment for GetSketchQuantile

0f92b67

Added TestProcessKafkaTransactions

6e220eb

Using AddWithCount when adding to Kafka sketch, to count the exact nu…

500c4f8

…mber of hits in the sketch

Added kafka/stats_test.go and fixed a bug in kafka/stats.go

4f1e970

DanielLavie marked this pull request as ready for review July 18, 2024 16:47

DanielLavie requested review from a team as code owners July 18, 2024 16:47

Fixed import errors

e4bc658

leeavital approved these changes Jul 19, 2024

View reviewed changes

guyarb reviewed Jul 20, 2024

View reviewed changes

DanielLavie added 2 commits July 21, 2024 14:23

Fixed TestProcessKafkaTransactions

340a00c

Merge branch 'main' into daniel.lavie/USMON-830-kafka-fetch-latency

05ecee1

vitkyrka approved these changes Jul 23, 2024

View reviewed changes

DanielLavie added 2 commits July 30, 2024 13:58

Merge branch 'main' into daniel.lavie/USMON-830-kafka-fetch-latency

2ccc4b3

Fixed typo

3b42b3a

hmahmood approved these changes Jul 30, 2024

View reviewed changes

DanielLavie added 7 commits July 30, 2024 17:58

Merge branch 'main' into daniel.lavie/USMON-830-kafka-fetch-latency

26be4ea

Removed FirstLatencySample for Kafka aggregation as it seems like an …

999bb91

…necessary optimization

Fixed encoding_test.go

68c45bc

Merge branch 'main' into daniel.lavie/USMON-830-kafka-fetch-latency

3b5c22c

Reverted the removal of the FirstLatencySample

bff2969

Merge branch 'main' into daniel.lavie/USMON-830-kafka-fetch-latency

cba049c

Merge branch 'main' into daniel.lavie/USMON-830-kafka-fetch-latency

62a4515

dd-mergequeue bot merged commit e0baff3 into main Aug 7, 2024
293 of 294 checks passed

dd-mergequeue bot deleted the daniel.lavie/USMON-830-kafka-fetch-latency branch August 7, 2024 09:30

github-actions bot added this to the 7.57.0 milestone Aug 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

USMON-830: Kafka fetch latency #27620

USMON-830: Kafka fetch latency #27620

DanielLavie commented Jul 16, 2024 •

edited

Loading

pr-commenter bot commented Jul 16, 2024 •

edited

Loading

pr-commenter bot commented Jul 16, 2024 •

edited

Loading

Fine details of change detection per experiment

Explanation

leeavital left a comment

guyarb Jul 20, 2024

DanielLavie Jul 21, 2024 •

edited

Loading

vitkyrka Jul 23, 2024

DanielLavie Jul 30, 2024

DanielLavie Jul 30, 2024

DanielLavie Aug 4, 2024 •

edited

Loading

DanielLavie Aug 4, 2024 •

edited

Loading

DanielLavie Aug 4, 2024

DanielLavie Aug 4, 2024

DanielLavie Aug 5, 2024

vitkyrka Jul 23, 2024

vitkyrka Jul 23, 2024

DanielLavie Aug 5, 2024

DanielLavie Aug 5, 2024

DanielLavie commented Aug 7, 2024

dd-devflow bot commented Aug 7, 2024

USMON-830: Kafka fetch latency #27620

USMON-830: Kafka fetch latency #27620

Conversation

DanielLavie commented Jul 16, 2024 • edited Loading

What does this PR do?

Motivation

Additional Notes

Load test results:

Possible Drawbacks / Trade-offs

Describe how to test/QA your changes

pr-commenter bot commented Jul 16, 2024 • edited Loading

Test changes on VM

pr-commenter bot commented Jul 16, 2024 • edited Loading

Regression Detector

Regression Detector Results

No significant changes in experiment optimization goals

Fine details of change detection per experiment

Explanation

leeavital left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DanielLavie Jul 21, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DanielLavie Aug 4, 2024 • edited Loading

Choose a reason for hiding this comment

DanielLavie Aug 4, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Results Breakdown

Interpretation

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DanielLavie commented Aug 7, 2024

dd-devflow bot commented Aug 7, 2024

DanielLavie commented Jul 16, 2024 •

edited

Loading

pr-commenter bot commented Jul 16, 2024 •

edited

Loading

pr-commenter bot commented Jul 16, 2024 •

edited

Loading

DanielLavie Jul 21, 2024 •

edited

Loading

DanielLavie Aug 4, 2024 •

edited

Loading

DanielLavie Aug 4, 2024 •

edited

Loading