-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor Go-DI template generation to use location expressions #31424
base: main
Are you sure you want to change the base?
Refactor Go-DI template generation to use location expressions #31424
Conversation
Regression DetectorRegression Detector ResultsMetrics dashboard Baseline: e49efd5 Optimization Goals: ✅ No significant changes detected
|
perf | experiment | goal | Δ mean % | Δ mean % CI | trials | links |
---|---|---|---|---|---|---|
➖ | quality_gate_logs | % cpu utilization | +2.56 | [-0.43, +5.55] | 1 | Logs |
➖ | quality_gate_idle_all_features | memory utilization | +1.18 | [+1.05, +1.31] | 1 | Logs bounds checks dashboard |
➖ | otel_to_otel_logs | ingress throughput | +1.01 | [+0.33, +1.69] | 1 | Logs |
➖ | uds_dogstatsd_to_api_cpu | % cpu utilization | +0.43 | [-0.29, +1.16] | 1 | Logs |
➖ | file_to_blackhole_500ms_latency | egress throughput | +0.10 | [-0.66, +0.87] | 1 | Logs |
➖ | file_to_blackhole_0ms_latency_http2 | egress throughput | +0.09 | [-0.71, +0.89] | 1 | Logs |
➖ | file_to_blackhole_100ms_latency | egress throughput | +0.05 | [-0.60, +0.70] | 1 | Logs |
➖ | file_to_blackhole_300ms_latency | egress throughput | +0.05 | [-0.58, +0.68] | 1 | Logs |
➖ | file_to_blackhole_0ms_latency | egress throughput | +0.04 | [-0.80, +0.87] | 1 | Logs |
➖ | tcp_dd_logs_filter_exclude | ingress throughput | +0.00 | [-0.01, +0.02] | 1 | Logs |
➖ | uds_dogstatsd_to_api | ingress throughput | +0.00 | [-0.11, +0.11] | 1 | Logs |
➖ | file_to_blackhole_0ms_latency_http1 | egress throughput | -0.05 | [-0.96, +0.86] | 1 | Logs |
➖ | file_to_blackhole_1000ms_latency | egress throughput | -0.16 | [-0.94, +0.62] | 1 | Logs |
➖ | file_to_blackhole_1000ms_latency_linear_load | egress throughput | -0.23 | [-0.69, +0.22] | 1 | Logs |
➖ | quality_gate_idle | memory utilization | -0.24 | [-0.28, -0.19] | 1 | Logs bounds checks dashboard |
➖ | tcp_syslog_to_blackhole | ingress throughput | -0.53 | [-0.60, -0.46] | 1 | Logs |
➖ | file_tree | memory utilization | -0.55 | [-0.67, -0.43] | 1 | Logs |
Bounds Checks: ❌ Failed
perf | experiment | bounds_check_name | replicates_passed | links |
---|---|---|---|---|
❌ | file_to_blackhole_100ms_latency | lost_bytes | 5/10 | |
❌ | file_to_blackhole_0ms_latency | lost_bytes | 9/10 | |
✅ | file_to_blackhole_0ms_latency | memory_usage | 10/10 | |
✅ | file_to_blackhole_0ms_latency_http1 | lost_bytes | 10/10 | |
✅ | file_to_blackhole_0ms_latency_http1 | memory_usage | 10/10 | |
✅ | file_to_blackhole_0ms_latency_http2 | lost_bytes | 10/10 | |
✅ | file_to_blackhole_0ms_latency_http2 | memory_usage | 10/10 | |
✅ | file_to_blackhole_1000ms_latency | memory_usage | 10/10 | |
✅ | file_to_blackhole_1000ms_latency_linear_load | memory_usage | 10/10 | |
✅ | file_to_blackhole_100ms_latency | memory_usage | 10/10 | |
✅ | file_to_blackhole_300ms_latency | lost_bytes | 10/10 | |
✅ | file_to_blackhole_300ms_latency | memory_usage | 10/10 | |
✅ | file_to_blackhole_500ms_latency | lost_bytes | 10/10 | |
✅ | file_to_blackhole_500ms_latency | memory_usage | 10/10 | |
✅ | quality_gate_idle | memory_usage | 10/10 | bounds checks dashboard |
✅ | quality_gate_idle_all_features | memory_usage | 10/10 | bounds checks dashboard |
✅ | quality_gate_logs | lost_bytes | 10/10 | |
✅ | quality_gate_logs | memory_usage | 10/10 |
Explanation
Confidence level: 90.00%
Effect size tolerance: |Δ mean %| ≥ 5.00%
Performance changes are noted in the perf column of each table:
- ✅ = significantly better comparison variant performance
- ❌ = significantly worse comparison variant performance
- ➖ = no significant change in performance
A regression test is an A/B test of target performance in a repeatable rig, where "performance" is measured as "comparison variant minus baseline variant" for an optimization goal (e.g., ingress throughput). Due to intrinsic variability in measuring that goal, we can only estimate its mean value for each experiment; we report uncertainty in that value as a 90.00% confidence interval denoted "Δ mean % CI".
For each experiment, we decide whether a change in performance is a "regression" -- a change worth investigating further -- if all of the following criteria are true:
-
Its estimated |Δ mean %| ≥ 5.00%, indicating the change is big enough to merit a closer look.
-
Its 90.00% confidence interval "Δ mean % CI" does not contain zero, indicating that if our statistical model is accurate, there is at least a 90.00% chance there is a difference in performance between baseline and comparison variants.
-
Its configuration does not mark it "erratic".
CI Pass/Fail Decision
✅ Passed. All Quality Gates passed.
- quality_gate_logs, bounds check lost_bytes: 10/10 replicas passed. Gate passed.
- quality_gate_logs, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_idle_all_features, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_idle, bounds check memory_usage: 10/10 replicas passed. Gate passed.
b3edbd7
to
924fb7f
Compare
pkg/dynamicinstrumentation/testutil/sample/kafka_go_forwarder/kafka_go_forwarder
Outdated
Show resolved
Hide resolved
e1ffc3a
to
3869f77
Compare
Package size comparisonComparison with ancestor Diff per package
Decision |
Test changes on VMUse this command from test-infra-definitions to manually test this PR changes on a VM: inv aws.create-vm --pipeline-id=50960741 --os-family=ubuntu Note: This applies to commit 26663d4 |
Signed-off-by: grantseltzer <[email protected]>
Signed-off-by: grantseltzer <[email protected]>
This is a major refactor of the way Go-DI generates bpf programs. Instead of generating bpf programs via templates for specific types, it breaks down captures into basic building blocks of operations such as reading from registers/stack, dereferencing, adding offsets, writing to output, etc... Templates are now these basic operations and as a result we can express much more complex captures, such as pointers to pointers, slices of strings or any combination of types that were previously supported. Signed-off-by: grantseltzer <[email protected]>
Signed-off-by: grantseltzer <[email protected]>
Signed-off-by: grantseltzer <[email protected]>
Signed-off-by: grantseltzer <[email protected]>
…instead of right before the values. Still need to update event parsing code Signed-off-by: grantseltzer <[email protected]>
Signed-off-by: grantseltzer <[email protected]>
Signed-off-by: grantseltzer <[email protected]>
This changes the logic for collection limit labels to create a new label before every slice element and jump to it accordingly. This is to avoid collissions such as in the case of embedded slices. Signed-off-by: grantseltzer <[email protected]>
Signed-off-by: grantseltzer <[email protected]>
Signed-off-by: grantseltzer <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sending a first batch of comments, I still have a few files I didn't get to yet
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is an improvement over the previous code, it's more structured and I like the idea of the location opcodes driving code generation - however I find it hard to follow the logic here in these changes.
I'm treating the end to end tests as a confirmation that this works.
|
||
"github.com/DataDog/datadog-agent/pkg/dynamicinstrumentation/ditypes" | ||
"github.com/DataDog/datadog-agent/pkg/util/log" | ||
"golang.org/x/exp/rand" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you want math/rand
or math/rand/v2
here
length := 6 | ||
randomString := make([]byte, length) | ||
for i := 0; i < length; i++ { | ||
randomString[i] = byte(65 + rand.Intn(25)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is 65
here? If you are just needing a random alphanumeric label, then this code works well:
func randomLabel() string {
length := 6
buf := make([]byte, length/2)
_, _ = rand.Read(buf)
return hex.EncodeToString(buf)
}
@@ -127,7 +127,7 @@ func (cm *RCConfigManager) installConfigProbe(procInfo *ditypes.ProcessInfo) err | |||
svcConfigProbe := *configProbe | |||
svcConfigProbe.ServiceName = procInfo.ServiceName | |||
procInfo.ProbesByID[configProbe.ID] = &svcConfigProbe | |||
|
|||
log.Infof("Installing config probe for service: %s.", svcConfigProbe.ServiceName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: logs generally don't have punctuation at the end
"github.com/Shopify/sarama" | ||
"github.com/confluentinc/confluent-kafka-go/v2/kafka" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are net-new imports for the agent. Is there a way to write this test program using github.com/twmb/franz-go
?
Signed-off-by: grantseltzer <[email protected]>
…slice length limit Signed-off-by: grantseltzer <[email protected]>
Signed-off-by: grantseltzer <[email protected]>
Signed-off-by: grantseltzer <[email protected]>
Signed-off-by: grantseltzer <[email protected]>
Signed-off-by: grantseltzer <[email protected]>
err = bpf_probe_read(&context.event->output[*(context.output_offset)+i], element_size, &valueHolder); | ||
if (err != 0) { | ||
log_debug("error when reading data while popping from bpf stack: %ld", err); | ||
return_err = err; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason you don't return early here?
if (length > limit) { | ||
length = limit; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
duplicate of lines 294-296?
__uint(key_size, sizeof(__u32)); | ||
__uint(value_size, sizeof(char[PARAM_BUFFER_SIZE])); | ||
__uint(max_entries, 1); | ||
} zeroval SEC(".maps"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI we have helper macros for defining maps in map-defs.h
, so you can do things like BPF_ARRAY_MAP(zeroval, char[PARAM_BUFFER_SIZE], 1)
Signed-off-by: grantseltzer <[email protected]>
63034f8
to
bc87164
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
first (partial) iteration.
covered mostly ebpf code and some small changes in go.
also as discussed in private few general comments on the bpf code:
- replace bpf_probe_read with bpf_probe_read_user/bpf_probe_read_kernel
- replace bpf_printk with log_debug
- error flow -> in expressions.h, if you encounter an error do you want to continue the execution of the function? why not bail out immediately?
|
||
#include "ktypes.h" | ||
|
||
// NOTE: Be careful when adding fields, alignment should always be to 8 bytes | ||
struct base_event { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
consider adding a comment to explain the "magic numbers" for this struct's fields of array type (304 and 10)
even better replace it with #define
struct event { | ||
struct base_event base; | ||
char output[PARAM_BUFFER_SIZE]; | ||
}; | ||
|
||
struct expression_context { | ||
__u64 *output_offset; | ||
struct pt_regs *ctx; | ||
struct event *event; | ||
__u64 *temp_storage; | ||
char *zero_string; | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i assume those are the very core structs in your module, consider adding a short paragraph in code to explain their purpose (since the names of the structs are pretty generic)
|
||
struct event { | ||
struct base_event base; | ||
char output[PARAM_BUFFER_SIZE]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think you need to include macros.h for the PARAM_BUFFER_SIZE
{ | ||
long err; | ||
__u64 valueHolder = 0; | ||
err = bpf_probe_read(&valueHolder, element_size, &context.ctx->DWARF_REGISTER(reg)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- it is better to use
bpf_probe_read_kernel
/bpf_probe_read_user
instead of this helper. - consider using the wrappers (
bpf_probe_read_kernel_with_telemetry
/bpf_probe_read_user_with_telemetry
) if you wish to get bpf errors telemetry (more details here)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this comment is relevant to all bpf code of DI
__u64 valueHolder = 0; | ||
err = bpf_probe_read(&valueHolder, element_size, &context.ctx->DWARF_REGISTER(reg)); | ||
if (err != 0) { | ||
bpf_printk("error when reading data from register: %d", err); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
replace with log_debug as @gjulianm suggested
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also, do you always try and push the parameter? if bpf_probe_read
failed don't you want to bail out?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
those 2 comments are relevant for all functions defined in this file
MaxReferenceDepth uint8 = 4 // MaxReferenceDepth is the default depth that DI will traverse datatypes for capturing values | ||
MaxFieldCount = 20 // MaxFieldCount is the default limit for how many fields DI will capture in a single data type | ||
SliceMaxSize = 1800 // SliceMaxSize is the default limit in bytes of a slice | ||
SliceMaxLength = 100 // SliceMaxLength is the default limit in number of elements of a slice | ||
SliceMaxLength = 3 // SliceMaxLength is the default limit in number of elements of a slice |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
was that change intentional? (seems like a large decrease)
|
||
package ditypes | ||
|
||
const StackRegister = 29 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a comment to explain this const, btw won't you have a name conflict with the same const in the amd64 file? (the build tags for those files are not mutually exclusive iiuc)
|
||
package ditypes | ||
|
||
const StackRegister = 7 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a comment to explain this const
Location *Location | ||
LocationExpressions []LocationExpression | ||
FieldOffset uint64 | ||
NotCaptureReason NotCaptureReason |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would just call it Reason
@@ -34,14 +34,16 @@ type TypeMap struct { | |||
|
|||
// Parameter represents a function parameter as read from DWARF info | |||
type Parameter struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
document this struct's fields (inline one liners would be sufficient)
Signed-off-by: grantseltzer <[email protected]>
Signed-off-by: grantseltzer <[email protected]>
Signed-off-by: grantseltzer <[email protected]>
Signed-off-by: grantseltzer <[email protected]>
Signed-off-by: grantseltzer <[email protected]>
Signed-off-by: grantseltzer <[email protected]>
Signed-off-by: grantseltzer <[email protected]>
Signed-off-by: grantseltzer <[email protected]>
Signed-off-by: grantseltzer <[email protected]>
What does this PR do?
This is a major refactor of the way Go-DI generates bpf programs. Instead
of generating bpf programs via templates for specific types, it breaks
down captures into basic building blocks of operations such as reading
from registers/stack, dereferencing, adding offsets, writing to output,
etc...
Templates are now these basic operations and as a result we can express
much more complex captures, such as pointers to pointers, slices of strings
or any combination of types that were previously supported.
Motivation
The previous iteration of Go-DI failed to be able to express complex types such as pointers to pointers. We didn't have a way of expressing any amount of multiple dereferences. This greatly increases the supported features/types for our customers to instrument.
Describe how to test/QA your changes
Run e2e tests
Possible Drawbacks / Trade-offs
Additional Notes