-
Notifications
You must be signed in to change notification settings - Fork 200
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
eBPF instrumentation manager #1776
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Graet job!!!
Looks good, added many comments but most are nits and style
func (m *Manager) UpdateConfig(ctx context.Context, workloadKey types.NamespacedName, config *odigosv1.InstrumentationConfig) error { | ||
// send a config update event for the specified workload | ||
select { | ||
case m.configUpdates <- configUpdate{workloadKey: workloadKey, config: config}: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can block quite easily. should we give some capacity to the channel so we can handle bursts without blocking the caller to UpdateConfig
?
configUpdates: make(chan configUpdate)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree we should probably make this channel buffered, not sure about what the size should be?
@@ -53,5 +55,12 @@ func (i *InstrumentationConfigReconciler) Reconcile(ctx context.Context, req ctr | |||
} | |||
} | |||
|
|||
if i.OnUpdate != nil { | |||
err = i.OnUpdate(ctx, types.NamespacedName{Namespace: req.Namespace, Name: workloadName}, instrumentationConfig) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thoughts - maybe instead of passing a function here and potentially blocking this part for some time, WDYT about passing the channel directly and having this code pushing the event to the channel.
This make it clear that an async operation is about to take place, and let the caller handle cancelation and errors.
For example: the error returned from OnUpdate
is only the error for sending to the channel, but future readers might expect the update result itself to be returned here which can lead to errors.
Having it as a channel, communicates how it is used, what to expect, and leave more flexibility to the caller.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall I agree, the only disadvantage to this is how we handle closing the channel.
The idiomatic way is for the sender to close the channel, but the reconciler itself is not aware if the manager is stopped or not.
odiglet/pkg/ebpf/manager.go
Outdated
return containerName, ok | ||
} | ||
|
||
func (m *Manager) languageSdk(pod *corev1.Pod, containerName string) (common.ProgrammingLanguage, common.OtelSdk, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion for alternative name
func (m *Manager) languageSdk(pod *corev1.Pod, containerName string) (common.ProgrammingLanguage, common.OtelSdk, error) { | |
func (m *Manager) getContainerDeviceInfo(pod *corev1.Pod, containerName string) (common.ProgrammingLanguage, common.OtelSdk, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved to k8sutils
, and renamed LanguageSdkFromPodContainer
} | ||
|
||
// Fetch initial config based on the InstrumentationConfig CR | ||
sdkConfig := m.instrumentationSDKConfig(ctx, workload, lang, types.NamespacedName{Namespace: pod.Namespace, Name: pod.Name}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sdkConfig
can be nil here, we should check for nil everytime we use it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is used for the initial config. I added a check in the go.go
file to check for that, and use the default config if nil is provided.
odiglet/pkg/ebpf/manager.go
Outdated
case <-runLoopCtx.Done(): | ||
m.Logger.Info("stopping Odiglet instrumentation manager") | ||
for pid, details := range m.DetailsByPid { | ||
err := details.Inst.Close(ctx) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: nil check
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see #1776 (comment)
m.stop = stop | ||
|
||
// main event loop for handling instrumentations | ||
for { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure how or if we should handle it, but I think this design has a noisy neighbor issue:
Imaging a Pod with the following bash script:
while true; do
echo "test"
done
This will trigger tons of ProcEvents and will cause starvation for all other pods / configuration changes.
What do you think? cc @blumamir
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
aka Odiglet DDos
odiglet/pkg/ebpf/manager.go
Outdated
|
||
m.Logger.Info("cleaning instrumentation resources", "pid", pid) | ||
|
||
err := details.Inst.Close(ctx) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same nil check, maybe we should move the Close
function to the details instance instead of closing the individual fields?
This will probably never be nil but still afraid of panics
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I created the startTrackInstrumentation
which is the only place where instrumentationDetails
is created, since this is an internal logic to this package, and not dependant on user input I think if the inst
is nill - it is a result of some logical error on our side which should never happen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
||
// active instrumentations by workload, and aggregated by pid | ||
// this map is not concurrent safe, so it should be accessed only from the main event loop | ||
detailsByWorkload map[types.NamespacedName]map[int]*instrumentationDetails |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add workload kind and language here and simplify the usage
This PR is a follow-up to #1645.
Adding the new
Manager
which will eventually replace theDirector
.The new design has the following key features and improvements:
note: the
runtime-detector
is configured to filter process events and will only pass events according to its configuration.Manager
does not have locks and uses an internal event loop.InstrumentationConfig
reconciler (same as before) - those updates will be handled in the event loop.Factory
interface is refactored and aSettings
option can be expanded in the future to add more initial configuration options.Instrumentation
interface is introduced and will replaceOtelEbpfSdk
.go.opentelemetry.io/auto
tov0.18.0-alpha
.This change will currently only apply for OSS Go instrumentation.