-
Notifications
You must be signed in to change notification settings - Fork 200
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(instrumentor): reconcile device on restarts #1710
Conversation
…d at instrumentor restarts
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🚀
// based on the current state of the cluster. | ||
// if the instrumented application is deleted but the device is not cleaned, | ||
// the instrumented application controller will not be invoked after restart, which is why we need to handle this case here. | ||
return true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should add a check here that the odigos label is present?
Since after this change, when the instrumentor is starting this will cause it to have an event for each workload in the cluster - even those which odigos does not care about.
For those the new check added in this PR below if apierrors.IsNotFound(err)
will be true.
That means the even for workloads which were not instrumented at all we will call removeInstrumentationDeviceFromWorkload
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to handle workloads which has no label.
Consider the following sequence:
- label is removed from a workload which has the device
- instrumented application deleted
- instrumentor went down
When the instrumentor starts, we need to remove the device from this workload. if we add the filter you suggested, nothing will remove the device from this workload
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about checking if there is a device present in the pod spec? Since we have the workload object here I think we can check that.
I assume it will be fine with the current approach as well, but since this has the potential of bringing a huge batch of events, we should try to handle them as fast as possible by filtering here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should refine the Create condtion
} | ||
|
||
result, err := controllerutil.CreateOrPatch(ctx, kubeClient, workloadObj, func() error { | ||
err := retry.RetryOnConflict(retry.DefaultRetry, func() error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should verify that this behaves as we expect by doing a large batch of un-instrumentation and adding some logs here for debug
// if we didn't change anything, we don't need to update the object | ||
// skip the api-server call, return no-op and skip the log message | ||
if !webhookLabelRemoved && !deviceRemoved && !envChanged { | ||
return nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should return here a custom error indicating that nothing has been done?
This is for logging below - to avoid logging for un-relevant workloads.
Another option is to set a bool updatedWorkload
in the removeInstrumentationDeviceFromWorkload
scope - initialize it to true - here we will set it to false. This will allow us to know if nil is returned because of a successful update or because of this case.
if apierrors.IsNotFound(err) { | ||
// if there is no instrumented application, make sure the device is removed from the workload pod template manifest | ||
workloadName, workloadKind, err := workload.ExtractWorkloadInfoFromRuntimeObjectName(instrumentedAppName) | ||
if err != nil { | ||
return err | ||
} | ||
err = removeInstrumentationDeviceFromWorkload(ctx, k8sClient, namespace, workloadKind, workloadName, ApplyInstrumentationDeviceReasonNoRuntimeDetails) | ||
return err | ||
} else { | ||
return err | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
if apierrors.IsNotFound(err) { | |
// if there is no instrumented application, make sure the device is removed from the workload pod template manifest | |
workloadName, workloadKind, err := workload.ExtractWorkloadInfoFromRuntimeObjectName(instrumentedAppName) | |
if err != nil { | |
return err | |
} | |
err = removeInstrumentationDeviceFromWorkload(ctx, k8sClient, namespace, workloadKind, workloadName, ApplyInstrumentationDeviceReasonNoRuntimeDetails) | |
return err | |
} else { | |
return err | |
} | |
if apierrors.IsNotFound(err) { | |
// if there is no instrumented application, make sure the device is removed from the workload pod template manifest | |
workloadName, workloadKind, err := workload.ExtractWorkloadInfoFromRuntimeObjectName(instrumentedAppName) | |
if err != nil { | |
return err | |
} | |
err = removeInstrumentationDeviceFromWorkload(ctx, k8sClient, namespace, workloadKind, workloadName, ApplyInstrumentationDeviceReasonNoRuntimeDetails) | |
} | |
return err |
return err | ||
} | ||
err = removeInstrumentationDeviceFromWorkload(ctx, k8sClient, namespace, workloadKind, workloadName, ApplyInstrumentationDeviceReasonNoRuntimeDetails) | ||
return err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this means that a conflict error will be printed by the controller runtime - right?
A typical error we see when trying to update the instrumentation config CR is: > 2024-11-12T08:43:28Z ERROR error updating instrumentation config {"controller": "instrumentor-instrumentationconfig-instrumentedapplication", "controllerGroup": "odigos.io", "controllerKind": "InstrumentedApplication", "InstrumentedApplication": {"name":"deployment-pricing","namespace":"simple-demo19"}, "namespace": "simple-demo19", "name": "deployment-pricing", "reconcileID": "8c980fc2-ead9-47c7-b2ac-69be6e880002", "workload": "deployment-pricing", "error": "Operation cannot be fulfilled on instrumentationconfigs.odigos.io \"deployment-pricing\": the object has been modified; please apply your changes to the latest version and try again"} This is due to the Get and Update pattern we use. Following #1710 , use the update error util for this case as well. Adding to the error handling util - a check for `IsNotFound` error and ignoring it. --------- Co-authored-by: Amir Blum <[email protected]>
At the moment, the sequence of events is:
Device should be removed if the instrumented application is deleted, and specifically in these cases:
I reproduced these cases locally with old version, observed the device is not cleaned up properly in these edge cases, and then run the changes in this PR to make sure these are fixed.