-
Notifications
You must be signed in to change notification settings - Fork 375
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Shut down profiler if any components failed #3197
Conversation
rescue Exception => e # rubocop:disable Lint/RescueException | ||
Datadog.logger.warn( | ||
'Profiling::Scheduler thread error. ' \ | ||
"Cause: #{e.class.name} #{e.message} Location: #{Array(e.backtrace).first}" | ||
) | ||
on_failure_proc&.call | ||
raise |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why the extra begin, couldn't we have added the rescue
to the existing block? The only functional change I can see would be in the order of the debug vs warn lines but that doesn't seem that important? Also with the added rescue, we can maybe get away with the need for the interrupted local?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right -- cleaned it up + added a spec for it in 844cf50 .
**What does this PR do?** This PR changes the profiler `Collectors::CpuAndWallTimeWorker` and `Scheduler` components to signal the top `Profiler` instance when they failed, so the `Profiler` can stop the other component as well. **Motivation:** While never reported by a customer, it ocurred to me that right now if one of the component fails, the other will keep working in a weird state. Of particular concern, if the scheduler failed, the `StackRecorder` would trigger a memory leak: it would start accumulating samples and never flush them. This PR changes the profiler so that if one component stops with a failure, the other will do so as well. **Additional Notes:** I'm not especially happy with how this change turned out. 1. The error handling in the scheduler somewhat duplicates similar logic in the `Core::Workers::Async` helper, which it uses. I wanted to avoid changing the shared helper just for this use-case, which I managed, but at the cost of a bit of duplication (including a duplicated log message, sigh). 2. I don't like the `on_failure_proc: nil`, but adding the argument would mean changing A LOT of tests, which is doable but has its own ugliness as well. I've considered passing the `on_failure_proc` via the constructor, but it's not entirely straightforward: the `Profiler` instance is created after the two others, so we'd need a way to break the loop. In the end, I decided to move forward with something, as I think it's important to plug this gap. **How to test the change?** This change includes test coverage.
I haven't yet gotten used to updating these as part of my usual workflow :sweatsmile:.
As pointed out during PR review by @AlexJF.
9e6a78a
to
d459fbc
Compare
Update: I had tried to merge master into this branch but that triggered a bunch of gemfile updates that were missing. Instead, I pulled the gemfile updates into a separate PR (#3213) and rebased this branch on top of master once that other PR got merged. |
…er-error-handling Backport #3197 to 1.x-stable: Shut down profiler if any components failed
What does this PR do?
This PR changes the profiler
Collectors::CpuAndWallTimeWorker
andScheduler
components to signal the topProfiler
instance when they failed, so theProfiler
can stop the other component as well.Motivation:
While never reported by a customer, it ocurred to me that right now if one of the component fails, the other will keep working in a weird state.
Of particular concern, if the scheduler failed, the
StackRecorder
would trigger a memory leak: it would start accumulating samples and never flush them.This PR changes the profiler so that if one component stops with a failure, the other will do so as well.
Additional Notes:
I'm not especially happy with how this change turned out.
The error handling in the scheduler somewhat duplicates similar logic in the
Core::Workers::Async
helper, which it uses.I wanted to avoid changing the shared helper just for this use-case, which I managed, but at the cost of a bit of duplication (including a duplicated log message, sigh).
I don't like the
on_failure_proc: nil
, but adding the argument would mean changing A LOT of tests, which is doable but has its own ugliness as well.I've considered passing the
on_failure_proc
via the constructor, but it's not entirely straightforward: theProfiler
instance is created after the two others, so we'd need a way to break the loop.In the end, I decided to move forward with something, as I think it's important to plug this gap.
How to test the change?
This change includes test coverage.
For Datadog employees:
credentials of any kind, I've requested a review from
@DataDog/security-design-and-guidance
.