Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Shut down profiler if any components failed #3197

Merged
merged 5 commits into from
Oct 24, 2023

Conversation

ivoanjo
Copy link
Member

@ivoanjo ivoanjo commented Oct 9, 2023

What does this PR do?

This PR changes the profiler Collectors::CpuAndWallTimeWorker and Scheduler components to signal the top Profiler instance when they failed, so the Profiler can stop the other component as well.

Motivation:

While never reported by a customer, it ocurred to me that right now if one of the component fails, the other will keep working in a weird state.

Of particular concern, if the scheduler failed, the StackRecorder would trigger a memory leak: it would start accumulating samples and never flush them.

This PR changes the profiler so that if one component stops with a failure, the other will do so as well.

Additional Notes:

I'm not especially happy with how this change turned out.

  1. The error handling in the scheduler somewhat duplicates similar logic in the Core::Workers::Async helper, which it uses.

    I wanted to avoid changing the shared helper just for this use-case, which I managed, but at the cost of a bit of duplication (including a duplicated log message, sigh).

  2. I don't like the on_failure_proc: nil, but adding the argument would mean changing A LOT of tests, which is doable but has its own ugliness as well.

    I've considered passing the on_failure_proc via the constructor, but it's not entirely straightforward: the Profiler instance is created after the two others, so we'd need a way to break the loop.

In the end, I decided to move forward with something, as I think it's important to plug this gap.

How to test the change?

This change includes test coverage.

For Datadog employees:

  • If this PR touches code that signs or publishes builds or packages, or handles
    credentials of any kind, I've requested a review from @DataDog/security-design-and-guidance.
  • This PR doesn't touch any of that.

@ivoanjo ivoanjo requested review from a team as code owners October 9, 2023 17:42
@github-actions github-actions bot added the profiling Involves Datadog profiling label Oct 9, 2023
Comment on lines +66 to +68
rescue Exception => e # rubocop:disable Lint/RescueException
Datadog.logger.warn(
'Profiling::Scheduler thread error. ' \
"Cause: #{e.class.name} #{e.message} Location: #{Array(e.backtrace).first}"
)
on_failure_proc&.call
raise
Copy link
Contributor

@AlexJF AlexJF Oct 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the extra begin, couldn't we have added the rescue to the existing block? The only functional change I can see would be in the order of the debug vs warn lines but that doesn't seem that important? Also with the added rescue, we can maybe get away with the need for the interrupted local?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right -- cleaned it up + added a spec for it in 844cf50 .

**What does this PR do?**

This PR changes the profiler `Collectors::CpuAndWallTimeWorker` and
`Scheduler` components to signal the top `Profiler` instance when they
failed, so the `Profiler` can stop the other component as well.

**Motivation:**

While never reported by a customer, it ocurred to me that right now if
one of the component fails, the other will keep working in a weird
state.

Of particular concern, if the scheduler failed, the `StackRecorder`
would trigger a memory leak: it would start accumulating samples and
never flush them.

This PR changes the profiler so that if one component stops with
a failure, the other will do so as well.

**Additional Notes:**

I'm not especially happy with how this change turned out.

1. The error handling in the scheduler somewhat duplicates similar
   logic in the `Core::Workers::Async` helper, which it uses.

   I wanted to avoid changing the shared helper just for this
   use-case, which I managed, but at the cost of a bit of
   duplication (including a duplicated log message, sigh).

2. I don't like the `on_failure_proc: nil`, but adding the
   argument would mean changing A LOT of tests, which is doable
   but has its own ugliness as well.

   I've considered passing the `on_failure_proc` via the constructor,
   but it's not entirely straightforward: the `Profiler` instance is
   created after the two others, so we'd need a way to break the loop.

In the end, I decided to move forward with something, as I think it's
important to plug this gap.

**How to test the change?**

This change includes test coverage.
I haven't yet gotten used to updating these as part of my usual
workflow :sweatsmile:.
@ivoanjo ivoanjo force-pushed the ivoanjo/improve-profiler-error-handling branch from 9e6a78a to d459fbc Compare October 24, 2023 13:13
@ivoanjo
Copy link
Member Author

ivoanjo commented Oct 24, 2023

Update: I had tried to merge master into this branch but that triggered a bunch of gemfile updates that were missing.

Instead, I pulled the gemfile updates into a separate PR (#3213) and rebased this branch on top of master once that other PR got merged.

@ivoanjo ivoanjo merged commit bd2a376 into master Oct 24, 2023
177 checks passed
@ivoanjo ivoanjo deleted the ivoanjo/improve-profiler-error-handling branch October 24, 2023 16:42
@marcotc marcotc added this to the 1.16.0 milestone Oct 25, 2023
ivoanjo added a commit that referenced this pull request Oct 27, 2023
…er-error-handling

Backport #3197 to 1.x-stable: Shut down profiler if any components failed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
profiling Involves Datadog profiling
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants