Shut down profiler if any components failed #3197

ivoanjo · 2023-10-09T17:42:19Z

What does this PR do?

This PR changes the profiler Collectors::CpuAndWallTimeWorker and Scheduler components to signal the top Profiler instance when they failed, so the Profiler can stop the other component as well.

Motivation:

While never reported by a customer, it ocurred to me that right now if one of the component fails, the other will keep working in a weird state.

Of particular concern, if the scheduler failed, the StackRecorder would trigger a memory leak: it would start accumulating samples and never flush them.

This PR changes the profiler so that if one component stops with a failure, the other will do so as well.

Additional Notes:

I'm not especially happy with how this change turned out.

The error handling in the scheduler somewhat duplicates similar logic in the Core::Workers::Async helper, which it uses.

I wanted to avoid changing the shared helper just for this use-case, which I managed, but at the cost of a bit of duplication (including a duplicated log message, sigh).
I don't like the on_failure_proc: nil, but adding the argument would mean changing A LOT of tests, which is doable but has its own ugliness as well.

I've considered passing the on_failure_proc via the constructor, but it's not entirely straightforward: the Profiler instance is created after the two others, so we'd need a way to break the loop.

In the end, I decided to move forward with something, as I think it's important to plug this gap.

How to test the change?

This change includes test coverage.

For Datadog employees:

If this PR touches code that signs or publishes builds or packages, or handles
credentials of any kind, I've requested a review from @DataDog/security-design-and-guidance.
This PR doesn't touch any of that.

AlexJF · 2023-10-10T11:34:12Z

lib/datadog/profiling/scheduler.rb

+        rescue Exception => e # rubocop:disable Lint/RescueException
+          Datadog.logger.warn(
+            'Profiling::Scheduler thread error. ' \
+            "Cause: #{e.class.name} #{e.message} Location: #{Array(e.backtrace).first}"
+          )
+          on_failure_proc&.call
+          raise


Why the extra begin, couldn't we have added the rescue to the existing block? The only functional change I can see would be in the order of the debug vs warn lines but that doesn't seem that important? Also with the added rescue, we can maybe get away with the need for the interrupted local?

You're right -- cleaned it up + added a spec for it in 844cf50 .

**What does this PR do?** This PR changes the profiler `Collectors::CpuAndWallTimeWorker` and `Scheduler` components to signal the top `Profiler` instance when they failed, so the `Profiler` can stop the other component as well. **Motivation:** While never reported by a customer, it ocurred to me that right now if one of the component fails, the other will keep working in a weird state. Of particular concern, if the scheduler failed, the `StackRecorder` would trigger a memory leak: it would start accumulating samples and never flush them. This PR changes the profiler so that if one component stops with a failure, the other will do so as well. **Additional Notes:** I'm not especially happy with how this change turned out. 1. The error handling in the scheduler somewhat duplicates similar logic in the `Core::Workers::Async` helper, which it uses. I wanted to avoid changing the shared helper just for this use-case, which I managed, but at the cost of a bit of duplication (including a duplicated log message, sigh). 2. I don't like the `on_failure_proc: nil`, but adding the argument would mean changing A LOT of tests, which is doable but has its own ugliness as well. I've considered passing the `on_failure_proc` via the constructor, but it's not entirely straightforward: the `Profiler` instance is created after the two others, so we'd need a way to break the loop. In the end, I decided to move forward with something, as I think it's important to plug this gap. **How to test the change?** This change includes test coverage.

I haven't yet gotten used to updating these as part of my usual workflow :sweatsmile:.

SIGH

@AlexJF

As pointed out during PR review by @AlexJF.

ivoanjo · 2023-10-24T13:13:52Z

Update: I had tried to merge master into this branch but that triggered a bunch of gemfile updates that were missing.

Instead, I pulled the gemfile updates into a separate PR (#3213) and rebased this branch on top of master once that other PR got merged.

…er-error-handling Backport #3197 to 1.x-stable: Shut down profiler if any components failed

ivoanjo requested review from a team as code owners October 9, 2023 17:42

github-actions bot added the profiling Involves Datadog profiling label Oct 9, 2023

AlexJF approved these changes Oct 10, 2023

View reviewed changes

marcotc approved these changes Oct 10, 2023

View reviewed changes

ivoanjo added 5 commits October 24, 2023 14:13

Update typechecker signatures

dc2cedb

I haven't yet gotten used to updating these as part of my usual workflow :sweatsmile:.

Make sure profiler specs don't break on Ruby 2.1 and 2.2

33b0c94

SIGH

Avoid noisy thread failure logging during spec execution

341e06f

Simplify begin/ensure inside begin/rescue and add spec for it

d459fbc

As pointed out during PR review by @AlexJF.

ivoanjo force-pushed the ivoanjo/improve-profiler-error-handling branch from 9e6a78a to d459fbc Compare October 24, 2023 13:13

ivoanjo merged commit bd2a376 into master Oct 24, 2023
177 checks passed

ivoanjo deleted the ivoanjo/improve-profiler-error-handling branch October 24, 2023 16:42

ivoanjo mentioned this pull request Oct 24, 2023

Backport #3197 to 1.x-stable: Shut down profiler if any components failed #3216

Merged

marcotc added this to the 1.16.0 milestone Oct 25, 2023

ivoanjo added a commit that referenced this pull request Oct 27, 2023

Merge pull request #3216 from DataDog/ivoanjo/backport-improve-profil…

f897407

…er-error-handling Backport #3197 to 1.x-stable: Shut down profiler if any components failed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Shut down profiler if any components failed #3197

Shut down profiler if any components failed #3197

ivoanjo commented Oct 9, 2023

AlexJF Oct 10, 2023 •

edited

Loading

ivoanjo Oct 10, 2023

ivoanjo commented Oct 24, 2023

Shut down profiler if any components failed #3197

Shut down profiler if any components failed #3197

Conversation

ivoanjo commented Oct 9, 2023

AlexJF Oct 10, 2023 • edited Loading

Choose a reason for hiding this comment

ivoanjo Oct 10, 2023

Choose a reason for hiding this comment

ivoanjo commented Oct 24, 2023

AlexJF Oct 10, 2023 •

edited

Loading