Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better callback registration/deregistration in host provider's lifecycle #2485

Merged
merged 12 commits into from
Apr 11, 2023

Conversation

ycombinator
Copy link
Contributor

@ycombinator ycombinator commented Apr 11, 2023

What does this PR do?

This PR makes the following changes to the host provider:

  • It moves the registration of the FQDN feature flag change callback function to the host provider's constructor.
  • It introduces a new CloseableProvider interface and updates the host provider to use this new interface. Specifically, the Close() method on the host provider is used to deregister the FQDN feature flag change callback.

It also changes when the feature flags from a standalone configuration are applied - at the very end of the Agent's construction process (application.New()).

Why is it important?

Prior to this PR, callback registration (and deregistration) were happening inside the host provider's Run() method. This introduced a race condition between when the FQDN feature flag change callback registration happened and when the same callback might be called due to a change in the FQDN feature flag:

  • When the timing is just right, the host provider would've registered the callback prior to a change in the FQDN feature flag, and thus the host provider would get notified of the change correctly.
  • When the timing is just wrong, a change in the FQDN feature flag would happen prior to the host provider registering the callack, and thus the host provider would miss the notification of the change.

By moving the callback registration to the host provider's constructor (and, less relevantly but symmetrically, the callback deregistration to the host provider's new Close() method), we now ensure that the callback will be registered when the provider is constructed. Providers are constructed in the controller's constructor, which is now called before feature flags from standalone configuration or fleet-managed configuration are applied and any callbacks are called.

A secondary benefit of these changes is that the host.TestFQDNFeatureFlagToggle unit test is now no longer flaky.
It was the flakiness that pointed to the race condition in the implementation prior to this PR.

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in ./changelog/fragments using the changelog tool
  • I have added an integration test or an E2E test

Author's Checklist

  • [ ]

How to test this PR locally

To check that the host.TestFQDNFeatureFlagToggle unit test is no longer flaky, run it a 1000 times:

time go test github.com/elastic/elastic-agent/internal/pkg/composable/providers/host -test.run ^TestFQDNFeatureFlagToggle$ -test.count 1000
ok  	github.com/elastic/elastic-agent/internal/pkg/composable/providers/host	103.303s
go test  -test.run ^TestFQDNFeatureFlagToggle$ -test.count 1000  3.39s user 1.89s system 5% cpu 1:44.45 total

Related issues

Use cases

Screenshots

Logs

Questions to ask yourself

  • How are we going to support this in production?
  • How are we going to measure its adoption?
  • How are we going to debug this?
  • What are the metrics I should take care of?
  • ...

@ycombinator ycombinator added Team:Elastic-Agent Label for the Agent team Cleanup backport-v8.7.0 Automated backport with mergify labels Apr 11, 2023
@ycombinator ycombinator requested a review from a team as a code owner April 11, 2023 17:24
@ycombinator ycombinator requested review from michel-laterman and pchila and removed request for a team April 11, 2023 17:24
@mergify
Copy link
Contributor

mergify bot commented Apr 11, 2023

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b provider-closer upstream/provider-closer
git merge upstream/main
git push upstream provider-closer

@elasticmachine
Copy link
Contributor

elasticmachine commented Apr 11, 2023

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

Build stats

  • Start Time: 2023-04-11T18:43:48.520+0000

  • Duration: 17 min 46 sec

Test stats 🧪

Test Results
Failed 0
Passed 5451
Skipped 23
Total 5474

💚 Flaky test report

Tests succeeded.

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

  • /package : Generate the packages.

  • run integration tests : Run the Elastic Agent Integration tests.

  • run end-to-end tests : Generate the packages and run the E2E Tests.

  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

@elasticmachine
Copy link
Contributor

elasticmachine commented Apr 11, 2023

🌐 Coverage report

Name Metrics % (covered/total) Diff
Packages 98.507% (66/67) 👍
Files 69.432% (159/229) 👍
Classes 68.65% (300/437) 👍
Methods 54.14% (922/1703) 👍 0.086
Lines 39.329% (10364/26352) 👍 0.003
Conditionals 100.0% (0/0) 💚

@@ -137,7 +153,7 @@ func (c *controller) Run(ctx context.Context) error {
state.Context = localCtx
state.signal = notify
go func(name string, state *dynamicProviderState) {
defer wg.Done()
defer closeProvider(name, state.provider)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this changes the controller some. If you where to create a controller, run it / stop it, and run it again. Then the callback would not be registered because the close is called at the end of run, but the register is called in the New.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, good point. The closeable providers need to be closed when a controller is done with all its runs. Let me try to find an appropriate place in the controller lifecycle where this could happen. Thanks!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 00b781a and 50a0a19.

@@ -114,6 +112,16 @@ func ContextProviderBuilder(log *logger.Logger, c *config.Config, _ bool) (corec
if p.CheckInterval <= 0 {
p.CheckInterval = DefaultCheckInterval
}

p.fqdnFFChangeCh = make(chan struct{})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also turn this into a buffered channel so its not blocking when the callback hook is called but the provider is not running.

Now with it separate it is possible that no one is reading from the channel when it is pushed onto the channel.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Addressed in b0b55c3.

}
func (c *contextProvider) onFQDNFeatureFlagChange(new, old bool) {
// FQDN feature flag was toggled, so notify on channel
c.fqdnFFChangeCh <- struct{}{}
Copy link
Contributor

@blakerouse blakerouse Apr 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment about the buffer channel.

You only need to ensure that a struct is on the channel, if one is already there you really don't need to push it. So I would change this line to:

select {
case c.fqdnFFChangeCh <- struct{}{}:
default:
}

This removes any chance of blocking, but ensures that if Run is running or will be ran it will get the notification.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Addressed in 4a5a7c2.

Copy link
Contributor

@blakerouse blakerouse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fixes, changes look good!

@ycombinator ycombinator merged commit 86c3395 into elastic:main Apr 11, 2023
@ycombinator ycombinator deleted the provider-closer branch April 11, 2023 20:07
mergify bot pushed a commit that referenced this pull request Apr 11, 2023
…cle (#2485)

* Implement CloseableProvider

* Make hostProvider implement CloseableProvider

* Update test to be more resilient

* Move feature flags parsing from standalone configuration to late in the Agent constructor

* Add explanatory comment for closeProvider function

* Running mage fmt

* Remove unused function

* Add Close() to controller interface and implement it

* Call composable controller's Close() method

* Use buffered channel to prevent blocking

* Don't block until Run() is running

* Call composable controller's Close() in tests

(cherry picked from commit 86c3395)

# Conflicts:
#	internal/pkg/agent/application/application.go
#	internal/pkg/agent/cmd/run.go
ycombinator added a commit that referenced this pull request Apr 11, 2023
…host provider's lifecycle (#2486)

* Better callback registration/deregistration in host provider's lifecycle (#2485)

* Implement CloseableProvider

* Make hostProvider implement CloseableProvider

* Update test to be more resilient

* Move feature flags parsing from standalone configuration to late in the Agent constructor

* Add explanatory comment for closeProvider function

* Running mage fmt

* Remove unused function

* Add Close() to controller interface and implement it

* Call composable controller's Close() method

* Use buffered channel to prevent blocking

* Don't block until Run() is running

* Call composable controller's Close() in tests

(cherry picked from commit 86c3395)

# Conflicts:
#	internal/pkg/agent/application/application.go
#	internal/pkg/agent/cmd/run.go

* Fixing conflicts

---------

Co-authored-by: Shaunak Kashyap <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-v8.7.0 Automated backport with mergify Cleanup Team:Elastic-Agent Label for the Agent team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants