Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add the ability to specify 'use_cached' connections in config #2000

Merged
merged 2 commits into from
Aug 7, 2018

Conversation

kellydunn
Copy link
Contributor

During our use of the datadog pgbouncer check, we found an issue where the agent would fail to reconnect to a recovered pgbouncer process. We simulated this by starting our pgbouncer and datadog-agent processes, and then subsequently stopping pgbouncer, waiting a few minutes, and restarting pgbouncer.

The agent would correctly alert us that pgbouncer was down, but upon recovery, the agent would continue to be in an error state. We would have to restart the agent to get it reporting "healthy" again. We have reason to believe this is due to the caching of the underlying Postgres connection.

We simulated this again with the change in this PR and the agent behaved as expected.

Please let us know if there's additional testing coverage or simulations we can provide.

Cheers!

What does this PR do?

This enables users to configure whether or not they want to used cached connections to check pgbouncer.

Motivation

As mentioned above, we were evaluating this check in our development environment and we observed that when a pgbouncer process stopped, and restarted again, the datadog agent would continually think it was in an error state. We have reason to believe it is due to caching the connection object and attempting to call functions on it after it has been closed.

Review checklist

  • PR has a meaningful title or PR has the no-changelog label attached
  • Feature or bugfix has tests
  • Git history is clean
  • If PR impacts documentation, docs team has been notified or an issue has been opened on the documentation repo

Additional Notes

I'm more than open to explore other possibilities too, like catching a different exception or finding out why ShouldRestartException isn't being thrown in this instance.

@kellydunn kellydunn requested a review from a team as a code owner August 7, 2018 00:35
During our use of the datadog pgbouncer check, we found an issue where the agent would fail
to reconnect to a recovered pgbouncer process.  We simulated this by starting our
`pgbouncer` and `datadog-agent` processes, and then subsequently stopping `pgbouncer`,
waiting a few minutes, and restarting `pgbouncer`.

The agent would correctly alert us that `pgbouncer` was down, but upon recovery, the agent
would continue to be in an error state.  We would have to restart the agent to get it reporting "healthy" again.
We have reason to believe this is due to the caching of the underlying Postgres connection.

We simulated this again with the change in this PR and the agent behaved as expected.

Please let us know if there's additional testing coverage or simulations we can provide.

Cheers!
@kellydunn
Copy link
Contributor Author

We've been running this fork of the core-integrations over night and we've observed the following:

  • No additional connections seem to be active when viewing SHOW CLIENTS; on the pgbouncer database
  • No visibile memory footprint from trying to create a new connection each time with each check
  • Monitor continues to be green, even after stopping and restarting the pgbouncer check, and does not require an agent restart after it gets into an error state.

i-0e1bd96feccc63266 datadog 2018-08-07 09-25-05

monitor status pgbouncer on void is straight borked datadog 2018-08-07 09-25-52

Let us know any more context or perfomance data you may need when reviewing this PR :)

Thanks!

Copy link
Contributor

@ofek ofek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kellydunn Awesome investigative work!

@@ -206,6 +206,7 @@ def check(self, instance):
password = instance.get('password', '')
tags = instance.get('tags', [])
database_url = instance.get('database_url')
use_cached = instance.get('use_cached', True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please wrap this with is_affirmative? from datadog_checks.config import is_affirmative

@@ -206,6 +206,7 @@ def check(self, instance):
password = instance.get('password', '')
tags = instance.get('tags', [])
database_url = instance.get('database_url')
use_cached = instance.get('use_cached', True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kellydunn
Copy link
Contributor Author

@ofek Thanks for the feedback! Let me know if you think I need to address anything else! Thanks!

@ofek
Copy link
Contributor

ofek commented Aug 7, 2018

LGTM. Also, congratulations on getting PR number 2000 😉

@ofek ofek merged commit f26fb09 into DataDog:master Aug 7, 2018
@ofek ofek changed the title Adds in the ability to specify 'use_cached' in the pgbouncer config. Add the ability to specify 'use_cached' connections in config Aug 7, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants