-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add the ability to specify 'use_cached' connections in config #2000
Conversation
During our use of the datadog pgbouncer check, we found an issue where the agent would fail to reconnect to a recovered pgbouncer process. We simulated this by starting our `pgbouncer` and `datadog-agent` processes, and then subsequently stopping `pgbouncer`, waiting a few minutes, and restarting `pgbouncer`. The agent would correctly alert us that `pgbouncer` was down, but upon recovery, the agent would continue to be in an error state. We would have to restart the agent to get it reporting "healthy" again. We have reason to believe this is due to the caching of the underlying Postgres connection. We simulated this again with the change in this PR and the agent behaved as expected. Please let us know if there's additional testing coverage or simulations we can provide. Cheers!
73df382
to
8731fd6
Compare
We've been running this fork of the
Let us know any more context or perfomance data you may need when reviewing this PR :) Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kellydunn Awesome investigative work!
@@ -206,6 +206,7 @@ def check(self, instance): | |||
password = instance.get('password', '') | |||
tags = instance.get('tags', []) | |||
database_url = instance.get('database_url') | |||
use_cached = instance.get('use_cached', True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please wrap this with is_affirmative
? from datadog_checks.config import is_affirmative
@@ -206,6 +206,7 @@ def check(self, instance): | |||
password = instance.get('password', '') | |||
tags = instance.get('tags', []) | |||
database_url = instance.get('database_url') | |||
use_cached = instance.get('use_cached', True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please also document this here? https://github.com/DataDog/integrations-core/blob/master/pgbouncer/datadog_checks/pgbouncer/data/conf.yaml.example
@ofek Thanks for the feedback! Let me know if you think I need to address anything else! Thanks! |
LGTM. Also, congratulations on getting PR number |
During our use of the datadog pgbouncer check, we found an issue where the agent would fail to reconnect to a recovered pgbouncer process. We simulated this by starting our
pgbouncer
anddatadog-agent
processes, and then subsequently stoppingpgbouncer
, waiting a few minutes, and restartingpgbouncer
.The agent would correctly alert us that
pgbouncer
was down, but upon recovery, the agent would continue to be in an error state. We would have to restart the agent to get it reporting "healthy" again. We have reason to believe this is due to the caching of the underlying Postgres connection.We simulated this again with the change in this PR and the agent behaved as expected.
Please let us know if there's additional testing coverage or simulations we can provide.
Cheers!
What does this PR do?
This enables users to configure whether or not they want to used cached connections to check pgbouncer.
Motivation
As mentioned above, we were evaluating this check in our development environment and we observed that when a
pgbouncer
process stopped, and restarted again, the datadog agent would continually think it was in an error state. We have reason to believe it is due to caching the connection object and attempting to call functions on it after it has been closed.Review checklist
no-changelog
label attachedAdditional Notes
I'm more than open to explore other possibilities too, like catching a different exception or finding out why
ShouldRestartException
isn't being thrown in this instance.