Add the ability to specify 'use_cached' connections in config #2000

kellydunn · 2018-08-07T00:35:12Z

During our use of the datadog pgbouncer check, we found an issue where the agent would fail to reconnect to a recovered pgbouncer process. We simulated this by starting our pgbouncer and datadog-agent processes, and then subsequently stopping pgbouncer, waiting a few minutes, and restarting pgbouncer.

The agent would correctly alert us that pgbouncer was down, but upon recovery, the agent would continue to be in an error state. We would have to restart the agent to get it reporting "healthy" again. We have reason to believe this is due to the caching of the underlying Postgres connection.

We simulated this again with the change in this PR and the agent behaved as expected.

Please let us know if there's additional testing coverage or simulations we can provide.

Cheers!

What does this PR do?

This enables users to configure whether or not they want to used cached connections to check pgbouncer.

Motivation

As mentioned above, we were evaluating this check in our development environment and we observed that when a pgbouncer process stopped, and restarted again, the datadog agent would continually think it was in an error state. We have reason to believe it is due to caching the connection object and attempting to call functions on it after it has been closed.

Review checklist

PR has a meaningful title or PR has the no-changelog label attached
Feature or bugfix has tests
Git history is clean
If PR impacts documentation, docs team has been notified or an issue has been opened on the documentation repo

Additional Notes

I'm more than open to explore other possibilities too, like catching a different exception or finding out why ShouldRestartException isn't being thrown in this instance.

During our use of the datadog pgbouncer check, we found an issue where the agent would fail to reconnect to a recovered pgbouncer process. We simulated this by starting our `pgbouncer` and `datadog-agent` processes, and then subsequently stopping `pgbouncer`, waiting a few minutes, and restarting `pgbouncer`. The agent would correctly alert us that `pgbouncer` was down, but upon recovery, the agent would continue to be in an error state. We would have to restart the agent to get it reporting "healthy" again. We have reason to believe this is due to the caching of the underlying Postgres connection. We simulated this again with the change in this PR and the agent behaved as expected. Please let us know if there's additional testing coverage or simulations we can provide. Cheers!

kellydunn · 2018-08-07T16:40:46Z

We've been running this fork of the core-integrations over night and we've observed the following:

No additional connections seem to be active when viewing SHOW CLIENTS; on the pgbouncer database
No visibile memory footprint from trying to create a new connection each time with each check
Monitor continues to be green, even after stopping and restarting the pgbouncer check, and does not require an agent restart after it gets into an error state.

Let us know any more context or perfomance data you may need when reviewing this PR :)

Thanks!

ofek

@kellydunn Awesome investigative work!

ofek · 2018-08-07T19:48:59Z

pgbouncer/datadog_checks/pgbouncer/pgbouncer.py

@@ -206,6 +206,7 @@ def check(self, instance):
        password = instance.get('password', '')
        tags = instance.get('tags', [])
        database_url = instance.get('database_url')
+        use_cached = instance.get('use_cached', True)


Can you please wrap this with is_affirmative? from datadog_checks.config import is_affirmative

ofek · 2018-08-07T19:50:28Z

pgbouncer/datadog_checks/pgbouncer/pgbouncer.py

@@ -206,6 +206,7 @@ def check(self, instance):
        password = instance.get('password', '')
        tags = instance.get('tags', [])
        database_url = instance.get('database_url')
+        use_cached = instance.get('use_cached', True)


Can you please also document this here? https://github.com/DataDog/integrations-core/blob/master/pgbouncer/datadog_checks/pgbouncer/data/conf.yaml.example

kellydunn · 2018-08-07T22:09:43Z

@ofek Thanks for the feedback! Let me know if you think I need to address anything else! Thanks!

ofek · 2018-08-07T23:17:10Z

LGTM. Also, congratulations on getting PR number 2000 😉

kellydunn requested a review from a team as a code owner August 7, 2018 00:35

kellydunn force-pushed the pgbouncer-exposed-cached branch from 73df382 to 8731fd6 Compare August 7, 2018 00:41

masci added community integration/pgbouncer changelog/Added labels Aug 7, 2018

ofek requested changes Aug 7, 2018

View reviewed changes

Adding in feedback from ofek, wrt documentation and santizing input

ca8bcb9

ofek approved these changes Aug 7, 2018

View reviewed changes

ofek merged commit f26fb09 into DataDog:master Aug 7, 2018

ofek changed the title ~~Adds in the ability to specify 'use_cached' in the pgbouncer config.~~ Add the ability to specify 'use_cached' connections in config Aug 7, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add the ability to specify 'use_cached' connections in config #2000

Add the ability to specify 'use_cached' connections in config #2000

kellydunn commented Aug 7, 2018

kellydunn commented Aug 7, 2018

ofek left a comment

ofek Aug 7, 2018

ofek Aug 7, 2018

kellydunn commented Aug 7, 2018

ofek commented Aug 7, 2018

Add the ability to specify 'use_cached' connections in config #2000

Add the ability to specify 'use_cached' connections in config #2000

Conversation

kellydunn commented Aug 7, 2018

What does this PR do?

Motivation

Review checklist

Additional Notes

kellydunn commented Aug 7, 2018

ofek left a comment

Choose a reason for hiding this comment

ofek Aug 7, 2018

Choose a reason for hiding this comment

ofek Aug 7, 2018

Choose a reason for hiding this comment

kellydunn commented Aug 7, 2018

ofek commented Aug 7, 2018