Upgrade postgres check to psycopg3 #15411

jmeunier28 · 2023-07-28T02:09:48Z

What does this PR do?

This PR upgrades the postgres integration to use the latest psycopg library, psycopg3. This moves us to the latest and greatest tech when it comes to running postgres queries via python.

Motivation

Prior to this commit, we were running a version of psycopg2 that is 3 years old.
This allows us to track important bug fixes, as well as use the latest version of libpq under the hood.
There is evidence, based on several customer reports that the postgres integration is leaking under certain conditions

We were able to re-produce this in our own cluster, and saw that the place we were leaking from was in native memory

... after doing some investigative work, we were able to narrow down the source of the leak to likely be coming from libpq & it seemingly was triggered when many psycopg2 exceptions were being thrown.. as you can see here:

Bumps in memory & OOMs seem to line up with spikes in psycopg2 exceptions being thrown
From running the datadog native profiler, we could see memory growing overtime in libpq code path

Since libpq ships with the pyscopg2-binary, we were un able to upgrade it without upgrading pyscopg2. After upgrading to pyscopg3, we saw memory stabilize for our test agent

The next step is test across all of our clusters & with customers to confirm that this indeed solves the issue.

Other Fixes / Improvements

Adds logic around ensuring job loop threads are shutdown before cleaning up connection pool resources. This prevents psycopg.OperationalError: consuming input failed: connection pointer is NULL errors from being thrown & in some cases this resulted in a segfault being thrown from the underlying C code. Prior to this commit, the following could happen:

On cancel the job loop threads would shut down
This would call this shutdown hook that would close all db connections... from one of the job loop threads
Meanwhile, the check could still be running using the main db connection from the pool & see its connection closed while it's trying to finish running all of it's queries

When testing this code against hosts where cancel is called frequently (could happen to dbs that are not properly configured for the agent), we see null pointer connection exceptions / segfaults as a result.

In order to fix this we must:
- use a separate connection for the main check thread
- wait until job loops shut down & then close all connections from the main check thread
- finally close main check thread connection after check has finished

Fix frequent json serialization errors seen on activity samples by adding support for IPv4Address to our custom implementation
Add better logic around how we format queries where dbstrict or ignore_databases is set in the users config. Without this, I saw many SyntaxErrors being thrown by psycopg, which I believe were previously swallowed
Prevent flaky failures on the activity sampling collection test, as explained here
Adds new WaitGroup utility to our tests, which allows for us to do communication between many threads & removes the need to rely on sleep to test for certain conditions. This also improved flakey test failures & will help us in the future as we continue to add functionality to this check.

Review checklist (to be filled by reviewers)

Feature or bugfix MUST have appropriate tests (unit, integration, e2e)
PR title must be written as a CHANGELOG entry (see why)
Files changes must correspond to the primary purpose of the PR as described in the title (small unrelated changes should have their own PR)
PR must have changelog/ and integration/ labels attached
If the PR doesn't need to be tested during QA, please add a qa/skip-qa label.

alexandre-normand · 2023-07-28T02:32:05Z

postgres/datadog_checks/postgres/explain_parameterized_queries.py

@@ -81,7 +81,7 @@ def explain_statement(self, dbname, statement, obfuscated_statement):
        result = self._explain_prepared_statement(dbname, statement, obfuscated_statement, query_signature)
        self._deallocate_prepared_statement(dbname, query_signature)
        if result:
-            return result[0][0][0]
+            return result[0]['explain_statement'][0]


I do prefer ['explain_statement'] to [0]. 🙇

alexandre-normand · 2023-07-28T02:35:43Z

postgres/datadog_checks/postgres/postgres.py

+                executor.submit(thread.cancel)
+                for thread in [self.statement_samples, self.statement_metrics, self.metadata_samples]
+            ]
+            concurrent.futures.wait(tasks)


Should there be a timeout to this where we just continue forcefully? I'm thinking of some of the reports of the agent service not being able to be restarted because of the running hypothesis that one of our db integration wasn't able to gracefully terminate.

yeah, i thought about adding that, better to be safe

I'm thinking of some of the reports of the agent service not being able to be restarted because of the running hypothesis that one of our db integration wasn't able to gracefully terminate

oh interesting, I hadn't heard about this

github-actions · 2023-07-28T08:38:20Z

Test Results

    992 files     992 suites 7h 45m 20s ⏱️
  5 812 tests   5 742 ✔️     69 💤 1 ❌
24 435 runs 20 251 ✔️ 4 183 💤 1 ❌

For more details on these failures, see this check.

Results for commit 24f3aca.

♻️ This comment has been updated with latest results.

…c fixes

jmeunier28 · 2023-07-28T14:23:06Z

datadog_checks_base/datadog_checks/base/utils/db/utils.py

@@ -183,6 +184,8 @@ def default_json_event_encoding(o):
        return float(o)
    if isinstance(o, (datetime.date, datetime.datetime)):
        return o.isoformat()
+    if isinstance(o, IPv4Address):


We've been seeing json serialization errors for a long time now due to some activity payloads containing IPv4Address objects

alexandre-normand

Thanks for the nice and detailed write-up. This is going to be a very very welcome upgrade 🙇 .

.

FlorentClarret

Thanks for the PR!

Seems like this new dependency make a couple of agent builds fail: https://gitlab.ddbuild.io/DataDog/datadog-agent/-/jobs/305458412 Could you have a look? Do not hesitate to reach out to us if you need help to solve this

justiniso · 2023-08-01T16:37:20Z

postgres/datadog_checks/postgres/postgres.py

+            try:
+                concurrent.futures.wait(tasks, timeout=self._config.min_collection_interval)
+            except concurrent.futures.TimeoutError:
+                self.log.warning("timeout reached on cancel, proceeding with unclean shutdown.")


All warning and error logs should be more user- and support-friendly indicating the context and impact. I don't think SEs would know what to do with this and there's high possibility of misinterpreting its meaning.

OK, do you have a suggestion for more helpful wording? The impact is that we could be closing connections that are in-use by the job loop threads, which would result in some nasty exceptions being thrown.

"Not all job loops were completed in time when cancelling the main check. Proceeding with the check cancellation. Some unexpected errors related to closed connections may occur after this message."

justiniso · 2023-08-01T16:50:08Z

postgres/datadog_checks/postgres/version_utils.py

@@ -22,6 +22,7 @@
 class VersionUtils(object):
    def __init__(self):
        self.log = get_check_logger()
+        self.seen_aurora_exception = False


Suggested change

self.seen_aurora_exception = False

self._seen_aurora_exception = False

justiniso · 2023-08-01T16:50:19Z

postgres/datadog_checks/postgres/version_utils.py

@@ -31,6 +32,8 @@ def get_raw_version(db):
            return raw_version

    def is_aurora(self, db):
+        if self.seen_aurora_exception:


Suggested change

if self.seen_aurora_exception:

if self._seen_aurora_exception:

justiniso · 2023-08-01T16:50:29Z

postgres/datadog_checks/postgres/version_utils.py

@@ -39,6 +42,7 @@ def is_aurora(self, db):
        except Exception as e:
            self.log.debug("Captured exception %s while determining if the DB is aurora. Assuming is not", str(e))
            db.rollback()
+            self.seen_aurora_exception = True


Suggested change

self.seen_aurora_exception = True

self._seen_aurora_exception = True

justiniso · 2023-08-01T17:53:39Z

postgres/datadog_checks/postgres/postgres.py

@@ -837,3 +865,9 @@ def check(self, _):
        finally:
            # Add the warnings saved during the execution of the check
            self._report_warnings()
+            if self.check_cancelled and self.db:


Suggested change

if self.check_cancelled and self.db:

if self._check_cancelled and self.db:

justiniso · 2023-08-01T17:53:56Z

postgres/datadog_checks/postgres/postgres.py

        self._relations_manager = RelationsManager(self._config.relations, self._config.max_relations)
+        self.check_cancelled = False


Suggested change

self.check_cancelled = False

self._check_cancelled = False

justiniso · 2023-08-01T17:54:10Z

postgres/datadog_checks/postgres/postgres.py

+                self.log.warning("timeout reached on cancel, proceeding with unclean shutdown.")
+
+        self._close_db_pool()
+        self.check_cancelled = True


Suggested change

self.check_cancelled = True

self._check_cancelled = True

justiniso · 2023-08-01T18:03:21Z

postgres/datadog_checks/postgres/statement_samples.py

+        finally:
+            self.db_pool.set_conn_inactive(dbname)


This is decoupling the connection getting from its context manager, which seems unnecessary. This can be accomplished by replacing:

try: self.db_pool.get_connection(dbname, self._conn_ttl_ms) except: ...

with

try: with self.db_pool.get_connection(dbname, self._conn_ttl_ms): pass except: ...

The reason I prefer this is because it keeps the API tighter and more defensible, making it harder to introduce flows that would result in the inactivity not getting called.

This is actually a mistake, and was part of the original implementation I had. If you look at the other uses, I am using a context manager in all other places.

Thanks for catching it, I'll fix it!

justiniso · 2023-08-01T18:49:15Z

I think changelog/Fixed is more appropriate, no? What was added?

jmeunier28 · 2023-08-02T14:38:51Z

This PR requires this change to the build process DataDog/datadog-agent#18532 to be merged before we can proceed

jmeunier28 · 2023-08-02T15:07:31Z

datadog_checks_base/datadog_checks/base/utils/db/utils.py

@@ -259,6 +262,9 @@ def cancel(self):
        Send a signal to cancel the job loop asynchronously.
        """
        self._cancel_event.set()
+        # after setting cancel event, wait for job loop to fully shutdown
+        if self._job_loop_future:
+            self._job_loop_future.result()


without this change, we see a race condition that can happen when multiple job loop threads are running & cancel is called. This can sometimes result in a segfault:

tests/test_statements.py::test_async_job_enabled[True3-True1-True0] Fatal Python error: Segmentation fault

Current thread 0x00007f4924d0b700 (most recent call first): File "/home/ec2-user/.local/share/hatch/env/virtual/datadog-postgres/550Gv7FF/py3.9-12.1/lib/python3.9/site-packages/psycopg/cursor.py", line 516 in _select_current_result File "/home/ec2-user/.local/share/hatch/env/virtual/datadog-postgres/550Gv7FF/py3.9-12.1/lib/python3.9/site-packages/psycopg/cursor.py", line 205 in _execute_gen File "/home/ec2-user/.local/share/hatch/env/virtual/datadog-postgres/550Gv7FF/py3.9-12.1/lib/python3.9/site-packages/psycopg/connection.py", line 957 in wait .... ... File "/home/ec2-user/dd/integrations-core/datadog_checks_base/datadog_checks/base/utils/tracking.py", line 71 in wrapper File "/home/ec2-user/dd/integrations-core/postgres/datadog_checks/postgres/statement_samples.py", line 434 in run_job File "/home/ec2-user/dd/integrations-core/datadog_checks_base/datadog_checks/base/utils/db/utils.py", line 348 in _run_job_traced File "/home/ec2-user/dd/integrations-core/datadog_checks_base/datadog_checks/base/utils/db/utils.py", line 342 in _run_job_rate_limited File "/home/ec2-user/dd/integrations-core/datadog_checks_base/datadog_checks/base/utils/db/utils.py", line 303 in _job_loop File "/home/ec2-user/anaconda3/envs/agent39/lib/python3.9/concurrent/futures/thread.py", line 58 in run File "/home/ec2-user/anaconda3/envs/agent39/lib/python3.9/concurrent/futures/thread.py", line 83 in _worker This is because the call to cancel would simply set the cancel() threading event & would return immediately, which would result in a call to close_all_connections. This is a problem bc setting the cancel event on the job loop thread does not guarantee that the thread isn't already executing at the time we end up calling close for all connections in the db pool. (we only check if cancel is set [here](https://github.com/DataDog/integrations-core/blob/master/datadog_checks_base/datadog_checks/base/utils/db/utils.py#L290) and [here](https://github.com/DataDog/integrations-core/blob/master/datadog_checks_base/datadog_checks/base/utils/db/utils.py#L340))

Therefore, the only way to safely cancel all job loops is to do the following:

cancel is called from main check thread

Wait for all job loop cancel methods to return

Job loop method cancel methods do not return until the job loop is no longer running

Once all job loop thread cancel()'s return, close all database connections in the conn pool

Finally, close main check connection once check is no longer running

This PR is to allow for the datadog-agent build to work with the psycopg3 dependency, which was added as part of the integrations-core PR DataDog/integrations-core#15411

This PR upgrades the postgres integration to use the latest psycopg library, psycopg3. This moves us to the latest and greatest tech when it comes to running postgres queries via python. a75ea25

jmeunier28 force-pushed the jmeunier/psyc-3-upgrade branch from 93305f0 to 4f9f6c0 Compare July 28, 2023 02:15

ghost added dev/tooling integration/postgres dependencies dev_package base_package labels Jul 28, 2023

jmeunier28 added the changelog/Added label Jul 28, 2023

Upgrade postgres integration to use psycopg3

dcaee7e

jmeunier28 force-pushed the jmeunier/psyc-3-upgrade branch from c408c4a to dcaee7e Compare July 28, 2023 02:20

alexandre-normand reviewed Jul 28, 2023

View reviewed changes

jmeunier28 force-pushed the jmeunier/psyc-3-upgrade branch 2 times, most recently from aebe710 to d7b57f3 Compare July 28, 2023 14:18

allow for unclean shutdown on timeout to stop job loop threads && mis…

9793b74

…c fixes

jmeunier28 force-pushed the jmeunier/psyc-3-upgrade branch from d7b57f3 to 9793b74 Compare July 28, 2023 14:22

jmeunier28 commented Jul 28, 2023

View reviewed changes

jmeunier28 added 3 commits July 28, 2023 13:01

fix blocking test for pg 9.6

f2a695c

fix typo

1d4cb95

defaults for ssl inputs should all be strings

1cc2845

ghost added the documentation label Jul 28, 2023

set display_default to string

e2e31b1

jmeunier28 force-pushed the jmeunier/psyc-3-upgrade branch from 6c8ec01 to e2e31b1 Compare July 28, 2023 17:48

DataDog deleted a comment from github-actions bot Jul 28, 2023

jmeunier28 requested a review from a team as a code owner July 31, 2023 21:02

alexandre-normand previously approved these changes Jul 31, 2023

View reviewed changes

FlorentClarret requested changes Aug 1, 2023

View reviewed changes

try using psycopg[c] instead of binary install

3b7c879

jmeunier28 dismissed alexandre-normand’s stale review via 3b7c879 August 1, 2023 14:31

justiniso reviewed Aug 1, 2023

View reviewed changes

jmeunier28 added 2 commits August 1, 2023 14:13

review feedback

ffcc8a4

better log msg on check cancellation

6cb39b8

jmeunier28 added changelog/Fixed and removed changelog/Added labels Aug 1, 2023

fix flakey cancel tests

b2c9865

jmeunier28 force-pushed the jmeunier/psyc-3-upgrade branch from 752a919 to b2c9865 Compare August 1, 2023 21:41

jmeunier28 added 2 commits August 2, 2023 09:02

move back to psycopg[binary] && try to fix build in omnibus

e72c545

prevent segfault possibility in all tests

302f51b

make RL tests less flakey

24f3aca

jmeunier28 commented Aug 2, 2023

View reviewed changes

jmeunier28 mentioned this pull request Aug 2, 2023

Allow for psycopg 3 package install in agent build DataDog/datadog-agent#18532

Merged

10 tasks

FlorentClarret approved these changes Aug 2, 2023

View reviewed changes

alexandre-normand approved these changes Aug 2, 2023

View reviewed changes

jmeunier28 merged commit a75ea25 into master Aug 2, 2023
355 of 357 checks passed

jmeunier28 deleted the jmeunier/psyc-3-upgrade branch August 2, 2023 17:11

jmeunier28 mentioned this pull request Aug 6, 2023

Fix error handling for psycopg3 err messages #15488

Merged

5 tasks

FlorentClarret restored the jmeunier/psyc-3-upgrade branch August 7, 2023 08:48

FlorentClarret deleted the jmeunier/psyc-3-upgrade branch August 7, 2023 08:57

ofek mentioned this pull request Aug 16, 2023

Update list of dependency upgrade exclusions #14364

Merged

jmeunier28 mentioned this pull request Sep 18, 2023

Revert psycopg3 upgrade in postgres integration #15859

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Upgrade postgres check to psycopg3 #15411

Upgrade postgres check to psycopg3 #15411

jmeunier28 commented Jul 28, 2023 •

edited

Loading

alexandre-normand Jul 28, 2023

alexandre-normand Jul 28, 2023

jmeunier28 Jul 28, 2023

jmeunier28 Jul 28, 2023

github-actions bot commented Jul 28, 2023 •

edited

Loading

jmeunier28 Jul 28, 2023

alexandre-normand left a comment •

edited

Loading

FlorentClarret left a comment

justiniso Aug 1, 2023

jmeunier28 Aug 1, 2023

justiniso Aug 1, 2023

justiniso Aug 1, 2023

justiniso Aug 1, 2023

justiniso Aug 1, 2023

justiniso Aug 1, 2023

justiniso Aug 1, 2023

justiniso Aug 1, 2023

justiniso Aug 1, 2023

jmeunier28 Aug 1, 2023

justiniso commented Aug 1, 2023

jmeunier28 commented Aug 2, 2023

jmeunier28 Aug 2, 2023 •

edited

Loading

	self.seen_aurora_exception = False
	self._seen_aurora_exception = False

	if self.seen_aurora_exception:
	if self._seen_aurora_exception:

	self.seen_aurora_exception = True
	self._seen_aurora_exception = True

	if self.check_cancelled and self.db:
	if self._check_cancelled and self.db:

		self._relations_manager = RelationsManager(self._config.relations, self._config.max_relations)
		self.check_cancelled = False

Upgrade postgres check to psycopg3 #15411

Upgrade postgres check to psycopg3 #15411

Conversation

jmeunier28 commented Jul 28, 2023 • edited Loading

What does this PR do?

Motivation

Other Fixes / Improvements

Review checklist (to be filled by reviewers)

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Jul 28, 2023 • edited Loading

Test Results

Choose a reason for hiding this comment

alexandre-normand left a comment • edited Loading

Choose a reason for hiding this comment

FlorentClarret left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

justiniso commented Aug 1, 2023

jmeunier28 commented Aug 2, 2023

jmeunier28 Aug 2, 2023 • edited Loading

Choose a reason for hiding this comment

jmeunier28 commented Jul 28, 2023 •

edited

Loading

github-actions bot commented Jul 28, 2023 •

edited

Loading

alexandre-normand left a comment •

edited

Loading

jmeunier28 Aug 2, 2023 •

edited

Loading