Collect mysql statement samples & execution plans #8629

djova · 2021-02-15T17:17:34Z

What does this PR do?

Adds a new feature to "Deep Database Monitoring", enabling collection of statement samples and execution plans.

Follow-up to:

Support MySQL statement-level metrics for deep database monitoring #7851 - mysql DBM statement metrics
Collect postgres statement samples & execution plans for deep database monitoring #8627 - Collect postgres statement samples & execution plans

How it works

If enabled, a python thread is launched during a regular check run:

collects statement samples at the configured rate limit (default 1 collection per second)
maintains its own pymysql connection as pymysql is not thread safe so we can't share with the main check
collects execution plans through a mysql procedure that the user must install into each database being monitored (if we wanted the agent to collect execution plans directly by running EXPLAIN then it would need full write permission to all tables)
shuts down if it detects that the main check has stopped running

During one "collection" we do the following:

read out all new statements from performance_schema.events_statement_{current|history|history_long}
try to collect an execution plan for each statement
submit events directly to the new database monitoring event intake

Rate Limiting

There are several different rate limits to keep load on the database to a minimum and to avoid reingesting duplicate events:

collections_per_second: limits how often collections are done (each collection is a query to an events_statements_* table)
explained_statements_cache: ttl limits how often we attempt to collect an execution plan for a given normalized query
seen_samples_cache: ttl limits how often we ingest statement samples for the same normalized query and execution plan

Events statements tables

The check will collect samples from the best available events_statements table. It's up to the user to enable the required events_statements consumers. Some managed databases like RDS Aurora replicas don't support events_statements_history_long in which case we fall back to one of the other tables.

events_statements_history_long - preferred as it has the longest retention giving us the highest chance of catching infrequent & fast queries
events_statements_current - less likely to catch infrequent & fast queries
events_statements_history - least preferred table because

Configuration

statement_samples:
  enabled: false
  # default rate depends on which events_statements table is being used. user can override.
  collections_per_second: -1
  # the best table is chosen automatically by the check. user can override.
  events_statements_table: ''
  events_statements_row_limit: 5000
  explain_procedure: 'explain_statement'
  fully_qualified_explain_procedure: 'datadog.explain_statement'
  events_statements_temp_table_name: 'datadog.temp_events'
  events_statements_enable_procedure: 'datadog.enable_events_statements_consumers'

Motivation

Collect statement samples & execution plans, enabling deeper insight into what's running on the database and how queries are being executed.

Review checklist (to be filled by reviewers)

Feature or bugfix MUST have appropriate tests (unit, integration, e2e)
PR title must be written as a CHANGELOG entry (see why)
Files changes must correspond to the primary purpose of the PR as described in the title (small unrelated changes should have their own PR)
PR must have changelog/ and integration/ labels attached

mysql/assets/configuration/spec.yaml

jtappa

Some suggestions to switch to a more active voice and consistency across descriptions.

mysql/assets/configuration/spec.yaml

mysql/datadog_checks/mysql/data/conf.yaml.example

djova · 2021-03-03T20:36:54Z

mysql/datadog_checks/mysql/mysql.py

@@ -124,6 +127,9 @@ def check(self, _):
            finally:
                self._conn = None

+    def cancel(self):
+        self._statement_samples.cancel()


@olivielpeau @hush-hush I added support for cancel here. I think we should still keep the "stop thread if it notices the check is no longer running" logic as a fallback if the main check stops running for any other reason (agent overload, or a bug somewhere where it gets stuck).

ok, LGTM.

I don't have a strong opinion about the fallback. Just a couple of thoughts: if the check is stuck it'll cause other problems (it'll hog a check runner, etc), but the fallback would at least free up some resources that's true. If the check is inactive because of an agent overload, and the fallback mechanism stops the background thread, you'll want to make sure the background thread can be started up again cleanly when the check is run again.

you'll want to make sure the background thread can be started up again cleanly when the check is run again.

Yes that is the current behavior. On every check run the check will start up the thread if it is not currently running.

djova · 2021-03-03T20:37:15Z

Some suggestions to switch to a more active voice and consistency across descriptions.

Thanks @jtappa for the review!

jtappa

👍🏻 from docs!

Co-authored-by: Jorie Helwig <[email protected]>

Add a new aggregator API through which checks can submit "event platform events" of various types. All supported `eventTypes` are hardcoded in `EventPlatformForwarder`. The `dbm-samples` and `dbm-metrics` events are expected to arrive fully serialized so their pipelines are simply "HTTP passthrough" pipelines which skip all of the other features of logs pipelines like processing rules and encoding. Future event types will be able to add more detailed processing if they need it. **Overall flow** 1. `aggregator.submit_event_platform_event(check_id, rawEvent, "{eventType}")` - python API. Here's how the postgres check would be updated to use it: DataDog/integrations-core#9045. 2. `BufferedAggregator` forwards events to the `EventPlatformForwarder`. Events are **dropped** here if `EventPlatformForwarder` is backed up for any reason. 3. `EventPlatformForwarder` forwards events to the pipeline for the given `eventType`, **dropping** events for unknown `eventTypes` **Internal Agent Stats** *Prometheus*: `aggregator.flush - data_type:{eventType}, state:{ok|error}` *ExpVar*: `EventPlatformEvents` & `EventPlatformEventsErrors`: counts by `eventType` **User-Facing Agent Stats** Statistics for each `eventType` will be tracked alongside other types of telemetry (`Service Checks`, `Series`, ...). Where appropriate, the raw `eventType` is translated to a human readable name (i.e. `dbm-samples` --> `Database Monitoring Query Samples`). `agent status` output: ``` ========= Collector ========= Running Checks ============== postgres (5.4.0) ---------------- Instance ID: postgres:1df52d84fb6f603c [OK] Metric Samples: Last Run: 366, Total: 7,527 Database Monitoring Query Samples: Last Run: 11, Total: 176 ... ========= Aggregator ========= Checks Metric Sample: 29,818 Database Monitoring Query Samples: 473 ... ``` `agent check {check_name}` output: ``` === Metrics === ... === Database Monitoring Query Samples === ... ``` `agent check {check_name} --json` output will use the raw event types instead of the human readable names: ``` "aggregator": { "metrics": [...], "dbm-samples": [...], ... } ``` **Motivation** The posting of statement samples payloads to the intake for postgres & mysql checks is done directly from python as of (DataDog/integrations-core#8627, DataDog/integrations-core#8629). With this change we'll be able to move responsibility for posting payloads to the more robust agent go code with proper batching, buffering, retries, error handling, and tracking of statistics.

* add new generic event platform aggregator API Add a new aggregator API through which checks can submit "event platform events" of various types. All supported `eventTypes` are hardcoded in `EventPlatformForwarder`. The `dbm-samples` and `dbm-metrics` events are expected to arrive fully serialized so their pipelines are simply "HTTP passthrough" pipelines which skip all of the other features of logs pipelines like processing rules and encoding. Future event types will be able to add more detailed processing if they need it. **Overall flow** 1. `aggregator.submit_event_platform_event(check_id, rawEvent, "{eventType}")` - python API. Here's how the postgres check would be updated to use it: DataDog/integrations-core#9045. 2. `BufferedAggregator` forwards events to the `EventPlatformForwarder`. Events are **dropped** here if `EventPlatformForwarder` is backed up for any reason. 3. `EventPlatformForwarder` forwards events to the pipeline for the given `eventType`, **dropping** events for unknown `eventTypes` **Internal Agent Stats** *Prometheus*: `aggregator.flush - data_type:{eventType}, state:{ok|error}` *ExpVar*: `EventPlatformEvents` & `EventPlatformEventsErrors`: counts by `eventType` **User-Facing Agent Stats** Statistics for each `eventType` will be tracked alongside other types of telemetry (`Service Checks`, `Series`, ...). Where appropriate, the raw `eventType` is translated to a human readable name (i.e. `dbm-samples` --> `Database Monitoring Query Samples`). `agent status` output: ``` ========= Collector ========= Running Checks ============== postgres (5.4.0) ---------------- Instance ID: postgres:1df52d84fb6f603c [OK] Metric Samples: Last Run: 366, Total: 7,527 Database Monitoring Query Samples: Last Run: 11, Total: 176 ... ========= Aggregator ========= Checks Metric Sample: 29,818 Database Monitoring Query Samples: 473 ... ``` `agent check {check_name}` output: ``` === Metrics === ... === Database Monitoring Query Samples === ... ``` `agent check {check_name} --json` output will use the raw event types instead of the human readable names: ``` "aggregator": { "metrics": [...], "dbm-samples": [...], ... } ``` **Motivation** The posting of statement samples payloads to the intake for postgres & mysql checks is done directly from python as of (DataDog/integrations-core#8627, DataDog/integrations-core#8629). With this change we'll be able to move responsibility for posting payloads to the more robust agent go code with proper batching, buffering, retries, error handling, and tracking of statistics. * simplify * remove debug log * move json marshaling to check.go * check enabled before lock * refactor, add noop ep forwarder * Update pkg/collector/check/stats.go Co-authored-by: maxime mouial <[email protected]> * remove purge during flush * remove global * Update rtloader/include/datadog_agent_rtloader.h Co-authored-by: Rémy Mathieu <[email protected]> * Update rtloader/common/builtins/aggregator.h Co-authored-by: Rémy Mathieu <[email protected]> * Update pkg/collector/check/stats.go Co-authored-by: Rémy Mathieu <[email protected]> * remove unnecessary * rename lock * refactor pipelines * remove unnecessary nil check * revert * Update releasenotes/notes/event-platform-aggregator-api-33e92539f08ac5c2.yaml Co-authored-by: Alexandre Yang <[email protected]> * track processed * move locking into ep forwarder * move to top * Update pkg/aggregator/aggregator.go Co-authored-by: Alexandre Yang <[email protected]> * remove read lock * refactor error logging * move to pkg/epforwarder * update default dbm-metrics endpoint * local var Co-authored-by: maxime mouial <[email protected]> Co-authored-by: Rémy Mathieu <[email protected]> Co-authored-by: Alexandre Yang <[email protected]>

ghost added integration/datadog_checks_base integration/mysql integration/postgres labels Feb 15, 2021

djova added the changelog/Added label Feb 15, 2021

djova force-pushed the djova/mysql-dbm-statement-samples branch from fb49cc2 to 2c0242f Compare February 15, 2021 17:22

djova removed integration/datadog_checks_base integration/postgres labels Feb 15, 2021

djova force-pushed the djova/mysql-dbm-statement-samples branch from 2c0242f to a80dd3b Compare February 15, 2021 21:39

ghost added dependencies integration/datadog_checks_base integration/postgres labels Feb 15, 2021

djova mentioned this pull request Feb 15, 2021

Collect postgres statement samples & execution plans for deep database monitoring #8627

Merged

4 tasks

djova force-pushed the djova/mysql-dbm-statement-samples branch from c48aedc to ccb8482 Compare March 1, 2021 19:22

djova requested review from ofek and florimondmanca March 2, 2021 20:19

ghost added the documentation label Mar 2, 2021

djova force-pushed the djova/mysql-dbm-statement-samples branch from c544538 to c052e42 Compare March 3, 2021 15:42

djova marked this pull request as ready for review March 3, 2021 15:42

djova requested review from a team as code owners March 3, 2021 15:42

djova removed integration/datadog_checks_base integration/postgres labels Mar 3, 2021

jtappa reviewed Mar 3, 2021

View reviewed changes

mysql/assets/configuration/spec.yaml Outdated Show resolved Hide resolved

jtappa reviewed Mar 3, 2021

View reviewed changes

mysql/assets/configuration/spec.yaml Outdated Show resolved Hide resolved

jtappa suggested changes Mar 3, 2021

View reviewed changes

djova commented Mar 3, 2021

View reviewed changes

jtappa previously approved these changes Mar 4, 2021

View reviewed changes

djova dismissed jtappa’s stale review via dac501b March 4, 2021 20:19

djova requested a review from a team as a code owner March 4, 2021 21:22

djova and others added 23 commits March 5, 2021 10:43

Update mysql/assets/configuration/spec.yaml

330fb15

Co-authored-by: Jorie Helwig <[email protected]>

Update mysql/assets/configuration/spec.yaml

d4e95d7

Co-authored-by: Jorie Helwig <[email protected]>

Update mysql/assets/configuration/spec.yaml

f53d1f6

Co-authored-by: Jorie Helwig <[email protected]>

Update mysql/assets/configuration/spec.yaml

ab13a8e

Co-authored-by: Jorie Helwig <[email protected]>

Update mysql/assets/configuration/spec.yaml

19632de

Co-authored-by: Jorie Helwig <[email protected]>

Update mysql/assets/configuration/spec.yaml

db07b89

Co-authored-by: Jorie Helwig <[email protected]>

Update mysql/assets/configuration/spec.yaml

085e8d1

Co-authored-by: Jorie Helwig <[email protected]>

Update mysql/assets/configuration/spec.yaml

e05269a

Co-authored-by: Jorie Helwig <[email protected]>

Update mysql/assets/configuration/spec.yaml

d123996

Co-authored-by: Jorie Helwig <[email protected]>

Update mysql/assets/configuration/spec.yaml

e190b65

Co-authored-by: Jorie Helwig <[email protected]>

Update mysql/assets/configuration/spec.yaml

130c199

Co-authored-by: Jorie Helwig <[email protected]>

Update spec

7c56f38

support cancel

3c1522e

Update mysql/assets/configuration/spec.yaml

3223a70

Co-authored-by: Jorie Helwig <[email protected]>

Update mysql/assets/configuration/spec.yaml

b7babd8

Co-authored-by: Jorie Helwig <[email protected]>

Update mysql/assets/configuration/spec.yaml

0e3657a

Co-authored-by: Jorie Helwig <[email protected]>

Update mysql/assets/configuration/spec.yaml

b9ea084

Co-authored-by: Jorie Helwig <[email protected]>

Conf example update from spec

4136105

fix client race condition in tests

eac38ff

remove unnecessary

cbec27f

style

d8e1937

use json dep

4e144ca

bump base

37cd220

djova force-pushed the djova/mysql-dbm-statement-samples branch from ff63029 to 37cd220 Compare March 5, 2021 15:53

ofek approved these changes Mar 5, 2021

View reviewed changes

ofek merged commit b878d2f into master Mar 6, 2021

ofek deleted the djova/mysql-dbm-statement-samples branch March 6, 2021 04:04

djova mentioned this pull request Apr 8, 2021

add new generic event platform aggregator API DataDog/datadog-agent#7735

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Collect mysql statement samples & execution plans #8629

Collect mysql statement samples & execution plans #8629

djova commented Feb 15, 2021 •

edited

Loading

jtappa left a comment

djova Mar 3, 2021 •

edited

Loading

olivielpeau Mar 4, 2021

djova Mar 4, 2021

djova commented Mar 3, 2021

jtappa left a comment

Collect mysql statement samples & execution plans #8629

Collect mysql statement samples & execution plans #8629

Conversation

djova commented Feb 15, 2021 • edited Loading

What does this PR do?

How it works

Rate Limiting

Events statements tables

Configuration

Motivation

Review checklist (to be filled by reviewers)

jtappa left a comment

Choose a reason for hiding this comment

djova Mar 3, 2021 • edited Loading

Choose a reason for hiding this comment

olivielpeau Mar 4, 2021

Choose a reason for hiding this comment

djova Mar 4, 2021

Choose a reason for hiding this comment

djova commented Mar 3, 2021

jtappa left a comment

Choose a reason for hiding this comment

djova commented Feb 15, 2021 •

edited

Loading

djova Mar 3, 2021 •

edited

Loading