Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Collect mysql statement samples & execution plans #8629

Merged
merged 26 commits into from
Mar 6, 2021

Conversation

djova
Copy link
Contributor

@djova djova commented Feb 15, 2021

What does this PR do?

Adds a new feature to "Deep Database Monitoring", enabling collection of statement samples and execution plans.

Follow-up to:

How it works

If enabled, a python thread is launched during a regular check run:

  • collects statement samples at the configured rate limit (default 1 collection per second)
  • maintains its own pymysql connection as pymysql is not thread safe so we can't share with the main check
  • collects execution plans through a mysql procedure that the user must install into each database being monitored (if we wanted the agent to collect execution plans directly by running EXPLAIN then it would need full write permission to all tables)
  • shuts down if it detects that the main check has stopped running

During one "collection" we do the following:

  1. read out all new statements from performance_schema.events_statement_{current|history|history_long}
  2. try to collect an execution plan for each statement
  3. submit events directly to the new database monitoring event intake

Rate Limiting

There are several different rate limits to keep load on the database to a minimum and to avoid reingesting duplicate events:

  • collections_per_second: limits how often collections are done (each collection is a query to an events_statements_* table)
  • explained_statements_cache: ttl limits how often we attempt to collect an execution plan for a given normalized query
  • seen_samples_cache: ttl limits how often we ingest statement samples for the same normalized query and execution plan

Events statements tables

The check will collect samples from the best available events_statements table. It's up to the user to enable the required events_statements consumers. Some managed databases like RDS Aurora replicas don't support events_statements_history_long in which case we fall back to one of the other tables.

  1. events_statements_history_long - preferred as it has the longest retention giving us the highest chance of catching infrequent & fast queries
  2. events_statements_current - less likely to catch infrequent & fast queries
  3. events_statements_history - least preferred table because

Configuration

statement_samples:
  enabled: false
  # default rate depends on which events_statements table is being used. user can override.
  collections_per_second: -1
  # the best table is chosen automatically by the check. user can override.
  events_statements_table: ''
  events_statements_row_limit: 5000
  explain_procedure: 'explain_statement'
  fully_qualified_explain_procedure: 'datadog.explain_statement'
  events_statements_temp_table_name: 'datadog.temp_events'
  events_statements_enable_procedure: 'datadog.enable_events_statements_consumers'

Motivation

Collect statement samples & execution plans, enabling deeper insight into what's running on the database and how queries are being executed.

Review checklist (to be filled by reviewers)

  • Feature or bugfix MUST have appropriate tests (unit, integration, e2e)
  • PR title must be written as a CHANGELOG entry (see why)
  • Files changes must correspond to the primary purpose of the PR as described in the title (small unrelated changes should have their own PR)
  • PR must have changelog/ and integration/ labels attached

Copy link
Contributor

@jtappa jtappa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some suggestions to switch to a more active voice and consistency across descriptions.

mysql/assets/configuration/spec.yaml Outdated Show resolved Hide resolved
mysql/assets/configuration/spec.yaml Outdated Show resolved Hide resolved
mysql/assets/configuration/spec.yaml Outdated Show resolved Hide resolved
mysql/assets/configuration/spec.yaml Outdated Show resolved Hide resolved
mysql/assets/configuration/spec.yaml Outdated Show resolved Hide resolved
mysql/assets/configuration/spec.yaml Outdated Show resolved Hide resolved
mysql/assets/configuration/spec.yaml Outdated Show resolved Hide resolved
mysql/assets/configuration/spec.yaml Outdated Show resolved Hide resolved
mysql/assets/configuration/spec.yaml Outdated Show resolved Hide resolved
mysql/datadog_checks/mysql/data/conf.yaml.example Outdated Show resolved Hide resolved
@@ -124,6 +127,9 @@ def check(self, _):
finally:
self._conn = None

def cancel(self):
self._statement_samples.cancel()
Copy link
Contributor Author

@djova djova Mar 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@olivielpeau @hush-hush I added support for cancel here. I think we should still keep the "stop thread if it notices the check is no longer running" logic as a fallback if the main check stops running for any other reason (agent overload, or a bug somewhere where it gets stuck).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, LGTM.

I don't have a strong opinion about the fallback. Just a couple of thoughts: if the check is stuck it'll cause other problems (it'll hog a check runner, etc), but the fallback would at least free up some resources that's true. If the check is inactive because of an agent overload, and the fallback mechanism stops the background thread, you'll want to make sure the background thread can be started up again cleanly when the check is run again.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you'll want to make sure the background thread can be started up again cleanly when the check is run again.

Yes that is the current behavior. On every check run the check will start up the thread if it is not currently running.

@djova
Copy link
Contributor Author

djova commented Mar 3, 2021

Some suggestions to switch to a more active voice and consistency across descriptions.

Thanks @jtappa for the review!

jtappa
jtappa previously approved these changes Mar 4, 2021
Copy link
Contributor

@jtappa jtappa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍🏻 from docs!

@djova djova requested a review from a team as a code owner March 4, 2021 21:22
djova and others added 23 commits March 5, 2021 10:43
@djova djova force-pushed the djova/mysql-dbm-statement-samples branch from ff63029 to 37cd220 Compare March 5, 2021 15:53
@ofek ofek merged commit b878d2f into master Mar 6, 2021
@ofek ofek deleted the djova/mysql-dbm-statement-samples branch March 6, 2021 04:04
djova added a commit to DataDog/datadog-agent that referenced this pull request Apr 9, 2021
Add a new aggregator API through which checks can submit "event platform events" of various types.

All supported `eventTypes` are hardcoded in `EventPlatformForwarder`.

The `dbm-samples` and `dbm-metrics` events are expected to arrive fully serialized so their pipelines are simply "HTTP passthrough" pipelines which skip all of the other features of logs pipelines like processing rules and encoding.

Future event types will be able to add more detailed processing if they need it.

**Overall flow**

1. `aggregator.submit_event_platform_event(check_id, rawEvent, "{eventType}")` - python API. Here's how the postgres check would be updated to use it: DataDog/integrations-core#9045.
2. `BufferedAggregator` forwards events to the `EventPlatformForwarder`. Events are **dropped** here if `EventPlatformForwarder` is backed up for any reason.
3. `EventPlatformForwarder` forwards events to the pipeline for the given `eventType`, **dropping** events for unknown `eventTypes`

**Internal Agent Stats**

*Prometheus*: `aggregator.flush - data_type:{eventType}, state:{ok|error}`

*ExpVar*: `EventPlatformEvents` & `EventPlatformEventsErrors`: counts by `eventType`

**User-Facing Agent Stats**

Statistics for each `eventType` will be tracked alongside other types of telemetry (`Service Checks`, `Series`, ...). Where appropriate, the raw `eventType` is translated to a human readable name (i.e. `dbm-samples` --> `Database Monitoring Query Samples`).

`agent status` output:
```
=========
Collector
=========

  Running Checks
  ==============

    postgres (5.4.0)
    ----------------
      Instance ID: postgres:1df52d84fb6f603c [OK]
      Metric Samples: Last Run: 366, Total: 7,527
      Database Monitoring Query Samples: Last Run: 11, Total: 176
      ...

=========
Aggregator
=========
  Checks Metric Sample: 29,818
  Database Monitoring Query Samples: 473
  ...
```

`agent check {check_name}` output:

```
=== Metrics ===
...
=== Database Monitoring Query Samples ===
...
```

`agent check {check_name} --json` output will use the raw event types instead of the human readable names:

```
"aggregator": {
  "metrics": [...],
  "dbm-samples": [...],
  ...
}
```

**Motivation**

The posting of statement samples payloads to the intake for postgres & mysql checks is done directly from python as of (DataDog/integrations-core#8627, DataDog/integrations-core#8629). With this change we'll be able to move responsibility for posting payloads to the more robust agent go code with proper batching, buffering, retries, error handling, and tracking of statistics.
remeh added a commit to DataDog/datadog-agent that referenced this pull request Apr 16, 2021
* add new generic event platform aggregator API

Add a new aggregator API through which checks can submit "event platform events" of various types.

All supported `eventTypes` are hardcoded in `EventPlatformForwarder`.

The `dbm-samples` and `dbm-metrics` events are expected to arrive fully serialized so their pipelines are simply "HTTP passthrough" pipelines which skip all of the other features of logs pipelines like processing rules and encoding.

Future event types will be able to add more detailed processing if they need it.

**Overall flow**

1. `aggregator.submit_event_platform_event(check_id, rawEvent, "{eventType}")` - python API. Here's how the postgres check would be updated to use it: DataDog/integrations-core#9045.
2. `BufferedAggregator` forwards events to the `EventPlatformForwarder`. Events are **dropped** here if `EventPlatformForwarder` is backed up for any reason.
3. `EventPlatformForwarder` forwards events to the pipeline for the given `eventType`, **dropping** events for unknown `eventTypes`

**Internal Agent Stats**

*Prometheus*: `aggregator.flush - data_type:{eventType}, state:{ok|error}`

*ExpVar*: `EventPlatformEvents` & `EventPlatformEventsErrors`: counts by `eventType`

**User-Facing Agent Stats**

Statistics for each `eventType` will be tracked alongside other types of telemetry (`Service Checks`, `Series`, ...). Where appropriate, the raw `eventType` is translated to a human readable name (i.e. `dbm-samples` --> `Database Monitoring Query Samples`).

`agent status` output:
```
=========
Collector
=========

  Running Checks
  ==============

    postgres (5.4.0)
    ----------------
      Instance ID: postgres:1df52d84fb6f603c [OK]
      Metric Samples: Last Run: 366, Total: 7,527
      Database Monitoring Query Samples: Last Run: 11, Total: 176
      ...

=========
Aggregator
=========
  Checks Metric Sample: 29,818
  Database Monitoring Query Samples: 473
  ...
```

`agent check {check_name}` output:

```
=== Metrics ===
...
=== Database Monitoring Query Samples ===
...
```

`agent check {check_name} --json` output will use the raw event types instead of the human readable names:

```
"aggregator": {
  "metrics": [...],
  "dbm-samples": [...],
  ...
}
```

**Motivation**

The posting of statement samples payloads to the intake for postgres & mysql checks is done directly from python as of (DataDog/integrations-core#8627, DataDog/integrations-core#8629). With this change we'll be able to move responsibility for posting payloads to the more robust agent go code with proper batching, buffering, retries, error handling, and tracking of statistics.

* simplify

* remove debug log

* move json marshaling to check.go

* check enabled before lock

* refactor, add noop ep forwarder

* Update pkg/collector/check/stats.go

Co-authored-by: maxime mouial <[email protected]>

* remove purge during flush

* remove global

* Update rtloader/include/datadog_agent_rtloader.h

Co-authored-by: Rémy Mathieu <[email protected]>

* Update rtloader/common/builtins/aggregator.h

Co-authored-by: Rémy Mathieu <[email protected]>

* Update pkg/collector/check/stats.go

Co-authored-by: Rémy Mathieu <[email protected]>

* remove unnecessary

* rename lock

* refactor pipelines

* remove unnecessary nil check

* revert

* Update releasenotes/notes/event-platform-aggregator-api-33e92539f08ac5c2.yaml

Co-authored-by: Alexandre Yang <[email protected]>

* track processed

* move locking into ep forwarder

* move to top

* Update pkg/aggregator/aggregator.go

Co-authored-by: Alexandre Yang <[email protected]>

* remove read lock

* refactor error logging

* move to pkg/epforwarder

* update default dbm-metrics endpoint

* local var

Co-authored-by: maxime mouial <[email protected]>
Co-authored-by: Rémy Mathieu <[email protected]>
Co-authored-by: Alexandre Yang <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants