Skip to content

Commit

Permalink
Collect mysql statement samples & execution plans (#8629)
Browse files Browse the repository at this point in the history
* Collect mysql statement samples & execution plans

**What does this PR do?**

Adds a new feature to "Deep Database Monitoring", enabling collection of statement samples and execution plans. Follow-up to #7851.

**How it works**

If enabled, a python thread is launched during a regular check run:
* collects statement samples at the configured rate limit (default 1 collection per second)
* maintains its own `pymysql` connection as `pymysql` is not thread safe so we can't share with the main check
* collects execution plans through a mysql procedure that the user must install into each database being monitored (if we wanted the agent to collect execution plans directly by running `EXPLAIN` then it would need full write permission to all tables)
* shuts down if it detects that the main check has stopped running

During one "collection" we do the following:
1. read out all new statements from `performance_schema.events_statement_{summary|history|history_long}`
1. try to collect an execution plan for each statement
1. submit events directly to the new database monitoring event intake

**Rate Limiting**

There are several different rate limits to keep load on the database to a minimum and to avoid reingesting duplicate events:
* `collections_per_second`: limits how often collections are done (each collection is a query to an `events_statements_*` table)
* `explained_statements_cache`: ttl limits how often we attempt to collect an execution plan for a given normalized query
* `seen_samples_cache`: ttl limits how often we ingest statement samples for the same normalized query and execution plan

**events statements tables**

The check will collect samples from the best available `events_statements` table. It's up to the user to enable the required `events_statements` consumers. Some managed databases like RDS Aurora replicas don't support `events_statements_history_long` in which case we fall back to one of the other tables.
1. `events_statements_history_long` - preferred as it has the longest retention giving us the highest chance of catching infrequent & fast queries
1. `events_statements_current` - less likely to catch infrequent & fast queries
1. `events_statements_history` - least preferred table because

**Configuration**

```yaml
statement_samples:
	enabled: false
	# default rate depends on which events_statements table is being used. user can override.
	collections_per_second: -1
	# the best table is chosen automatically by the check. user can override.
	events_statements_table: ''
	events_statements_row_limit: 5000
	explain_procedure: 'explain_statement'
	fully_qualified_explain_procedure: 'datadog.explain_statement'
	events_statements_temp_table_name: 'datadog.temp_events'
```

**Motivation**

Collect statement samples & execution plans, enabling deeper insight into what's running on the database and how queries are being executed.

* style

* Update mysql/assets/configuration/spec.yaml

Co-authored-by: Jorie Helwig <[email protected]>

* Update mysql/assets/configuration/spec.yaml

Co-authored-by: Jorie Helwig <[email protected]>

* Update mysql/assets/configuration/spec.yaml

Co-authored-by: Jorie Helwig <[email protected]>

* Update mysql/assets/configuration/spec.yaml

Co-authored-by: Jorie Helwig <[email protected]>

* Update mysql/assets/configuration/spec.yaml

Co-authored-by: Jorie Helwig <[email protected]>

* Update mysql/assets/configuration/spec.yaml

Co-authored-by: Jorie Helwig <[email protected]>

* Update mysql/assets/configuration/spec.yaml

Co-authored-by: Jorie Helwig <[email protected]>

* Update mysql/assets/configuration/spec.yaml

Co-authored-by: Jorie Helwig <[email protected]>

* Update mysql/assets/configuration/spec.yaml

Co-authored-by: Jorie Helwig <[email protected]>

* Update mysql/assets/configuration/spec.yaml

Co-authored-by: Jorie Helwig <[email protected]>

* Update mysql/assets/configuration/spec.yaml

Co-authored-by: Jorie Helwig <[email protected]>

* Update mysql/assets/configuration/spec.yaml

Co-authored-by: Jorie Helwig <[email protected]>

* Update spec

* support cancel

* Update mysql/assets/configuration/spec.yaml

Co-authored-by: Jorie Helwig <[email protected]>

* Update mysql/assets/configuration/spec.yaml

Co-authored-by: Jorie Helwig <[email protected]>

* Update mysql/assets/configuration/spec.yaml

Co-authored-by: Jorie Helwig <[email protected]>

* Update mysql/assets/configuration/spec.yaml

Co-authored-by: Jorie Helwig <[email protected]>

* Conf example update from spec

* fix client race condition in tests

* remove unnecessary

* style

* use json dep

* bump base

Co-authored-by: Jorie Helwig <[email protected]>
  • Loading branch information
djova and jtappa authored Mar 6, 2021
1 parent c0d96f9 commit b878d2f
Show file tree
Hide file tree
Showing 12 changed files with 1,290 additions and 19 deletions.
105 changes: 104 additions & 1 deletion mysql/assets/configuration/spec.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -261,7 +261,110 @@ files:
time: [100, 0]
<METRIC_COLUMN>: [<TOP_K_LIMIT>, <BOTTOM_K_LIMIT>]
display_default: false

- name: statement_samples
description: Configure collection of statement samples
options:
- name: enabled
description: |
Enables collection of statement samples. Requires `deep_database_monitoring: true`.
value:
type: boolean
example: false
- name: collections_per_second
description: |
Sets the maximum statement sample collection rate. Each collection involves a single query to one
of the `performance_schema.events_statements_*` tables, followed by at most one `EXPLAIN` query per
unique statement seen.
value:
type: number
example: 1
- name: explained_statements_per_hour_per_query
description: |
Sets the rate limit for how many execution plans will be collected per hour per normalized statement.
value:
type: integer
example: 60
- name: samples_per_hour_per_query
description: |
Sets the rate limit for how many statement sample events will be ingested per hour per normalized
execution plan.
value:
type: integer
example: 15
- name: explained_statements_cache_maxsize
description: |
Sets the max size of the cache used for the explained_statements_per_hour_per_query rate limit. This
should be increased for databases with a very large number of unique normalized queries which exceed the
cache's limit.
value:
type: integer
example: 5000
- name: seen_samples_cache_maxsize
description: |
Sets the max size of the cache used for the samples_per_hour_per_query rate limit. This should be
increased for databases with a very large number of unique normalized execution plans which exceed the
cache's limit.
value:
type: integer
example: 10000
- name: events_statements_row_limit
description: |
Sets the maximum number of rows to read out from a `performance_schema.events_statements_*` table during
a single collection.
value:
type: integer
example: 5000
- name: events_statements_table
description: |
Forces a specific events statements table. Must be one of events_statements_current,
events_statements_history, events_statements_history_long. If not set then the agent will choose
the best available table that is enabled and not empty.
value:
type: string
example: events_statements_history_long
display_default: null
- name: explain_procedure
description: |
Overrides the default procedure used for collecting execution plans. The agent will use this
procedure in each schema where it exists: `{schema}.explain_statement({statement})`.
value:
type: string
example: explain_statement
- name: fully_qualified_explain_procedure
description: |
Overrides the default fully qualified explain procedure used for collecting execution plans for
statements sent from connections that do not have a default schema configured.
value:
type: string
example: datadog.explain_statement
- name: events_statements_enable_procedure
description: |
Overrides the default procedure used for enabling events statements consumers.
value:
type: string
example: datadog.enable_events_statements_consumers
- name: events_statements_temp_table_name
description: |
Overrides the default fully qualified name for the temp table the agent creates while collecting
samples.
value:
type: string
example: datadog.temp_events
- name: collection_strategy_cache_maxsize
description: |
Sets the max size of the cache used for caching collection strategies. This value should be increased
to be at least as many as the number of unique schemas that are being monitored.
value:
type: integer
example: 1000
- name: collection_strategy_cache_ttl
description: |
Sets how long to cache collection strategies. This should only be decreased if the set of enabled
`events_statements_*` tables changes frequently enough to cause stale strategies to return empty
results for an extended period of time.
value:
type: integer
example: 300
- template: instances/default

- template: logs
Expand Down
2 changes: 2 additions & 0 deletions mysql/datadog_checks/mysql/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,8 @@ def __init__(self, instance):
self.charset = instance.get('charset')
self.deep_database_monitoring = is_affirmative(instance.get('deep_database_monitoring', False))
self.statement_metrics_limits = instance.get('statement_metrics_limits', None)
self.statement_samples_config = instance.get('statement_samples', {}) or {}
self.min_collection_interval = instance.get('min_collection_interval', 15)
self.configuration_checks()

def _build_tags(self, custom_tags):
Expand Down
90 changes: 90 additions & 0 deletions mysql/datadog_checks/mysql/data/conf.yaml.example
Original file line number Diff line number Diff line change
Expand Up @@ -251,6 +251,96 @@ instances:
# - <TOP_K_LIMIT>
# - <BOTTOM_K_LIMIT>

## Configure collection of statement samples
#
statement_samples:

## @param enabled - boolean - optional - default: false
## Enables collection of statement samples. Requires `deep_database_monitoring: true`.
#
# enabled: false

## @param collections_per_second - number - optional - default: 1
## Sets the maximum statement sample collection rate. Each collection involves a single query to one
## of the `performance_schema.events_statements_*` tables, followed by at most one `EXPLAIN` query per
## unique statement seen.
#
# collections_per_second: 1

## @param explained_statements_per_hour_per_query - integer - optional - default: 60
## Sets the rate limit for how many execution plans will be collected per hour per normalized statement.
#
# explained_statements_per_hour_per_query: 60

## @param samples_per_hour_per_query - integer - optional - default: 15
## Sets the rate limit for how many statement sample events will be ingested per hour per normalized
## execution plan.
#
# samples_per_hour_per_query: 15

## @param explained_statements_cache_maxsize - integer - optional - default: 5000
## Sets the max size of the cache used for the explained_statements_per_hour_per_query rate limit. This
## should be increased for databases with a very large number of unique normalized queries which exceed the
## cache's limit.
#
# explained_statements_cache_maxsize: 5000

## @param seen_samples_cache_maxsize - integer - optional - default: 10000
## Sets the max size of the cache used for the samples_per_hour_per_query rate limit. This should be
## increased for databases with a very large number of unique normalized execution plans which exceed the
## cache's limit.
#
# seen_samples_cache_maxsize: 10000

## @param events_statements_row_limit - integer - optional - default: 5000
## Sets the maximum number of rows to read out from a `performance_schema.events_statements_*` table during
## a single collection.
#
# events_statements_row_limit: 5000

## @param events_statements_table - string - optional
## Forces a specific events statements table. Must be one of events_statements_current,
## events_statements_history, events_statements_history_long. If not set then the agent will choose
## the best available table that is enabled and not empty.
#
# events_statements_table: events_statements_history_long

## @param explain_procedure - string - optional - default: explain_statement
## Overrides the default procedure used for collecting execution plans. The agent will use this
## procedure in each schema where it exists: `{schema}.explain_statement({statement})`.
#
# explain_procedure: explain_statement

## @param fully_qualified_explain_procedure - string - optional - default: datadog.explain_statement
## Overrides the default fully qualified explain procedure used for collecting execution plans for
## statements sent from connections that do not have a default schema configured.
#
# fully_qualified_explain_procedure: datadog.explain_statement

## @param events_statements_enable_procedure - string - optional - default: datadog.enable_events_statements_consumers
## Overrides the default procedure used for enabling events statements consumers.
#
# events_statements_enable_procedure: datadog.enable_events_statements_consumers

## @param events_statements_temp_table_name - string - optional - default: datadog.temp_events
## Overrides the default fully qualified name for the temp table the agent creates while collecting
## samples.
#
# events_statements_temp_table_name: datadog.temp_events

## @param collection_strategy_cache_maxsize - integer - optional - default: 1000
## Sets the max size of the cache used for caching collection strategies. This value should be increased
## to be at least as many as the number of unique schemas that are being monitored.
#
# collection_strategy_cache_maxsize: 1000

## @param collection_strategy_cache_ttl - integer - optional - default: 300
## Sets how long to cache collection strategies. This should only be decreased if the set of enabled
## `events_statements_*` tables changes frequently enough to cause stale strategies to return empty
## results for an extended period of time.
#
# collection_strategy_cache_ttl: 300

## @param tags - list of strings - optional
## A list of tags to attach to every metric and service check emitted by this instance.
##
Expand Down
8 changes: 7 additions & 1 deletion mysql/datadog_checks/mysql/mysql.py
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,7 @@
SQL_WORKER_THREADS,
show_replica_status_query,
)
from .statement_samples import MySQLStatementSamples
from .statements import MySQLStatementMetrics
from .version_utils import get_version

Expand Down Expand Up @@ -76,10 +77,11 @@ def __init__(self, name, init_config, instances):
self._conn = None

self._query_manager = QueryManager(self, self.execute_query_raw, queries=[], tags=self._config.tags)
self._statement_metrics = MySQLStatementMetrics(self._config)
self.check_initializations.append(self._query_manager.compile_queries)
self.innodb_stats = InnoDBMetrics()
self.check_initializations.append(self._config.configuration_checks)
self._statement_metrics = MySQLStatementMetrics(self._config)
self._statement_samples = MySQLStatementSamples(self, self._config, self._get_connection_args())

def execute_query_raw(self, query):
with closing(self._conn.cursor(pymysql.cursors.SSCursor)) as cursor:
Expand Down Expand Up @@ -111,6 +113,7 @@ def check(self, _):
self._collect_system_metrics(self._config.host, db, self._config.tags)
if self._config.deep_database_monitoring:
self._collect_statement_metrics(db, self._config.tags)
self._statement_samples.run_sampler(self.service_check_tags)

# keeping track of these:
self._put_qcache_stats()
Expand All @@ -124,6 +127,9 @@ def check(self, _):
finally:
self._conn = None

def cancel(self):
self._statement_samples.cancel()

def _set_qcache_stats(self):
host_key = self._get_host_key()
qcache_st = self.qcache_stats.get(host_key, (None, None, None))
Expand Down
Loading

0 comments on commit b878d2f

Please sign in to comment.