-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Collect mysql statement samples & execution plans #8629
Conversation
fb49cc2
to
2c0242f
Compare
2c0242f
to
a80dd3b
Compare
c48aedc
to
ccb8482
Compare
c544538
to
c052e42
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some suggestions to switch to a more active voice and consistency across descriptions.
@@ -124,6 +127,9 @@ def check(self, _): | |||
finally: | |||
self._conn = None | |||
|
|||
def cancel(self): | |||
self._statement_samples.cancel() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@olivielpeau @hush-hush I added support for cancel
here. I think we should still keep the "stop thread if it notices the check is no longer running" logic as a fallback if the main check stops running for any other reason (agent overload, or a bug somewhere where it gets stuck).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, LGTM.
I don't have a strong opinion about the fallback. Just a couple of thoughts: if the check is stuck it'll cause other problems (it'll hog a check runner, etc), but the fallback would at least free up some resources that's true. If the check is inactive because of an agent overload, and the fallback mechanism stops the background thread, you'll want to make sure the background thread can be started up again cleanly when the check is run again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you'll want to make sure the background thread can be started up again cleanly when the check is run again.
Yes that is the current behavior. On every check run the check will start up the thread if it is not currently running.
Thanks @jtappa for the review! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍🏻 from docs!
Co-authored-by: Jorie Helwig <[email protected]>
Co-authored-by: Jorie Helwig <[email protected]>
Co-authored-by: Jorie Helwig <[email protected]>
Co-authored-by: Jorie Helwig <[email protected]>
Co-authored-by: Jorie Helwig <[email protected]>
Co-authored-by: Jorie Helwig <[email protected]>
Co-authored-by: Jorie Helwig <[email protected]>
Co-authored-by: Jorie Helwig <[email protected]>
Co-authored-by: Jorie Helwig <[email protected]>
Co-authored-by: Jorie Helwig <[email protected]>
Co-authored-by: Jorie Helwig <[email protected]>
Co-authored-by: Jorie Helwig <[email protected]>
Co-authored-by: Jorie Helwig <[email protected]>
Co-authored-by: Jorie Helwig <[email protected]>
Co-authored-by: Jorie Helwig <[email protected]>
ff63029
to
37cd220
Compare
Add a new aggregator API through which checks can submit "event platform events" of various types. All supported `eventTypes` are hardcoded in `EventPlatformForwarder`. The `dbm-samples` and `dbm-metrics` events are expected to arrive fully serialized so their pipelines are simply "HTTP passthrough" pipelines which skip all of the other features of logs pipelines like processing rules and encoding. Future event types will be able to add more detailed processing if they need it. **Overall flow** 1. `aggregator.submit_event_platform_event(check_id, rawEvent, "{eventType}")` - python API. Here's how the postgres check would be updated to use it: DataDog/integrations-core#9045. 2. `BufferedAggregator` forwards events to the `EventPlatformForwarder`. Events are **dropped** here if `EventPlatformForwarder` is backed up for any reason. 3. `EventPlatformForwarder` forwards events to the pipeline for the given `eventType`, **dropping** events for unknown `eventTypes` **Internal Agent Stats** *Prometheus*: `aggregator.flush - data_type:{eventType}, state:{ok|error}` *ExpVar*: `EventPlatformEvents` & `EventPlatformEventsErrors`: counts by `eventType` **User-Facing Agent Stats** Statistics for each `eventType` will be tracked alongside other types of telemetry (`Service Checks`, `Series`, ...). Where appropriate, the raw `eventType` is translated to a human readable name (i.e. `dbm-samples` --> `Database Monitoring Query Samples`). `agent status` output: ``` ========= Collector ========= Running Checks ============== postgres (5.4.0) ---------------- Instance ID: postgres:1df52d84fb6f603c [OK] Metric Samples: Last Run: 366, Total: 7,527 Database Monitoring Query Samples: Last Run: 11, Total: 176 ... ========= Aggregator ========= Checks Metric Sample: 29,818 Database Monitoring Query Samples: 473 ... ``` `agent check {check_name}` output: ``` === Metrics === ... === Database Monitoring Query Samples === ... ``` `agent check {check_name} --json` output will use the raw event types instead of the human readable names: ``` "aggregator": { "metrics": [...], "dbm-samples": [...], ... } ``` **Motivation** The posting of statement samples payloads to the intake for postgres & mysql checks is done directly from python as of (DataDog/integrations-core#8627, DataDog/integrations-core#8629). With this change we'll be able to move responsibility for posting payloads to the more robust agent go code with proper batching, buffering, retries, error handling, and tracking of statistics.
* add new generic event platform aggregator API Add a new aggregator API through which checks can submit "event platform events" of various types. All supported `eventTypes` are hardcoded in `EventPlatformForwarder`. The `dbm-samples` and `dbm-metrics` events are expected to arrive fully serialized so their pipelines are simply "HTTP passthrough" pipelines which skip all of the other features of logs pipelines like processing rules and encoding. Future event types will be able to add more detailed processing if they need it. **Overall flow** 1. `aggregator.submit_event_platform_event(check_id, rawEvent, "{eventType}")` - python API. Here's how the postgres check would be updated to use it: DataDog/integrations-core#9045. 2. `BufferedAggregator` forwards events to the `EventPlatformForwarder`. Events are **dropped** here if `EventPlatformForwarder` is backed up for any reason. 3. `EventPlatformForwarder` forwards events to the pipeline for the given `eventType`, **dropping** events for unknown `eventTypes` **Internal Agent Stats** *Prometheus*: `aggregator.flush - data_type:{eventType}, state:{ok|error}` *ExpVar*: `EventPlatformEvents` & `EventPlatformEventsErrors`: counts by `eventType` **User-Facing Agent Stats** Statistics for each `eventType` will be tracked alongside other types of telemetry (`Service Checks`, `Series`, ...). Where appropriate, the raw `eventType` is translated to a human readable name (i.e. `dbm-samples` --> `Database Monitoring Query Samples`). `agent status` output: ``` ========= Collector ========= Running Checks ============== postgres (5.4.0) ---------------- Instance ID: postgres:1df52d84fb6f603c [OK] Metric Samples: Last Run: 366, Total: 7,527 Database Monitoring Query Samples: Last Run: 11, Total: 176 ... ========= Aggregator ========= Checks Metric Sample: 29,818 Database Monitoring Query Samples: 473 ... ``` `agent check {check_name}` output: ``` === Metrics === ... === Database Monitoring Query Samples === ... ``` `agent check {check_name} --json` output will use the raw event types instead of the human readable names: ``` "aggregator": { "metrics": [...], "dbm-samples": [...], ... } ``` **Motivation** The posting of statement samples payloads to the intake for postgres & mysql checks is done directly from python as of (DataDog/integrations-core#8627, DataDog/integrations-core#8629). With this change we'll be able to move responsibility for posting payloads to the more robust agent go code with proper batching, buffering, retries, error handling, and tracking of statistics. * simplify * remove debug log * move json marshaling to check.go * check enabled before lock * refactor, add noop ep forwarder * Update pkg/collector/check/stats.go Co-authored-by: maxime mouial <[email protected]> * remove purge during flush * remove global * Update rtloader/include/datadog_agent_rtloader.h Co-authored-by: Rémy Mathieu <[email protected]> * Update rtloader/common/builtins/aggregator.h Co-authored-by: Rémy Mathieu <[email protected]> * Update pkg/collector/check/stats.go Co-authored-by: Rémy Mathieu <[email protected]> * remove unnecessary * rename lock * refactor pipelines * remove unnecessary nil check * revert * Update releasenotes/notes/event-platform-aggregator-api-33e92539f08ac5c2.yaml Co-authored-by: Alexandre Yang <[email protected]> * track processed * move locking into ep forwarder * move to top * Update pkg/aggregator/aggregator.go Co-authored-by: Alexandre Yang <[email protected]> * remove read lock * refactor error logging * move to pkg/epforwarder * update default dbm-metrics endpoint * local var Co-authored-by: maxime mouial <[email protected]> Co-authored-by: Rémy Mathieu <[email protected]> Co-authored-by: Alexandre Yang <[email protected]>
What does this PR do?
Adds a new feature to "Deep Database Monitoring", enabling collection of statement samples and execution plans.
Follow-up to:
How it works
If enabled, a python thread is launched during a regular check run:
pymysql
connection aspymysql
is not thread safe so we can't share with the main checkEXPLAIN
then it would need full write permission to all tables)During one "collection" we do the following:
performance_schema.events_statement_{current|history|history_long}
Rate Limiting
There are several different rate limits to keep load on the database to a minimum and to avoid reingesting duplicate events:
collections_per_second
: limits how often collections are done (each collection is a query to anevents_statements_*
table)explained_statements_cache
: ttl limits how often we attempt to collect an execution plan for a given normalized queryseen_samples_cache
: ttl limits how often we ingest statement samples for the same normalized query and execution planEvents statements tables
The check will collect samples from the best available
events_statements
table. It's up to the user to enable the requiredevents_statements
consumers. Some managed databases like RDS Aurora replicas don't supportevents_statements_history_long
in which case we fall back to one of the other tables.events_statements_history_long
- preferred as it has the longest retention giving us the highest chance of catching infrequent & fast queriesevents_statements_current
- less likely to catch infrequent & fast queriesevents_statements_history
- least preferred table becauseConfiguration
Motivation
Collect statement samples & execution plans, enabling deeper insight into what's running on the database and how queries are being executed.
Review checklist (to be filled by reviewers)
changelog/
andintegration/
labels attached