jobsprofiler: periodically persist aggregated stats during job execution #100126

adityamaru · 2023-03-30T14:26:20Z

Processors of some jobs have a TracingAggregator attached to their root ctx before starting execution. This aggregator is subscribed to listen for all StructuredEvents that are emitted in the associated trace. The aggregator maintains a rolling aggregate of the StructuredEvents it is notified about. Today, this rolling aggregate is held in memory and is thrown away once the job completes execution. With the introduction of the job_info table we should start periodically persisting these aggregated stats over the lifetime of the job. This will give us a timeseries of all the collected stats over the lifetime of the job. The information persisted in the job_info table can then be consumed at a future point by tooling that we build to analyze the performance of a job.

Epic: CRDB-8964

Jira issue: CRDB-26268

The text was updated successfully, but these errors were encountered:

blathers-crl · 2023-03-30T14:26:24Z

cc @cockroachdb/disaster-recovery

This change moves the TracingAggregator to `pkg/util/tracing`. It also moves the CapturedStack proto into `pkg/util/tracing` from `pkg/util/tracing/tracingpb` so that it can implement the AggregatorEvent interface. `tracingpb` cannot import `tracing` because of a dependency cycle. Release note: None Informs: cockroachdb#100126

This change teaches the backup processor to periodically flush collected AggregatorEvents. For the time being we only log the CapturedStack structured events but this sets us up to persist other aggregated statistics in the future. Release note: None Informs: cockroachdb#100126

This change moves the TracingAggregator to `pkg/util/tracing`. It also moves the CapturedStack proto into `pkg/util/tracing` from `pkg/util/tracing/tracingpb` so that it can implement the AggregatorEvent interface. `tracingpb` cannot import `tracing` because of a dependency cycle. This change also teaches the CapturedStack about the AggregatorEvent so that CapturedStacks emitted during job execution can be aggreagated by tracing aggregators. Release note: None Informs: cockroachdb#100126

This change teaches the backup processor to periodically flush collected AggregatorEvents. For the time being we only log the CapturedStack structured events but this sets us up to persist other aggregated statistics in the future. Release note: None Informs: cockroachdb#100126

This change introduces a new meta type that allows processors in a DistSQL flow to send back `TracingAggregatorEvents`. These events capture valuable information about the current execution state of the job and will be exposed in a future commit for improved observability. Currently, only the backup processor has been taught to periodically send metas of this type to the coordinator. In the future we will teach C2C, restore and import to do the same. Informs: cockroachdb#100126 Release note: None

This commit teaches the coordinator of the backup job to listen for `TracingAggregatorEvents` metas from nodes that are executing in the DistSQL flow. Each `TracingAggregatorEvent` can be identified by its tag. The received metas are categorized by node and further categorized by tag so that we have an up-to-date in-memory representation of the latest `TracingAggregatorEvent` of each tag on each node. Periodically, this in-memory state is flushed to the `system.job_info` table in both machine-readable and human-readable file formats: - A file per node, for each aggregated TracingAggregatorEvent. These files contain the machine-readable proto bytes of the TracingAggregatorEvent. - A text file that contains a cluster-wide and per-node summary of each TracingAggregatorEvent in its human-readable format. Example: ``` -- SQL Instance ID: 1; Flow ID: 831caaf5-75cd-4e00-9e11-9a7469727eb5 - ExportStats num_files: 443 data_size: 1639.29 MB throughput: 54.63 MB/s -- Cluster-wide -- ExportStats num_files: 443 data_size: 1639.29 MB throughput: 54.63 MB/s ``` These files can be viewed and downloaded in the Advanced Debugging tab of the job details page. The files wil help understand the execution state of the job at different points in time. Some future work items that will build off this infrastructure are: - Annotating the job's DistSQL diagram with per-processor stats. - Displaying relevant stats in the job details page. - Teaching restore, import and C2C jobs to also start persisting aggregator stats for improved observability. Informs: cockroachdb#100126 Release note: None

This change introduces a new meta type that allows processors in a DistSQL flow to send back `TracingAggregatorEvents`. These events capture valuable information about the current execution state of the job and will be exposed in a future commit for improved observability. Currently, only the backup processor has been taught to periodically send metas of this type to the coordinator. In the future we will teach C2C, restore and import to do the same. Informs: cockroachdb#100126 Release note: None

This commit teaches the coordinator of the backup job to listen for `TracingAggregatorEvents` metas from nodes that are executing in the DistSQL flow. Each `TracingAggregatorEvent` can be identified by its tag. The received metas are categorized by node and further categorized by tag so that we have an up-to-date in-memory representation of the latest `TracingAggregatorEvent` of each tag on each node. Periodically, this in-memory state is flushed to the `system.job_info` table in both machine-readable and human-readable file formats: - A file per node, for each aggregated TracingAggregatorEvent. These files contain the machine-readable proto bytes of the TracingAggregatorEvent. - A text file that contains a cluster-wide and per-node summary of each TracingAggregatorEvent in its human-readable format. Example: ``` -- SQL Instance ID: 1; Flow ID: 831caaf5-75cd-4e00-9e11-9a7469727eb5 - ExportStats num_files: 443 data_size: 1639.29 MB throughput: 54.63 MB/s -- Cluster-wide -- ExportStats num_files: 443 data_size: 1639.29 MB throughput: 54.63 MB/s ``` These files can be viewed and downloaded in the Advanced Debugging tab of the job details page. The files wil help understand the execution state of the job at different points in time. Some future work items that will build off this infrastructure are: - Annotating the job's DistSQL diagram with per-processor stats. - Displaying relevant stats in the job details page. - Teaching restore, import and C2C jobs to also start persisting aggregator stats for improved observability. We are not equipped to handle special characters in the path of a status/admin server URL. To bypass this problem in the face of filenames with special characters we move the filename from the path component of the URL to a query parameter. Informs: cockroachdb#100126 Release note: None

This change builds on top of cockroachdb#107994 and wires up each restore data processor to emit TracingAggregatorEvents to the job coordinator. These events are periodically flushed to files in the `job_info` table and are consumable via the DBConsole Job Details page. Fixes: cockroachdb#100126 Release note: None

This change builds on top of cockroachdb#107994 and wires up each stream ingestion data processor to emit TracingAggregatorEvents to the frontier and subsequently the job coordinator. These events are periodically flushed to files in the `job_info` table and are consumable via the DBConsole Job Details page. Currently, the only aggregator event that is propagated is the IngestionPerformanceStats emitted by the sst batcher. Fixes: cockroachdb#100126 Release note: None

This change introduces a new meta type that allows processors in a DistSQL flow to send back `TracingAggregatorEvents`. These events capture valuable information about the current execution state of the job and will be exposed in a future commit for improved observability. Currently, only the backup processor has been taught to periodically send metas of this type to the coordinator. In the future we will teach C2C, restore and import to do the same. Informs: cockroachdb#100126 Release note: None

This commit teaches the coordinator of the backup job to listen for `TracingAggregatorEvents` metas from nodes that are executing in the DistSQL flow. Each `TracingAggregatorEvent` can be identified by its tag. The received metas are categorized by node and further categorized by tag so that we have an up-to-date in-memory representation of the latest `TracingAggregatorEvent` of each tag on each node. Periodically, this in-memory state is flushed to the `system.job_info` table in both machine-readable and human-readable file formats: - A file per node, for each aggregated TracingAggregatorEvent. These files contain the machine-readable proto bytes of the TracingAggregatorEvent. - A text file that contains a cluster-wide and per-node summary of each TracingAggregatorEvent in its human-readable format. Example: ``` -- SQL Instance ID: 1; Flow ID: 831caaf5-75cd-4e00-9e11-9a7469727eb5 - ExportStats num_files: 443 data_size: 1639.29 MB throughput: 54.63 MB/s -- Cluster-wide -- ExportStats num_files: 443 data_size: 1639.29 MB throughput: 54.63 MB/s ``` These files can be viewed and downloaded in the Advanced Debugging tab of the job details page. The files wil help understand the execution state of the job at different points in time. Some future work items that will build off this infrastructure are: - Annotating the job's DistSQL diagram with per-processor stats. - Displaying relevant stats in the job details page. - Teaching restore, import and C2C jobs to also start persisting aggregator stats for improved observability. We are not equipped to handle special characters in the path of a status/admin server URL. To bypass this problem in the face of filenames with special characters we move the filename from the path component of the URL to a query parameter. Informs: cockroachdb#100126 Release note: None

This change introduces a new meta type that allows processors in a DistSQL flow to send back `TracingAggregatorEvents`. These events capture valuable information about the current execution state of the job and will be exposed in a future commit for improved observability. Currently, only the backup processor has been taught to periodically send metas of this type to the coordinator. In the future we will teach C2C, restore and import to do the same. Informs: cockroachdb#100126 Release note: None

This commit teaches the coordinator of the backup job to listen for `TracingAggregatorEvents` metas from nodes that are executing in the DistSQL flow. Each `TracingAggregatorEvent` can be identified by its tag. The received metas are categorized by node and further categorized by tag so that we have an up-to-date in-memory representation of the latest `TracingAggregatorEvent` of each tag on each node. Periodically, this in-memory state is flushed to the `system.job_info` table in both machine-readable and human-readable file formats: - A file per node, for each aggregated TracingAggregatorEvent. These files contain the machine-readable proto bytes of the TracingAggregatorEvent. - A text file that contains a cluster-wide and per-node summary of each TracingAggregatorEvent in its human-readable format. Example: ``` -- SQL Instance ID: 1; Flow ID: 831caaf5-75cd-4e00-9e11-9a7469727eb5 - ExportStats num_files: 443 data_size: 1639.29 MB throughput: 54.63 MB/s -- Cluster-wide -- ExportStats num_files: 443 data_size: 1639.29 MB throughput: 54.63 MB/s ``` These files can be viewed and downloaded in the Advanced Debugging tab of the job details page. The files wil help understand the execution state of the job at different points in time. Some future work items that will build off this infrastructure are: - Annotating the job's DistSQL diagram with per-processor stats. - Displaying relevant stats in the job details page. - Teaching restore, import and C2C jobs to also start persisting aggregator stats for improved observability. We are not equipped to handle special characters in the path of a status/admin server URL. To bypass this problem in the face of filenames with special characters we move the filename from the path component of the URL to a query parameter. Informs: cockroachdb#100126 Release note: None

107994: bulk,backupccl: process and persist aggregator stats r=stevendanna a=adityamaru Please refer to individual commit messages. Informs: #100126 108961: batcheval: make Get and {,Rev}Scan cmds read write r=nvanbenschoten a=arulajmani Previously, Get, Scan, and RevScan commands were registered as read only commands. This meant they only had access to a storage.Reader. With the imminent introduction of replicated locks, Get/Scan/RevScan requests that acquire replicated locks will need access to a storage.ReadWriter. In preperation, we now register them as read write commands. Note that non-replicated lock acquiring Get/Scan/RevScan commands will continue to go through the read-only execution path. This patch doesn't affect that behavior, as that distinction is made based on the request's flags. We're losing some of the type safety on the read-only path for these requests that was added in 5e6e11c. We could bring it back in the future, but it'll likely not be in the current structure of how commands are registered. The current mechanism wasn't designed to have both a read-only and read-write variant for a single request type. For now, this patch shall do. Epic: none Release note: None Co-authored-by: adityamaru <[email protected]> Co-authored-by: Arul Ajmani <[email protected]>

108359: backupccl: hookup tracing aggregator events from the restore job r=stevendanna a=adityamaru This change builds on top of #107994 and wires up each restore data processor to emit TracingAggregatorEvents to the job coordinator. These events are periodically flushed to files in the `job_info` table and are consumable via the DBConsole Job Details page. Fixes: #100126 Release note: None 109291: streamingccl: stream span config checkpoints r=stevendanna a=msbutler This patch modifies the span config event stream to emit a checkpoint event containing the rangefeed frontier after the event stream processes each rangefeed cache flush. The span config client can then use this information while processing updates. Specifically, the subscription.Next() call may return a checkpoint which indicates that all updates up to a given frontier have been emitted by the rangefeed. This patch also fixes two bugs: - prevents sending an empty batch of updates - prevents sending system target span config updates Informs #106823 Release note: None Co-authored-by: adityamaru <[email protected]> Co-authored-by: Michael Butler <[email protected]>

This change builds on top of cockroachdb#107994 and wires up each stream ingestion data processor to emit TracingAggregatorEvents to the frontier and subsequently the job coordinator. These events are periodically flushed to files in the `job_info` table and are consumable via the DBConsole Job Details page. Currently, the only aggregator event that is propagated is the IngestionPerformanceStats emitted by the sst batcher. Fixes: cockroachdb#100126 Release note: None

108458: streamingccl: hookup tracing aggregtor events for the C2C job r=stevendanna a=adityamaru This change builds on top of #107994 and wires up each stream ingestion data processor to emit TracingAggregatorEvents to the frontier and subsequently the job coordinator. These events are periodically flushed to files in the `job_info` table and are consumable via the DBConsole Job Details page. Currently, the only aggregator event that is propagated is the IngestionPerformanceStats emitted by the sst batcher. Informs: #108374 Fixes: #100126 Release note: None 109529: concurrency: correctly establish joint claims when a lock is released r=nvanbenschoten a=arulajmani This patch closes the loop on joint claims. In particular, it correctly handles which locking requests are allowed to proceed when a lock is released. We also handle the case where a request that holds a claim (but not the lock) drops out without acquiring the lock. The handling itself is simple -- the head of the locking requests wait queue that is compatible with each other is allowed to proceed. The compatible request(s) are said to have established a (possibly joint) claim. Most of this patch is beefing up testing. Some of the testing additions here weren't strictly related to the code change. Closes #102272 Epic: none Release note: None Co-authored-by: adityamaru <[email protected]> Co-authored-by: Arul Ajmani <[email protected]>

adityamaru added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-disaster-recovery labels Mar 30, 2023

adityamaru self-assigned this Mar 30, 2023

blathers-crl bot added the T-disaster-recovery label Mar 30, 2023

adityamaru mentioned this issue Mar 30, 2023

jobsprofiler: measure CPU time spent in processing KV requests during job execution #100128

Closed

adityamaru added A-disaster-recovery and removed A-disaster-recovery labels Mar 30, 2023

exalate-issue-sync bot removed the A-disaster-recovery label Mar 30, 2023

adityamaru mentioned this issue Apr 2, 2023

tracing,backupccl: log aggregated CapturedStacks during backup #100316

Closed

adityamaru added the A-jobs label Apr 4, 2023

blathers-crl bot added the T-jobs label Apr 4, 2023

exalate-issue-sync bot removed the T-jobs label Aug 1, 2023

adityamaru mentioned this issue Aug 2, 2023

bulk,backupccl: process and persist aggregator stats #107994

Merged

adityamaru mentioned this issue Aug 8, 2023

backupccl: hookup tracing aggregator events from the restore job #108359

Merged

adityamaru mentioned this issue Aug 9, 2023

streamingccl: hookup tracing aggregtor events for the C2C job #108458

Merged

craig bot closed this as completed in e89c74f Aug 25, 2023

github-project-automation bot added this to Disaster Recovery Backlog Aug 28, 2024

github-project-automation bot moved this to Done in Disaster Recovery Backlog Aug 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

jobsprofiler: periodically persist aggregated stats during job execution #100126

jobsprofiler: periodically persist aggregated stats during job execution #100126

adityamaru commented Mar 30, 2023 •

edited by cockroach-jira-scripts

Loading

blathers-crl bot commented Mar 30, 2023

jobsprofiler: periodically persist aggregated stats during job execution #100126

jobsprofiler: periodically persist aggregated stats during job execution #100126

Comments

adityamaru commented Mar 30, 2023 • edited by cockroach-jira-scripts Loading

blathers-crl bot commented Mar 30, 2023

adityamaru commented Mar 30, 2023 •

edited by cockroach-jira-scripts

Loading