-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
jobsprofiler: periodically persist aggregated stats during job execution #100126
Labels
A-jobs
C-enhancement
Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
T-disaster-recovery
Comments
adityamaru
added
C-enhancement
Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
A-disaster-recovery
labels
Mar 30, 2023
cc @cockroachdb/disaster-recovery |
adityamaru
added a commit
to adityamaru/cockroach
that referenced
this issue
Apr 4, 2023
This change moves the TracingAggregator to `pkg/util/tracing`. It also moves the CapturedStack proto into `pkg/util/tracing` from `pkg/util/tracing/tracingpb` so that it can implement the AggregatorEvent interface. `tracingpb` cannot import `tracing` because of a dependency cycle. Release note: None Informs: cockroachdb#100126
adityamaru
added a commit
to adityamaru/cockroach
that referenced
this issue
Apr 4, 2023
This change teaches the backup processor to periodically flush collected AggregatorEvents. For the time being we only log the CapturedStack structured events but this sets us up to persist other aggregated statistics in the future. Release note: None Informs: cockroachdb#100126
adityamaru
added a commit
to adityamaru/cockroach
that referenced
this issue
Apr 13, 2023
This change moves the TracingAggregator to `pkg/util/tracing`. It also moves the CapturedStack proto into `pkg/util/tracing` from `pkg/util/tracing/tracingpb` so that it can implement the AggregatorEvent interface. `tracingpb` cannot import `tracing` because of a dependency cycle. This change also teaches the CapturedStack about the AggregatorEvent so that CapturedStacks emitted during job execution can be aggreagated by tracing aggregators. Release note: None Informs: cockroachdb#100126
adityamaru
added a commit
to adityamaru/cockroach
that referenced
this issue
Apr 13, 2023
This change teaches the backup processor to periodically flush collected AggregatorEvents. For the time being we only log the CapturedStack structured events but this sets us up to persist other aggregated statistics in the future. Release note: None Informs: cockroachdb#100126
adityamaru
added a commit
to adityamaru/cockroach
that referenced
this issue
Aug 1, 2023
This change introduces a new meta type that allows processors in a DistSQL flow to send back `TracingAggregatorEvents`. These events capture valuable information about the current execution state of the job and will be exposed in a future commit for improved observability. Currently, only the backup processor has been taught to periodically send metas of this type to the coordinator. In the future we will teach C2C, restore and import to do the same. Informs: cockroachdb#100126 Release note: None
adityamaru
added a commit
to adityamaru/cockroach
that referenced
this issue
Aug 1, 2023
This commit teaches the coordinator of the backup job to listen for `TracingAggregatorEvents` metas from nodes that are executing in the DistSQL flow. Each `TracingAggregatorEvent` can be identified by its tag. The received metas are categorized by node and further categorized by tag so that we have an up-to-date in-memory representation of the latest `TracingAggregatorEvent` of each tag on each node. Periodically, this in-memory state is flushed to the `system.job_info` table in both machine-readable and human-readable file formats: - A file per node, for each aggregated TracingAggregatorEvent. These files contain the machine-readable proto bytes of the TracingAggregatorEvent. - A text file that contains a cluster-wide and per-node summary of each TracingAggregatorEvent in its human-readable format. Example: ``` -- SQL Instance ID: 1; Flow ID: 831caaf5-75cd-4e00-9e11-9a7469727eb5 - ExportStats num_files: 443 data_size: 1639.29 MB throughput: 54.63 MB/s -- Cluster-wide -- ExportStats num_files: 443 data_size: 1639.29 MB throughput: 54.63 MB/s ``` These files can be viewed and downloaded in the Advanced Debugging tab of the job details page. The files wil help understand the execution state of the job at different points in time. Some future work items that will build off this infrastructure are: - Annotating the job's DistSQL diagram with per-processor stats. - Displaying relevant stats in the job details page. - Teaching restore, import and C2C jobs to also start persisting aggregator stats for improved observability. Informs: cockroachdb#100126 Release note: None
adityamaru
added a commit
to adityamaru/cockroach
that referenced
this issue
Aug 3, 2023
This change introduces a new meta type that allows processors in a DistSQL flow to send back `TracingAggregatorEvents`. These events capture valuable information about the current execution state of the job and will be exposed in a future commit for improved observability. Currently, only the backup processor has been taught to periodically send metas of this type to the coordinator. In the future we will teach C2C, restore and import to do the same. Informs: cockroachdb#100126 Release note: None
adityamaru
added a commit
to adityamaru/cockroach
that referenced
this issue
Aug 8, 2023
This commit teaches the coordinator of the backup job to listen for `TracingAggregatorEvents` metas from nodes that are executing in the DistSQL flow. Each `TracingAggregatorEvent` can be identified by its tag. The received metas are categorized by node and further categorized by tag so that we have an up-to-date in-memory representation of the latest `TracingAggregatorEvent` of each tag on each node. Periodically, this in-memory state is flushed to the `system.job_info` table in both machine-readable and human-readable file formats: - A file per node, for each aggregated TracingAggregatorEvent. These files contain the machine-readable proto bytes of the TracingAggregatorEvent. - A text file that contains a cluster-wide and per-node summary of each TracingAggregatorEvent in its human-readable format. Example: ``` -- SQL Instance ID: 1; Flow ID: 831caaf5-75cd-4e00-9e11-9a7469727eb5 - ExportStats num_files: 443 data_size: 1639.29 MB throughput: 54.63 MB/s -- Cluster-wide -- ExportStats num_files: 443 data_size: 1639.29 MB throughput: 54.63 MB/s ``` These files can be viewed and downloaded in the Advanced Debugging tab of the job details page. The files wil help understand the execution state of the job at different points in time. Some future work items that will build off this infrastructure are: - Annotating the job's DistSQL diagram with per-processor stats. - Displaying relevant stats in the job details page. - Teaching restore, import and C2C jobs to also start persisting aggregator stats for improved observability. We are not equipped to handle special characters in the path of a status/admin server URL. To bypass this problem in the face of filenames with special characters we move the filename from the path component of the URL to a query parameter. Informs: cockroachdb#100126 Release note: None
adityamaru
added a commit
to adityamaru/cockroach
that referenced
this issue
Aug 8, 2023
This change builds on top of cockroachdb#107994 and wires up each restore data processor to emit TracingAggregatorEvents to the job coordinator. These events are periodically flushed to files in the `job_info` table and are consumable via the DBConsole Job Details page. Fixes: cockroachdb#100126 Release note: None
adityamaru
added a commit
to adityamaru/cockroach
that referenced
this issue
Aug 9, 2023
This change builds on top of cockroachdb#107994 and wires up each restore data processor to emit TracingAggregatorEvents to the job coordinator. These events are periodically flushed to files in the `job_info` table and are consumable via the DBConsole Job Details page. Fixes: cockroachdb#100126 Release note: None
adityamaru
added a commit
to adityamaru/cockroach
that referenced
this issue
Aug 9, 2023
This change builds on top of cockroachdb#107994 and wires up each stream ingestion data processor to emit TracingAggregatorEvents to the frontier and subsequently the job coordinator. These events are periodically flushed to files in the `job_info` table and are consumable via the DBConsole Job Details page. Currently, the only aggregator event that is propagated is the IngestionPerformanceStats emitted by the sst batcher. Fixes: cockroachdb#100126 Release note: None
adityamaru
added a commit
to adityamaru/cockroach
that referenced
this issue
Aug 18, 2023
This change introduces a new meta type that allows processors in a DistSQL flow to send back `TracingAggregatorEvents`. These events capture valuable information about the current execution state of the job and will be exposed in a future commit for improved observability. Currently, only the backup processor has been taught to periodically send metas of this type to the coordinator. In the future we will teach C2C, restore and import to do the same. Informs: cockroachdb#100126 Release note: None
adityamaru
added a commit
to adityamaru/cockroach
that referenced
this issue
Aug 18, 2023
This commit teaches the coordinator of the backup job to listen for `TracingAggregatorEvents` metas from nodes that are executing in the DistSQL flow. Each `TracingAggregatorEvent` can be identified by its tag. The received metas are categorized by node and further categorized by tag so that we have an up-to-date in-memory representation of the latest `TracingAggregatorEvent` of each tag on each node. Periodically, this in-memory state is flushed to the `system.job_info` table in both machine-readable and human-readable file formats: - A file per node, for each aggregated TracingAggregatorEvent. These files contain the machine-readable proto bytes of the TracingAggregatorEvent. - A text file that contains a cluster-wide and per-node summary of each TracingAggregatorEvent in its human-readable format. Example: ``` -- SQL Instance ID: 1; Flow ID: 831caaf5-75cd-4e00-9e11-9a7469727eb5 - ExportStats num_files: 443 data_size: 1639.29 MB throughput: 54.63 MB/s -- Cluster-wide -- ExportStats num_files: 443 data_size: 1639.29 MB throughput: 54.63 MB/s ``` These files can be viewed and downloaded in the Advanced Debugging tab of the job details page. The files wil help understand the execution state of the job at different points in time. Some future work items that will build off this infrastructure are: - Annotating the job's DistSQL diagram with per-processor stats. - Displaying relevant stats in the job details page. - Teaching restore, import and C2C jobs to also start persisting aggregator stats for improved observability. We are not equipped to handle special characters in the path of a status/admin server URL. To bypass this problem in the face of filenames with special characters we move the filename from the path component of the URL to a query parameter. Informs: cockroachdb#100126 Release note: None
adityamaru
added a commit
to adityamaru/cockroach
that referenced
this issue
Aug 18, 2023
This change introduces a new meta type that allows processors in a DistSQL flow to send back `TracingAggregatorEvents`. These events capture valuable information about the current execution state of the job and will be exposed in a future commit for improved observability. Currently, only the backup processor has been taught to periodically send metas of this type to the coordinator. In the future we will teach C2C, restore and import to do the same. Informs: cockroachdb#100126 Release note: None
adityamaru
added a commit
to adityamaru/cockroach
that referenced
this issue
Aug 18, 2023
This commit teaches the coordinator of the backup job to listen for `TracingAggregatorEvents` metas from nodes that are executing in the DistSQL flow. Each `TracingAggregatorEvent` can be identified by its tag. The received metas are categorized by node and further categorized by tag so that we have an up-to-date in-memory representation of the latest `TracingAggregatorEvent` of each tag on each node. Periodically, this in-memory state is flushed to the `system.job_info` table in both machine-readable and human-readable file formats: - A file per node, for each aggregated TracingAggregatorEvent. These files contain the machine-readable proto bytes of the TracingAggregatorEvent. - A text file that contains a cluster-wide and per-node summary of each TracingAggregatorEvent in its human-readable format. Example: ``` -- SQL Instance ID: 1; Flow ID: 831caaf5-75cd-4e00-9e11-9a7469727eb5 - ExportStats num_files: 443 data_size: 1639.29 MB throughput: 54.63 MB/s -- Cluster-wide -- ExportStats num_files: 443 data_size: 1639.29 MB throughput: 54.63 MB/s ``` These files can be viewed and downloaded in the Advanced Debugging tab of the job details page. The files wil help understand the execution state of the job at different points in time. Some future work items that will build off this infrastructure are: - Annotating the job's DistSQL diagram with per-processor stats. - Displaying relevant stats in the job details page. - Teaching restore, import and C2C jobs to also start persisting aggregator stats for improved observability. We are not equipped to handle special characters in the path of a status/admin server URL. To bypass this problem in the face of filenames with special characters we move the filename from the path component of the URL to a query parameter. Informs: cockroachdb#100126 Release note: None
craig bot
pushed a commit
that referenced
this issue
Aug 19, 2023
107994: bulk,backupccl: process and persist aggregator stats r=stevendanna a=adityamaru Please refer to individual commit messages. Informs: #100126 108961: batcheval: make Get and {,Rev}Scan cmds read write r=nvanbenschoten a=arulajmani Previously, Get, Scan, and RevScan commands were registered as read only commands. This meant they only had access to a storage.Reader. With the imminent introduction of replicated locks, Get/Scan/RevScan requests that acquire replicated locks will need access to a storage.ReadWriter. In preperation, we now register them as read write commands. Note that non-replicated lock acquiring Get/Scan/RevScan commands will continue to go through the read-only execution path. This patch doesn't affect that behavior, as that distinction is made based on the request's flags. We're losing some of the type safety on the read-only path for these requests that was added in 5e6e11c. We could bring it back in the future, but it'll likely not be in the current structure of how commands are registered. The current mechanism wasn't designed to have both a read-only and read-write variant for a single request type. For now, this patch shall do. Epic: none Release note: None Co-authored-by: adityamaru <[email protected]> Co-authored-by: Arul Ajmani <[email protected]>
craig bot
pushed a commit
that referenced
this issue
Aug 25, 2023
108359: backupccl: hookup tracing aggregator events from the restore job r=stevendanna a=adityamaru This change builds on top of #107994 and wires up each restore data processor to emit TracingAggregatorEvents to the job coordinator. These events are periodically flushed to files in the `job_info` table and are consumable via the DBConsole Job Details page. Fixes: #100126 Release note: None 109291: streamingccl: stream span config checkpoints r=stevendanna a=msbutler This patch modifies the span config event stream to emit a checkpoint event containing the rangefeed frontier after the event stream processes each rangefeed cache flush. The span config client can then use this information while processing updates. Specifically, the subscription.Next() call may return a checkpoint which indicates that all updates up to a given frontier have been emitted by the rangefeed. This patch also fixes two bugs: - prevents sending an empty batch of updates - prevents sending system target span config updates Informs #106823 Release note: None Co-authored-by: adityamaru <[email protected]> Co-authored-by: Michael Butler <[email protected]>
adityamaru
added a commit
to adityamaru/cockroach
that referenced
this issue
Aug 25, 2023
This change builds on top of cockroachdb#107994 and wires up each stream ingestion data processor to emit TracingAggregatorEvents to the frontier and subsequently the job coordinator. These events are periodically flushed to files in the `job_info` table and are consumable via the DBConsole Job Details page. Currently, the only aggregator event that is propagated is the IngestionPerformanceStats emitted by the sst batcher. Fixes: cockroachdb#100126 Release note: None
craig bot
pushed a commit
that referenced
this issue
Aug 28, 2023
108458: streamingccl: hookup tracing aggregtor events for the C2C job r=stevendanna a=adityamaru This change builds on top of #107994 and wires up each stream ingestion data processor to emit TracingAggregatorEvents to the frontier and subsequently the job coordinator. These events are periodically flushed to files in the `job_info` table and are consumable via the DBConsole Job Details page. Currently, the only aggregator event that is propagated is the IngestionPerformanceStats emitted by the sst batcher. Informs: #108374 Fixes: #100126 Release note: None 109529: concurrency: correctly establish joint claims when a lock is released r=nvanbenschoten a=arulajmani This patch closes the loop on joint claims. In particular, it correctly handles which locking requests are allowed to proceed when a lock is released. We also handle the case where a request that holds a claim (but not the lock) drops out without acquiring the lock. The handling itself is simple -- the head of the locking requests wait queue that is compatible with each other is allowed to proceed. The compatible request(s) are said to have established a (possibly joint) claim. Most of this patch is beefing up testing. Some of the testing additions here weren't strictly related to the code change. Closes #102272 Epic: none Release note: None Co-authored-by: adityamaru <[email protected]> Co-authored-by: Arul Ajmani <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
A-jobs
C-enhancement
Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
T-disaster-recovery
Processors of some jobs have a
TracingAggregator
attached to their root ctx before starting execution. This aggregator is subscribed to listen for all StructuredEvents that are emitted in the associated trace. The aggregator maintains a rolling aggregate of the StructuredEvents it is notified about. Today, this rolling aggregate is held in memory and is thrown away once the job completes execution. With the introduction of thejob_info
table we should start periodically persisting these aggregated stats over the lifetime of the job. This will give us a timeseries of all the collected stats over the lifetime of the job. The information persisted in the job_info table can then be consumed at a future point by tooling that we build to analyze the performance of a job.Epic: CRDB-8964
Jira issue: CRDB-26268
The text was updated successfully, but these errors were encountered: