Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

jobsprofiler: periodically persist aggregated stats during job execution #100126

Closed
adityamaru opened this issue Mar 30, 2023 · 1 comment · Fixed by #108359 or #108458
Closed

jobsprofiler: periodically persist aggregated stats during job execution #100126

adityamaru opened this issue Mar 30, 2023 · 1 comment · Fixed by #108359 or #108458
Assignees
Labels
A-jobs C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-disaster-recovery

Comments

@adityamaru
Copy link
Contributor

adityamaru commented Mar 30, 2023

Processors of some jobs have a TracingAggregator attached to their root ctx before starting execution. This aggregator is subscribed to listen for all StructuredEvents that are emitted in the associated trace. The aggregator maintains a rolling aggregate of the StructuredEvents it is notified about. Today, this rolling aggregate is held in memory and is thrown away once the job completes execution. With the introduction of the job_info table we should start periodically persisting these aggregated stats over the lifetime of the job. This will give us a timeseries of all the collected stats over the lifetime of the job. The information persisted in the job_info table can then be consumed at a future point by tooling that we build to analyze the performance of a job.

Epic: CRDB-8964

Jira issue: CRDB-26268

@adityamaru adityamaru added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-disaster-recovery labels Mar 30, 2023
@adityamaru adityamaru self-assigned this Mar 30, 2023
@blathers-crl
Copy link

blathers-crl bot commented Mar 30, 2023

cc @cockroachdb/disaster-recovery

@blathers-crl blathers-crl bot added the T-jobs label Apr 4, 2023
adityamaru added a commit to adityamaru/cockroach that referenced this issue Apr 4, 2023
This change moves the TracingAggregator to `pkg/util/tracing`.
It also moves the CapturedStack proto into `pkg/util/tracing`
from `pkg/util/tracing/tracingpb` so that it can implement
the AggregatorEvent interface. `tracingpb` cannot import `tracing`
because of a dependency cycle.

Release note: None
Informs: cockroachdb#100126
adityamaru added a commit to adityamaru/cockroach that referenced this issue Apr 4, 2023
This change teaches the backup processor to
periodically flush collected AggregatorEvents.
For the time being we only log the CapturedStack structured
events but this sets us up to persist other aggregated
statistics in the future.

Release note: None
Informs: cockroachdb#100126
adityamaru added a commit to adityamaru/cockroach that referenced this issue Apr 13, 2023
This change moves the TracingAggregator to `pkg/util/tracing`.
It also moves the CapturedStack proto into `pkg/util/tracing`
from `pkg/util/tracing/tracingpb` so that it can implement
the AggregatorEvent interface. `tracingpb` cannot import `tracing`
because of a dependency cycle.

This change also teaches the CapturedStack about the AggregatorEvent
so that CapturedStacks emitted during job execution can be aggreagated
by tracing aggregators.

Release note: None
Informs: cockroachdb#100126
adityamaru added a commit to adityamaru/cockroach that referenced this issue Apr 13, 2023
This change teaches the backup processor to
periodically flush collected AggregatorEvents.
For the time being we only log the CapturedStack structured
events but this sets us up to persist other aggregated
statistics in the future.

Release note: None
Informs: cockroachdb#100126
adityamaru added a commit to adityamaru/cockroach that referenced this issue Aug 1, 2023
This change introduces a new meta type that allows processors in
a DistSQL flow to send back `TracingAggregatorEvents`. These events
capture valuable information about the current execution state of
the job and will be exposed in a future commit for improved observability.

Currently, only the backup processor has been taught to periodically
send metas of this type to the coordinator. In the future we will teach
C2C, restore and import to do the same.

Informs: cockroachdb#100126
Release note: None
adityamaru added a commit to adityamaru/cockroach that referenced this issue Aug 1, 2023
This commit teaches the coordinator of the backup job
to listen for `TracingAggregatorEvents` metas
from nodes that are executing in the DistSQL flow.
Each `TracingAggregatorEvent` can be identified by its
tag. The received metas are categorized by node and further
categorized by tag so that we have an up-to-date in-memory
representation of the latest `TracingAggregatorEvent` of each
tag on each node.

Periodically, this in-memory state is flushed to the `system.job_info`
table in both machine-readable and human-readable file formats:

- A file per node, for each aggregated TracingAggregatorEvent. These files
  contain the machine-readable proto bytes of the TracingAggregatorEvent.

- A text file that contains a cluster-wide and per-node summary of each
  TracingAggregatorEvent in its human-readable format.

Example:
```
-- SQL Instance ID: 1; Flow ID: 831caaf5-75cd-4e00-9e11-9a7469727eb5

- ExportStats
num_files: 443
data_size: 1639.29 MB
throughput: 54.63 MB/s

-- Cluster-wide

-- ExportStats
num_files: 443
data_size: 1639.29 MB
throughput: 54.63 MB/s

```

These files can be viewed and downloaded in the Advanced Debugging
tab of the job details page. The files wil help understand the execution
state of the job at different points in time.

Some future work items that will build off this infrastructure are:
- Annotating the job's DistSQL diagram with per-processor stats.
- Displaying relevant stats in the job details page.
- Teaching restore, import and C2C jobs to also start persisting aggregator
  stats for improved observability.

Informs: cockroachdb#100126
Release note: None
adityamaru added a commit to adityamaru/cockroach that referenced this issue Aug 3, 2023
This change introduces a new meta type that allows processors in
a DistSQL flow to send back `TracingAggregatorEvents`. These events
capture valuable information about the current execution state of
the job and will be exposed in a future commit for improved observability.

Currently, only the backup processor has been taught to periodically
send metas of this type to the coordinator. In the future we will teach
C2C, restore and import to do the same.

Informs: cockroachdb#100126
Release note: None
adityamaru added a commit to adityamaru/cockroach that referenced this issue Aug 8, 2023
This commit teaches the coordinator of the backup job
to listen for `TracingAggregatorEvents` metas
from nodes that are executing in the DistSQL flow.
Each `TracingAggregatorEvent` can be identified by its
tag. The received metas are categorized by node and further
categorized by tag so that we have an up-to-date in-memory
representation of the latest `TracingAggregatorEvent` of each
tag on each node.

Periodically, this in-memory state is flushed to the `system.job_info`
table in both machine-readable and human-readable file formats:

- A file per node, for each aggregated TracingAggregatorEvent. These files
  contain the machine-readable proto bytes of the TracingAggregatorEvent.

- A text file that contains a cluster-wide and per-node summary of each
  TracingAggregatorEvent in its human-readable format.

Example:
```
-- SQL Instance ID: 1; Flow ID: 831caaf5-75cd-4e00-9e11-9a7469727eb5

- ExportStats
num_files: 443
data_size: 1639.29 MB
throughput: 54.63 MB/s

-- Cluster-wide

-- ExportStats
num_files: 443
data_size: 1639.29 MB
throughput: 54.63 MB/s

```

These files can be viewed and downloaded in the Advanced Debugging
tab of the job details page. The files wil help understand the execution
state of the job at different points in time.

Some future work items that will build off this infrastructure are:
- Annotating the job's DistSQL diagram with per-processor stats.
- Displaying relevant stats in the job details page.
- Teaching restore, import and C2C jobs to also start persisting aggregator
  stats for improved observability.

We are not equipped to handle special characters
in the path of a status/admin server URL. To bypass
this problem in the face of filenames with special
characters we move the filename from the path component
of the URL to a query parameter.

Informs: cockroachdb#100126
Release note: None
adityamaru added a commit to adityamaru/cockroach that referenced this issue Aug 8, 2023
This change builds on top of cockroachdb#107994 and wires up each restore
data processor to emit TracingAggregatorEvents to the job coordinator.
These events are periodically flushed to files in the `job_info`
table and are consumable via the DBConsole Job Details page.

Fixes: cockroachdb#100126
Release note: None
adityamaru added a commit to adityamaru/cockroach that referenced this issue Aug 9, 2023
This change builds on top of cockroachdb#107994 and wires up each restore
data processor to emit TracingAggregatorEvents to the job coordinator.
These events are periodically flushed to files in the `job_info`
table and are consumable via the DBConsole Job Details page.

Fixes: cockroachdb#100126
Release note: None
adityamaru added a commit to adityamaru/cockroach that referenced this issue Aug 9, 2023
This change builds on top of cockroachdb#107994 and wires up each stream
ingestion data processor to emit TracingAggregatorEvents to
the frontier and subsequently the job coordinator.
These events are periodically flushed to files in the `job_info`
table and are consumable via the DBConsole Job Details page.

Currently, the only aggregator event that is propagated is the
IngestionPerformanceStats emitted by the sst batcher.

Fixes: cockroachdb#100126
Release note: None
adityamaru added a commit to adityamaru/cockroach that referenced this issue Aug 18, 2023
This change introduces a new meta type that allows processors in
a DistSQL flow to send back `TracingAggregatorEvents`. These events
capture valuable information about the current execution state of
the job and will be exposed in a future commit for improved observability.

Currently, only the backup processor has been taught to periodically
send metas of this type to the coordinator. In the future we will teach
C2C, restore and import to do the same.

Informs: cockroachdb#100126
Release note: None
adityamaru added a commit to adityamaru/cockroach that referenced this issue Aug 18, 2023
This commit teaches the coordinator of the backup job
to listen for `TracingAggregatorEvents` metas
from nodes that are executing in the DistSQL flow.
Each `TracingAggregatorEvent` can be identified by its
tag. The received metas are categorized by node and further
categorized by tag so that we have an up-to-date in-memory
representation of the latest `TracingAggregatorEvent` of each
tag on each node.

Periodically, this in-memory state is flushed to the `system.job_info`
table in both machine-readable and human-readable file formats:

- A file per node, for each aggregated TracingAggregatorEvent. These files
  contain the machine-readable proto bytes of the TracingAggregatorEvent.

- A text file that contains a cluster-wide and per-node summary of each
  TracingAggregatorEvent in its human-readable format.

Example:
```
-- SQL Instance ID: 1; Flow ID: 831caaf5-75cd-4e00-9e11-9a7469727eb5

- ExportStats
num_files: 443
data_size: 1639.29 MB
throughput: 54.63 MB/s

-- Cluster-wide

-- ExportStats
num_files: 443
data_size: 1639.29 MB
throughput: 54.63 MB/s

```

These files can be viewed and downloaded in the Advanced Debugging
tab of the job details page. The files wil help understand the execution
state of the job at different points in time.

Some future work items that will build off this infrastructure are:
- Annotating the job's DistSQL diagram with per-processor stats.
- Displaying relevant stats in the job details page.
- Teaching restore, import and C2C jobs to also start persisting aggregator
  stats for improved observability.

We are not equipped to handle special characters
in the path of a status/admin server URL. To bypass
this problem in the face of filenames with special
characters we move the filename from the path component
of the URL to a query parameter.

Informs: cockroachdb#100126
Release note: None
adityamaru added a commit to adityamaru/cockroach that referenced this issue Aug 18, 2023
This change introduces a new meta type that allows processors in
a DistSQL flow to send back `TracingAggregatorEvents`. These events
capture valuable information about the current execution state of
the job and will be exposed in a future commit for improved observability.

Currently, only the backup processor has been taught to periodically
send metas of this type to the coordinator. In the future we will teach
C2C, restore and import to do the same.

Informs: cockroachdb#100126
Release note: None
adityamaru added a commit to adityamaru/cockroach that referenced this issue Aug 18, 2023
This commit teaches the coordinator of the backup job
to listen for `TracingAggregatorEvents` metas
from nodes that are executing in the DistSQL flow.
Each `TracingAggregatorEvent` can be identified by its
tag. The received metas are categorized by node and further
categorized by tag so that we have an up-to-date in-memory
representation of the latest `TracingAggregatorEvent` of each
tag on each node.

Periodically, this in-memory state is flushed to the `system.job_info`
table in both machine-readable and human-readable file formats:

- A file per node, for each aggregated TracingAggregatorEvent. These files
  contain the machine-readable proto bytes of the TracingAggregatorEvent.

- A text file that contains a cluster-wide and per-node summary of each
  TracingAggregatorEvent in its human-readable format.

Example:
```
-- SQL Instance ID: 1; Flow ID: 831caaf5-75cd-4e00-9e11-9a7469727eb5

- ExportStats
num_files: 443
data_size: 1639.29 MB
throughput: 54.63 MB/s

-- Cluster-wide

-- ExportStats
num_files: 443
data_size: 1639.29 MB
throughput: 54.63 MB/s

```

These files can be viewed and downloaded in the Advanced Debugging
tab of the job details page. The files wil help understand the execution
state of the job at different points in time.

Some future work items that will build off this infrastructure are:
- Annotating the job's DistSQL diagram with per-processor stats.
- Displaying relevant stats in the job details page.
- Teaching restore, import and C2C jobs to also start persisting aggregator
  stats for improved observability.

We are not equipped to handle special characters
in the path of a status/admin server URL. To bypass
this problem in the face of filenames with special
characters we move the filename from the path component
of the URL to a query parameter.

Informs: cockroachdb#100126
Release note: None
craig bot pushed a commit that referenced this issue Aug 19, 2023
107994: bulk,backupccl: process and persist aggregator stats r=stevendanna a=adityamaru

Please refer to individual commit messages.

Informs: #100126

108961: batcheval: make Get and {,Rev}Scan cmds read write r=nvanbenschoten a=arulajmani

Previously, Get, Scan, and RevScan commands were registered as read only commands. This meant they only had access to a storage.Reader. With the imminent introduction of replicated locks, Get/Scan/RevScan requests that acquire replicated locks will need access to a storage.ReadWriter. In preperation, we now register them as read write commands.

Note that non-replicated lock acquiring Get/Scan/RevScan commands will continue to go through the read-only execution path. This patch doesn't affect that behavior, as that distinction is made based on the request's flags.

We're losing some of the type safety on the read-only path for these requests that was added in 5e6e11c. We could bring it back in the future, but it'll likely not be in the current structure of how commands are registered. The current mechanism wasn't designed to have both a read-only and read-write variant for a single request type. For now, this patch shall do.

Epic: none

Release note: None

Co-authored-by: adityamaru <[email protected]>
Co-authored-by: Arul Ajmani <[email protected]>
craig bot pushed a commit that referenced this issue Aug 25, 2023
108359: backupccl: hookup tracing aggregator events from the restore job r=stevendanna a=adityamaru

This change builds on top of #107994 and wires up each restore
data processor to emit TracingAggregatorEvents to the job coordinator.
These events are periodically flushed to files in the `job_info`
table and are consumable via the DBConsole Job Details page.

Fixes: #100126
Release note: None

109291: streamingccl: stream span config checkpoints r=stevendanna a=msbutler

This patch modifies the span config event stream to emit a checkpoint event
containing the rangefeed frontier after the event stream processes each
rangefeed cache flush.

The span config client can then use this information while processing updates.
Specifically, the subscription.Next() call may return a
checkpoint which indicates that all updates up to a given frontier have been
emitted by the rangefeed.

This patch also fixes two bugs:
- prevents sending an empty batch of updates
- prevents sending system target span config updates

Informs #106823

Release note: None

Co-authored-by: adityamaru <[email protected]>
Co-authored-by: Michael Butler <[email protected]>
@craig craig bot closed this as completed in e89c74f Aug 25, 2023
adityamaru added a commit to adityamaru/cockroach that referenced this issue Aug 25, 2023
This change builds on top of cockroachdb#107994 and wires up each stream
ingestion data processor to emit TracingAggregatorEvents to
the frontier and subsequently the job coordinator.
These events are periodically flushed to files in the `job_info`
table and are consumable via the DBConsole Job Details page.

Currently, the only aggregator event that is propagated is the
IngestionPerformanceStats emitted by the sst batcher.

Fixes: cockroachdb#100126
Release note: None
craig bot pushed a commit that referenced this issue Aug 28, 2023
108458: streamingccl: hookup tracing aggregtor events for the C2C job r=stevendanna a=adityamaru

This change builds on top of #107994 and wires up each stream
ingestion data processor to emit TracingAggregatorEvents to
the frontier and subsequently the job coordinator.
These events are periodically flushed to files in the `job_info`
table and are consumable via the DBConsole Job Details page.

Currently, the only aggregator event that is propagated is the
IngestionPerformanceStats emitted by the sst batcher.

Informs: #108374
Fixes: #100126
Release note: None

109529: concurrency: correctly establish joint claims when a lock is released r=nvanbenschoten a=arulajmani

This patch closes the loop on joint claims. In particular, it correctly
handles which locking requests are allowed to proceed when a lock is
released. We also handle the case where a request that holds a claim
(but not the lock) drops out without acquiring the lock. The handling
itself is simple -- the head of the locking requests wait queue that is
compatible with each other is allowed to proceed. The compatible
request(s) are said to have established a (possibly joint) claim.

Most of this patch is beefing up testing. Some of the testing additions
here weren't strictly related to the code change.

Closes #102272

Epic: none

Release note: None

Co-authored-by: adityamaru <[email protected]>
Co-authored-by: Arul Ajmani <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-jobs C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-disaster-recovery
Projects
No open projects
Archived in project
1 participant