Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

jobs: excessive dsp-diag-url profiler rows #126083

Closed
dt opened this issue Jun 23, 2024 · 3 comments · Fixed by #126084
Closed

jobs: excessive dsp-diag-url profiler rows #126083

dt opened this issue Jun 23, 2024 · 3 comments · Fixed by #126084
Assignees
Labels
A-jobs branch-release-23.1 Used to mark GA and release blockers, technical advisories, and bugs for 23.1 branch-release-23.2 Used to mark GA and release blockers, technical advisories, and bugs for 23.2 branch-release-24.1 Used to mark GA and release blockers, technical advisories, and bugs for 24.1 C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-support Would prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docs P-1 Issues/test failures with a fix SLA of 1 month T-jobs

Comments

@dt
Copy link
Member

dt commented Jun 23, 2024

In 23.1 we started storing the physical plan for many job types in rows of their prefix of the job info table to aid in debugging. These rows are created every time the job plans a physical flow, and generally are deleted when the job is deleted. For very long-running jobs that replan many, many times in their lifetime, such as a changefeed that runs for months or years and encounters errors, node restarted, rolling upgrades and restarts, etc, the number of persisted plans can grow to be large (currently a fact significantly compounded by the fact that the diagrams URLs for changefeeds in particular can be massive, even for just a single plan, due to a lack of limiting on detail such as number or length of spans from the spec copied into the diagram URL).

At one point we considered deleting "debugging" data like these persisted plans even before the job itself was deleted and all of its data was deleted. Indeed, in 23.2 we started prefixing these plan rows and other similar rows that were persisted only for later manual debugging inspection with the character ~ so that this prefix could be what was scanned and culled if we later added a clean-up loop similar to the terminal job one to clean up specifically "ephemeral" debugging data.

A time-based expiration for all "ephemeral" data has the downside however of potentially removing the most-recent diagram, that is still actively executing, if it executes for longer than the retention period. Also, a job in a tight replanning loop could still find itself with "too many" rows even without any of them being old enough to be eligible for cleanup under this time based policy. This led us to hold off on implementing this blanket policy until we saw evidence that it was needed and justified its downsides.

However for the specific case of diagram URLs, we can have the task that persists them also delete older ones for that job, to keep the number retained per job limited, regardless of their age.

Jira issue: CRDB-39767

@dt dt added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. P-1 Issues/test failures with a fix SLA of 1 month labels Jun 23, 2024
@dt dt self-assigned this Jun 23, 2024
Copy link

blathers-crl bot commented Jun 23, 2024

Hi @dt, please add branch-* labels to identify which branch(es) this C-bug affects.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@dt dt added O-support Would prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docs branch-master Failures and bugs on the master branch. branch-release-23.1 Used to mark GA and release blockers, technical advisories, and bugs for 23.1 branch-release-24.1 Used to mark GA and release blockers, technical advisories, and bugs for 24.1 branch-release-23.2 Used to mark GA and release blockers, technical advisories, and bugs for 23.2 labels Jun 23, 2024
dt added a commit to dt/cockroach that referenced this issue Jun 23, 2024
Mitigation for cockroachdb#126083.

Release note (ops change): Some debugging-only information about physcial plans is no longer collected
in the system.job_info table for changefeeds due to it having the potential to grow very large.

Epic: none.
craig bot pushed a commit that referenced this issue Jun 24, 2024
126085: changefeedccl: disable physical plan debug persistence r=dt a=dt

Mitigation for #126083.

Release note (ops change): Some debugging-only information about physcial plans is no longer collected in the system.job_info table for changefeeds due to it having the potential to grow very large.

Epic: none.

Co-authored-by: David Taylor <[email protected]>
blathers-crl bot pushed a commit that referenced this issue Jun 24, 2024
Mitigation for #126083.

Release note (ops change): Some debugging-only information about physcial plans is no longer collected
in the system.job_info table for changefeeds due to it having the potential to grow very large.

Epic: none.
@benbardin
Copy link
Collaborator

Will this be an issue with any other type of metadata, or just DSP diagrams?

@craig craig bot closed this as completed in 6693f4a Jun 24, 2024
blathers-crl bot pushed a commit that referenced this issue Jun 24, 2024
Mitigation for #126083.

Release note (ops change): Some debugging-only information about physcial plans is no longer collected
in the system.job_info table for changefeeds due to it having the potential to grow very large.

Epic: none.
@dt dt reopened this Jun 24, 2024
@dt dt removed branch-master Failures and bugs on the master branch. labels Jun 24, 2024
@dt
Copy link
Member Author

dt commented Jun 24, 2024

re-opening until the backports land

asg0451 pushed a commit to asg0451/cockroach that referenced this issue Jun 25, 2024
Mitigation for cockroachdb#126083.

Release note (ops change): Some debugging-only information about physcial plans is no longer collected
in the system.job_info table for changefeeds due to it having the potential to grow very large.

Epic: none.
@blathers-crl blathers-crl bot added the A-jobs label Jun 25, 2024
dt added a commit that referenced this issue Jun 27, 2024
Mitigation for #126083.

Release note (ops change): Some debugging-only information about physcial plans is no longer collected
in the system.job_info table for changefeeds due to it having the potential to grow very large.

Epic: none.
@dt dt closed this as completed Jul 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-jobs branch-release-23.1 Used to mark GA and release blockers, technical advisories, and bugs for 23.1 branch-release-23.2 Used to mark GA and release blockers, technical advisories, and bugs for 23.2 branch-release-24.1 Used to mark GA and release blockers, technical advisories, and bugs for 24.1 C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-support Would prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docs P-1 Issues/test failures with a fix SLA of 1 month T-jobs
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants