-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
jobs: excessive dsp-diag-url profiler rows #126083
Labels
A-jobs
branch-release-23.1
Used to mark GA and release blockers, technical advisories, and bugs for 23.1
branch-release-23.2
Used to mark GA and release blockers, technical advisories, and bugs for 23.2
branch-release-24.1
Used to mark GA and release blockers, technical advisories, and bugs for 24.1
C-bug
Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
O-support
Would prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docs
P-1
Issues/test failures with a fix SLA of 1 month
T-jobs
Comments
dt
added
C-bug
Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
P-1
Issues/test failures with a fix SLA of 1 month
labels
Jun 23, 2024
Hi @dt, please add branch-* labels to identify which branch(es) this C-bug affects. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
dt
added
O-support
Would prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docs
branch-master
Failures and bugs on the master branch.
branch-release-23.1
Used to mark GA and release blockers, technical advisories, and bugs for 23.1
branch-release-24.1
Used to mark GA and release blockers, technical advisories, and bugs for 24.1
branch-release-23.2
Used to mark GA and release blockers, technical advisories, and bugs for 23.2
labels
Jun 23, 2024
dt
added a commit
to dt/cockroach
that referenced
this issue
Jun 23, 2024
Mitigation for cockroachdb#126083. Release note (ops change): Some debugging-only information about physcial plans is no longer collected in the system.job_info table for changefeeds due to it having the potential to grow very large. Epic: none.
craig bot
pushed a commit
that referenced
this issue
Jun 24, 2024
126085: changefeedccl: disable physical plan debug persistence r=dt a=dt Mitigation for #126083. Release note (ops change): Some debugging-only information about physcial plans is no longer collected in the system.job_info table for changefeeds due to it having the potential to grow very large. Epic: none. Co-authored-by: David Taylor <[email protected]>
blathers-crl bot
pushed a commit
that referenced
this issue
Jun 24, 2024
Mitigation for #126083. Release note (ops change): Some debugging-only information about physcial plans is no longer collected in the system.job_info table for changefeeds due to it having the potential to grow very large. Epic: none.
Will this be an issue with any other type of metadata, or just DSP diagrams? |
blathers-crl bot
pushed a commit
that referenced
this issue
Jun 24, 2024
Mitigation for #126083. Release note (ops change): Some debugging-only information about physcial plans is no longer collected in the system.job_info table for changefeeds due to it having the potential to grow very large. Epic: none.
re-opening until the backports land |
asg0451
pushed a commit
to asg0451/cockroach
that referenced
this issue
Jun 25, 2024
Mitigation for cockroachdb#126083. Release note (ops change): Some debugging-only information about physcial plans is no longer collected in the system.job_info table for changefeeds due to it having the potential to grow very large. Epic: none.
dt
added a commit
that referenced
this issue
Jun 27, 2024
Mitigation for #126083. Release note (ops change): Some debugging-only information about physcial plans is no longer collected in the system.job_info table for changefeeds due to it having the potential to grow very large. Epic: none.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
A-jobs
branch-release-23.1
Used to mark GA and release blockers, technical advisories, and bugs for 23.1
branch-release-23.2
Used to mark GA and release blockers, technical advisories, and bugs for 23.2
branch-release-24.1
Used to mark GA and release blockers, technical advisories, and bugs for 24.1
C-bug
Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
O-support
Would prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docs
P-1
Issues/test failures with a fix SLA of 1 month
T-jobs
In 23.1 we started storing the physical plan for many job types in rows of their prefix of the job info table to aid in debugging. These rows are created every time the job plans a physical flow, and generally are deleted when the job is deleted. For very long-running jobs that replan many, many times in their lifetime, such as a changefeed that runs for months or years and encounters errors, node restarted, rolling upgrades and restarts, etc, the number of persisted plans can grow to be large (currently a fact significantly compounded by the fact that the diagrams URLs for changefeeds in particular can be massive, even for just a single plan, due to a lack of limiting on detail such as number or length of spans from the spec copied into the diagram URL).
At one point we considered deleting "debugging" data like these persisted plans even before the job itself was deleted and all of its data was deleted. Indeed, in 23.2 we started prefixing these plan rows and other similar rows that were persisted only for later manual debugging inspection with the character
~
so that this prefix could be what was scanned and culled if we later added a clean-up loop similar to the terminal job one to clean up specifically "ephemeral" debugging data.A time-based expiration for all "ephemeral" data has the downside however of potentially removing the most-recent diagram, that is still actively executing, if it executes for longer than the retention period. Also, a job in a tight replanning loop could still find itself with "too many" rows even without any of them being old enough to be eligible for cleanup under this time based policy. This led us to hold off on implementing this blanket policy until we saw evidence that it was needed and justified its downsides.
However for the specific case of diagram URLs, we can have the task that persists them also delete older ones for that job, to keep the number retained per job limited, regardless of their age.
Jira issue: CRDB-39767
The text was updated successfully, but these errors were encountered: