Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql: Add telemetry counter when descriptor corruption error is encountered #61786

Closed
vy-ton opened this issue Mar 10, 2021 · 5 comments · Fixed by #62546
Closed

sql: Add telemetry counter when descriptor corruption error is encountered #61786

vy-ton opened this issue Mar 10, 2021 · 5 comments · Fixed by #62546
Assignees
Labels
A-telemetry C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)

Comments

@vy-ton
Copy link
Contributor

vy-ton commented Mar 10, 2021

In 20.2, we added descriptor validation on write. In 21.1, we will start logging/creating Sentry reports for namespace validation errors.

Desired behavior
Whenever we encounter a corruption error, we should increment a telemetry counter(s). The number of counters should be determined by the granularity level we have for corruption errors (Ex. descriptor corruption vs namespace corruption).

@vy-ton vy-ton added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-telemetry labels Mar 10, 2021
@postamar postamar self-assigned this Mar 15, 2021
@postamar
Copy link
Contributor

I've been giving this some thought. Are there any arguments against adding telemetry for all descriptor and namespace validation failures?

@vy-ton
Copy link
Contributor Author

vy-ton commented Mar 16, 2021

I would love that! Would it be a single counter for all descriptor validation failures?

@postamar
Copy link
Contributor

I'm in the process of finding out how to best do this but I should have an answer for you today.

@postamar
Copy link
Contributor

I eventually came up with a solution involving a small number of counters with the following dimensions:

  • failure class: was this a descriptor self-check? or a cross-reference check?
  • failure subject type: was this descriptor a table descriptor? a database descriptor? etc.
  • failure time: did the failure occur at read time or at write time? Note that validations also take place when attempting to rename things like columns and whatnot. I opted to not count these because they're ultimately the result of user error. These errors usually have proper PG codes anyway (but not always!)

Please correct me if I'm wrong but I believe this addresses the original ask and it's not too big a change that it goes against the spirit of the stability period.

@vy-ton
Copy link
Contributor Author

vy-ton commented Mar 24, 2021

I'll let @ajwerner or @jordanlewis weigh in on the right dimensions.

Initially, I would likely aggregate all description corruption-related counters to view the trendline. So this proposal SGTM.

postamar pushed a commit to postamar/cockroach that referenced this issue Mar 24, 2021
This commit adds `sql.schema.validation_errors.*` telemetry keys to
descriptor validation errors.

Fixes cockroachdb#61786.

Release note: None
craig bot pushed a commit that referenced this issue Mar 31, 2021
62546: catalog: add telemetry for descriptor validation errors r=postamar a=postamar

This commit adds `sql.schema.validation_errors.*` telemetry keys to
descriptor validation errors.

Fixes #61786.

Release note: None

Co-authored-by: Marius Posta <[email protected]>
@craig craig bot closed this as completed in edcf5a8 Mar 31, 2021
postamar pushed a commit to postamar/cockroach that referenced this issue Mar 31, 2021
This commit adds `sql.schema.validation_errors.*` telemetry keys to
descriptor validation errors.

Fixes cockroachdb#61786.

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-telemetry C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants