-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
schematelemetry: emit metrics and logs about invalid objects #108559
Conversation
Couple of notes: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not super familiar with the jobs framework, but was interested in learning more about it so I took a stab at reviewing this PR.
One related question: do you know how/when this job is created in the first place?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 4 of 4 files at r1, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @chrisseto)
pkg/sql/catalog/schematelemetry/schema_telemetry_job.go
line 114 at r1 (raw file):
maxRecords int, ) error { // TODO(before merge): is this the right txn type? Does this need to be run within a txn?
Yeah this looks correct
pkg/sql/catalog/schematelemetry/schema_telemetry_job.go
line 121 at r1 (raw file):
} rows, err := txn.QueryIteratorEx(ctx, "sql-telemetry-invalid-objects", txn.KV(), sessiondata.NodeUserSessionDataOverride, `SELECT id, error FROM "".crdb_internal.invalid_objects LIMIT `+strconv.Itoa(maxRecords))
Use $1 and pass it as a parameter.
pkg/sql/catalog/schematelemetry/schema_telemetry_test.go
line 179 at r1 (raw file):
// Ensure that a log line is emitted for each invalid object. errorRE := regexp.MustCompile("found invalid object with ID.*") entries, err := log.FetchEntriesFromFiles(0, math.MaxInt64, 1000, errorRE, log.SelectEditMode(false, false))
Do we want to take it further and validate the types of validation errors?
9899c85
to
01d9d0d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TFTRs!
do you know how/when this job is created in the first place?
It's a combo of scheduled jobs and the schematelemetrycontroller
package.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andyyang890 and @fqazi)
pkg/sql/catalog/schematelemetry/scheduled_job_executor.go
line 30 at r1 (raw file):
Previously, andyyang890 (Andy Yang) wrote…
nit: maybe split the renaming into its own commit
What would be the reasoning for that OOC? It can certainly be it's own atomic commit but it would be very small and largely irrelevant without the context of the following commit. I'm happy to do so if that's the standard practice! This seems like it may be pushing towards too granular. My "canary" for determining if something is too granular is whether or not I feel it would be possible to write a commit message longer than a sentence that provides more meaning/value than a one sentence summary.
pkg/sql/catalog/schematelemetry/schema_telemetry_job.go
line 51 at r1 (raw file):
Previously, andyyang890 (Andy Yang) wrote…
nit:
// MetricStruct implements the metric.Struct interface.
Done.
pkg/sql/catalog/schematelemetry/schema_telemetry_job.go
line 114 at r1 (raw file):
Previously, fqazi (Faizan Qazi) wrote…
Yeah this looks correct
Done.
pkg/sql/catalog/schematelemetry/schema_telemetry_job.go
line 121 at r1 (raw file):
Previously, fqazi (Faizan Qazi) wrote…
Use $1 and pass it as a parameter.
Huh, I swear that didn't work the first time around. Fixed now!
pkg/sql/catalog/schematelemetry/schema_telemetry_job.go
line 132 at r1 (raw file):
Previously, andyyang890 (Andy Yang) wrote…
I feel like this for loop might be more readable like this:
for { ok, err = rows.Next(ctx) if !ok { break }
Also, should the
err != nil
check happen before checkingok
?
Done.
I just copied this pattern from somewhere else 😅
pkg/sql/catalog/schematelemetry/schema_telemetry_test.go
line 175 at r1 (raw file):
Previously, andyyang890 (Andy Yang) wrote…
Do you mind explaining in a comment why there should be 9 invalid objects?
Done.
pkg/sql/catalog/schematelemetry/schema_telemetry_test.go
line 179 at r1 (raw file):
Previously, fqazi (Faizan Qazi) wrote…
Do we want to take it further and validate the types of validation errors?
I leaning towards no. In part there's some laziness here but I feel that quantifying the various schema corruptions within this test wouldn't provide any additional confidence about this specific test and would overall make the test more brittle.
I played around with "canning" corruptions and pairing them with a matcher. It ended up being pretty nasty given the broad scope of possible corruption and didn't change this particular test too much short of asserting that invalid_objects
works as expected. Also additional runs of the job ended up making the test take like 90s :/.
I do think there's a good argument for a follow up PR that's focused on classifying corruptions and further testing crdb_internal.invalid_objects. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @chrisseto and @fqazi)
pkg/sql/catalog/schematelemetry/scheduled_job_executor.go
line 30 at r1 (raw file):
Previously, chrisseto (Chris Seto) wrote…
What would be the reasoning for that OOC? It can certainly be it's own atomic commit but it would be very small and largely irrelevant without the context of the following commit. I'm happy to do so if that's the standard practice! This seems like it may be pushing towards too granular. My "canary" for determining if something is too granular is whether or not I feel it would be possible to write a commit message longer than a sentence that provides more meaning/value than a one sentence summary.
Yeah I guess it's a judgement call, I do find it's more useful when revisiting the git history that commits that rename things have their own commit since it's immediately obvious at what point it was renamed. I'd be okay if you want to keep it in the same commit.
pkg/sql/catalog/schematelemetry/schema_telemetry_job.go
line 51 at r1 (raw file):
Previously, chrisseto (Chris Seto) wrote…
Done.
nit: missing "the" after "implements"
pkg/sql/catalog/schematelemetry/schema_telemetry_test.go
line 175 at r1 (raw file):
Previously, chrisseto (Chris Seto) wrote…
Done.
Thanks! Would you mind also adding the specific test name to the comment (i.e. TestInvalidObjects
)?
Short of continuously polling `crdb_internal.invalid_objects`, there was not a convenient way to monitor a cluster for descriptor corruption. Having such an indicator would allow customers to perform preflight checks ahead of upgrades to avoid being stuck in a mixed version state. It would also allow CRL to more easily monitor cloud clusters for corruptions in the wild. This commit updates the schematelemetry job to additionally update the `sql.schema.invalid_objects` gauge and emit logs for any encountered corruptions. Informs: cockroachdb#104266 Epic: CRDB-28665 Release note: None
01d9d0d
to
f6af97a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andyyang890 and @fqazi)
pkg/sql/catalog/schematelemetry/scheduled_job_executor.go
line 30 at r1 (raw file):
Previously, andyyang890 (Andy Yang) wrote…
Yeah I guess it's a judgement call, I do find it's more useful when revisiting the git history that commits that rename things have their own commit since it's immediately obvious at what point it was renamed. I'd be okay if you want to keep it in the same commit.
I'll opt to keep this one bundled in given that it's a private variable!
pkg/sql/catalog/schematelemetry/schema_telemetry_job.go
line 51 at r1 (raw file):
Previously, andyyang890 (Andy Yang) wrote…
nit: missing "the" after "implements"
Done.
pkg/sql/catalog/schematelemetry/schema_telemetry_test.go
line 175 at r1 (raw file):
Previously, andyyang890 (Andy Yang) wrote…
Thanks! Would you mind also adding the specific test name to the comment (i.e.
TestInvalidObjects
)?
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @fqazi)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 2 files at r3, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @andyyang890 and @chrisseto)
pkg/sql/catalog/schematelemetry/schema_telemetry_test.go
line 179 at r1 (raw file):
Previously, chrisseto (Chris Seto) wrote…
I leaning towards no. In part there's some laziness here but I feel that quantifying the various schema corruptions within this test wouldn't provide any additional confidence about this specific test and would overall make the test more brittle.
I played around with "canning" corruptions and pairing them with a matcher. It ended up being pretty nasty given the broad scope of possible corruption and didn't change this particular test too much short of asserting that
invalid_objects
works as expected. Also additional runs of the job ended up making the test take like 90s :/.I do think there's a good argument for a follow up PR that's focused on classifying corruptions and further testing crdb_internal.invalid_objects. WDYT?
That's fair
TFTRs! bors r=fqazi,andyyang890 |
Build succeeded: |
I just realized we missed adding a release note for this one (so that customers are aware of the new metric that was added). I filed https://cockroachlabs.atlassian.net/browse/DOC-8710 Also, let's get this backported: blathers backport 23.1 |
Short of continuously polling
crdb_internal.invalid_objects
, there was not a convenient way to monitor a cluster for descriptor corruption.Having such an indicator would allow customers to perform preflight checks ahead of upgrades to avoid being stuck in a mixed version state. It would also allow CRL to more easily monitor cloud clusters for corruptions in the wild.
This commit updates the schematelemetry job to additionally update the
sql.schema.invalid_objects
gauge and emit logs for any encountered corruptions.Informs: #104266
Epic: CRDB-28665
(release note was added after commit was merged)
Release note (ops change): Added a new sql.schema.invalid_objects gauge
metric. This gauge is periodically updated based on the schedule set by
the sql.schema.telemetry.recurrence cluster setting. When it is updated,
it counts the number of schema objects (tables, types, schemas, databases,
and functions) that are in an invalid state according to CockroachDB’s
internal validation checks. This metric is expected to be zero in a healthy
cluster, and if it is not, it indicates that there is a problem that must
be repaired.