schematelemetry: emit metrics and logs about invalid objects #108559

chrisseto · 2023-08-10T21:26:53Z

Short of continuously polling crdb_internal.invalid_objects, there was not a convenient way to monitor a cluster for descriptor corruption.

Having such an indicator would allow customers to perform preflight checks ahead of upgrades to avoid being stuck in a mixed version state. It would also allow CRL to more easily monitor cloud clusters for corruptions in the wild.

This commit updates the schematelemetry job to additionally update the sql.schema.invalid_objects gauge and emit logs for any encountered corruptions.

Informs: #104266
Epic: CRDB-28665

(release note was added after commit was merged)
Release note (ops change): Added a new sql.schema.invalid_objects gauge
metric. This gauge is periodically updated based on the schedule set by
the sql.schema.telemetry.recurrence cluster setting. When it is updated,
it counts the number of schema objects (tables, types, schemas, databases,
and functions) that are in an invalid state according to CockroachDB’s
internal validation checks. This metric is expected to be zero in a healthy
cluster, and if it is not, it indicates that there is a problem that must
be repaired.

cockroach-teamcity · 2023-08-10T21:27:02Z

This change is

chrisseto · 2023-08-10T21:28:56Z

Couple of notes:
I'm not sure if the metrics are named or hooked up correctly.
I had a bit of work on "canning" descriptor corruptions but ultimately tabled it to make get this work over the finish line sooner.
I would have liked to get more iterations on the job run to verify that the gauge is updated as expected but the job takes about ~30s to run or it would require exporting a bunch of internal types 🤔

andyyang890

I'm not super familiar with the jobs framework, but was interested in learning more about it so I took a stab at reviewing this PR.

One related question: do you know how/when this job is created in the first place?

pkg/sql/catalog/schematelemetry/scheduled_job_executor.go

pkg/sql/catalog/schematelemetry/schema_telemetry_job.go

pkg/sql/catalog/schematelemetry/schema_telemetry_test.go

fqazi

Reviewed 4 of 4 files at r1, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @chrisseto)

pkg/sql/catalog/schematelemetry/schema_telemetry_job.go line 114 at r1 (raw file):

	maxRecords int,
) error {
	// TODO(before merge): is this the right txn type? Does this need to be run within a txn?

Yeah this looks correct

pkg/sql/catalog/schematelemetry/schema_telemetry_job.go line 121 at r1 (raw file):

		}

		rows, err := txn.QueryIteratorEx(ctx, "sql-telemetry-invalid-objects", txn.KV(), sessiondata.NodeUserSessionDataOverride, `SELECT id, error FROM "".crdb_internal.invalid_objects LIMIT `+strconv.Itoa(maxRecords))

Use $1 and pass it as a parameter.

pkg/sql/catalog/schematelemetry/schema_telemetry_test.go line 179 at r1 (raw file):

	// Ensure that a log line is emitted for each invalid object.
	errorRE := regexp.MustCompile("found invalid object with ID.*")
	entries, err := log.FetchEntriesFromFiles(0, math.MaxInt64, 1000, errorRE, log.SelectEditMode(false, false))

Do we want to take it further and validate the types of validation errors?

chrisseto

TFTRs!

do you know how/when this job is created in the first place?

It's a combo of scheduled jobs and the schematelemetrycontrollerpackage.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andyyang890 and @fqazi)

pkg/sql/catalog/schematelemetry/scheduled_job_executor.go line 30 at r1 (raw file):

Previously, andyyang890 (Andy Yang) wrote…

nit: maybe split the renaming into its own commit

What would be the reasoning for that OOC? It can certainly be it's own atomic commit but it would be very small and largely irrelevant without the context of the following commit. I'm happy to do so if that's the standard practice! This seems like it may be pushing towards too granular. My "canary" for determining if something is too granular is whether or not I feel it would be possible to write a commit message longer than a sentence that provides more meaning/value than a one sentence summary.

pkg/sql/catalog/schematelemetry/schema_telemetry_job.go line 51 at r1 (raw file):

Previously, andyyang890 (Andy Yang) wrote…

nit:

// MetricStruct implements the metric.Struct interface.

Done.

pkg/sql/catalog/schematelemetry/schema_telemetry_job.go line 114 at r1 (raw file):

Previously, fqazi (Faizan Qazi) wrote…

Yeah this looks correct

Done.

pkg/sql/catalog/schematelemetry/schema_telemetry_job.go line 121 at r1 (raw file):

Previously, fqazi (Faizan Qazi) wrote…

Use $1 and pass it as a parameter.

Huh, I swear that didn't work the first time around. Fixed now!

pkg/sql/catalog/schematelemetry/schema_telemetry_job.go line 132 at r1 (raw file):

Previously, andyyang890 (Andy Yang) wrote…

I feel like this for loop might be more readable like this:
		for {
			ok, err = rows.Next(ctx)
			if !ok {
				break
			}
Also, should the err != nil check happen before checking ok?

Done.

I just copied this pattern from somewhere else 😅

pkg/sql/catalog/schematelemetry/schema_telemetry_test.go line 175 at r1 (raw file):

Previously, andyyang890 (Andy Yang) wrote…

Do you mind explaining in a comment why there should be 9 invalid objects?

Done.

pkg/sql/catalog/schematelemetry/schema_telemetry_test.go line 179 at r1 (raw file):

Previously, fqazi (Faizan Qazi) wrote…

Do we want to take it further and validate the types of validation errors?

I leaning towards no. In part there's some laziness here but I feel that quantifying the various schema corruptions within this test wouldn't provide any additional confidence about this specific test and would overall make the test more brittle.

I played around with "canning" corruptions and pairing them with a matcher. It ended up being pretty nasty given the broad scope of possible corruption and didn't change this particular test too much short of asserting that invalid_objects works as expected. Also additional runs of the job ended up making the test take like 90s :/.

I do think there's a good argument for a follow up PR that's focused on classifying corruptions and further testing crdb_internal.invalid_objects. WDYT?

andyyang890

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @chrisseto and @fqazi)

pkg/sql/catalog/schematelemetry/scheduled_job_executor.go line 30 at r1 (raw file):

Previously, chrisseto (Chris Seto) wrote…

What would be the reasoning for that OOC? It can certainly be it's own atomic commit but it would be very small and largely irrelevant without the context of the following commit. I'm happy to do so if that's the standard practice! This seems like it may be pushing towards too granular. My "canary" for determining if something is too granular is whether or not I feel it would be possible to write a commit message longer than a sentence that provides more meaning/value than a one sentence summary.

Yeah I guess it's a judgement call, I do find it's more useful when revisiting the git history that commits that rename things have their own commit since it's immediately obvious at what point it was renamed. I'd be okay if you want to keep it in the same commit.

pkg/sql/catalog/schematelemetry/schema_telemetry_job.go line 51 at r1 (raw file):

Previously, chrisseto (Chris Seto) wrote…

Done.

nit: missing "the" after "implements"

pkg/sql/catalog/schematelemetry/schema_telemetry_test.go line 175 at r1 (raw file):

Previously, chrisseto (Chris Seto) wrote…

Done.

Thanks! Would you mind also adding the specific test name to the comment (i.e. TestInvalidObjects)?

Short of continuously polling `crdb_internal.invalid_objects`, there was not a convenient way to monitor a cluster for descriptor corruption. Having such an indicator would allow customers to perform preflight checks ahead of upgrades to avoid being stuck in a mixed version state. It would also allow CRL to more easily monitor cloud clusters for corruptions in the wild. This commit updates the schematelemetry job to additionally update the `sql.schema.invalid_objects` gauge and emit logs for any encountered corruptions. Informs: cockroachdb#104266 Epic: CRDB-28665 Release note: None

chrisseto

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andyyang890 and @fqazi)

pkg/sql/catalog/schematelemetry/scheduled_job_executor.go line 30 at r1 (raw file):

Previously, andyyang890 (Andy Yang) wrote…

Yeah I guess it's a judgement call, I do find it's more useful when revisiting the git history that commits that rename things have their own commit since it's immediately obvious at what point it was renamed. I'd be okay if you want to keep it in the same commit.

I'll opt to keep this one bundled in given that it's a private variable!

pkg/sql/catalog/schematelemetry/schema_telemetry_job.go line 51 at r1 (raw file):

Previously, andyyang890 (Andy Yang) wrote…

nit: missing "the" after "implements"

Done.

pkg/sql/catalog/schematelemetry/schema_telemetry_test.go line 175 at r1 (raw file):

Previously, andyyang890 (Andy Yang) wrote…

Thanks! Would you mind also adding the specific test name to the comment (i.e. TestInvalidObjects)?

Done.

andyyang890

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @fqazi)

fqazi

Reviewed 2 of 2 files at r3, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @andyyang890 and @chrisseto)

pkg/sql/catalog/schematelemetry/schema_telemetry_test.go line 179 at r1 (raw file):

Previously, chrisseto (Chris Seto) wrote…

I leaning towards no. In part there's some laziness here but I feel that quantifying the various schema corruptions within this test wouldn't provide any additional confidence about this specific test and would overall make the test more brittle.

I played around with "canning" corruptions and pairing them with a matcher. It ended up being pretty nasty given the broad scope of possible corruption and didn't change this particular test too much short of asserting that invalid_objects works as expected. Also additional runs of the job ended up making the test take like 90s :/.

I do think there's a good argument for a follow up PR that's focused on classifying corruptions and further testing crdb_internal.invalid_objects. WDYT?

That's fair

chrisseto · 2023-08-24T18:40:07Z

TFTRs!

bors r=fqazi,andyyang890

craig · 2023-08-24T20:02:06Z

Build succeeded:

Bazel Essential CI (Cockroach)

rafiss · 2023-08-30T17:12:18Z

I just realized we missed adding a release note for this one (so that customers are aware of the new metric that was added). I filed https://cockroachlabs.atlassian.net/browse/DOC-8710

Also, let's get this backported:

blathers backport 23.1

chrisseto requested a review from a team as a code owner August 10, 2023 21:26

andyyang890 reviewed Aug 11, 2023

View reviewed changes

fqazi requested changes Aug 11, 2023

View reviewed changes

chrisseto force-pushed the invalid-objs-metric branch from 9899c85 to 01d9d0d Compare August 14, 2023 17:26

chrisseto commented Aug 14, 2023

View reviewed changes

andyyang890 reviewed Aug 18, 2023

View reviewed changes

chrisseto force-pushed the invalid-objs-metric branch from 01d9d0d to f6af97a Compare August 22, 2023 18:24

chrisseto commented Aug 22, 2023

View reviewed changes

andyyang890 approved these changes Aug 22, 2023

View reviewed changes

chrisseto requested a review from fqazi August 22, 2023 18:57

fqazi approved these changes Aug 24, 2023

View reviewed changes

craig bot merged commit 7437f5e into cockroachdb:master Aug 24, 2023

exalate-issue-sync bot mentioned this pull request Aug 30, 2023

PR #108559 - schematelemetry: emit metrics and logs about invalid objects cockroachdb/docs#17790

Open

rafiss mentioned this pull request Aug 30, 2023

Run descriptor validation checks periodically to find any corrupted descriptors and provide logs/alerts or auto-repair #104266

Closed

blathers-crl bot mentioned this pull request Aug 30, 2023

release-23.1: schematelemetry: emit metrics and logs about invalid objects #109733

Merged

rafiss added backport-23.1.x Flags PRs that need to be backported to 23.1 backport-23.1.9-rc labels Aug 30, 2023

blathers-crl bot mentioned this pull request Aug 30, 2023

release-23.1.9-rc: schematelemetry: emit metrics and logs about invalid objects #109739

Merged

chrisseto deleted the invalid-objs-metric branch October 18, 2023 13:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

schematelemetry: emit metrics and logs about invalid objects #108559

schematelemetry: emit metrics and logs about invalid objects #108559

chrisseto commented Aug 10, 2023 •

edited by rafiss

Loading

cockroach-teamcity commented Aug 10, 2023

chrisseto commented Aug 10, 2023

andyyang890 left a comment

fqazi left a comment

chrisseto left a comment

andyyang890 left a comment

chrisseto left a comment

andyyang890 left a comment

fqazi left a comment

chrisseto commented Aug 24, 2023

craig bot commented Aug 24, 2023

rafiss commented Aug 30, 2023

schematelemetry: emit metrics and logs about invalid objects #108559

schematelemetry: emit metrics and logs about invalid objects #108559

Conversation

chrisseto commented Aug 10, 2023 • edited by rafiss Loading

cockroach-teamcity commented Aug 10, 2023

chrisseto commented Aug 10, 2023

andyyang890 left a comment

Choose a reason for hiding this comment

fqazi left a comment

Choose a reason for hiding this comment

chrisseto left a comment

Choose a reason for hiding this comment

andyyang890 left a comment

Choose a reason for hiding this comment

chrisseto left a comment

Choose a reason for hiding this comment

andyyang890 left a comment

Choose a reason for hiding this comment

fqazi left a comment

Choose a reason for hiding this comment

chrisseto commented Aug 24, 2023

craig bot commented Aug 24, 2023

rafiss commented Aug 30, 2023

chrisseto commented Aug 10, 2023 •

edited by rafiss

Loading