Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

schematelemetry: emit metrics and logs about invalid objects #108559

Merged
merged 1 commit into from
Aug 24, 2023

Conversation

chrisseto
Copy link
Contributor

@chrisseto chrisseto commented Aug 10, 2023

Short of continuously polling crdb_internal.invalid_objects, there was not a convenient way to monitor a cluster for descriptor corruption.

Having such an indicator would allow customers to perform preflight checks ahead of upgrades to avoid being stuck in a mixed version state. It would also allow CRL to more easily monitor cloud clusters for corruptions in the wild.

This commit updates the schematelemetry job to additionally update the sql.schema.invalid_objects gauge and emit logs for any encountered corruptions.

Informs: #104266
Epic: CRDB-28665

(release note was added after commit was merged)
Release note (ops change): Added a new sql.schema.invalid_objects gauge
metric. This gauge is periodically updated based on the schedule set by
the sql.schema.telemetry.recurrence cluster setting. When it is updated,
it counts the number of schema objects (tables, types, schemas, databases,
and functions) that are in an invalid state according to CockroachDB’s
internal validation checks. This metric is expected to be zero in a healthy
cluster, and if it is not, it indicates that there is a problem that must
be repaired.

@chrisseto chrisseto requested a review from a team as a code owner August 10, 2023 21:26
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@chrisseto
Copy link
Contributor Author

Couple of notes:
I'm not sure if the metrics are named or hooked up correctly.
I had a bit of work on "canning" descriptor corruptions but ultimately tabled it to make get this work over the finish line sooner.
I would have liked to get more iterations on the job run to verify that the gauge is updated as expected but the job takes about ~30s to run or it would require exporting a bunch of internal types 🤔

Copy link
Collaborator

@andyyang890 andyyang890 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not super familiar with the jobs framework, but was interested in learning more about it so I took a stab at reviewing this PR.

One related question: do you know how/when this job is created in the first place?

Copy link
Collaborator

@fqazi fqazi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 4 of 4 files at r1, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @chrisseto)


pkg/sql/catalog/schematelemetry/schema_telemetry_job.go line 114 at r1 (raw file):

	maxRecords int,
) error {
	// TODO(before merge): is this the right txn type? Does this need to be run within a txn?

Yeah this looks correct


pkg/sql/catalog/schematelemetry/schema_telemetry_job.go line 121 at r1 (raw file):

		}

		rows, err := txn.QueryIteratorEx(ctx, "sql-telemetry-invalid-objects", txn.KV(), sessiondata.NodeUserSessionDataOverride, `SELECT id, error FROM "".crdb_internal.invalid_objects LIMIT `+strconv.Itoa(maxRecords))

Use $1 and pass it as a parameter.


pkg/sql/catalog/schematelemetry/schema_telemetry_test.go line 179 at r1 (raw file):

	// Ensure that a log line is emitted for each invalid object.
	errorRE := regexp.MustCompile("found invalid object with ID.*")
	entries, err := log.FetchEntriesFromFiles(0, math.MaxInt64, 1000, errorRE, log.SelectEditMode(false, false))

Do we want to take it further and validate the types of validation errors?

@chrisseto chrisseto force-pushed the invalid-objs-metric branch from 9899c85 to 01d9d0d Compare August 14, 2023 17:26
Copy link
Contributor Author

@chrisseto chrisseto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTRs!

do you know how/when this job is created in the first place?

It's a combo of scheduled jobs and the schematelemetrycontrollerpackage.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andyyang890 and @fqazi)


pkg/sql/catalog/schematelemetry/scheduled_job_executor.go line 30 at r1 (raw file):

Previously, andyyang890 (Andy Yang) wrote…

nit: maybe split the renaming into its own commit

What would be the reasoning for that OOC? It can certainly be it's own atomic commit but it would be very small and largely irrelevant without the context of the following commit. I'm happy to do so if that's the standard practice! This seems like it may be pushing towards too granular. My "canary" for determining if something is too granular is whether or not I feel it would be possible to write a commit message longer than a sentence that provides more meaning/value than a one sentence summary.


pkg/sql/catalog/schematelemetry/schema_telemetry_job.go line 51 at r1 (raw file):

Previously, andyyang890 (Andy Yang) wrote…

nit:

// MetricStruct implements the metric.Struct interface.

Done.


pkg/sql/catalog/schematelemetry/schema_telemetry_job.go line 114 at r1 (raw file):

Previously, fqazi (Faizan Qazi) wrote…

Yeah this looks correct

Done.


pkg/sql/catalog/schematelemetry/schema_telemetry_job.go line 121 at r1 (raw file):

Previously, fqazi (Faizan Qazi) wrote…

Use $1 and pass it as a parameter.

Huh, I swear that didn't work the first time around. Fixed now!


pkg/sql/catalog/schematelemetry/schema_telemetry_job.go line 132 at r1 (raw file):

Previously, andyyang890 (Andy Yang) wrote…

I feel like this for loop might be more readable like this:

		for {
			ok, err = rows.Next(ctx)
			if !ok {
				break
			}

Also, should the err != nil check happen before checking ok?

Done.

I just copied this pattern from somewhere else 😅


pkg/sql/catalog/schematelemetry/schema_telemetry_test.go line 175 at r1 (raw file):

Previously, andyyang890 (Andy Yang) wrote…

Do you mind explaining in a comment why there should be 9 invalid objects?

Done.


pkg/sql/catalog/schematelemetry/schema_telemetry_test.go line 179 at r1 (raw file):

Previously, fqazi (Faizan Qazi) wrote…

Do we want to take it further and validate the types of validation errors?

I leaning towards no. In part there's some laziness here but I feel that quantifying the various schema corruptions within this test wouldn't provide any additional confidence about this specific test and would overall make the test more brittle.

I played around with "canning" corruptions and pairing them with a matcher. It ended up being pretty nasty given the broad scope of possible corruption and didn't change this particular test too much short of asserting that invalid_objects works as expected. Also additional runs of the job ended up making the test take like 90s :/.

I do think there's a good argument for a follow up PR that's focused on classifying corruptions and further testing crdb_internal.invalid_objects. WDYT?

Copy link
Collaborator

@andyyang890 andyyang890 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @chrisseto and @fqazi)


pkg/sql/catalog/schematelemetry/scheduled_job_executor.go line 30 at r1 (raw file):

Previously, chrisseto (Chris Seto) wrote…

What would be the reasoning for that OOC? It can certainly be it's own atomic commit but it would be very small and largely irrelevant without the context of the following commit. I'm happy to do so if that's the standard practice! This seems like it may be pushing towards too granular. My "canary" for determining if something is too granular is whether or not I feel it would be possible to write a commit message longer than a sentence that provides more meaning/value than a one sentence summary.

Yeah I guess it's a judgement call, I do find it's more useful when revisiting the git history that commits that rename things have their own commit since it's immediately obvious at what point it was renamed. I'd be okay if you want to keep it in the same commit.


pkg/sql/catalog/schematelemetry/schema_telemetry_job.go line 51 at r1 (raw file):

Previously, chrisseto (Chris Seto) wrote…

Done.

nit: missing "the" after "implements"


pkg/sql/catalog/schematelemetry/schema_telemetry_test.go line 175 at r1 (raw file):

Previously, chrisseto (Chris Seto) wrote…

Done.

Thanks! Would you mind also adding the specific test name to the comment (i.e. TestInvalidObjects)?

Short of continuously polling `crdb_internal.invalid_objects`, there was
not a convenient way to monitor a cluster for descriptor corruption.

Having such an indicator would allow customers to perform preflight
checks ahead of upgrades to avoid being stuck in a mixed version state.
It would also allow CRL to more easily monitor cloud clusters for
corruptions in the wild.

This commit updates the schematelemetry job to additionally update the
`sql.schema.invalid_objects` gauge and emit logs for any encountered
corruptions.

Informs: cockroachdb#104266
Epic: CRDB-28665
Release note: None
@chrisseto chrisseto force-pushed the invalid-objs-metric branch from 01d9d0d to f6af97a Compare August 22, 2023 18:24
Copy link
Contributor Author

@chrisseto chrisseto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andyyang890 and @fqazi)


pkg/sql/catalog/schematelemetry/scheduled_job_executor.go line 30 at r1 (raw file):

Previously, andyyang890 (Andy Yang) wrote…

Yeah I guess it's a judgement call, I do find it's more useful when revisiting the git history that commits that rename things have their own commit since it's immediately obvious at what point it was renamed. I'd be okay if you want to keep it in the same commit.

I'll opt to keep this one bundled in given that it's a private variable!


pkg/sql/catalog/schematelemetry/schema_telemetry_job.go line 51 at r1 (raw file):

Previously, andyyang890 (Andy Yang) wrote…

nit: missing "the" after "implements"

Done.


pkg/sql/catalog/schematelemetry/schema_telemetry_test.go line 175 at r1 (raw file):

Previously, andyyang890 (Andy Yang) wrote…

Thanks! Would you mind also adding the specific test name to the comment (i.e. TestInvalidObjects)?

Done.

Copy link
Collaborator

@andyyang890 andyyang890 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @fqazi)

@chrisseto chrisseto requested a review from fqazi August 22, 2023 18:57
Copy link
Collaborator

@fqazi fqazi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 2 files at r3, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @andyyang890 and @chrisseto)


pkg/sql/catalog/schematelemetry/schema_telemetry_test.go line 179 at r1 (raw file):

Previously, chrisseto (Chris Seto) wrote…

I leaning towards no. In part there's some laziness here but I feel that quantifying the various schema corruptions within this test wouldn't provide any additional confidence about this specific test and would overall make the test more brittle.

I played around with "canning" corruptions and pairing them with a matcher. It ended up being pretty nasty given the broad scope of possible corruption and didn't change this particular test too much short of asserting that invalid_objects works as expected. Also additional runs of the job ended up making the test take like 90s :/.

I do think there's a good argument for a follow up PR that's focused on classifying corruptions and further testing crdb_internal.invalid_objects. WDYT?

That's fair

@chrisseto
Copy link
Contributor Author

TFTRs!

bors r=fqazi,andyyang890

@craig
Copy link
Contributor

craig bot commented Aug 24, 2023

Build succeeded:

@rafiss
Copy link
Collaborator

rafiss commented Aug 30, 2023

I just realized we missed adding a release note for this one (so that customers are aware of the new metric that was added). I filed https://cockroachlabs.atlassian.net/browse/DOC-8710

Also, let's get this backported:

blathers backport 23.1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-23.1.x Flags PRs that need to be backported to 23.1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants