Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cli/doctor, doctor: use right jobs table, skip dropped descs #123298

Merged
merged 2 commits into from
May 16, 2024

Conversation

annrpom
Copy link
Contributor

@annrpom annrpom commented Apr 30, 2024

cli/doctor: doctor should read from the right jobs table

In #97762, we started writing a job's payload (and progress)
information to the system.jobs_info table. As a result, we
had to change the parts of our code that relied on the system.jobs
table to use crdb_internal.system_jobs instead (since that table
would inaccurately report that some payloads were NULL).
This change did not occur for the in-memory representation of the jobs
table created by debug doctor -- which can result in missing job
false-positives. This patch updates debug doctor's representation of
the jobs table by referring to crdb_internal.system_jobs instead.

Epic: none
Fixes: #122675

Release note: None


doctor: skip validation for dropped descriptors

In some cases, dropped descriptors appear in our system.descriptors
table with dangling job mutations without an associated job.
This patch teaches debug doctor examine to skip validation
on such dropped descriptors.

Epic: none

Fixes: #123477
Fixes: #122956

Release note: none

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@annrpom annrpom force-pushed the fp-debug-doctor-jobs branch 3 times, most recently from c47b474 to 5baaac0 Compare May 2, 2024 22:10
@annrpom annrpom changed the title cli/doctor: doctor should read from the right jobs table cli/doctor, doctor: use right jobs table, skip dropped descs May 3, 2024
@annrpom annrpom force-pushed the fp-debug-doctor-jobs branch from 5baaac0 to 483722d Compare May 6, 2024 15:38
@annrpom
Copy link
Contributor Author

annrpom commented May 6, 2024

this is RFAL -- i had a question about mimicking the json file version (system.jobs.json)

@annrpom annrpom marked this pull request as ready for review May 6, 2024 15:42
@annrpom annrpom requested review from a team as code owners May 6, 2024 15:42
@annrpom annrpom requested a review from fqazi May 6, 2024 15:42
Copy link
Collaborator

@fqazi fqazi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 14 of 14 files at r1, 2 of 3 files at r2.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @annrpom)


pkg/cli/doctor.go line 558 at r1 (raw file):

		}, 0)
		// TODO(before merge): when does this happen?
		if err := parseJSONFile(zipDirPath, "system.jobs.json", &jobsTableJSON); err != nil {

On master / 24.1 a debug zip will format tables as JSON, so we need to account for that


pkg/cli/doctor_test.go line 214 at r2 (raw file):

// TestDoctorClusterDropped tests that debug doctor examine will avoid validating dropped descriptors.
func TestDoctorClusterDropped(t *testing.T) {

Could we just combine it with the one above? I think it will be a bit cleaner since it will be a different path and toggling between the two flag states.

@annrpom annrpom force-pushed the fp-debug-doctor-jobs branch 2 times, most recently from 0bcea2f to 26388b0 Compare May 7, 2024 22:40
Copy link
Contributor Author

@annrpom annrpom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @fqazi)


pkg/cli/doctor.go line 558 at r1 (raw file):

Previously, fqazi (Faizan Qazi) wrote…

On master / 24.1 a debug zip will format tables as JSON, so we need to account for that

faizan and i had a talk offline and found out that the original PR that changed debug zip output to JSON format was partially reverted(38cf825) so that the format is still the beloved tsv that we see today -- with an option to specify JSON formatting by doing:

cockroach debug zip debug-json.zip --format=json

just did a little update to ensure that JSON formatting is supported properly


pkg/cli/doctor_test.go line 214 at r2 (raw file):

Previously, fqazi (Faizan Qazi) wrote…

Could we just combine it with the one above? I think it will be a bit cleaner since it will be a different path and toggling between the two flag states.

I tried seeing how this one would pan out, but it's not looking too hot annrpom@9eb1aec

I think we could merge these two tests together if we did something like

// setup for 1st one:
create table
unsafe upsert
run debug doctor examine

// followed by setup for 2nd one:
unsafe upsert -- adding dropped fields
run debug doctor examine

However, I am getting the error linked in the commit above when I try this; also, would

cockroach/pkg/sql/repair.go

Lines 101 to 104 in 48ebfd1

if existing.IsUncommittedVersion() {
return pgerror.Newf(pgcode.ObjectNotInPrerequisiteState,
"cannot modify a modified descriptor (%d) with UnsafeUpsertDescriptor", id)
}
not block me in any way? I am probably missing something here, wdyt 🙇 ?

@annrpom annrpom requested a review from fqazi May 7, 2024 22:46
Copy link
Collaborator

@fqazi fqazi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 2 of 3 files at r3, 5 of 5 files at r4, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @annrpom)


pkg/cli/doctor_test.go line 214 at r2 (raw file):

Previously, annrpom (annie pompa) wrote…

I tried seeing how this one would pan out, but it's not looking too hot annrpom@9eb1aec

I think we could merge these two tests together if we did something like

// setup for 1st one:
create table
unsafe upsert
run debug doctor examine

// followed by setup for 2nd one:
unsafe upsert -- adding dropped fields
run debug doctor examine

However, I am getting the error linked in the commit above when I try this; also, would

cockroach/pkg/sql/repair.go

Lines 101 to 104 in 48ebfd1

if existing.IsUncommittedVersion() {
return pgerror.Newf(pgcode.ObjectNotInPrerequisiteState,
"cannot modify a modified descriptor (%d) with UnsafeUpsertDescriptor", id)
}
not block me in any way? I am probably missing something here, wdyt 🙇 ?

Yeah, its okay to leave it separate then.

@rafiss
Copy link
Collaborator

rafiss commented May 8, 2024

nice work! looks like there's one issue in the new test.

=== RUN   TestDoctorClusterDropped/examine
[debug --host=127.0.0.1:39133 --insecure=false --certs-dir=/var/lib/engflow/worker/work/0/exec/_tmp/1855398a359288315eef18ccd1f2c1dd/cli-test2994948257 doctor examine cluster]
    datadriven.go:144: 
        /var/lib/engflow/worker/work/0/exec/bazel-out/k8-fastbuild/bin/pkg/cli/cli_test_/cli_test.runfiles/com_github_cockroachdb_cockroach/pkg/cli/testdata/doctor/test_examine_cluster_dropped:1:
         
        expected:
        debug doctor examine cluster
        Examining 57 descriptors and 58 namespace entries...
        Examining 2 jobs...
        No problems found!
        
        found:
        debug doctor examine cluster
        Examining 62 descriptors and 63 namespace entries...
          ParentID 100, ParentSchemaID 101: namespace entry "foo" (104): descriptor is being dropped
        Examining 7 jobs...
        ERROR: validation failed

also, could you take care of adding the correct backport labels? (is it just 24.1 or 23.2 as well?)

@annrpom annrpom added backport-23.2.x Flags PRs that need to be backported to 23.2. backport-24.1.x Flags PRs that need to be backported to 24.1. labels May 13, 2024
@annrpom annrpom force-pushed the fp-debug-doctor-jobs branch from 26388b0 to 06ec4b3 Compare May 14, 2024 18:08
@annrpom annrpom added the backport-23.1.x Flags PRs that need to be backported to 23.1 label May 14, 2024
In cockroachdb#97762, we started writing a job's payload (and progress)
information to the `system.jobs_info` table. As a result, we
had to change the parts of our code that relied on the `system.jobs`
table to use `crdb_internal.system_jobs` instead (since that table
would inaccurately report that some `payload`s were `NULL`).
This change did not occur for the in-memory representation of the jobs
table created by debug doctor -- which can result in missing job
false-positives. This patch updates debug doctor's representation of
the jobs table by referring to `crdb_internal.system_jobs` instead.

Epic: none
Fixes: cockroachdb#122675

Release note: None
@annrpom annrpom force-pushed the fp-debug-doctor-jobs branch from 06ec4b3 to ace490b Compare May 15, 2024 16:21
In some cases, dropped descriptors appear in our `system.descriptors`
table with dangling job mutations without an associated job.
This patch teaches debug doctor examine to skip validation
on such dropped descriptors.

Epic: none

Fixes: cockroachdb#123477
Fixes: cockroachdb#122956

Release note: none
@annrpom annrpom force-pushed the fp-debug-doctor-jobs branch from ace490b to 52a697b Compare May 16, 2024 18:23
@annrpom annrpom requested a review from a team as a code owner May 16, 2024 18:23
@annrpom annrpom requested review from nameisbhaskar and vidit-bhat and removed request for a team May 16, 2024 18:23
@annrpom
Copy link
Contributor Author

annrpom commented May 16, 2024

TFTR! ('-')7

backporting all the way back to 23.1 because that looks like when we stopped writing payload/progress to system.jobs #99458

bors r=fqazi

@craig craig bot merged commit 6a45828 into cockroachdb:master May 16, 2024
22 checks passed
Copy link

blathers-crl bot commented May 16, 2024

Encountered an error creating backports. Some common things that can go wrong:

  1. The backport branch might have already existed.
  2. There was a merge conflict.
  3. The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.


error creating merge commit from 2566787 to blathers/backport-release-23.1-123298: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 23.1.x failed. See errors above.


error creating merge commit from 2566787 to blathers/backport-release-23.2-123298: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 23.2.x failed. See errors above.


🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-23.1.x Flags PRs that need to be backported to 23.1 backport-23.2.x Flags PRs that need to be backported to 23.2. backport-24.1.x Flags PRs that need to be backported to 24.1.
Projects
None yet
4 participants