Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: fix bugs in & fortify tenant refcounting #72471

Merged
merged 5 commits into from
Nov 18, 2021

Conversation

tbg
Copy link
Member

@tbg tbg commented Nov 4, 2021

This PR fixes a sandwich of two bugs around refcounting the tenant rate limiters and metrics that I found while prototyping around #72374.

We had an accidental early return in postDestroyRaftMuLocked that meant that tenant metrics and rate limiters were essentially never released.

We were also continuing to use at least the tenant metrics object after the call to postDestroyRaftMuLocked had returned (but note that the above bug meant that we hadn't actually released the ref).

This PR fixes both and adds precautions against regressions of such bugs.

Despite having fixed bugs, I would be cautious about backporting this to 21.2 and 21.1; the bugs here never seem to have caused any problems, and since our day-to-day testing isn't heavy on tenants, it's unclear how reliably we'd be shaking out problems that were previously masked by the bug.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@tbg tbg force-pushed the kv-tenant-metrics-refcount branch 2 times, most recently from ec534f2 to 5111caf Compare November 10, 2021 13:56
@tbg
Copy link
Member Author

tbg commented Nov 10, 2021

cc @ajwerner curious about your opinion on backporting the bug fixes here. I'd definitely want a lot of bake time but I do wonder if it's even a good idea in general to backport.

@tbg tbg force-pushed the kv-tenant-metrics-refcount branch from 5111caf to d49a91b Compare November 10, 2021 14:02
@tbg tbg changed the title [wip] kvserver: avoid use-after-release for tenant metrics kvserver: fix bugs & fortify tenant refcounting Nov 10, 2021
@tbg tbg changed the title kvserver: fix bugs & fortify tenant refcounting kvserver: fix bugs in & fortify tenant refcounting Nov 10, 2021
@tbg tbg marked this pull request as ready for review November 10, 2021 14:21
@tbg tbg requested a review from a team as a code owner November 10, 2021 14:21
@tbg
Copy link
Member Author

tbg commented Nov 10, 2021

This is now RFAL.

Copy link
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

I don't have much to say on the defensive checks for the tenant ID. I suspect I added them as I was implementing it to try to catch anything I might have missed.

Reviewed 1 of 1 files at r1, 12 of 12 files at r2, 1 of 1 files at r3, 2 of 2 files at r4, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @tbg)


pkg/kv/kvserver/metrics.go, line 1446 at r2 (raw file):

	_tenantID roachpb.TenantID
	_state    int32 // atomic; 0=usable 1=poisoned

why the _ prefix?


pkg/kv/kvserver/metrics.go, line 1447 at r2 (raw file):

	_tenantID roachpb.TenantID
	_state    int32 // atomic; 0=usable 1=poisoned
	stack     []byte

Comment on when this gets populated and note that this should effectively never be populated and then stick around for a long time. It'd probably be sad if we populated and stored a stack trace for all replicas at all times (which this does not do).


pkg/kv/kvserver/metrics.go, line 1554 at r2 (raw file):

	m := sm.getTenant(ctx, ref) // NB: asserts against use-after-release
	if atomic.SwapInt32(&ref._state, 1) != 0 {
		log.Fatalf(ctx, "metrics ref released twice")

Log the stack trace here?


pkg/kv/kvserver/metrics.go, line 1556 at r2 (raw file):

		log.Fatalf(ctx, "metrics ref released twice")
	}
	ref.stack = debug.Stack()

Does this not need to be protected by the mutex? Is there some assumption that the Replica.mu or a different mutex is locked? If so, comment it.


pkg/kv/kvserver/metrics.go, line 1479 at r3 (raw file):

	// This struct is invisible to the metric package.
	//
	// TODO(tbg): seems bad that this uses int64 keys but tenantIDs can

Sometimes I wish uint63 was a datatype.

tbg added 4 commits November 17, 2021 15:13
Since this early return has been present since at least 19.1, the
tenant limiter release path was never triggered.

Release note: None
Give tenants an explicit token that they hold on to to account for their
tenant metrics reference. This gives us a convenient place to assert
that references are not double-freed or used after free, which in fact
hey were in multiple locations (all fixed in this commit).

Release note: None
These are effectively unsupported since using them would generate
undefined behavior, at least in this one map that keys by int64
without shifting the value range around (officially, tenantIDs
are uint64s).

Release note: None
@tbg tbg force-pushed the kv-tenant-metrics-refcount branch 2 times, most recently from 63da385 to 4b086eb Compare November 17, 2021 15:19
@tbg tbg requested a review from ajwerner November 17, 2021 15:19
Copy link
Member Author

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @ajwerner)


pkg/kv/kvserver/metrics.go, line 1446 at r2 (raw file):

Previously, ajwerner wrote…
	_tenantID roachpb.TenantID
	_state    int32 // atomic; 0=usable 1=poisoned

why the _ prefix?

This stuff lives in the same package as the "user", I want to somehow signal that these fields aren't to be accessed directly by anyone except the type's receivers. This is the best I could do without pulling this into a separate package, which I felt was not worth it.


pkg/kv/kvserver/metrics.go, line 1479 at r3 (raw file):

Previously, ajwerner wrote…

Sometimes I wish uint63 was a datatype.

Actually not sure what I was thinking in this comment - everything does work. It would be annoying to look at this map through a debugger and to have to do a manual int64->uint64 cast, but honestly that's fine. Re-worded the comment.

Copy link
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 11 of 12 files at r6, 2 of 2 files at r8, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @ajwerner and @tbg)


pkg/kv/kvserver/metrics.go, line 1463 at r9 (raw file):

		ref._stack.Lock()
		defer ref._stack.Unlock()
		log.Fatalf(ctx, "tenantMetricsRef already finalized in:\n%s", ref._stack.b)

want to use depth 1?


pkg/kv/kvserver/metrics.go, line 1572 at r9 (raw file):

		ref._stack.Lock()
		defer ref._stack.Unlock()
		log.Fatalf(ctx, "metrics ref released twice, first time in: %s", ref._stack)

This needs to be ref._stack.b

Also, this is almost exactly assert but with a different message. Maybe just add a string or redect.SafeString to assert like inc, get, release? Also, want to make sure the stack doesn't get redacted when we do assert?

@tbg tbg requested a review from ajwerner November 18, 2021 09:26
Copy link
Member Author

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @ajwerner)


pkg/kv/kvserver/metrics.go, line 1572 at r9 (raw file):

Previously, ajwerner wrote…

This needs to be ref._stack.b

Also, this is almost exactly assert but with a different message. Maybe just add a string or redect.SafeString to assert like inc, get, release? Also, want to make sure the stack doesn't get redacted when we do assert?

I simplified this by calling assert. When it gets hit we get the stack of the second and the first caller so I don't think we need to custom-tailor the message.

I would like to have the stack unredacted, but unfortunately a stack trace in general contains PII, and I don't want to get into the game of trying to guess who the callers are (nor write the redaction logic for a stack trace, we can probably use panicparse to facilitate that though).

This adds some finishing touches that came out of PR review.

Release note: None
@tbg tbg force-pushed the kv-tenant-metrics-refcount branch from 4b086eb to d1e3fc1 Compare November 18, 2021 09:27
@tbg
Copy link
Member Author

tbg commented Nov 18, 2021

bors r=ajwerner

@craig
Copy link
Contributor

craig bot commented Nov 18, 2021

Build failed (retrying...):

@craig
Copy link
Contributor

craig bot commented Nov 18, 2021

Build succeeded:

@craig craig bot merged commit e1c7d6a into cockroachdb:master Nov 18, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants