Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kv: Replace multiTestContext with TestCluster in consistency_queue_test #50257

Conversation

lunevalex
Copy link
Collaborator

Makes progress on #8299

multiTestContext is legacy construct that is deprecated in favor of running tests via TestCluster. This is one PR out of many to remove the usage of multiTestContext in the consistency_queue test cases.

Release note : None

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@lunevalex lunevalex requested review from tbg and knz June 16, 2020 14:02
Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @knz, @lunevalex, and @tbg)


pkg/kv/kvserver/consistency_queue_test.go, line 73 at r1 (raw file):

	// observe a timestamp in the past to trick the clock.
	ts.Clock().Update(hlc.Timestamp{
		WallTime: ts.Clock().PhysicalNow() + store.LeaseExpiration(),

Thanks for working on this!
We should not be passing in bogus future timestamps into clocks, not even in tests. The reason is that our system is build on assumptions of bounded clock skew, which is certainly violated here. The test would likely survive this (it may panic sporadically due when the clock skew is detected) but as a general rule we want to avoid this pattern, at least in its naive form. Being able to expire leases is a necessary tool to get the test to run faster in some cases, though. When we do need it, we could look into running a test cluster with a very large MaxOffset (larger than the lease expiration), which would allow us to bump this at least once. I'm sure there's a can of worms there somewhere, though.

This particular test would benefit from a simpler refactor: make the consistency checker's shouldQueue method unit testable and don't start any nodes at all.

Basically, take

func (q *consistencyQueue) shouldQueue(
ctx context.Context, now hlc.Timestamp, repl *Replica, _ *config.SystemConfig,
) (bool, float64) {
and replace *Replica with a replica interface { getQueueLastProcessed() (hlc.Timestamp, error); Desc() *roachpb.RangeDescriptor) }, pass a similarly taylored interface in for NodeLiveness and when all is done you get to test the desired functionality directly. For inspiration, check out raft_log_queue.go and replica_gc_queue.go, both of them do this sort of thing.

@tbg tbg self-requested a review June 17, 2020 09:44
Copy link
Collaborator Author

@lunevalex lunevalex left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @knz and @tbg)


pkg/kv/kvserver/consistency_queue_test.go, line 73 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Thanks for working on this!
We should not be passing in bogus future timestamps into clocks, not even in tests. The reason is that our system is build on assumptions of bounded clock skew, which is certainly violated here. The test would likely survive this (it may panic sporadically due when the clock skew is detected) but as a general rule we want to avoid this pattern, at least in its naive form. Being able to expire leases is a necessary tool to get the test to run faster in some cases, though. When we do need it, we could look into running a test cluster with a very large MaxOffset (larger than the lease expiration), which would allow us to bump this at least once. I'm sure there's a can of worms there somewhere, though.

This particular test would benefit from a simpler refactor: make the consistency checker's shouldQueue method unit testable and don't start any nodes at all.

Basically, take

func (q *consistencyQueue) shouldQueue(
ctx context.Context, now hlc.Timestamp, repl *Replica, _ *config.SystemConfig,
) (bool, float64) {
and replace *Replica with a replica interface { getQueueLastProcessed() (hlc.Timestamp, error); Desc() *roachpb.RangeDescriptor) }, pass a similarly taylored interface in for NodeLiveness and when all is done you get to test the desired functionality directly. For inspiration, check out raft_log_queue.go and replica_gc_queue.go, both of them do this sort of thing.

Thank you for the suggestion this makes sense, I will refactor and update.

@lunevalex lunevalex force-pushed the alex/remove-multi-test-context-consistency-test branch from b3aaaa7 to 3c21991 Compare June 22, 2020 21:59
@lunevalex
Copy link
Collaborator Author


pkg/kv/kvserver/consistency_queue_test.go, line 73 at r1 (raw file):

Previously, lunevalex wrote…

Thank you for the suggestion this makes sense, I will refactor and update.

It ended up being more messy than originally intended because of kvserver vs kvserver_test, but the test is a lot simpler now.

@lunevalex lunevalex force-pushed the alex/remove-multi-test-context-consistency-test branch from 3c21991 to 7eeb6db Compare June 23, 2020 14:18
Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with some comments, thanks!

@@ -49,6 +49,15 @@ type consistencyQueue struct {
replicaCountFn func() int
}

// A data wrapper to allow for the shouldQueue method to be easier to test
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're probably getting tired of this, but .

@@ -77,30 +86,50 @@ func newConsistencyQueue(store *Store, gossip *gossip.Gossip) *consistencyQueue
func (q *consistencyQueue) shouldQueue(
ctx context.Context, now hlc.Timestamp, repl *Replica, _ *config.SystemConfig,
) (bool, float64) {
interval := q.interval()
if interval <= 0 {
return consistencyQueueShouldQueueImpl(ctx, now,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious why now isn't also on consistencyShouldQueueData.

Copy link
Collaborator Author

@lunevalex lunevalex Jun 25, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My mindset was to wrap everything that we use from Replica in this struct, and now() was part of the original method signature so I left it out. No strong preference on this one, could go either way.

}

// With a real clock its not possible to check the actual diff,
// so we settle for just a key check.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can't you add diff[0].Timestamp.WallTime=123 to the old code?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that the timestamp is part of the raw MVCC record, which makes this fairly hard to diff. Here is an example with 2 consecutive runs of the test with this code restored to as it.

` consistency_queue_test.go:257: expected:
--- leaseholder
+++ follower
+0.000000123,987 "e"
+ ts:1970-01-01 00:00:00.000000123 +0000 UTC
+ value:"\x00\x00\x00\x00\x01T"
+ raw mvcc_key/value: 6500000000000000007b000003db0d 000000000154

    got:
    --- leaseholder
    +++ follower
    +1593109825.736774000,987 "e"
    +    ts:2020-06-25 18:30:25.736774 +0000 UTC
    +    value:"\x00\x00\x00\x00\x01T"
    +    raw mvcc_key/value: 6500161bdcf223399170000003db0d 000000000154


        consistency_queue_test.go:257: expected:
    --- leaseholder
    +++ follower
    +0.000000123,987 "e"
    +    ts:1970-01-01 00:00:00.000000123 +0000 UTC
    +    value:"\x00\x00\x00\x00\x01T"
    +    raw mvcc_key/value: 6500000000000000007b000003db0d 000000000154
    
    got:
    --- leaseholder
    +++ follower
    +1593109975.875626000,987 "e"
    +    ts:2020-06-25 18:32:55.875626 +0000 UTC
    +    value:"\x00\x00\x00\x00\x01T"
    +    raw mvcc_key/value: 6500161bdd151832a410000003db0d 000000000154`

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Specifically note: raw mvcc_key/value: 6500161bdcf223399170000003db0d 000000000154 vs raw mvcc_key/value: 6500161bdd151832a410000003db0d 000000000154

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, this dropped off my radar. I would "just" regexp-replace s~(raw mvcc_key/value: ).*~\1 <omitted>~g so that we compare the parts we know should match.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in that case the only thing that is comparable is the value, since everything else has a timestamp, so we can just compare those directly without using strings.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, I just overrode the timestamp and that seemed to have solved it all. Ok going with that.

@@ -91,10 +90,18 @@ func (s *Store) ComputeMVCCStats() (enginepb.MVCCStats, error) {

// ConsistencyQueueShouldQueue invokes the shouldQueue method on the
// store's consistency queue.
func (s *Store) ConsistencyQueueShouldQueue(
ctx context.Context, now hlc.Timestamp, r *Replica, cfg *config.SystemConfig,
func ConsistencyQueueShouldQueue(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this exported?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(and why does it even exist? On master, it's called only from TestConsistencyQueueRequiresLive which you've refactored).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TestConsistencyQueueRequiresLive stills calls this and it lives in kvserver_test. Unfortunately we cant move that file into kvserver because of the circular dependency between TestCluster and kvserver.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense!

@lunevalex lunevalex force-pushed the alex/remove-multi-test-context-consistency-test branch from 7eeb6db to 6c243dc Compare June 25, 2020 18:37
@tbg
Copy link
Member

tbg commented Jun 26, 2020 via email

@lunevalex lunevalex force-pushed the alex/remove-multi-test-context-consistency-test branch from 6c243dc to 5171245 Compare June 26, 2020 17:24
Makes progress on cockroachdb#8299

multiTestContext is legacy construct that is deprecated in favor of running tests via TestCluster. This is one PR out of many to remove the usage of multiTestContext in the consistency_queue test cases.

Release note : None
@lunevalex lunevalex force-pushed the alex/remove-multi-test-context-consistency-test branch from 5171245 to 59306ed Compare June 26, 2020 19:02
@lunevalex
Copy link
Collaborator Author

bors r+

@craig
Copy link
Contributor

craig bot commented Jun 27, 2020

Build succeeded

@craig craig bot merged commit d8b3058 into cockroachdb:master Jun 27, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants