Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

intentresolver: fix testrace flake by extending timeouts #35085

Merged

Conversation

ajwerner
Copy link
Contributor

The unit tests which failed assumed that all messages intents in a single call
to ResolveIntents could be queued within the batch timeout and idle window which
is by default rather short (10 and 5 ms respectively). The testing validation
logic is rather strict and assumes that the batch will fire due to size rather
than time constraints. The fix is to allow the test to increase these values to
make the test more robust to load. Before this change the flake was reproducible
running testrace within several hundred iterations. After the test it has not
failed after running over 20k test race iterations from two concurrently running
executions.

Fixes #35064.

Release note: None

@ajwerner ajwerner requested a review from a team February 20, 2019 04:51
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@ajwerner ajwerner changed the title intentresolver: fix testrace flake by extending intentresolver: fix testrace flake by extending timeoutes Feb 20, 2019
@ajwerner ajwerner changed the title intentresolver: fix testrace flake by extending timeoutes intentresolver: fix testrace flake by extending timeouts' Feb 20, 2019
@ajwerner ajwerner changed the title intentresolver: fix testrace flake by extending timeouts' intentresolver: fix testrace flake by extending timeouts Feb 20, 2019
@ajwerner ajwerner force-pushed the ajwerner/intent-resolver-testrace-flake branch from 4bcf67e to aa1c66f Compare February 20, 2019 04:53
@tbg
Copy link
Member

tbg commented Feb 20, 2019

in your CI:

panic: runtime error: invalid memory address or nil pointer dereference [recovered]
	panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0xcf5ecb]

goroutine 1039 [running]:
testing.tRunner.func1(0xc00038f300)
	/usr/local/go/src/testing/testing.go:792 +0x387
panic(0xe6ddc0, 0x1a180a0)
	/usr/local/go/src/runtime/panic.go:513 +0x1b9
github.com/cockroachdb/cockroach/pkg/storage/intentresolver.TestCleanupIntents.func2(0xc00038f300)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/intentresolver/intent_resolver_test.go:697 +0x5b
testing.tRunner(0xc00038f300, 0xc0002647c0)
	/usr/local/go/src/testing/testing.go:827 +0xbf
created by testing.(*T).Run
	/usr/local/go/src/testing/testing.go:878 +0x35c

// intentResolutionBatchWait is used to configure the RequestBatcher which
// batches intent resolution requests across transactions. Intent resolution
// needs to occur in a relatively short period of time after the completion
// of a transaction in order to minimize the contention footprint of the write
// for other contending reads or writes. The chosen value was selected based
// on some light experimentation to ensure that performance does not degrade
// in the face of highly contended workloads.
// in the face of highly contended workloads. This setting is a var rather
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would greatly prefer if you added that to the intentResolver itself (unexported) instead so that your test can manipulate it directly. The pattern here is kind of dirty, and while sometimes it's the right thing to do just to "get things done" it seems that here you have everything in place to avoid it. I think your before would take the *intentResolver and after would go away.

tbg added a commit to tbg/cockroach that referenced this pull request Feb 20, 2019
craig bot pushed a commit that referenced this pull request Feb 20, 2019
35089: intentresolver: skip TestCleanupIntents r=knz a=tbg

See #35085.

(the PR is in flight, but this seems pretty flaky)

Release note: None

Co-authored-by: Tobias Schottdorf <[email protected]>
@ajwerner ajwerner force-pushed the ajwerner/intent-resolver-testrace-flake branch 2 times, most recently from 0dde36f to 3807721 Compare February 20, 2019 14:05
@ajwerner
Copy link
Contributor Author

TFTR! I shouldn't push PRs just before going to bed. Firstly I forgot to git add the change to fix the panic and secondly yes the injection was sloppy.

Given that the setting is passed along to the requestbatcher which I wasn't eager to fully re-create inside of the before func. I instead exposed the parameters for batching timeout which also adds symmetry to the gc batcher. Hopefully this feels a little bit better.

@ajwerner ajwerner force-pushed the ajwerner/intent-resolver-testrace-flake branch from 3807721 to 9585216 Compare February 20, 2019 15:02
Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:, thanks for polishing this.

I shouldn't push PRs just before going to bed

It also never works out for me when I do it.

I also agree with your comment in the commit message that if this becomes flaky again, the test should be weakened as opposed to adding hacks. But hope this just works in eternity.

Reviewed 2 of 2 files at r2.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @ajwerner)

@ajwerner
Copy link
Contributor Author

bors r+

@craig
Copy link
Contributor

craig bot commented Feb 20, 2019

Build failed (retrying...)

@craig
Copy link
Contributor

craig bot commented Feb 20, 2019

Merge conflict (retrying...)

The unit tests which failed assumed that all messages intents in a single call
to ResolveIntents could be queued within the batch timeout and idle window which
is by default rather short (10 and 5 ms respectively). The testing validation
logic is rather strict and assumes that the batch will fire due to size rather
than time constraints. The fix is to allow the test to increase these values to
make the test more robust to load. Before this change the flake was reproducible
running testrace within several hundred iterations. Running testrace now seems
to not provoke any failures.

All that being said, I can see an argument that the testing logic should be
made less rigid and should accept that the batches may be split up.

Fixes cockroachdb#35064.

Release note: None
@ajwerner ajwerner force-pushed the ajwerner/intent-resolver-testrace-flake branch from 9585216 to 7fdc815 Compare February 20, 2019 23:26
@craig
Copy link
Contributor

craig bot commented Feb 20, 2019

Canceled

@ajwerner
Copy link
Contributor Author

bors r+

craig bot pushed a commit that referenced this pull request Feb 21, 2019
35085: intentresolver: fix testrace flake by extending timeouts r=ajwerner a=ajwerner

The unit tests which failed assumed that all messages intents in a single call
to ResolveIntents could be queued within the batch timeout and idle window which
is by default rather short (10 and 5 ms respectively). The testing validation
logic is rather strict and assumes that the batch will fire due to size rather
than time constraints. The fix is to allow the test to increase these values to
make the test more robust to load. Before this change the flake was reproducible
running testrace within several hundred iterations. After the test it has not
failed after running over 20k test race iterations from two concurrently running
executions.

Fixes #35064.

Release note: None

Co-authored-by: Andrew Werner <[email protected]>
@craig
Copy link
Contributor

craig bot commented Feb 21, 2019

Build succeeded

@craig craig bot merged commit 7fdc815 into cockroachdb:master Feb 21, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

teamcity: failed test: TestCleanupIntents
3 participants