Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: hotspots splits test is flaky #25036

Closed
tbg opened this issue Apr 24, 2018 · 3 comments · Fixed by #25261
Closed

roachtest: hotspots splits test is flaky #25036

tbg opened this issue Apr 24, 2018 · 3 comments · Fixed by #25261
Assignees
Labels
A-kv-client Relating to the KV client and the KV interface.

Comments

@tbg
Copy link
Member

tbg commented Apr 24, 2018

hotspotsplits.go:102,hotspotsplits.go:118: range size 198 MiB exceeded 192 MiB

https://teamcity.cockroachdb.com/viewLog.html?buildId=620631&buildTypeId=Cockroach_Nightlies_WorkloadNightly

@tbg tbg assigned m-schneider and nvanbenschoten and unassigned m-schneider Apr 24, 2018
@tbg
Copy link
Member Author

tbg commented Apr 24, 2018

@nvanbenschoten any idea how this happens? Is this perhaps because the blobs we're writing are so large that we can overshoot by 4mb?

@nvanbenschoten
Copy link
Member

Yeah, that may be part of it, but I think the bigger issue is that the backpressure mechanism is only a soft upper bound at this point. We won't currently delay/reject a query if a split is not in-flight, and it's possible that if the system is so overloaded that the splitQueue can't keep up with the load then splits might not be in-flight on all ranges at the same time. I was hoping that this test was tuned to avoid entering this performance regime, but I guess it isn't.

Part of the proposed fix for #24215 is to turn this into a hard upper bound at some larger range size. Once we hit that size, we'll start rejecting all clients instead of just slowing them down. That should help, at which point we can be more confident in this test and will only need to worry about the small overshoot you mentioned.

How flaky is this? We might want to skip it until the hard upper bound is introduced.

@m-schneider
Copy link
Contributor

It looks pretty flaky, I see two failures for today.

tbg added a commit to tbg/cockroach that referenced this issue Apr 24, 2018
See cockroachdb#25036.

Also introduce test skipping.

Release note: None
tbg added a commit to tbg/cockroach that referenced this issue Apr 26, 2018
See cockroachdb#25036.

Also introduce test skipping.

Release note: None
craig bot pushed a commit that referenced this issue Apr 26, 2018
25059: roachtest: skip hotspot workload r=benesch,nvanbenschoten a=tschottdorf

See #25036.

Also introduce test skipping.

Release note: None

Co-authored-by: Tobias Schottdorf <[email protected]>
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue May 3, 2018
Fixes cockroachdb#25036.

This fixes the flakes in the test. They have passed 5 straight times for
me. It's also easy to prove that backpressure is necessary for the test
to pass, just run the following and it quickly fails:

```
set cluster setting kv.range.backpressure_range_size_multiplier=0;
```

Release note: None
@tbg tbg added the A-kv-client Relating to the KV client and the KV interface. label May 15, 2018
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue May 16, 2018
Fixes cockroachdb#25036.

This fixes the flakes in the test. They have passed 5 straight times for
me. It's also easy to prove that backpressure is necessary for the test
to pass, just run the following and it quickly fails:

```
set cluster setting kv.range.backpressure_range_size_multiplier=0;
```

Release note: None
craig bot pushed a commit that referenced this issue May 16, 2018
25261: storage: harden replica write backpressure mechanism r=nvanbenschoten a=nvanbenschoten

Fixes #24215.
Fixes #25036.
Fixes #24974.
Fixes #25141.
Fixes #25142.

This PR includes a change to harden the replica's write backpressure
mechanism. We now reject backpressured write when a range is unsplittable.
This is important for placing an upper bound on the size of a range when it
consists of just a single row (#24215).

The change also fixes a few flaky tests that resulted from backpressure
not being restrictive enough.

25539: storage: put an Aborted txn inside a TxnAbortedError r=andreimatei a=andreimatei

TxnAbortedErrors have an updated transaction it them. It makes sense for
that transaction to be marked as Aborted, instead of Pending.

Co-authored-by: Nathan VanBenschoten <[email protected]>
Co-authored-by: Andrei Matei <[email protected]>
@craig craig bot closed this as completed in #25261 May 16, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-client Relating to the KV client and the KV interface.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants