Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

acceptance: stats-based rebalancing makes Allocator Test 3 to 5 10G flaky #17685

Closed
a-robinson opened this issue Aug 16, 2017 · 5 comments
Closed
Assignees

Comments

@a-robinson
Copy link
Contributor

https://teamcity.cockroachdb.com/viewType.html?buildTypeId=Cockroach_AllocatorTest3to510g&tab=buildTypeHistoryList&branch_Cockroach_Nightlies=__all_branches__

Opening this to make it clear it's on my plate.

@a-robinson a-robinson self-assigned this Aug 16, 2017
@cuongdo
Copy link
Contributor

cuongdo commented Aug 16, 2017

Could this be a case of the passed in max standard deviation being too strict for the latest allocator changes? It seems like we were more aggressively trying to balance range counts before

@a-robinson
Copy link
Contributor Author

a-robinson commented Aug 16, 2017

The two most likely explanations are:

  1. The goal of the new allocator isn't to balance range counts - it's to balance the combination of range counts, writes per second, and disk usage per store. If writes and disk usage are balanced but the number of ranges isn't, that's considered fine. This test only verifies range counts, so the allocator could have succeeded in its new goals while failing the test. Changing the test would be needed to fix this.
  2. If writes and disk usage are evenly distributed, you might expect that balancing them is approximately the same as balancing range count. However, the test could still be failing because we only require writes per second and disk usage to be within 20% of the mean on each store, but are testing that the standard deviation of range count is less than 10%. Reducing the setting for the duration of the test may fix this.

There is a third explanation (that the allocator is actually broken), but both of the above explanations are likely to lead to flakiness of this test even if everything is working perfectly in the allocator.

@a-robinson
Copy link
Contributor Author

And for future reference, here are the artifacts from last night's failure:
Nightlies_Allocator_Test_3_to_5_10G_461_artifacts.zip

[TestRebalance_3To5Small] allocator_test.go:162: 3m15s822ms670us elapsed without changes, but replica count standard deviation is 21.05 (>14.00)

@a-robinson
Copy link
Contributor Author

It's looking like #17733 may have fixed this. The only failures since it went in have been due to acceptance test infrastructure issues, which were fixed in #17776. I'll continue to monitor, but things are looking good at this point.

@a-robinson a-robinson reopened this Oct 26, 2017
@a-robinson a-robinson changed the title acceptance: Allocator Test 3 to 5 10G is flaky acceptance: stats-based rebalancing makes Allocator Test 3 to 5 10G flaky Oct 26, 2017
@tbg
Copy link
Member

tbg commented Apr 19, 2018

Closing as not actionable.

@tbg tbg closed this as completed Apr 19, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants