Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Roachperf shows poor performance on some days for YCSB #34458

Closed
awoods187 opened this issue Jan 31, 2019 · 4 comments
Closed

Roachperf shows poor performance on some days for YCSB #34458

awoods187 opened this issue Jan 31, 2019 · 4 comments
Assignees

Comments

@awoods187
Copy link
Contributor

Roachperf displays large dips in performance on YCSB in January.

image

@ajwerner
Copy link
Contributor

ajwerner commented Feb 5, 2019

With AWS data added we see the same sort of pattern on one day. This is definitely something to dig in to.

ajwerner added a commit to ajwerner/cockroach that referenced this issue Feb 11, 2019
This PR is a temporary measure to aid in debugging the very peculiar cockroachdb#34458.
The idea is that if a run fails to meet the expected throughput (which is above
but near the bad runs), we'd like the opportunity to poke around.

Release note: None
ajwerner added a commit that referenced this issue Feb 12, 2019
This PR is a temporary measure to aid in debugging the very peculiar #34458.
The idea is that if a run fails to meet the expected throughput (which is above
but near the bad runs), we'd like the opportunity to poke around.

Release note: None
ajwerner added a commit to ajwerner/cockroach that referenced this issue Feb 12, 2019
This PR is a temporary measure to aid in debugging the very peculiar cockroachdb#34458.
The idea is that if a run fails to meet the expected throughput (which is above
but near the bad runs), we'd like the opportunity to poke around.

Release note: None
craig bot pushed a commit that referenced this issue Feb 12, 2019
34808: roachtest: fail YCSB for debugging if below performance expectations r=ajwerner a=ajwerner

This PR is a temporary measure to aid in debugging the very peculiar #34458.
The idea is that if a run fails to meet the expected throughput (which is above
but near the bad runs), we'd like the opportunity to poke around.

Release note: None

Co-authored-by: Andrew Werner <[email protected]>
@ajwerner
Copy link
Contributor

The good news is that there seems to be a very reasonable explanation for all of this. The bad news is that it's going to be somewhat difficult to rectify.

These performance drops seem to be due to the fact that we run the same nightly tests off of different branches and releases yet the test script is completely unaware of this. The reason we find these dips to be highly correlated across cloud providers is that the roachperf code processes the first completed result on a given day based on lexicographical order which will end up being based on build ID and then will skip other builds. This means that these dips generally correspond to previous releases.

There are a couple of mitigations and forward looking changes I have in mind.

  1. We should capture the build hash for each test run somewhere
  2. We should probably only store data for master runs, or at least store runs from different branches elsewhere
  3. We need a way to go back and pick the right data for a given day, my first approach is likely going to be to change roachperf to process all of the data for a given day and keep the best one which I'll take as a proxy for the newest one. It's not perfect and it's totally possible that some prior release performed better for some metric but it's going to give us a better signal than what we've got today.

@petermattis
Copy link
Collaborator

Heh, this is somewhat amusing. I think it is worth writing up a short note to eng@ about the dips and how they were correlated across cloud providers, and then providing a reveal as to the cause.

Storing only master runs makes sense to me. At the very least, we should include the build branch in the storage directory.

@ajwerner
Copy link
Contributor

Closing with the tweak to the teamcity scripts and some cleanup. There's still more to be done to increase the robustness of the roachperf information collection and management but that's for a different issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants