Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Throughput performance drop #367

Closed
zezic opened this issue Oct 7, 2021 · 10 comments · Fixed by #368
Closed

Throughput performance drop #367

zezic opened this issue Oct 7, 2021 · 10 comments · Fixed by #368
Labels
bug Something isn't working

Comments

@zezic
Copy link

zezic commented Oct 7, 2021

We are experiencing significant throughput regression after commit ef71107 which was introduced with #358.
Before this commit our tests were able to run at ~18000 RPS and it was possible to detect the bottleneck of our service, but after ef71107 we only getting around ~6000 RPS with the same tests and now Goose is bottleneck.

@jeremyandrews
Copy link
Member

Are you using .set_wait_time()? If so, what are some example values you're setting?

I'll do some analysis, perhaps my performance regression test needs some enhancement -- I didn't see a regression in my earlier testing.

@zezic
Copy link
Author

zezic commented Oct 7, 2021

No, I'm not using the .set_wait_time(). Also, I'm not using theGooseDefault::ThrottleRequests or anything which should limit request rate, I'm trying to achieve the maximum request rate.

@jeremyandrews
Copy link
Member

I was able to duplicate this ... I need to review how it slipped through my pre-release testing, and will work on a fix to restore performance. Thanks for reporting!

@jeremyandrews
Copy link
Member

jeremyandrews commented Oct 9, 2021

I've not had time to work on a proper solution yet. I have run a few tests though, and confirmed that reverting the set_wait_time() PR restores performance (for each test I restarted all the server-side processes and reloaded the database, then ran the drupal_memcache loadtest 3 times in a row for 10 minutes):

0.13.3

  • 10,747 rps
  • 11,201 rps
  • 11,094 rps

0.14.1-dev

  • 8,055 rps
  • 8,410 rps
  • 8,460 rps

0.14.1-dev reverted set_wait_time() changes

  • 10,265 rps
  • 10,702 rps
  • 10,966 rps

@jeremyandrews
Copy link
Member

@zezic Can you test these changes? #368

@zezic
Copy link
Author

zezic commented Oct 12, 2021

@jeremyandrews Sure! I will check it today and will report to you as soon as possible.

@jeremyandrews jeremyandrews added the bug Something isn't working label Oct 12, 2021
@zezic
Copy link
Author

zezic commented Oct 12, 2021

@jeremyandrews I tested your branch https://github.com/jeremyandrews/goose/tree/revert-test and I can confirm that it fixes the issue! Now I'm getting identical results for v0.13.3 and v0.14.1-dev (revert-test), the difference is in the margin of error.

@jeremyandrews
Copy link
Member

@zezic Thank you for reporting the problem and testing the fix! I'll get this merged and a release rolled soon.

@jeremyandrews
Copy link
Member

To confirm: this regression was because there is a very small amount of overhead in converting a Duration to an integer, but this adds up when making tens of thousands of requests per second. The fix was to not perform any Duration -> integer conversions when set_wait_time() is not being used.

@zezic
Copy link
Author

zezic commented Oct 13, 2021

You mean, Duration::as_millis() is producing that noticeable difference? I doubt that it adds significant overhead, because it's just a simple arithmetic with a few type casts – https://doc.rust-lang.org/src/core/time.rs.html#408

pub const fn as_millis(&self) -> u128 {
        self.secs as u128 * MILLIS_PER_SEC as u128 + (self.nanos / NANOS_PER_MILLI) as u128
}

My guess is that the problem has something to do with the granularity and the nature of the Tokio timers. In documentation for tokio::time::sleep() function i read that:

Sleep operates at millisecond granularity and should not be used for tasks that require high-resolution timers.

Maybe that's not the exact reason for the regression but I have suspicion that its related to some mechanic of the Tokio timers which makes awaiting on a sleep with a 0 ms duration to produce some unwanted latency.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants