-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
io_tester: implement latency correction for jobs with RPS #2260
base: master
Are you sure you want to change the base?
io_tester: implement latency correction for jobs with RPS #2260
Conversation
Currently, for jobs with RPS (request-per-second) rate specified, io_tester measures only service time. It means, that if servicing one of requests takes much longer than available delta time between consecutive requests (issuing the next request is delayed), then the measured latency does not depict that fact. In such case we issue less requests than required and high latency is reported just for the first request. For instance, if there is a latency spike for one of requests, that exceeds the available time of service according to RPS schedule, then the total number of scheduled requests does not match the expected count calculated as 'TOTAL = duration_seconds * RPS'. Furthermore, the percentiles with latency printed at the end of the simulation may show inaccurate data. Firstly, the count of samples is lower than expected. Secondly, if the amount of time needed to handle requests after the latency spike returned to the ordinary value, then our statistics show that handling of only one request was long, but it is not true - io_tester stopped sending requests at given RPS and this way other requests could not be properly measured. This indicates that io_tester suffers from coordinated omission problem. This change implements latency correction flag. When it is enabled, then io_tester measures the total request latency including the delay between the expected schedule time and the actual schedule time. Moreover, if any of requests take more time than available, then io_tester tries to schedule 'delayed' requests as soon as possible to return to the defined schedule. Signed-off-by: Patryk Wrobel <[email protected]>
fb659bc
to
7f0febd
Compare
From ScyllaDB blog - "Coordinated omission" [1]:
|
Also a short example of the behavior without the patch. Given the following configuration file:
And executing io_tester with
In spite of expecting 30k issued requests only 24035 were issued. We measured only service time and the missed requests were not counted - thus, the final statistics may not depict the actual state. When doing the same thing with the code from the patch, then the delay between expected submission time of the request and actual submission time is added to the final latency. Also, the missed requests are scheduled as soon as possible to get on track with the schedule.
|
Hi @avikivity, @xemul, I tried to implement latency correction in io_tester - the fix is related to coordinated omission problem. Currently, io_tester measures only |
I don't think io-tester should be overly smart. It's enough if we have a watchdog that detects if each request finishes in timely manner and screams into logs otherwise. For that it can be as simple as
|
Or even simpler -- if the request latency exceeds the "expected" one -- print a warning |
Currently, for jobs with RPS (request-per-second) rate specified,
io_tester measures only service time. It means, that if servicing
one of requests takes much longer than available delta time between
consecutive requests (issuing the next request is delayed), then
the measured latency does not depict that fact. In such case we
issue less requests than required and high latency is reported just
for the first request.
For instance, if there is a latency spike for one of requests, that
exceeds the available time of service according to RPS schedule, then
the total number of scheduled requests does not match the expected
count calculated as 'TOTAL = duration_seconds * RPS'.
Furthermore, the percentiles with latency printed at the end of the
simulation may show inaccurate data. Firstly, the count of samples
is lower than expected. Secondly, if the amount of time needed to
handle requests after the latency spike returned to the ordinary
value, then our statistics show that handling of only one request
was long, but it is not true - io_tester stopped sending requests
at given RPS and this way other requests could not be properly measured.
This indicates that io_tester suffers from coordinated omission problem.
This change implements latency correction flag. When it is enabled,
then io_tester measures the total request latency including the delay
between the expected schedule time and the actual schedule time.
Moreover, if any of requests take more time than available, then
io_tester tries to schedule 'delayed' requests as soon as possible
to return to the defined schedule.