-
Notifications
You must be signed in to change notification settings - Fork 835
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
High latency of REST locally #803
Comments
I think there are a couple of initial comments:
Finally, gRPC will be much faster. See this blog post: https://www.intel.ai/inference-performance-boost-with-seldon-on-intel-xeon-scalable-processors/#gs.vrh9uz and figure 1. |
Thanks! @cliveseldon. I will try with the latest commit soon. I benchmarked with gRPC as well. It was half of the REST latency but still a bit high. We hope the total latency could be ~ 10 ms. Yeah because |
@cliveseldon I install the latest commit of seldon-core and re-ran the profiling test, but didn't see a change:
The code path is still the same ---- My guess is the |
Can we clarify which lines are the issues is it: |
@cliveseldon With |
@cliveseldon ok looks like the code goes to
extracting the request still takes 30 - 40 ms. |
Yes. I think the parse the proto is the issue: seldon-core/python/seldon_core/utils.py Lines 414 to 420 in 12d1bff
@axsaucedo can you look at this? |
I think we should be able to optimize this without the parse the Proto. |
For sure, as discussed this morning, this was done for validation, but given the proto is also processed in the SeldonEngine prior to this, it's possible to remove this. The work to be done here will be to remove the current proto conversions from the Python wrapper. This should remove this overhead. |
Thanks @cliveseldon and @axsaucedo . (Note we are using Ambassador as gateway for now, and I talked to Ryan, there is a problem that we cannot use gRPC, see here and here. So REST could be the only option for us for now) |
Thanks for the heads up @lennon310. We've added #803 with the fixes for this PR. It basically removes the conversion into Proto (except for TFTensor), which would remove overhead. Before landing it would be great if you could give it a try to see if it addresses the latency issues. |
OK I will re-run the profiler once this is merged and update the result here. Thanks! |
I pulled @axsaucedo commit locally and re-ran the profiler,
,
GRPC:
,
So (1) Parsing in REST seems fixed but in GRPC is not, and (2) There are still baseline latency from network request in both REST and GRPC. @cliveseldon |
@lennon310 thank you for rerunning the profiler on the pull request. Good to see the rest path seems to be fixed, we should be able to merge this soon. That is correct, the current code changes are focused purely on the REST code path. In regards to the gRPC latency, is that still the case if you use the predict_raw function instead of the predict function? Furthermore, when you say there is baseline latency from network, what do you mean? Is this on a kubernetes deployment? If that's the case are you referring to latency on the Seldon Engine handling the request or the network itself. |
There should be no parsing in gRPC as the message should already be a proto buffer. |
Looks like test locally has higher latency. I deployed @axsaucedo branch to Kubernetes, and ran a simple load test,
It looks good. Thanks for the help @axsaucedo and @cliveseldon ! |
@lennon310 from the positive response I assume you meant lower latency + higher throughput? Clive is correct, there should be no parsing in GRPC as message already comes as proto (except conversions to ndarray and back to proto for response). We'll add a few more tests to the PR before landing. I've added it to our next release 👍 |
@axsaucedo Yeah I mean the latency and throughput. So am I understanding this correctly that with ndarray as input, the input data type conversion in REST is slightly higher than sending the input as proto in GRPC by probably no more than 10 ms (and this part makes GRPC faster)? Is this consistent with your team's benchmarking? |
gRPC is faster for many reasons. The binary protocol and faster serialization/deserialization and also becuase it shares a single http2 channel. See the Intel benchmarking post above in my initial reply. |
It would be good to add your work as a PR to the python library as part of the test infrastructure if you have time. |
@cliveseldon I would love to. Which file(s) should I make changes? |
I would a new entry to the Makefile ( Lines 41 to 46 in b9d2032
And add whatever files you need to run the test. Plus extend the test_require in the setup.py |
OK, I will do that. Thanks! |
I think we were talking about this but was still wondering for the GRPC path, even if no parsing is used, is the time spent on getting data from proto expected?
I tested our service and gRPC (mean ~ 50 ms) seems to have higher latency than REST (mean ~ 25 ms) so I'm not sure if there is still something needs optimized? @cliveseldon |
As per my other comment, the conversion is necessary as in GRPC the request is received as Proto, and needs to be converted into a numpy array before it's passed to the python wrapper. You can access the raw proto through the We're happy to take suggestions but if you have a look at the functions, it's basically converting the Proto directly to numpy array. If you run some alternative implementations of methods to convert numpy arrays to/from proto objects that are more efficient, we would certainly be keen to introduce them. |
We just update the image version of operator and this change now applies to prepackaged server as well. However, it seems like the I was wondering if the PR needs to apply to other code paths as well? @axsaucedo @cliveseldon |
@lennon310 that is correct, this will be resolved with #739 |
I'm trying to profiling the seldon client locally. I was using
seldon-core-microservice <MyModel> REST
to serve the model. I was using XGBoost model with ~ 30 features and 200 rows. And my script isThe profiling result shows:
As you can see the network has a latency of ~ 50 ms (see the api from
connectionpool.py
), and data conversion contributes ~ 30 ms (MessageToJson
fromprotobuf
). That makes our prediction take ~ 100 + ms, which is too slow.The text was updated successfully, but these errors were encountered: