-
Notifications
You must be signed in to change notification settings - Fork 835
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: Data Ingestor Batch Processing With Long Running and Finite Functionality #1505
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: If they are not already assigned, you can assign the PR to them by writing The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Thu Mar 5 07:51:57 UTC 2020 impatient try |
Thu Mar 5 07:52:00 UTC 2020 impatient try |
Long Running CapabilitiesAfter some research on the internals on GRPC, it was confirmed that unary GRPC calls establish the same HTTP2 tunneling channel for bidirectional message passing to the one that is set up for GRPC streaming, and hence it's built for reliable long connections. I recommend this fantastic youtube talk which dives into the internals of GRPC: https://www.youtube.com/watch?v=S7WIYLcPS1Y The only thing is that we may have to consider to potentially restrict batch to either only work on GRPC, or somehow make it very explicit that for any "long running" capabilities, it would require for the graph to be set up with GRPC. This could be added into the CRD definition as More specifically, this is the overview that explains in high level the GRPC connection: |
Issues go stale after 30d of inactivity. |
Closing in favour of #1714 |
Data Ingestor Batch Processing
Following from the learnings obtained whilst building the multi-streaming async implementation via #1447 this PR contains a version that decouples the batch functionality into a new component called the "Data Ingestor".
The Data Ingestor is a generic extra component within a Seldon Deployment which allows for batch capabilities outlined in the requirements for #1413 and #1391, namely:
Currently the functionality for [2.2] is still being explored but it's now a matter of what the best practice would be to implement; that is whether we want to handle the functionality from the Operator or the Seldon Deployment itself.
Furthermore, the functionality for [2.3] Scheduled jobs, is not covered in this PR and will be explored in a future PR.
Long Running Capabilities
After some research on the internals on GRPC, it was confirmed that unary GRPC calls establish the same HTTP2 tunneling channel for bidirectional message passing to the one that is set up for GRPC streaming, and hence it's built for reliable long connections. I recommend this fantastic youtube talk which dives into the internals of GRPC: https://www.youtube.com/watch?v=S7WIYLcPS1Y
The only thing is that we may have to consider to potentially restrict batch to either only work on GRPC, or somehow make it very explicit that for any "long running" capabilities, it would require for the graph to be set up with GRPC. This could be added into the CRD definition as
"long-running": "true/false"
which coudl be compulsory so that the check is done explicitly, but certainly something to explore further.MOre specifically, this is the overview that explains in high level the GRPC connection:
Functionality Added
This PR encompasses the following functionality:
The Kafka example can be found at the following link: https://github.com/SeldonIO/seldon-core/blob/868fecd0078555e7d76778a63bc961e9079ef6da/examples/batch/kafka_streaming/README.md
The ELK example can be found at the following link: https://github.com/SeldonIO/seldon-core/blob/868fecd0078555e7d76778a63bc961e9079ef6da/examples/batch/elastic_search_simple/README.md
Still TODO in this PR