WIP: Data Ingestor Batch Processing With Long Running and Finite Functionality #1505

axsaucedo · 2020-03-05T07:50:41Z

Data Ingestor Batch Processing

Following from the learnings obtained whilst building the multi-streaming async implementation via #1447 this PR contains a version that decouples the batch functionality into a new component called the "Data Ingestor".

The Data Ingestor is a generic extra component within a Seldon Deployment which allows for batch capabilities outlined in the requirements for #1413 and #1391, namely:

[1.1] Long running batch jobs
[1.2] Non-Long running batch jobs
[2.1] Jobs that are asynchronous
[2.2] Jobs that terminate when finite dataset has been processed

Currently the functionality for [2.2] is still being explored but it's now a matter of what the best practice would be to implement; that is whether we want to handle the functionality from the Operator or the Seldon Deployment itself.

Furthermore, the functionality for [2.3] Scheduled jobs, is not covered in this PR and will be explored in a future PR.

Long Running Capabilities

After some research on the internals on GRPC, it was confirmed that unary GRPC calls establish the same HTTP2 tunneling channel for bidirectional message passing to the one that is set up for GRPC streaming, and hence it's built for reliable long connections. I recommend this fantastic youtube talk which dives into the internals of GRPC: https://www.youtube.com/watch?v=S7WIYLcPS1Y

The only thing is that we may have to consider to potentially restrict batch to either only work on GRPC, or somehow make it very explicit that for any "long running" capabilities, it would require for the graph to be set up with GRPC. This could be added into the CRD definition as "long-running": "true/false" which coudl be compulsory so that the check is done explicitly, but certainly something to explore further.

MOre specifically, this is the overview that explains in high level the GRPC connection:

Functionality Added

This PR encompasses the following functionality:

Seldon Core Python library extension to support BatchWorker class wrapper
Simple "finite-dataset" example that consumes from ELK stack once
Simple "streaming dataset" which continuously consumes and publishes from Kafka cluster

The Kafka example can be found at the following link: https://github.com/SeldonIO/seldon-core/blob/868fecd0078555e7d76778a63bc961e9079ef6da/examples/batch/kafka_streaming/README.md

The ELK example can be found at the following link: https://github.com/SeldonIO/seldon-core/blob/868fecd0078555e7d76778a63bc961e9079ef6da/examples/batch/elastic_search_simple/README.md

Still TODO in this PR

Explore how to have re-usable source/process/sink classes (a la prepackaged)
- Currently data ingestors seem qutie generic so they could be built separately (like TFserving example) to simplify these examples
Explore CRD structure for data ingestor, as well as terminology to be represented in CRD
For kafka example potentially use defaults for group_id and topics based on predictor identifiers
Explore ways to make available multiprocess-safe logger to methods
A structured way to handle failures on either job level or data point level
Explore whether restricting to just GRPC to ensure long running
Explore best way to deal with data logging
Explore best way to handle this batch functionality from management tool perspective

seldondev · 2020-03-05T07:50:50Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To fully approve this pull request, please assign additional approvers.
We suggest the following additional approver: axsaucedo

If they are not already assigned, you can assign the PR to them by writing /assign @axsaucedo in a comment when ready.

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

seldondev · 2020-03-05T07:51:58Z

Thu Mar 5 07:51:57 UTC 2020
The logs for [pr-build] [1] will show after the pipeline context has finished.
https://github.com/SeldonIO/seldon-core/blob/gh-pages/jenkins-x/logs/SeldonIO/seldon-core/PR-1505/1.log

impatient try
jx get build logs SeldonIO/seldon-core/PR-1505 --build=1

seldondev · 2020-03-05T07:52:01Z

Thu Mar 5 07:52:00 UTC 2020
The logs for [lint] [2] will show after the pipeline context has finished.
https://github.com/SeldonIO/seldon-core/blob/gh-pages/jenkins-x/logs/SeldonIO/seldon-core/PR-1505/2.log

impatient try
jx get build logs SeldonIO/seldon-core/PR-1505 --build=2

axsaucedo · 2020-03-05T07:58:58Z

Long Running Capabilities

After some research on the internals on GRPC, it was confirmed that unary GRPC calls establish the same HTTP2 tunneling channel for bidirectional message passing to the one that is set up for GRPC streaming, and hence it's built for reliable long connections. I recommend this fantastic youtube talk which dives into the internals of GRPC: https://www.youtube.com/watch?v=S7WIYLcPS1Y

The only thing is that we may have to consider to potentially restrict batch to either only work on GRPC, or somehow make it very explicit that for any "long running" capabilities, it would require for the graph to be set up with GRPC. This could be added into the CRD definition as "long-running": "true/false" which coudl be compulsory so that the check is done explicitly, but certainly something to explore further.

More specifically, this is the overview that explains in high level the GRPC connection:

seldondev · 2020-04-23T08:39:54Z

Issues go stale after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
/lifecycle stale

axsaucedo · 2020-05-18T21:59:51Z

Closing in favour of #1714

axsaucedo added 3 commits March 5, 2020 06:36

Added elastic and kafka example

2ec04ea

Added ELK and Kafka Examples

7968bcb

Added examples for ELK and Kafka

868fecd

seldondev added the do-not-merge/work-in-progress label Mar 5, 2020

seldondev added the size/XL label Mar 5, 2020

seldondev added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 23, 2020

axsaucedo closed this May 18, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: Data Ingestor Batch Processing With Long Running and Finite Functionality #1505

WIP: Data Ingestor Batch Processing With Long Running and Finite Functionality #1505

axsaucedo commented Mar 5, 2020 •

edited

Loading

seldondev commented Mar 5, 2020

seldondev commented Mar 5, 2020

seldondev commented Mar 5, 2020

axsaucedo commented Mar 5, 2020 •

edited

Loading

seldondev commented Apr 23, 2020

axsaucedo commented May 18, 2020

WIP: Data Ingestor Batch Processing With Long Running and Finite Functionality #1505

WIP: Data Ingestor Batch Processing With Long Running and Finite Functionality #1505

Conversation

axsaucedo commented Mar 5, 2020 • edited Loading

Data Ingestor Batch Processing

Long Running Capabilities

Functionality Added

Still TODO in this PR

seldondev commented Mar 5, 2020

seldondev commented Mar 5, 2020

seldondev commented Mar 5, 2020

axsaucedo commented Mar 5, 2020 • edited Loading

Long Running Capabilities

seldondev commented Apr 23, 2020

axsaucedo commented May 18, 2020

axsaucedo commented Mar 5, 2020 •

edited

Loading

axsaucedo commented Mar 5, 2020 •

edited

Loading