-
Notifications
You must be signed in to change notification settings - Fork 834
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(dataflow): handle pipeline errors and clear kafka streams state #5358
Merged
lc525
merged 4 commits into
SeldonIO:v2
from
lc525:fix.dataflow.kafka-streams-state-cleanup
Feb 26, 2024
Merged
fix(dataflow): handle pipeline errors and clear kafka streams state #5358
lc525
merged 4 commits into
SeldonIO:v2
from
lc525:fix.dataflow.kafka-streams-state-cleanup
Feb 26, 2024
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Fixes INFRA-755 (internal): Exceptions in pipeline creation/deletion are unrecoverable because of uncleaned Kafka Streams state Because we're now continuously retrying the subscription to the scheduler when the subscription terminates, we need to also clean the kafka streams state of failed pipelines when the subscription is terminated with an error. This commit also introduces a new way of handling errors and the changes in pipeline status. It moves us towards being able to inspect the status of the status of the underlying kafka streams from outside the Pipeline object. The end-goal (not archieved here) is to react to pipelines reaching error states during their operation, rather than just when they are created for the first time (what's currently implemented). Instead of propagating exceptions, the idea is to return and handle common errors in a more similar way to golang.
lc525
commented
Feb 23, 2024
scheduler/data-flow/src/main/kotlin/io/seldon/dataflow/PipelineSubscriber.kt
Show resolved
Hide resolved
lc525
commented
Feb 23, 2024
scheduler/data-flow/src/main/kotlin/io/seldon/dataflow/kafka/Pipeline.kt
Outdated
Show resolved
Hide resolved
Moved to draft as I've realised a couple more improvements can/should be made before merging |
- propose a new way of reporting error states and returning them from function calls - deal with edge-cases and document them in code
Revamps error handling for pipelines so that: 1. Scheduler subscription is not ended when kafka stream errors/exceptions appear 2. Any error messages are propagated towards the scheduler so that it can show the reason for a broad array of cases where the kafka streams app for a given pipeline gets stopped. 3. Lay groundwork so that errors are propagated to the scheduler not only at pipeline creation time but on transitioning to an error state.
lc525
commented
Feb 25, 2024
lc525
commented
Feb 25, 2024
scheduler/data-flow/src/main/kotlin/io/seldon/dataflow/kafka/Configuration.kt
Show resolved
Hide resolved
lc525
commented
Feb 25, 2024
scheduler/data-flow/src/main/kotlin/io/seldon/dataflow/kafka/Configuration.kt
Show resolved
Hide resolved
lc525
changed the title
fix(dataflow): clear kafka stream state for failed pipelines
fix(dataflow): handle pipeline errors and clear kafka streams state
Feb 25, 2024
sakoush
approved these changes
Feb 26, 2024
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great change! I have added comments / questions to the extent I can read kotlin.
scheduler/data-flow/src/main/kotlin/io/seldon/dataflow/DataflowStatus.kt
Outdated
Show resolved
Hide resolved
scheduler/data-flow/src/main/kotlin/io/seldon/dataflow/kafka/PipelineStatus.kt
Outdated
Show resolved
Hide resolved
scheduler/data-flow/src/main/kotlin/io/seldon/dataflow/kafka/PipelineStatus.kt
Outdated
Show resolved
Hide resolved
scheduler/data-flow/src/main/kotlin/io/seldon/dataflow/kafka/PipelineStatus.kt
Outdated
Show resolved
Hide resolved
scheduler/data-flow/src/main/kotlin/io/seldon/dataflow/kafka/PipelineStatus.kt
Outdated
Show resolved
Hide resolved
scheduler/data-flow/src/main/kotlin/io/seldon/dataflow/kafka/PipelineStatus.kt
Show resolved
Hide resolved
scheduler/data-flow/src/main/kotlin/io/seldon/dataflow/kafka/Pipeline.kt
Show resolved
Hide resolved
scheduler/data-flow/src/main/kotlin/io/seldon/dataflow/kafka/Pipeline.kt
Outdated
Show resolved
Hide resolved
scheduler/data-flow/src/main/kotlin/io/seldon/dataflow/kafka/Pipeline.kt
Show resolved
Hide resolved
scheduler/data-flow/src/main/kotlin/io/seldon/dataflow/kafka/Configuration.kt
Show resolved
Hide resolved
addressing PR comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What this PR does / why we need it:
Because we're now continuously retrying the subscription to the scheduler when the subscription terminates, we need to also clean the kafka streams state of failed pipelines when the subscription is terminated with an error.
This commit also introduces a new way of handling errors and the changes in pipeline status. It moves us towards being able to inspect the status of the underlying kafka streams from outside the Pipeline object. The end-goal (not archieved here) is to react to pipelines reaching error states during their operation, rather than just when they are created for the first time (what's currently implemented). However, this PR does get us to a point where the pipeline creation status, including possible errors/exceptions are reported to the scheduler and can show up either in the seldon CLI or in k8s.
Instead of propagating exceptions, the idea is to return and handle common errors in a more similar way to golang.
Which issue(s) this PR fixes:
** PR Merge Sequencing GROUP 663 **
This is part of a sequence of PRs building on each-other, but split in order to simplify reviewing.
This PR has sequence no: G663/1
Next to review in sequence: lc525#47 : fix(dataflow): wait for kafka topic creation