-
Notifications
You must be signed in to change notification settings - Fork 680
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Flyte Map Tasks should only retry subtasks #1276
Comments
There is precedence for us passing the "retry count" down to the downstream system (e.g. Qubole) to do the retries there... I think the right thing here is to follow that and do the retrying in the K8s Array Plugin. One way to do this is:
P.S. We should not use the native Pod retries because there is no way to avoid clobbering the output directory and noway to separate out the logs... besides, they only restart failing container not the entire pod... |
Describe the bug
Currently map tasks will retry the entire map task(all sub tasks) if a single subtask fails and count the retry toward the entire map task. This isn't ideal and should only be retried for that specific subtask that failed and be counted toward a single subtask retry.
We can technically get around this by setting retries to a very large number and with caching at some point it would make progress to completion but this is not ideal.
Expected behavior
I would expect map tasks to only retry failed sub tasks even if caching is not specified and to have retries count at the sub task level.
For example if retries are set to 2 and I have 10 subtasks, each subtask can fail twice before the entire map task would be counted as Failed. If any subtask fails more than the retry limit set, I believe it is reasonable to fail the entire map task and stop any other sub tasks currently running.
[Optional] Additional context
To Reproduce
Steps to reproduce the behavior:
1.
2.
Screenshots
If applicable, add screenshots to help explain your problem.
The text was updated successfully, but these errors were encountered: