-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Driver in OOMKilled state not restarted by Spark Operator (2.1.0-rc.0) #2299
Comments
Can you try increasing the retry interval to about 10 seconds? This looks like a race condition to me:
|
This may be related to #2285 |
I'm also having the same issue, I'm using the spark-operator 2.0.2,
Application state:
But it never retries, |
@rui-oliveira-bentley try using |
Thanks for the feedback @josecsotomorales it seems fixed in the version v2.1.0-rc.0 (with onFailureRetryInterval: 600 ) |
@rui-oliveira-bentley you can use it in prod, I have been using it since it launched and it's pretty much stable, @ChenYi015 do we have a timeline for an official 2.1.0? |
What happened?
Hello,
I'm currently using the latest version of Spark Operator (2.1.0-rc.0) on Kubernetes. I noticed that when the Spark driver is OOMKilled, the operator does not restart the driver, which is inconsistent with the expected behavior.
It's possible that this issue may be related to or should have been resolved by PR #2241 and issue #2237.
Details:
1. Observed Behavior:
2. Logs:
Could you please look into this issue? Let me know if further details or logs are needed to help with debugging.
Thank you!
Pod on rancher :
Drivers pods, and containers inside (istio is deployed on cluster) :
SparkApplication Status on Rancher :
Driver describe on CLI, we can see OOMKilled timestamp and finish status:
Reproduction Code
No response
Expected behavior
No response
Actual behavior
No response
Environment & Versions
Additional context
No response
Impacted by this bug?
Give it a 👍 We prioritize the issues with most 👍
The text was updated successfully, but these errors were encountered: