[BUG] Driver in OOMKilled state not restarted by Spark Operator (2.1.0-rc.0) #2299

TheDevilDan · 2024-10-29T08:46:57Z

What happened?

Hello,

I'm currently using the latest version of Spark Operator (2.1.0-rc.0) on Kubernetes. I noticed that when the Spark driver is OOMKilled, the operator does not restart the driver, which is inconsistent with the expected behavior.

It's possible that this issue may be related to or should have been resolved by PR #2241 and issue #2237.

Details:

1. Observed Behavior:

Yesterday at 20:00, the driver was successfully restarted after being OOMKilled (Maybe, not sure OOMKilled).
Today at 9:15 AM, the driver entered the OOMKilled state, but this time, it was not restarted by the Spark Operator.
I've attached screenshots to illustrate the issue (noting that Istio is deployed on the cluster).

2. Logs:

I have logs available from 20:00 when the operator successfully restarted the driver.
However, there are no logs corresponding to the event at 9:15 AM, when the driver failed to restart.

Could you please look into this issue? Let me know if further details or logs are needed to help with debugging.

Thank you!

Pod on rancher :

Drivers pods, and containers inside (istio is deployed on cluster) :

SparkApplication Status on Rancher :

Driver describe on CLI, we can see OOMKilled timestamp and finish status:

  Restart Policy:
    On Failure Retry Interval:             5
    On Submission Failure Retry Interval:  5
    Type:                                  Always

Reproduction Code

No response

Expected behavior

No response

Actual behavior

No response

Environment & Versions

Kubernetes Version:
Spark Operator Version:
Apache Spark Version:

Additional context

No response

Impacted by this bug?

Give it a 👍 We prioritize the issues with most 👍

The text was updated successfully, but these errors were encountered:

josecsotomorales · 2024-10-29T17:50:08Z

Can you try increasing the retry interval to about 10 seconds? This looks like a race condition to me:

  restartPolicy:
    type: Always
    onFailureRetries: 10
    onFailureRetryInterval: 10
    onSubmissionFailureRetries: 10
    onSubmissionFailureRetryInterval: 10

josecsotomorales · 2024-10-29T17:52:37Z

This may be related to #2285

rui-oliveira-bentley · 2024-11-12T17:48:42Z

I'm also having the same issue,

I'm using the spark-operator 2.0.2,
my restartPolicy:

restartPolicy:                          
     onFailureRetries: 4 
     onFailureRetryInterval: 600
     onSubmissionFailureRetries: 4  
     onSubmissionFailureRetryInterval: 600                                                                                                                                                        
     type: OnFailure

Application state:

│ status:                                                                                                                                                                                          │              │
│   applicationState:                                                                                                                                                                              │              │
│     errorMessage: 'driver container failed with ExitCode: 143, Reason: Error'                                                                                                                    │              │
│     state: FAILING

But it never retries,

josecsotomorales · 2024-11-12T18:28:02Z

@rui-oliveira-bentley try using v2.1.0-rc.0, that issue was fixed in that pre-release version.

rui-oliveira-bentley · 2024-11-13T16:19:46Z

Thanks for the feedback @josecsotomorales it seems fixed in the version v2.1.0-rc.0 (with onFailureRetryInterval: 600 )
Do you know when the version 2.1.0 would be released ?
Do you think that is a good idea to use the version v2.1.0-rc.0 on a production environment? Or would be better to wait for the final release?

josecsotomorales · 2024-11-13T16:41:05Z

@rui-oliveira-bentley you can use it in prod, I have been using it since it launched and it's pretty much stable, @ChenYi015 do we have a timeline for an official 2.1.0?

TheDevilDan added the kind/bug Something isn't working label Oct 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Driver in OOMKilled state not restarted by Spark Operator (2.1.0-rc.0) #2299

[BUG] Driver in OOMKilled state not restarted by Spark Operator (2.1.0-rc.0) #2299

TheDevilDan commented Oct 29, 2024 •

edited

Loading

josecsotomorales commented Oct 29, 2024

josecsotomorales commented Oct 29, 2024

rui-oliveira-bentley commented Nov 12, 2024

josecsotomorales commented Nov 12, 2024

rui-oliveira-bentley commented Nov 13, 2024

josecsotomorales commented Nov 13, 2024

[BUG] Driver in OOMKilled state not restarted by Spark Operator (2.1.0-rc.0) #2299

[BUG] Driver in OOMKilled state not restarted by Spark Operator (2.1.0-rc.0) #2299

Comments

TheDevilDan commented Oct 29, 2024 • edited Loading

What happened?

Reproduction Code

Expected behavior

Actual behavior

Environment & Versions

Additional context

Impacted by this bug?

josecsotomorales commented Oct 29, 2024

josecsotomorales commented Oct 29, 2024

rui-oliveira-bentley commented Nov 12, 2024

josecsotomorales commented Nov 12, 2024

rui-oliveira-bentley commented Nov 13, 2024

josecsotomorales commented Nov 13, 2024

TheDevilDan commented Oct 29, 2024 •

edited

Loading