Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Driver in OOMKilled state not restarted by Spark Operator (2.1.0-rc.0) #2299

Open
TheDevilDan opened this issue Oct 29, 2024 · 6 comments
Labels
kind/bug Something isn't working

Comments

@TheDevilDan
Copy link

TheDevilDan commented Oct 29, 2024

What happened?

Hello,

I'm currently using the latest version of Spark Operator (2.1.0-rc.0) on Kubernetes. I noticed that when the Spark driver is OOMKilled, the operator does not restart the driver, which is inconsistent with the expected behavior.

It's possible that this issue may be related to or should have been resolved by PR #2241 and issue #2237.

Details:

1. Observed Behavior:

  • Yesterday at 20:00, the driver was successfully restarted after being OOMKilled (Maybe, not sure OOMKilled).
  • Today at 9:15 AM, the driver entered the OOMKilled state, but this time, it was not restarted by the Spark Operator.
  • I've attached screenshots to illustrate the issue (noting that Istio is deployed on the cluster).

2. Logs:

  • I have logs available from 20:00 when the operator successfully restarted the driver.
  • However, there are no logs corresponding to the event at 9:15 AM, when the driver failed to restart.

Could you please look into this issue? Let me know if further details or logs are needed to help with debugging.

Thank you!

Pod on rancher :
image

Drivers pods, and containers inside (istio is deployed on cluster) :
image

SparkApplication Status on Rancher :
image

Driver describe on CLI, we can see OOMKilled timestamp and finish status:
image

  Restart Policy:
    On Failure Retry Interval:             5
    On Submission Failure Retry Interval:  5
    Type:                                  Always

Reproduction Code

No response

Expected behavior

No response

Actual behavior

No response

Environment & Versions

  • Kubernetes Version:
  • Spark Operator Version:
  • Apache Spark Version:

Additional context

No response

Impacted by this bug?

Give it a 👍 We prioritize the issues with most 👍

@TheDevilDan TheDevilDan added the kind/bug Something isn't working label Oct 29, 2024
@josecsotomorales
Copy link
Contributor

Can you try increasing the retry interval to about 10 seconds? This looks like a race condition to me:

  restartPolicy:
    type: Always
    onFailureRetries: 10
    onFailureRetryInterval: 10
    onSubmissionFailureRetries: 10
    onSubmissionFailureRetryInterval: 10

@josecsotomorales
Copy link
Contributor

This may be related to #2285

@rui-oliveira-bentley
Copy link

I'm also having the same issue,

I'm using the spark-operator 2.0.2,
my restartPolicy:

restartPolicy:                          
     onFailureRetries: 4 
     onFailureRetryInterval: 600
     onSubmissionFailureRetries: 4  
     onSubmissionFailureRetryInterval: 600                                                                                                                                                        
     type: OnFailure  

Application state:

│ status:                                                                                                                                                                                          │              │
│   applicationState:                                                                                                                                                                              │              │
│     errorMessage: 'driver container failed with ExitCode: 143, Reason: Error'                                                                                                                    │              │
│     state: FAILING  

But it never retries,

@josecsotomorales
Copy link
Contributor

@rui-oliveira-bentley try using v2.1.0-rc.0, that issue was fixed in that pre-release version.

@rui-oliveira-bentley
Copy link

Thanks for the feedback @josecsotomorales it seems fixed in the version v2.1.0-rc.0 (with onFailureRetryInterval: 600 )
Do you know when the version 2.1.0 would be released ?
Do you think that is a good idea to use the version v2.1.0-rc.0 on a production environment? Or would be better to wait for the final release?

@josecsotomorales
Copy link
Contributor

@rui-oliveira-bentley you can use it in prod, I have been using it since it launched and it's pretty much stable, @ChenYi015 do we have a timeline for an official 2.1.0?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants