You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have searched in the issues and found no similar issues.
What would you like to be improved?
Problem Description:
When submitting a Spark job to a Kubernetes (K8s) namespace via Kyuubi, if multiple Spark jobs are submitted concurrently and started at the same time, it may lead to a situation where several Spark drivers start successfully and enter the "running" state. This exhausts the resources of the namespace, leaving no resources available for Spark executors to start. As a result, multiple Spark drivers end up waiting for resources to start their Spark executors, but the Spark drivers do not release resources, causing a deadlock where the drivers are mutually waiting for resources.
How should we improve?
Solution:
We tried using Yunikorn-Gang scheduling to solve this issue, but it cannot completely avoid the deadlock. Therefore, we adopted another approach by adding a switch in Kyuubi. If the switch is turned on, Spark jobs submitted to the same namespace are processed sequentially instead of in parallel. The submission of the current job depends on the running status of the previous job:
1.If both the driver and executor exist simultaneously, it means the previous Spark job has successfully occupied resources, and the current Spark job can be submitted.
2.If only the driver exists and no executor is present, the system waits for a period of time before checking the status of the previous Spark job again until both the driver and executor are present, then the current Spark job is submitted.
3.A timeout period can be configured. If the timeout is greater than 0 and the condition of having both the driver and executor is not met before the timeout, the previous driver is killed, and the current Spark job is submitted. If the configured timeout is less than or equal to 0, the system will wait indefinitely until the previous Spark job meets the condition of having both the driver and executor before submitting the current Spark job.
Are you willing to submit PR?
Yes. I would be willing to submit a PR with guidance from the Kyuubi community to improve.
No. I cannot submit a PR at this time.
The text was updated successfully, but these errors were encountered:
We tried using Yunikorn-Gang scheduling to solve this issue, but it cannot completely avoid the deadlock.
could you elaborate this? we're combining Yunikorn and Kyuubi as well but no problem so far, it does resolve your case I believe, we also do have concurrent requests
Code of Conduct
Search before asking
What would you like to be improved?
Problem Description:
When submitting a Spark job to a Kubernetes (K8s) namespace via Kyuubi, if multiple Spark jobs are submitted concurrently and started at the same time, it may lead to a situation where several Spark drivers start successfully and enter the "running" state. This exhausts the resources of the namespace, leaving no resources available for Spark executors to start. As a result, multiple Spark drivers end up waiting for resources to start their Spark executors, but the Spark drivers do not release resources, causing a deadlock where the drivers are mutually waiting for resources.
How should we improve?
Solution:
We tried using Yunikorn-Gang scheduling to solve this issue, but it cannot completely avoid the deadlock. Therefore, we adopted another approach by adding a switch in Kyuubi. If the switch is turned on, Spark jobs submitted to the same namespace are processed sequentially instead of in parallel. The submission of the current job depends on the running status of the previous job:
1.If both the driver and executor exist simultaneously, it means the previous Spark job has successfully occupied resources, and the current Spark job can be submitted.
2.If only the driver exists and no executor is present, the system waits for a period of time before checking the status of the previous Spark job again until both the driver and executor are present, then the current Spark job is submitted.
3.A timeout period can be configured. If the timeout is greater than 0 and the condition of having both the driver and executor is not met before the timeout, the previous driver is killed, and the current Spark job is submitted. If the configured timeout is less than or equal to 0, the system will wait indefinitely until the previous Spark job meets the condition of having both the driver and executor before submitting the current Spark job.
Are you willing to submit PR?
The text was updated successfully, but these errors were encountered: