Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

XGBoost-Spark killing SparkContext on Task Failure #4826

Closed
lnmohankumar opened this issue Sep 3, 2019 · 12 comments
Closed

XGBoost-Spark killing SparkContext on Task Failure #4826

lnmohankumar opened this issue Sep 3, 2019 · 12 comments

Comments

@lnmohankumar
Copy link

Hi Team,

While running Machine Learning training models using xgboost in cluster and sparkContaxt is always getting shutdown after encountering any training task failure exception. So every time we need to restart the cluster to bring it back to normal state.

After looking for the root cause We found the code which causing the sparkcontext to close. I am not sure why sparkContext has to shutdown for any task failure, This is causing the other training models job to end which is not required.

https://github.com/dmlc/xgboost/blob/master/jvm-packages/xgboost4j-spark/src/main/scala/org/apache/spark/SparkParallelismTracker.scala#L127

above code is rolled out in 0.82 and 0.9 versions, Is it possible to fix it or any reason for this change in the new versions.

@hanyucui
Copy link

hanyucui commented Sep 3, 2019

To add to what @lnmohankumar said, I understand there are times when the entire Spark application should exit upon when XGBoost fails, but there are also a lot of cases where a user would like Spark to stay alive. For example, in a Jupyter notebook, users might play with different scenarios and want to continue using Spark even if some XGBoost tasks fail. I can also imagine users would like other models in the code to continue to train even if XGBoost models fail. I think there should be a way to enable/disable this behavior in user code.

@CodingCat
Copy link
Member

The current version of xgb does not behave normally when there is a failed task, i.e. the application would hang forever in that case

This is the reason we have to kill the entire application in the case of a failure. @cq is working on fixing the fault recovery strategy in xgb

@CodingCat
Copy link
Member

Additionally, train multiple models in parallel is an undefined behavior in xgb, rabit has some problem to fully support it

@hanyucui
Copy link

hanyucui commented Sep 3, 2019

@CodingCat Thanks for your response and it makes a lot of sense. Although, when xgboost fails, a user might still want to train models from other frameworks, say, scikit-learn or tensorflow, which is independent of xgboost. Any suggestion on how to do that on the current version of xgboost?

@CodingCat
Copy link
Member

While I didn’t try by myself, I think fork a new process to start spark-submit should work

@hanyucui
Copy link

hanyucui commented Sep 3, 2019

Thanks, @CodingCat. This is essentially what we are doing now. Let me clarify one last thing. Is it true that, when there is a failed task, xgboost would just hang and subsequent code in the main process will not be able to run? If that's the case, I agree we can only wait for the fault recovery fix @cq is working on.

@CodingCat
Copy link
Member

CodingCat commented Sep 3, 2019 via email

@hanyucui
Copy link

hanyucui commented Sep 4, 2019

Thanks, @CodingCat. Curious if there is an ETA for the fix.

@a-whitej
Copy link

Same issue met. any update on how to fix?

@a-whitej
Copy link

failure on GridSearch and error message:

2020-01-16 03:15:55,167 [ dispatcher-event-loop-2:47767420 ] - [ ERROR ] Lost executor 43 on ip-.cn-northwest-1.compute.internal: Container marked as failed: container_1577262625342_113301_01_000064 on host: ip-10-84-31-201.cn-northwest-1.compute.internal. Exit status: -100. Diagnostics: Container released on a lost node
2020-01-16 03:15:55,167 [ dispatcher-event-loop-7:47767420 ] - [ WARN ] Requesting driver to remove executor 43 for reason Container marked as failed: container_1577262625342_113301_01_000064 on host: ip-
-northwest-1.compute.internal. Exit status: -100. Diagnostics: Container released on a lost node
2020-01-16 03:15:55,167 [ dispatcher-event-loop-3:47767420 ] - [ WARN ] No more replicas available for rdd_461_1045 !
2020-01-16 03:15:55,167 [ spark-listener-group-executorManagement:47767420 ] - [ INFO ] Existing executor 43 has been removed (new total is 62)
2020-01-16 03:15:55,167 [ dispatcher-event-loop-2:47767420 ] - [ INFO ] Removal of executor 43 requested
2020-01-16 03:15:55,167 [ dispatcher-event-loop-2:47767420 ] - [ INFO ] Asked to remove non-existent executor 43
2020-01-16 03:15:55,167 [ dispatcher-event-loop-3:47767420 ] - [ WARN ] No more replicas available for rdd_461_2177 !

@hcho3
Copy link
Collaborator

hcho3 commented Sep 9, 2020

Fixed in #6019. #6097 documents the behavior of new parameter kill_spark_context_on_worker_failure.

@hcho3 hcho3 closed this as completed Sep 9, 2020
@FantDing
Copy link

FantDing commented Jan 8, 2021

Yes

On Tue, Sep 3, 2019 at 3:29 PM Hanyu Cui @.***> wrote: Thanks, @CodingCat https://github.com/CodingCat. This is essentially what we are doing now. Let me clarify one last thing. Is it true that, when there is a failed task, xgboost would just hang and subsequent code in the main process will not be able to run? If that's the case, I agree we can only wait for the fault recovery fix @cq https://github.com/cq is working on. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#4826?email_source=notifications&email_token=AAFFQ6AUDLNLDKH3LLCI3E3QH3QNRA5CNFSM4ITEKOY2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD5ZYSFA#issuecomment-527665428>, or mute the thread https://github.com/notifications/unsubscribe-auth/AAFFQ6HYP3YQPRSXIMXXFDLQH3QNRANCNFSM4ITEKOYQ .

@CodingCat but I found that, when I killed one executor where the xgb task works on, xgb can train normally. xgboost did not hang. I use the code to just cancel the job rather than kill sparkContext

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants