-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
XGBoost-Spark killing SparkContext on Task Failure #4826
Comments
To add to what @lnmohankumar said, I understand there are times when the entire Spark application should exit upon when XGBoost fails, but there are also a lot of cases where a user would like Spark to stay alive. For example, in a Jupyter notebook, users might play with different scenarios and want to continue using Spark even if some XGBoost tasks fail. I can also imagine users would like other models in the code to continue to train even if XGBoost models fail. I think there should be a way to enable/disable this behavior in user code. |
The current version of xgb does not behave normally when there is a failed task, i.e. the application would hang forever in that case This is the reason we have to kill the entire application in the case of a failure. @cq is working on fixing the fault recovery strategy in xgb |
Additionally, train multiple models in parallel is an undefined behavior in xgb, rabit has some problem to fully support it |
@CodingCat Thanks for your response and it makes a lot of sense. Although, when xgboost fails, a user might still want to train models from other frameworks, say, scikit-learn or tensorflow, which is independent of xgboost. Any suggestion on how to do that on the current version of xgboost? |
While I didn’t try by myself, I think fork a new process to start spark-submit should work |
Thanks, @CodingCat. This is essentially what we are doing now. Let me clarify one last thing. Is it true that, when there is a failed task, xgboost would just hang and subsequent code in the main process will not be able to run? If that's the case, I agree we can only wait for the fault recovery fix @cq is working on. |
Yes
…On Tue, Sep 3, 2019 at 3:29 PM Hanyu Cui ***@***.***> wrote:
Thanks, @CodingCat <https://github.com/CodingCat>. This is essentially
what we are doing now. Let me clarify one last thing. Is it true that, when
there is a failed task, xgboost would just hang and subsequent code in the
main process will not be able to run? If that's the case, I agree we can
only wait for the fault recovery fix @cq <https://github.com/cq> is
working on.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#4826?email_source=notifications&email_token=AAFFQ6AUDLNLDKH3LLCI3E3QH3QNRA5CNFSM4ITEKOY2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD5ZYSFA#issuecomment-527665428>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAFFQ6HYP3YQPRSXIMXXFDLQH3QNRANCNFSM4ITEKOYQ>
.
|
Thanks, @CodingCat. Curious if there is an ETA for the fix. |
Same issue met. any update on how to fix? |
failure on GridSearch and error message: 2020-01-16 03:15:55,167 [ dispatcher-event-loop-2:47767420 ] - [ ERROR ] Lost executor 43 on ip-.cn-northwest-1.compute.internal: Container marked as failed: container_1577262625342_113301_01_000064 on host: ip-10-84-31-201.cn-northwest-1.compute.internal. Exit status: -100. Diagnostics: Container released on a lost node |
@CodingCat but I found that, when I killed one executor where the xgb task works on, xgb can train normally. xgboost did not hang. I use the code to just cancel the job rather than kill sparkContext |
Hi Team,
While running Machine Learning training models using xgboost in cluster and sparkContaxt is always getting shutdown after encountering any training task failure exception. So every time we need to restart the cluster to bring it back to normal state.
After looking for the root cause We found the code which causing the sparkcontext to close. I am not sure why sparkContext has to shutdown for any task failure, This is causing the other training models job to end which is not required.
https://github.com/dmlc/xgboost/blob/master/jvm-packages/xgboost4j-spark/src/main/scala/org/apache/spark/SparkParallelismTracker.scala#L127
above code is rolled out in 0.82 and 0.9 versions, Is it possible to fix it or any reason for this change in the new versions.
The text was updated successfully, but these errors were encountered: