-
Notifications
You must be signed in to change notification settings - Fork 184
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix hanging trainings #132
Conversation
there are many changes not fully validated in the master branch, we are syncing with upstream (since 0.82) very very slowly...expecting to see more issues |
@trivialfis looks like some CI env problem? python3 not found? |
I'll take a look. I've seen the same problem in the CI for dmlc-core |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand what's happening here. Basically to me it's: Some code is removed, some debug messages are changed and not using tracker to print.
It would be nice if you have tests to specify expected behavior.
basically I am reverting the change at e3d51d3#diff-837b172455c2d356f95a6d61ab505595R277 cc: @chenqin |
Also, from your description:
The changes here doesn't seen to be related to tracker. I would really appreciate a small test that shows the expected behavior, maybe just a unittest in single node env. Any code without unittest is considered legacy. |
no, the changes I reverted brings wrong information about worker's identity, that makes tracker fall into a dead loop at https://github.com/dmlc/dmlc-core/blob/master/tracker/dmlc_tracker/tracker.py#L105 I think the problem is that the changes at e3d51d3#diff-837b172455c2d356f95a6d61ab505595R277 claims it is to I don't have bandwidth to dig into more about why the original changes are messing up I just put it here as you asked me what the rabit issue we are working on...this is the one we found some time ago |
The description was not very clear how this might fix the problem. If the author is confident this is the fix, please go ahead and merge. |
I think there are two questions (1) why in 0.8x we never see this problem? my answer is 0.8x rabit dependency does not have your change (2) why I reverted your change nothing related |
As I said, I did not go into details, the fixing logic here is simple, if any new bugs shows up, just revert anything new and suspicious |
Em, so no one can reason about the code? |
we have tried the fixed version in Dec, and everything goes fine when I try to debug this issue, I have found many changes which are lack of explanation of why and no verification in master branch. We should higher the bar of accepting code in Rabit at this point, my personal strategy to move things forward is, " if any new bugs shows up, just revert anything new and suspicious/proved to be problematic" (imagine your service down after a new deployment, you will roll back the deployment or spend days to debug?) |
I think you need a test at the very least. Otherwise we are just reverting progress of others and we can't move forward. If there is a test you can revert it and then @chenqin has a chance to modify his code to find the root cause. |
@chenqin Would you please help adding a test or explanation for the removed code? I just wanna get the code base a little bit trust worthy. |
After what @CodingCat more detailed explanation, I felt he might be right on finding problematic PR. But yeah, I can work on getting a test runs with help from @CodingCat on how to run multiple models with xgb-spark. |
the removed code was added to "enable test" now it turns out that without the added code, the tests (including the ones added after this PR) can still pass , shall the thing still be there? and I don't take bugs as progress honestly for now the most important progress is releasing a better 1.0, this is a bug blocking the release IMO... for anyone wants to try, use https://github.com/CodingCat/xgboost4j-spark-scalability/blob/master/src/main/scala/me/codingcat/xgboost4j/PureXGBoostLearner.scala to run with 0.9, and set |
BTW, I used 90 workers when reproducing this problem
…On Wed, Jan 8, 2020 at 4:55 PM Chen Qin ***@***.***> wrote:
@chenqin <https://github.com/chenqin> Would you please help adding a test
or explanation for the removed code? I just wanna get the code base a
little bit trust worthy.
After what @CodingCat <https://github.com/CodingCat> more detailed
explanation, I felt he might be right on finding problematic PR. But yeah,
I can work on getting a test runs with help from @CodingCat
<https://github.com/CodingCat> on how to run multiple models with
xgb-spark.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#132?email_source=notifications&email_token=AAFFQ6BZTNYJRGNP3T5OWVLQ4ZYXTA5CNFSM4KDZ4TG2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEIOQ77Y#issuecomment-572329983>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAFFQ6BGV3CAZ3Q7ETYMZODQ4ZYXTANCNFSM4KDZ4TGQ>
.
|
Seems reasonable to get this merged for 1.0 |
Spent some time on understanding the code base of rabit. It would be nice to have written form of communication protocol between tracker and workers, also recovery consensus. I had to match those function calls one by one ... Out of curious, @chenqin those code used for "smoothing the tracker", is the root problem in network bandwidth in hardware, or in the single threaded implementation of tracker? |
@trivialfis I have been rushing for planing and submission deadline. Will take a time to root cause bug raised in this issue latter. For the smooth traffic part, this is a fix to resolve rabit "hang" issue caused by combination of c++ assert and kill spark executor (error message unable to connect to tracker after one worker dead) as well as rabit internal reset and reconnect to tracker if any worker lost connection to tracker. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving to unblock the next release for XGBoost. We may want to consider JSON RPC for rabit in the future.
caused by combination of c++ assert and kill spark executor
There might be a better way to resolve this.
@CodingCat @chenqin @RAMitchell @hcho3 Is there anything you want in this PR, or rabit in general for the next XGBoost release? |
I have no objection for this PR to be merged. |
thanks folks! |
we observed infinitely hanging trainings with low probability when training a single model in Spark app and high probability when training multiple models
basically the tracker cannot figure out the topology of workers in the case of hanging