fix hanging trainings #132

CodingCat · 2020-01-07T14:15:58Z

we observed infinitely hanging trainings with low probability when training a single model in Spark app and high probability when training multiple models

basically the tracker cannot figure out the topology of workers in the case of hanging

CodingCat · 2020-01-07T14:17:11Z

there are many changes not fully validated in the master branch, we are syncing with upstream (since 0.82) very very slowly...expecting to see more issues

CodingCat · 2020-01-07T14:35:14Z

@trivialfis looks like some CI env problem? python3 not found?

hcho3 · 2020-01-07T19:34:24Z

I'll take a look. I've seen the same problem in the CI for dmlc-core

hcho3 · 2020-01-07T22:47:24Z

See dmlc/dmlc-core#589

trivialfis

I don't understand what's happening here. Basically to me it's: Some code is removed, some debug messages are changed and not using tracker to print.

It would be nice if you have tests to specify expected behavior.

CodingCat · 2020-01-08T04:25:46Z

I don't understand what's happening here. Basically to me it's: Some code is removed, some debug messages are changed and not using tracker to print.

It would be nice if you have tests to specify expected behavior.

basically I am reverting the change at e3d51d3#diff-837b172455c2d356f95a6d61ab505595R277

cc: @chenqin

trivialfis · 2020-01-08T04:39:23Z

Also, from your description:

basically the tracker cannot figure out the topology of workers in the case of hanging

The changes here doesn't seen to be related to tracker. I would really appreciate a small test that shows the expected behavior, maybe just a unittest in single node env. Any code without unittest is considered legacy.

CodingCat · 2020-01-08T04:52:41Z

Also, from your description:
basically the tracker cannot figure out the topology of workers in the case of hanging
The changes here does seen to be related to tracker. I would really appreciate a small test that shows the expected behavior, maybe just a unittest in single node env. Any code without unittest is considered legacy.

no, the changes I reverted brings wrong information about worker's identity, that makes tracker fall into a dead loop at https://github.com/dmlc/dmlc-core/blob/master/tracker/dmlc_tracker/tracker.py#L105

I think the problem is that the changes at e3d51d3#diff-837b172455c2d356f95a6d61ab505595R277 claims it is to enable all tests according to the titlein PR, however, it turns out the reverting of the changes doesn't impact anything , tests are still passing

I don't have bandwidth to dig into more about why the original changes are messing up port (most suspicious entity) , maybe it is as simple as some value initialization problem especially for multi model trains...

I just put it here as you asked me what the rabit issue we are working on...this is the one we found some time ago

chenqin · 2020-01-08T05:06:57Z

The description was not very clear how this might fix the problem.
Previous code change related to the weird issue socket time_wait issue I observed while reconnectlink method demonstrated from time to time.

If the author is confident this is the fix, please go ahead and merge.

CodingCat · 2020-01-08T05:09:58Z

The description was not very clear how this might fix the problem.
Previous code change related to the weird issue socket time_wait issue I observed while reconnectlink method demonstrated from time to time.

If the author is confident this is the fix, please go ahead and merge.

I think there are two questions

(1) why in 0.8x we never see this problem?

my answer is 0.8x rabit dependency does not have your change

(2) why I reverted your change nothing related the weird issue socket time_wait issue I observed while reconnectlink method demonstrated from time to time. as you claimed is showing up? <= this is the question for you

CodingCat · 2020-01-08T05:12:11Z

The description was not very clear how this might fix the problem.

As I said, I did not go into details, the fixing logic here is simple, if any new bugs shows up, just revert anything new and suspicious

trivialfis · 2020-01-08T07:46:46Z

Em, so no one can reason about the code?

CodingCat · 2020-01-08T18:23:21Z

Em, so no one can reason about the code?

we have tried the fixed version in Dec, and everything goes fine

when I try to debug this issue, I have found many changes which are lack of explanation of why and no verification in master branch. We should higher the bar of accepting code in Rabit

at this point, my personal strategy to move things forward is, " if any new bugs shows up, just revert anything new and suspicious/proved to be problematic" (imagine your service down after a new deployment, you will roll back the deployment or spend days to debug?)

RAMitchell · 2020-01-09T00:51:19Z

I think you need a test at the very least. Otherwise we are just reverting progress of others and we can't move forward. If there is a test you can revert it and then @chenqin has a chance to modify his code to find the root cause.

trivialfis · 2020-01-09T00:52:34Z

@chenqin Would you please help adding a test or explanation for the removed code? I just wanna get the code base a little bit trust worthy.

chenqin · 2020-01-09T00:55:20Z

@chenqin Would you please help adding a test or explanation for the removed code? I just wanna get the code base a little bit trust worthy.

After what @CodingCat more detailed explanation, I felt he might be right on finding problematic PR. But yeah, I can work on getting a test runs with help from @CodingCat on how to run multiple models with xgb-spark.

CodingCat · 2020-01-09T01:06:27Z

I think you need a test at the very least. Otherwise we are just reverting progress of others and we can't move forward. If there is a test you can revert it and then @chenqin has a chance to modify his code to find the root cause.

the removed code was added to "enable test" now it turns out that without the added code, the tests (including the ones added after this PR) can still pass , shall the thing still be there? and I don't take bugs as progress honestly

for now the most important progress is releasing a better 1.0, this is a bug blocking the release IMO...

for anyone wants to try, use https://github.com/CodingCat/xgboost4j-spark-scalability/blob/master/src/main/scala/me/codingcat/xgboost4j/PureXGBoostLearner.scala to run with 0.9, and set repeated to be, say 50, you will see it almost stuck at somewhere every time

CodingCat · 2020-01-09T01:20:47Z

BTW, I used 90 workers when reproducing this problem

…

On Wed, Jan 8, 2020 at 4:55 PM Chen Qin ***@***.***> wrote: @chenqin <https://github.com/chenqin> Would you please help adding a test or explanation for the removed code? I just wanna get the code base a little bit trust worthy. After what @CodingCat <https://github.com/CodingCat> more detailed explanation, I felt he might be right on finding problematic PR. But yeah, I can work on getting a test runs with help from @CodingCat <https://github.com/CodingCat> on how to run multiple models with xgb-spark. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#132?email_source=notifications&email_token=AAFFQ6BZTNYJRGNP3T5OWVLQ4ZYXTA5CNFSM4KDZ4TG2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEIOQ77Y#issuecomment-572329983>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAFFQ6BGV3CAZ3Q7ETYMZODQ4ZYXTANCNFSM4KDZ4TGQ> .

RAMitchell · 2020-01-15T21:37:25Z

Seems reasonable to get this merged for 1.0

trivialfis · 2020-01-20T10:41:44Z

Spent some time on understanding the code base of rabit. It would be nice to have written form of communication protocol between tracker and workers, also recovery consensus. I had to match those function calls one by one ...

Out of curious, @chenqin those code used for "smoothing the tracker", is the root problem in network bandwidth in hardware, or in the single threaded implementation of tracker?

chenqin · 2020-01-20T20:16:44Z

Spent some time on understanding the code base of rabit. It would be nice to have to written form of communication protocol between tracker and workers, also recovery consensus. I had to match those function calls one by one ...

Out of curious, @chenqin those code used for "smoothing the tracker", is the root problem in network bandwidth in hardware, or in the single threaded implementation of tracker?

@trivialfis I have been rushing for planing and submission deadline. Will take a time to root cause bug raised in this issue latter. For the smooth traffic part, this is a fix to resolve rabit "hang" issue caused by combination of c++ assert and kill spark executor (error message unable to connect to tracker after one worker dead) as well as rabit internal reset and reconnect to tracker if any worker lost connection to tracker.

trivialfis

Approving to unblock the next release for XGBoost. We may want to consider JSON RPC for rabit in the future.

caused by combination of c++ assert and kill spark executor

There might be a better way to resolve this.

trivialfis · 2020-01-26T18:57:33Z

@CodingCat @chenqin @RAMitchell @hcho3 Is there anything you want in this PR, or rabit in general for the next XGBoost release?

hcho3 · 2020-01-27T10:28:12Z

I have no objection for this PR to be merged.

CodingCat · 2020-01-27T17:11:59Z

thanks folks!

Nan Zhu added 2 commits January 7, 2020 05:47

fix hanging connections

e42bc0f

remove logging

9c416d0

CodingCat requested a review from trivialfis January 7, 2020 14:16

trivialfis reviewed Jan 8, 2020

View reviewed changes

trivialfis approved these changes Jan 26, 2020

View reviewed changes

CodingCat merged commit 6e56395 into master Jan 27, 2020

CodingCat deleted the fix_hanging branch January 27, 2020 17:12

hcho3 mentioned this pull request Jan 27, 2020

Update Rabit dmlc/xgboost#5237

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix hanging trainings #132

fix hanging trainings #132

CodingCat commented Jan 7, 2020

CodingCat commented Jan 7, 2020

CodingCat commented Jan 7, 2020

hcho3 commented Jan 7, 2020

hcho3 commented Jan 7, 2020

trivialfis left a comment •

edited

Loading

CodingCat commented Jan 8, 2020

trivialfis commented Jan 8, 2020 •

edited

Loading

CodingCat commented Jan 8, 2020

chenqin commented Jan 8, 2020

CodingCat commented Jan 8, 2020 •

edited

Loading

CodingCat commented Jan 8, 2020 •

edited

Loading

trivialfis commented Jan 8, 2020

CodingCat commented Jan 8, 2020

RAMitchell commented Jan 9, 2020

trivialfis commented Jan 9, 2020

chenqin commented Jan 9, 2020

CodingCat commented Jan 9, 2020

CodingCat commented Jan 9, 2020 via email

RAMitchell commented Jan 15, 2020

trivialfis commented Jan 20, 2020 •

edited

Loading

chenqin commented Jan 20, 2020

trivialfis left a comment

trivialfis commented Jan 26, 2020

hcho3 commented Jan 27, 2020

CodingCat commented Jan 27, 2020

fix hanging trainings #132

fix hanging trainings #132

Conversation

CodingCat commented Jan 7, 2020

CodingCat commented Jan 7, 2020

CodingCat commented Jan 7, 2020

hcho3 commented Jan 7, 2020

hcho3 commented Jan 7, 2020

trivialfis left a comment • edited Loading

Choose a reason for hiding this comment

CodingCat commented Jan 8, 2020

trivialfis commented Jan 8, 2020 • edited Loading

CodingCat commented Jan 8, 2020

chenqin commented Jan 8, 2020

CodingCat commented Jan 8, 2020 • edited Loading

CodingCat commented Jan 8, 2020 • edited Loading

trivialfis commented Jan 8, 2020

CodingCat commented Jan 8, 2020

RAMitchell commented Jan 9, 2020

trivialfis commented Jan 9, 2020

chenqin commented Jan 9, 2020

CodingCat commented Jan 9, 2020

CodingCat commented Jan 9, 2020 via email

RAMitchell commented Jan 15, 2020

trivialfis commented Jan 20, 2020 • edited Loading

chenqin commented Jan 20, 2020

trivialfis left a comment

Choose a reason for hiding this comment

trivialfis commented Jan 26, 2020

hcho3 commented Jan 27, 2020

CodingCat commented Jan 27, 2020

trivialfis left a comment •

edited

Loading

trivialfis commented Jan 8, 2020 •

edited

Loading

CodingCat commented Jan 8, 2020 •

edited

Loading

CodingCat commented Jan 8, 2020 •

edited

Loading

trivialfis commented Jan 20, 2020 •

edited

Loading