Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[HOTFIX] distributed training with hist method #4716

Merged
merged 13 commits into from
Aug 13, 2019

Conversation

sperlingxx
Copy link
Contributor

@sperlingxx sperlingxx commented Jul 30, 2019

Problem:

  • Distributed training with hist method will stop working when synchronizing histograms of different machines.

Debug Stack:

  • Size of expand_nodes in different workers are not equal, because one node may be a leaf node on one worker, but it is still splitable on other workers.
  • Results of SplitEvaluation are inconsistent on different workers (machines).
  • Inconsistent split results will be observed only when nthread > 1.
  • When I changed OPENMP schedule policy of EvaluateSplitsBatch from schedule(guided) to schedule(dynamic) num_threads(nthread), everything works.

@CodingCat
Copy link
Member

CodingCat commented Jul 30, 2019

is it only happening in master branch?

@sperlingxx
Copy link
Contributor Author

is it only happening in master branch?

@CodingCat Yes, I have tested branch release_0.9, everything is all right. I think some modifications in PR #4529 cause the problem.

@CodingCat
Copy link
Member

is it only happening in master branch?

I think it is related to issue at #4679

in 0.9 branch, the node stats is synced for only once when working on the root and left/right should be calculated from cache (check #4140) so I think everything should be fine

@hcho3 @trivialfis @RAMitchell I believe it is a blocking issue for 1.0?

@CodingCat
Copy link
Member

is it only happening in master branch?

@CodingCat Yes, I have tested branch release_0.9, everything is all right. I think some modifications in PR #4529 cause the problem.

Bingo! check my previous comment

@trivialfis
Copy link
Member

I believe it is a blocking issue for 1.0?

Could you please add this to the roadmap as you have better idea for what's happening.

@hcho3
Copy link
Collaborator

hcho3 commented Jul 30, 2019

@CodingCat Yes, I think this is blocking.

@sperlingxx Thanks for the report. FYI, see https://xgboost.readthedocs.io/en/latest/contrib/unit_tests.html for locally running tests

@CodingCat
Copy link
Member

I have updated 1.0.0 roadmap

@sperlingxx
Copy link
Contributor Author

@hcho3 Thanks for doc link, I will tune unit tests in local.

@sperlingxx
Copy link
Contributor Author

@hcho3 @trivialfis @CodingCat
I've tested codes of master branch with below script:

import xgboost as xgb
import unittest
import numpy as np


class TestOMP(unittest.TestCase):
    def test_omp(self):
        dpath = 'demo/data/'
        dtrain = xgb.DMatrix(dpath + 'agaricus.txt.train')
        dtest = xgb.DMatrix(dpath + 'agaricus.txt.test')

        param = {'booster': 'gbtree',
                 'objective': 'binary:logistic',
                 'grow_policy': 'depthwise',
                 'tree_method': 'hist',
                 'eval_metric': 'error',
                 'max_depth': 5,
                 'min_child_weight': 0}

        watchlist = [(dtest, 'eval'), (dtrain, 'train')]
        num_round = 5

        def run_trial():
            res = {}
            bst = xgb.train(param, dtrain, num_round, watchlist, evals_result=res)
            metrics = [res['train']['error'][-1], res['eval']['error'][-1]]
            preds = bst.predict(dtest)
            return metrics, preds

        def consist_test(title, n):
            auc, pred = run_trial()
            for i in range(n-1):
                auc2, pred2 = run_trial()
                try:
                    assert auc == auc2
                    assert np.array_equal(pred, pred2)
                except Exception as e:
                    print('-------test %s failed, num_trial: %d-------' % (title, i))
                    raise e
                auc, pred = auc2, pred2
            return auc, pred

        print('test approx ...')
        param['tree_method'] = 'approx'

        param['nthread'] = 1
        auc_1, pred_1 = consist_test('approx_thread_1', 100)

        param['nthread'] = 2
        auc_2, pred_2 = consist_test('approx_thread_2', 100)

        param['nthread'] = 3
        auc_3, pred_3 = consist_test('approx_thread_3', 100)

        assert auc_1 == auc_2 == auc_3
        assert np.array_equal(auc_1, auc_2)
        assert np.array_equal(auc_1, auc_3)

        print('test hist ...')
        param['tree_method'] = 'hist'

        param['nthread'] = 1
        auc_1, pred_1 = consist_test('hist_thread_1', 100)

        param['nthread'] = 2
        auc_2, pred_2 = consist_test('hist_thread_2', 100)

        param['nthread'] = 3
        auc_3, pred_3 = consist_test('hist_thread_3', 100)

        assert auc_1 == auc_2 == auc_3
        assert np.array_equal(auc_1, auc_2)
        assert np.array_equal(auc_1, auc_3)

approx method passed all the tests, but hist method always failed in consistent tests when thread num > 1. (If we reduce trial num, sometimes test of hist method will be passed).

After I changed all OPENMP scheduling policys in updater_quantile_hist.cc to static (from guided , consistent tests of hist method can be passed.

There exists three call of OPENMP parallel for in updater_quantile_hist.cc :

#pragma omp parallel for schedule(guided)

#pragma omp parallel for schedule(guided)

#pragma omp parallel for schedule(guided)

@trivialfis
Copy link
Member

@hcho3 Do you have time to investigate impact from previous optimization PR? I have a feeling that we might need to revert some problematic parts of it.

@hcho3
Copy link
Collaborator

hcho3 commented Aug 6, 2019

@trivialfis I'll probably have to take time to investigate. I'll let you know when I do so. My org has interest in fixing this problem as well.

Copy link
Collaborator

@hcho3 hcho3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's merge this for now. I would like to re-visit the previous optimization PR within the next 2 weeks and address #4679

cc @trivialfis @sperlingxx

@sperlingxx sperlingxx changed the title [WIP][HOTFIX] distributed training with hist method [HOTFIX] distributed training with hist method Aug 13, 2019
@hcho3 hcho3 merged commit ef9af33 into dmlc:master Aug 13, 2019
@lock lock bot locked as resolved and limited conversation to collaborators Nov 11, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants