[dask] test training when a worker has no data #3897

jmoralez · 2021-02-03T04:22:04Z

This includes a test to check that the training process succeeds even when a worker has no data, e.g. verify that #3882 is solved.

I wanted to persist the collections to a specific worker with something like:

for worker in workers:
    dX, dy, dw = client.persist([dX, dy, dw], workers=worker)

However I got: TypeError: unhashable type: 'Array'.

So in the test I just create a collection with one chunk, persist it and check that it is only in one worker. Any feedback to make this test more robust is welcome.

jameslamb

Thanks very much! I have some ideas to simplify this, without needing to introduce a new function to check that the data is only on one worker.

Assert that client.nworkers > 1, just to be sure the client fixture is for a multi-worker cluster.
Just repartition the data to 1 partition (.repartition(npartitions=1) for data frame, .rechunk((X.shape[0], X.shape[1])) for array). assert that dX.npartitions == 1. If you have only 1 partition, you know for sure only one worker has data!

other requests

Please make sure this test is run for ranking, classification, and regression. You can borrow from

LightGBM/tests/python_package_test/test_dask.py

Lines 463 to 487 in 876bfe5

    
           @pytest.mark.parametrize('task', ['classification', 'regression', 'ranking']) 
        
           def test_training_works_if_client_not_provided_or_set_after_construction(task, listen_port, client): 
        
               if task == 'ranking': 
        
                   _, _, _, _, dX, dy, _, dg = _create_ranking_data( 
        
                       output='array', 
        
                       group=None 
        
                   ) 
        
                   model_factory = lgb.DaskLGBMRanker 
        
               else: 
        
                   _, _, _, dX, dy, _ = _create_data( 
        
                       objective=task, 
        
                       output='array', 
        
                   ) 
        
                   dg = None 
        
                   if task == 'classification': 
        
                       model_factory = lgb.DaskLGBMClassifier 
        
                   elif task == 'regression': 
        
                       model_factory = lgb.DaskLGBMRegressor 
        
               params = { 
        
                   "time_out": 5, 
        
                   "local_listen_port": listen_port, 
        
                   "n_estimators": 1, 
        
                   "num_leaves": 2 
        
               }

to see how to do that.

For training on a single worker, the result should be identical to non-Dask training. So please also train a regular lightgbm.sklearn.LGBM[Classifier/Regressor/Ranker] and then check that assert_eq(dask_model.predict(dX).compute(), local_model.predict(X)). You can see the other tests in this file for references on how to do that, or @ me if you have any questions.

jmoralez · 2021-02-04T05:20:48Z

Hi James. I've included the different tasks and outputs. I just saw the PR you made for the categoricals and saw the new output type, I'll wait til that gets merged to include it.

jameslamb

Thanks so much! I like the changes you've made. We're starting to have enough of these tests where tasks is parameterized, makes sense to concentrate it at the top of tetst_dask.py like you did.

I left a few minor suggestions. Other than those, I agree with your proposal to wait until #3908 is merged.

tests/python_package_test/test_dask.py

jmoralez · 2021-02-09T02:44:15Z

Hi James. I've included your comments and the dataframe-with-categorical data output. Looking forward to your review.

jameslamb

This look great! Thanks very much for the help

github-actions · 2023-08-24T01:14:47Z

This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

include test for training when a worker has no data

4616f45

jmoralez requested a review from jameslamb as a code owner February 3, 2021 04:22

jameslamb added the maintenance label Feb 3, 2021

jameslamb requested changes Feb 3, 2021

View reviewed changes

jameslamb mentioned this pull request Feb 3, 2021

[dask] use random ports in network setup #3823

Merged

jmoralez added 4 commits February 2, 2021 23:55

merge with master

8755322

test single partition against local model for all tasks and outputs

ee82b0d

Merge branch 'master' into workers-without-data

c3f2345

remove futures_of

a01dc09

jameslamb requested changes Feb 4, 2021

View reviewed changes

jameslamb added the in progress label Feb 7, 2021

jmoralez added 3 commits February 8, 2021 18:37

include james' comments

0974bc7

merge with master

9448987

remove product import

3ad86b1

jameslamb removed the in progress label Feb 9, 2021

jameslamb self-requested a review February 9, 2021 02:47

jameslamb approved these changes Feb 9, 2021

View reviewed changes

jameslamb merged commit 7b47ab8 into microsoft:master Feb 9, 2021

jmoralez deleted the workers-without-data branch February 9, 2021 02:51

github-actions bot locked as resolved and limited conversation to collaborators Aug 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[dask] test training when a worker has no data #3897

[dask] test training when a worker has no data #3897

jmoralez commented Feb 3, 2021

jameslamb left a comment

jmoralez commented Feb 4, 2021

jameslamb left a comment

jmoralez commented Feb 9, 2021

jameslamb left a comment

github-actions bot commented Aug 24, 2023

	@pytest.mark.parametrize('task', ['classification', 'regression', 'ranking'])
	def test_training_works_if_client_not_provided_or_set_after_construction(task, listen_port, client):
	if task == 'ranking':
	_, _, _, _, dX, dy, _, dg = _create_ranking_data(
	output='array',
	group=None
	)
	model_factory = lgb.DaskLGBMRanker
	else:
	_, _, _, dX, dy, _ = _create_data(
	objective=task,
	output='array',
	)
	dg = None
	if task == 'classification':
	model_factory = lgb.DaskLGBMClassifier
	elif task == 'regression':
	model_factory = lgb.DaskLGBMRegressor

	params = {
	"time_out": 5,
	"local_listen_port": listen_port,
	"n_estimators": 1,
	"num_leaves": 2
	}

[dask] test training when a worker has no data #3897

[dask] test training when a worker has no data #3897

Conversation

jmoralez commented Feb 3, 2021

jameslamb left a comment

Choose a reason for hiding this comment

jmoralez commented Feb 4, 2021

jameslamb left a comment

Choose a reason for hiding this comment

jmoralez commented Feb 9, 2021

jameslamb left a comment

Choose a reason for hiding this comment

github-actions bot commented Aug 24, 2023