-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[dask] include multiclass-classification task in tests #4048
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks very much for doing this!
Please see my suggested changes. If you think it will be more than a few days until you have time to work on these suggestions, could you please submit a separate PR that only has the "put factory lookup in a dict" changes in it? Those changes don't require any review discussion since we'd already agreed to them, and a PR for those could be merged quickly. If you'll be able to get back to this PR in the next few days, I think it's totally fine to leave them together as a single pull request.
Thanks again for all your efforts on lightgbm.dask
! Really really appreciate the time and energy you've put into moving this part of the project forward.
...test_classifier sometimes fails when comparing the local vs dask probabilities, so I'm happy to address that if that's in the scope of this PR.
The test failures look specific to this PR. All of the failures look like they're coming from multiclass classification tests on test_classifier()
which are currently working on master
. Given that, I think they should be fixed on this PR.
FAILED ../tests/python_package_test/test_dask.py::test_classifier[multiclass-classification-dataframe]
FAILED ../tests/python_package_test/test_dask.py::test_classifier[multiclass-classification-dataframe-with-categorical]
FAILED ../tests/python_package_test/test_dask.py::test_classifier[multiclass-classification-array]
FAILED ../tests/python_package_test/test_dask.py::test_classifier[multiclass-classification-scipy_csr_matrix]
FAILED ../tests/python_package_test/test_dask.py::test_classifier[multiclass-classification-dataframe-with-categorical]
FAILED ../tests/python_package_test/test_dask.py::test_classifier[multiclass-classification-array]
FAILED ../tests/python_package_test/test_dask.py::test_classifier[multiclass-classification-dataframe]
FAILED ../tests/python_package_test/test_dask.py::test_classifier[multiclass-classification-dataframe-with-categorical]
I have one possible solution I'd like you to try. In the multiclass classification block of _create_data()
, instead of centers = 3
please try passing the value currently being passed on master
. Something like this
if objective == 'binary-classification':
centers = [[-4, -4], [4, 4]]
elif objective == 'multiclass-classification':
centers = [[-4, -4], [4, 4], [-4, 4]]
I know I originally suggested that just centers = 2
and centers = 3
would be sufficient, but these failing tests suggest that I was wrong to assume that. Maybe it's the case that the choice make_blobs()
makes for centers=3
produces classes that are too similar to each other (based on feature values), and as a result maybe the small models we're training in tests here (n_estimators=10, num_leaves=10
) are not able to effectively discriminate between classes. In general, for a fixed dataset we should expect distributed learning to take more iterations to achieve the same statistical performance as non-distributed learning.
@@ -1229,12 +1222,9 @@ def test_training_succeeds_when_data_is_dataframe_and_label_is_column_array( | |||
client.close(timeout=CLIENT_CLOSE_TIMEOUT) | |||
|
|||
|
|||
@pytest.mark.parametrize('task', tasks) | |||
@pytest.mark.parametrize('task', ['binary-classification', 'ranking', 'regression']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed that test_init_score fails for the multiclass-classification task but I removed it from the tasks in that test and can make a different PR adding it back once #4046 is solved.
This PR should include a test for init_score
used for multiclass classification. I agree with you that the "pass a 1D array" interface could lead to mistakes in the multiclass classification case, but I don't consider #4046 a bug and I expect that a PR that addresses #4046 would only ADD the ability to pass an array of shape (n_samples, n_classes)
, not remove the current behavior of passing a 1D array. I expect that it will end up that way so that we don't break existing users' code.
Here's a minimal example showing that the current 1D-array behavior can work in the Dask interface.
from sklearn.datasets import make_blobs
import lightgbm as lgb
import numpy as np
import dask.array as da
from distributed import Client, LocalCluster
cluster = LocalCluster(n_workers=2)
client = Client(cluster)
X, y = make_blobs(n_samples=1000, n_features=50, centers=3)
dX = da.from_array(X, chunks=(100, 50))
dy = da.from_array(y, chunks=(100,))
init_scores = dy.map_blocks(
lambda x: np.repeat(0.8, x.size * 3),
dtype=np.float64
)
assert init_scores.npartitions == dy.npartitions
dask_model = lgb.DaskLGBMClassifier()
dask_model.fit(dX, dy, init_score=init_scores)
assert dask_model.booster_.trees_to_dataframe()['value'][0] == 0.0
So for this test, can you please just change the init_score
setup to something like this?
# init_scores must be a 1D array, even for multiclass classification
# where you need to provide 1 score per class for each row in X
# https://github.com/microsoft/LightGBM/issues/4046
size_factor = 1
if task == "multiclass-classification":
size_factor = 3
if output.startswith('dataframe'):
init_scores = dy.map_partitions(
lambda x: pd_Series(np.repeat(init_score, x.size * size_factor))
)
else:
init_scores = dy.map_blocks(
lambda x: np.repeat(init_score, x.size * size_factor),
dtype=np.float64
)
… for multiclass-classification
Hi James. Thank you for your comments, I've included them and things work ok locally. However in the CI all the dask tests are failing because of the python version (3.9) I believe. |
Thanks! In the future, when you mention failing tests it would be useful to include a link to logs and preferably a small snippet of the error message(s) or other logs you're seeing. I can see from https://dev.azure.com/lightgbm-ci/lightgbm-ci/_build/results?buildId=9396&view=logs&j=c28dceab-947a-5848-c21f-eef3695e5f11&t=fa158246-17e2-53d4-5936-86070edbaacf that a lot of the Dask tests are failing with an error like this:
I'm very familiar with this problem actually! I opened dask/dask#7331 this weekend which I think describes it. It's possible to get this error if you have I can see that that's what we ended up with when the testing conda env was solved. From the same logs:
I see the same failures on the builds for a totally unrelated PR (#4053): https://dev.azure.com/lightgbm-ci/lightgbm-ci/_build/results?buildId=9393&view=logs&j=c28dceab-947a-5848-c21f-eef3695e5f11&t=fa158246-17e2-53d4-5936-86070edbaacf I'll open a PR right now to try to fix this by forcing our testing conda env to use compatible versions of these two libraries. |
Ok @jmoralez I think the known issues with this project's continuous integration stuff have been resolved. Could you please merge the most recent |
Hi James. Those CI fixes look like they were quite an adventure haha. Something strange happened with the test for Linux sdist. Truncated Logs
|
Ha yes, every day in CI world brings new weird stuff to figure out. I'm not sure what happened with that Azure DevOps job you linked to...from those logs, it looks like maybe I just restarted that job manually, hopefully it will pass this time. The failure definitely seems unrelated to your changes in this PR. |
I just changed the PR title, adding the phrase "in tests". We use a bot called release-drafter (https://github.com/microsoft/LightGBM/blob/master/.github/release-drafter.yml) to automatically create changelogs for releases, so each PR title becomes one point in the changelog. I want it to be clear in the changelog that this PR added tests on multiclass classification tasks, not a new feature in the Dask package. Just explaining since I know if someone changed the title of one of my PRs, I'd want to know why. |
Haha it's fine. I had noticed the title missed specifying the fact that it was for tests but forgot to update it so thank you. I'll make sure to have well defined pr titles in the future having this changelog thing in mind. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
awesome contribution, thanks so much for increasing the test coverage of the Dask package and simplifying the test code!
* [docs]Add alt text on images * Update docs/GPU-Windows.rst Co-authored-by: James Lamb <[email protected]> * Update docs/GPU-Windows.rst Co-authored-by: James Lamb <[email protected]> * Apply suggestions from code review Co-authored-by: James Lamb <[email protected]> * Apply suggestions from code review Co-authored-by: James Lamb <[email protected]> * Merge main branch commit updates (#1) * [docs] Add alt text to image in Parameters-Tuning.rst (#4035) * [docs] Add alt text to image in Parameters-Tuning.rst Add alt text to Leaf-wise growth image, as part of #4028 * Update docs/Parameters-Tuning.rst Co-authored-by: James Lamb <[email protected]> Co-authored-by: James Lamb <[email protected]> * [ci] [R-package] upgrade to R 4.0.4 in CI (#4042) * [docs] update description of deterministic parameter (#4027) * update description of deterministic parameter to require using with force_row_wise or force_col_wise * Update include/LightGBM/config.h Co-authored-by: Nikita Titov <[email protected]> * update docs Co-authored-by: Nikita Titov <[email protected]> * [dask] Include support for init_score (#3950) * include support for init_score * use dataframe from init_score and test difference with and without init_score in local model * revert refactoring * initial docs. test between distributed models with and without init_score * remove ranker from tests * test value for root node and change docs * comma * re-include parametrize * fix incorrect merge * use single init_score and the booster_ attribute * use np.float64 instead of float * [ci] ignore untitle Jupyter notebooks in .gitignore (#4047) * [ci] prevent getting incompatible dask and distributed versions (#4054) * [ci] prevent getting incompatible dask and distributed versions * Update .ci/test.sh Co-authored-by: Nikita Titov <[email protected]> * empty commit Co-authored-by: Nikita Titov <[email protected]> * [ci] fix R CMD CHECK note about example timings (fixes #4049) (#4055) * [ci] fix R CMD CHECK note about example timings (fixes #4049) * Apply suggestions from code review Co-authored-by: Nikita Titov <[email protected]> * empty commit Co-authored-by: Nikita Titov <[email protected]> * [ci] add CMake + R 3.6 test back (fixes #3469) (#4053) * [ci] add CMake + R 3.6 test back (fixes #3469) * Apply suggestions from code review Co-authored-by: Nikita Titov <[email protected]> * Update .ci/test_r_package_windows.ps1 * -Wait and remove rtools40 * empty commit Co-authored-by: Nikita Titov <[email protected]> * [dask] include multiclass-classification task in tests (#4048) * include multiclass-classification task and task_to_model_factory dicts * define centers coordinates. flatten init_scores within each partition for multiclass-classification * include issue comment and fix linting error * Update index.rst (#4029) Add alt text to logo image Co-authored-by: James Lamb <[email protected]> * [dask] raise more informative error for duplicates in 'machines' (fixes #4057) (#4059) * [dask] raise more informative error for duplicates in 'machines' * uncomment * avoid test failure * Revert "avoid test failure" This reverts commit 9442bdf. * [dask] add tutorial documentation (fixes #3814, fixes #3838) (#4030) * [dask] add tutorial documentation (fixes #3814, fixes #3838) * add notes on saving the model * quick start examples * add examples * fix timeouts in examples * remove notebook * fill out prediction section * table of contents * add line back * linting * isort * Apply suggestions from code review Co-authored-by: Nikita Titov <[email protected]> * Apply suggestions from code review Co-authored-by: Nikita Titov <[email protected]> * move examples under python-guide * remove unused pickle import Co-authored-by: Nikita Titov <[email protected]> * set 'pending' commit status for R Solaris optional workflow (#4061) * [docs] add Yu Shi to repo maintainers (#4060) * Update FAQ.rst * Update CODEOWNERS * set is_linear_ to false when it is absent from the model file (fix #3778) (#4056) * Add CMake option to enable sanitizers and build gtest (#3555) * Add CMake option to enable sanitizer * Set up gtest * Address reviewer's feedback * Address reviewer's feedback * Update CMakeLists.txt Co-authored-by: Nikita Titov <[email protected]> Co-authored-by: Nikita Titov <[email protected]> * added type hint (#4070) * [ci] run Dask examples on CI (#4064) * Update Parallel-Learning-Guide.rst * Update test.sh * fix path * address review comments * [python-package] add type hints on Booster.set_network() (#4068) * [python-package] add type hints on Booster.set_network() * change behavior * [python-package] Some mypy fixes (#3916) * Some mypy fixes * address James' comments * Re-introduce pass in empty classes * Update compat.py Remove extra lines * [dask] [ci] fix flaky network-setup test (#4071) * [tests][dask] simplify code in Dask tests (#4075) * simplify Dask tests code * enable CI * disable CI * Revert "[ci] prevent getting incompatible dask and distributed versions (#4054)" (#4076) This reverts commit 4e9c976. * Fix parsing of non-finite values (#3942) * Fix index out-of-range exception generated by BaggingHelper on small datasets. Prior to this change, the line "score_t threshold = tmp_gradients[top_k - 1];" would generate an exception, since tmp_gradients would be empty when the cnt input value to the function is zero. * Update goss.hpp * Update goss.hpp * Add API method LGBM_BoosterPredictForMats which runs prediction on a data set given as of array of pointers to rows (as opposed to existing method LGBM_BoosterPredictForMat which requires data given as contiguous array) * Fix incorrect upstream merge * Add link to LightGBM.NET * Fix indenting to 2 spaces * Dummy edit to trigger CI * Dummy edit to trigger CI * remove duplicate functions from merge * Fix parsing of non-finite values. Current implementation silently returns zero when input string is "inf", "-inf", or "nan" when compiled with VS2017, so instead just explicitly check for these values and fail if there is no match. No attempt to optimise string allocations in this implementation since it is usually rarely invoked. * Dummy commit to trigger CI * Also handle -nan in double parsing method * Update include/LightGBM/utils/common.h Remove trailing whitespace to pass linting tests Co-authored-by: Nikita Titov <[email protected]> Co-authored-by: matthew-peacock <[email protected]> Co-authored-by: Guolin Ke <[email protected]> Co-authored-by: Nikita Titov <[email protected]> * [dask] remove unused imports from typing (#4079) * Range check for DCG position discount lookup (#4069) * Add check to prevent out of index lookup in the position discount table. Add debug logging to report number of queries found in the data. * Change debug logging location so that we can print the data file name as well. * Revert "Change debug logging location so that we can print the data file name as well." This reverts commit 3981b34. * Add data file name to debug logging. * Move log line to a place where it is output even when query IDs are read from a separate file. * Also add the out-of-range check to rank metrics. * Perform check after number of queries is initialized. * Update * [ci] upgrade R CI scripts to work on Ubuntu 20.04 (#4084) * [ci] install additional LaTeX packages in R CI jobs * update autoconf version * bump upper limit on package size to 100 * [SWIG] Add streaming data support + cpp tests (#3997) * [feature] Add ChunkedArray to SWIG * Add ChunkedArray * Add ChunkedArray_API_extensions.i * Add SWIG class wrappers * Address some review comments * Fix linting issues * Move test to tests/test_ChunkedArray_manually.cpp * Add test note * Move ChunkedArray to include/LightGBM/utils/ * Declare more explicit types of ChunkedArray in the SWIG API. * Port ChunkedArray tests to googletest * Please C++ linter * Address StrikerRUS' review comments * Update SWIG doc & disable ChunkedArray<int64_t> * Use CHECK_EQ instead of assert * Change include order (linting) * Rename ChunkedArray -> chunked_array files * Change header guards * Address last comments from StrikerRUS * store all CMake files in one place (#4087) * v3.2.0 release (#3872) * Update VERSION.txt * update appveyor.yml and configure * fix Appveyor builds Co-authored-by: James Lamb <[email protected]> Co-authored-by: Nikita Titov <[email protected]> Co-authored-by: StrikerRUS <[email protected]> * [ci] Bump version for development (#4094) * Update .appveyor.yml * Update cran-comments.md * Update VERSION.txt * update configure Co-authored-by: James Lamb <[email protected]> * [ci] fix flaky Azure Pipelines jobs (#4095) * Update test.sh * Update setup.sh * Update .vsts-ci.yml * Update test.sh * Update setup.sh * Update .vsts-ci.yml * Update setup.sh * Update setup.sh Co-authored-by: Subham Agrawal <[email protected]> Co-authored-by: James Lamb <[email protected]> Co-authored-by: shiyu1994 <[email protected]> Co-authored-by: Nikita Titov <[email protected]> Co-authored-by: jmoralez <[email protected]> Co-authored-by: marcelonieva7 <[email protected]> Co-authored-by: Philip Hyunsu Cho <[email protected]> Co-authored-by: Deddy Jobson <[email protected]> Co-authored-by: Alberto Ferreira <[email protected]> Co-authored-by: mjmckp <[email protected]> Co-authored-by: matthew-peacock <[email protected]> Co-authored-by: Guolin Ke <[email protected]> Co-authored-by: ashok-ponnuswami-msft <[email protected]> Co-authored-by: StrikerRUS <[email protected]> * Apply suggestions from code review Co-authored-by: Nikita Titov <[email protected]> Co-authored-by: James Lamb <[email protected]> Co-authored-by: Subham Agrawal <[email protected]> Co-authored-by: shiyu1994 <[email protected]> Co-authored-by: Nikita Titov <[email protected]> Co-authored-by: jmoralez <[email protected]> Co-authored-by: marcelonieva7 <[email protected]> Co-authored-by: Philip Hyunsu Cho <[email protected]> Co-authored-by: Deddy Jobson <[email protected]> Co-authored-by: Alberto Ferreira <[email protected]> Co-authored-by: mjmckp <[email protected]> Co-authored-by: matthew-peacock <[email protected]> Co-authored-by: Guolin Ke <[email protected]> Co-authored-by: ashok-ponnuswami-msft <[email protected]> Co-authored-by: StrikerRUS <[email protected]>
This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this. |
This includes an additional task of multiclass-classification for
test_dask.py
and renames classification to binary-classification. I removed thecenters
argument from the_create_data
function and now define them inside the function accordingly to the classifying task. I also createdtask_to_dask_factory
andtask_to_local_factory
dictionaries to simplify the task of getting a model factory from a task, this was specially useful now that we have two classification tasks.I noticed that
test_init_score
fails for the multiclass-classification task but I removed it from the tasks in that test and can make a different PR adding it back once #4046 is solved. I also noticed thattest_classifier
sometimes fails when comparing the local vs dask probabilities, so I'm happy to address that if that's in the scope of this PR.