Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[python-packages] [docs] add type hints and define 'array-like' for X, y, group in scikit-learn interface #5757

Merged
merged 8 commits into from
Mar 29, 2023

Conversation

jameslamb
Copy link
Collaborator

@jameslamb jameslamb commented Mar 1, 2023

Contributes to #3756.

I've been putting this off for a while because it's so complicated, but I think it's finally time... this PR is a first attempt at clarifying the allowed input types for the scikit-learn interface 😬

It proposes the following:

  • adds type annotations on X, y, and group in fit() and predict() in the scikit-learn interface
  • replaces "array-like" in scikit-learn docstrings with specific, concrete types
  • adds unit tests checking that all combinations of these types minimally work (run without error and produce a model well-fit to the training data)

This PR also expands the _create_data() utility function in test_sklearn.py, adding some things I wanted for this PR like:

  • controlling the number of features and samples (to ensure there's enough data to get a decent fit)
  • differentiating between binary and multiclass classification

How I arrived at these lists of types

I tested all of the following for X, y, and group:

  • h2o datatable.Frame
  • list of floats
  • list of ints
  • list of list of floats
  • list of list of ints
  • pandas Series
  • pandas DataFrame
  • numpy array
  • scipy csc_matrix
  • scipy csr_matrix

At first, I tested this like this:

import lightgbm as lgb
import scipy.sparse
from sklearn.datasets import make_regression

X, y = make_regression()

lgb.LGBMRegressor(num_iterations=5).fit(
    X,
    scipy.sparse.csr_matrix(y)
)
# ValueError: y should be a 1d array, got an array of shape () instead.

Then tried adding different combinations of types into the unit tests added in this PR.

@jameslamb jameslamb changed the title WIP: [python-packages] [docs] add type hints and define 'array-like' for X, y, group in scikit-learn interface [python-packages] [docs] add type hints and define 'array-like' for X, y, group in scikit-learn interface Mar 1, 2023
@jameslamb jameslamb marked this pull request as ready for review March 1, 2023 06:30
@jameslamb jameslamb requested a review from guolinke March 1, 2023 17:53
@jameslamb
Copy link
Collaborator Author

@jmoralez @guolinke @shiyu1994 could I please get a review on this this week? I know it's complicated but I think it's an important step in the development of the Python package. Thank you.

Copy link
Collaborator

@guolinke guolinke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, LGTM

@jameslamb
Copy link
Collaborator Author

Thanks @guolinke !

@jameslamb jameslamb merged commit 9f03510 into master Mar 29, 2023
@jameslamb jameslamb deleted the python/sklearn-type-hints branch March 29, 2023 03:52
ClaudioSalvatoreArcidiacono added a commit to ClaudioSalvatoreArcidiacono/LightGBM that referenced this pull request Jun 5, 2023
@github-actions
Copy link

This pull request has been automatically locked since there has not been any recent activity since it was closed.
To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues
including a reference to this.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 15, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants