Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[python-package] check feature names in predict with dataframe (fixes #812) #4909

Merged
merged 25 commits into from
Jun 27, 2022

Conversation

jmoralez
Copy link
Collaborator

@jmoralez jmoralez commented Dec 24, 2021

This adds a check to verify that the feature names of the trained booster match the column names of a pandas dataframe when predicting with one and setting validate_features=True.

I believe this closes #812, since the input shape is already checked at the C++ side and the dataset object can't be constructed with repeated feature names.

@jmoralez
Copy link
Collaborator Author

The tests are failing at

y_pred = bst.predict(X_test)
where the model was trained with feature names that are different than the column names of the X_test dataframe. This seems more like a feature, since even though in this case the features are correct, these are the types of cases that this PR aims to help detect and allow the user to handle. Should I change that file to make the column names equal to the feature names?

Copy link
Collaborator

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't support adding checks like this to Booster.predict(). This extra overhead might have a noticeable impact on time-per-prediction if you're serving a LightGBM model and doing single-row predictions.

I understand that "columns are in the wrong order" is a failure mode to guard against in deploying machine learning models, but I'm not convinced that the Booster object (or even the lightgbm library) is the right place for such checks. Doing this in the way currently proposed in this PR makes the overhead of this check unavoidable, which latency-sensitive users might not be happy about.

I feel the same way about requests like #4040.

If @StrikerRUS @shiyu1994 disagree and are fine with adding things like this, I won't block this PR.

I'll also raise an alternative idea here...what about something like the following?

  • Booster.predict() = optimized for performance
  • Booster.predict_strict() = same as Booster.predict(), but optimized for more input data checks. Raises informative errors and warnings when the data being scored is different from the training data.

@jmoralez
Copy link
Collaborator Author

Thank you for your comments @jameslamb! I agree that the Booster.predict is a crucial part and that we would like the overhead to be a small as possible there. I like your suggestion and I think we could incorporate this as an option within the Booster.predict method, having an argument like strict that defaults to False and only run these checks if that argument was True.

Having said that I ran some benchmarks with the latest commit (e0827d0) that only changes the column order when necessary and I get 1.21 milliseconds in master and 1.23 milliseconds in this PR.

Benchmarking script
import argparse
import timeit

import lightgbm as lgb
import numpy as np
import pandas as pd


parser = argparse.ArgumentParser()
parser.add_argument('--n_samples', type=int, default=10_000)
parser.add_argument('--n_features', type=int, default=100)
parser.add_argument('--repeats', type=int, default=20_000)
parser.add_argument('--predict_samples', type=int, default=1)
args = parser.parse_args()

X = np.random.rand(args.n_samples, args.n_features)
y = np.random.rand(args.n_samples)

df = pd.DataFrame(X, columns=[f'feature_{i}' for i in range(X.shape[1])])
df['y'] = y
ds = lgb.Dataset(df.drop(columns='y'), df['y'])
bst = lgb.train({'num_leaves': 31, 'verbosity': -1}, ds)
test_idxs = np.random.choice(range(X.shape[0]), args.predict_samples)
X_test = df.iloc[test_idxs, :-1]
print(f'Measuring time predicting {args.predict_samples} sample(s).')
print(f'{timeit.timeit("bst.predict(X_test)", number=args.repeats, globals=globals()) / args.repeats * 1_000:.2f} milliseconds')

Also I think users who want the minimum latency when computing predictions use numpy arrays, where the latency for this test was 0.22 milliseconds.

Happy to discuss this further, and merry Christmas @jameslamb!

@StrikerRUS
Copy link
Collaborator

  • Booster.predict_strict() = same as Booster.predict(), but optimized for more input data checks.

I like your suggestion and I think we could incorporate this as an option within the Booster.predict method, having an argument like strict that defaults to False and only run these checks if that argument was True.

I'd better prefer to add new argument rather than new method.

XGBoost has validate_features parameter for such purporses:
dmlc/xgboost#5191
dmlc/xgboost#3653
dmlc/xgboost#6605

@shiyu1994
Copy link
Collaborator

I'd also prefer to add new parameters in predict method for feature name checks. Actually, we already have something like predict_disable_shape_check, see
https://lightgbm.readthedocs.io/en/latest/Parameters.html#predict_disable_shape_check

@shiyu1994
Copy link
Collaborator

Should I change that file to make the column names equal to the feature names?

Yes, I think we should do so.

@jameslamb
Copy link
Collaborator

I'd better prefer to add new argument rather than new method.

XGBoost has validate_features parameter for such purposes:

Ok sure. Thanks for the links! I'm fine with adding such a parameter instead of a separate method. I think it should be specifically named validate_features to be consistent with XGBoost.

@jmoralez jmoralez requested a review from guolinke as a code owner December 30, 2021 04:12
Copy link
Collaborator

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these changes do address the request in #812 and the added tests cover that new behavior well. Nice work!

However, I have one more question. Still thinking about latency-sensitive users (#4909 (review)), I feel that the default value for validate_features should be False.

As a user of a machine learning framework, I think I'd prefer to be told

"hey this method can now do some extra validation, change your code to opt into it"

instead of

"hey this method now does some extra validation by default, change your code to avoid the performance penalty of that validation"

I feel like that specifically for predict() methods since those are necessary for serving models and therefore latency-sensitive. I expect other methods related to training, cross validation, or plotting (for example) to be much less latency-sensitive.

Before making changes, let's also hear what @StrikerRUS and @shiyu1994 hear about this topic.

python-package/lightgbm/basic.py Outdated Show resolved Hide resolved
@StrikerRUS
Copy link
Collaborator

I have one general question before the review. Can this check be made at cpp side? Like reading feature names from file for LGBM_BoosterPredictForFile and optionally accepting feature names as new argument for all other LGBM_BoosterPredict* functions? Similarly to https://lightgbm.readthedocs.io/en/latest/Parameters.html#predict_disable_shape_check.

@jmoralez
Copy link
Collaborator Author

jmoralez commented Jan 6, 2022

Can this check be made at cpp side?

I guess it's possible I'm just not sure how hard it would be, if it isn't too hard I can give it a shot. WDYT @shiyu1994?

@jmoralez jmoralez changed the title [python-package] check feature names and order in predict with dataframe (fixes #812) [python-package] check feature names in predict with dataframe (fixes #812) May 30, 2022
@jmoralez jmoralez requested review from guolinke and removed request for hzy46 and tongwu-sh May 30, 2022 21:26
Copy link
Collaborator

@StrikerRUS StrikerRUS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for rethinking the interface. Love it.

LGTM for the approach! However, I left some comments below.

include/LightGBM/c_api.h Outdated Show resolved Hide resolved
@@ -779,9 +779,6 @@ def predict(self, data, start_iteration=0, num_iteration=-1,
Prediction result.
Can be sparse or a list of sparse objects (each element represents predictions for one class) for feature contributions (when ``pred_contrib=True``).
"""
if isinstance(data, Dataset):
raise TypeError("Cannot use Dataset instance for prediction, please use raw data instead")
data = _data_from_pandas(data, None, None, self.pandas_categorical)[0]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why was this moved from Predictor code? Now Predictor cannot accept pandas DataFrame which means, for example, refit() method cannot accept DataFrames anymore:

leaf_preds = predictor.predict(data, -1, pred_leaf=True)

and DataFrames cannot be used together with init_model argument during constructing Dataset:

init_score = predictor.predict(data,
raw_score=True,
data_has_header=data_has_header)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with this comment, and will just add that it would be useful to have unit tests (in a separate PR) for them, so such regressions could be caught automatically in the future.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 774a715

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Should we add validate_features argument to those methods as well?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can work on adding to refit as well.

src/c_api.cpp Outdated Show resolved Hide resolved
tests/python_package_test/test_sklearn.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@StrikerRUS StrikerRUS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for adding this feature!

@StrikerRUS
Copy link
Collaborator

@guolinke Could you please help with review of cpp part?

src/c_api.cpp Outdated Show resolved Hide resolved
src/c_api.cpp Outdated Show resolved Hide resolved
src/c_api.cpp Outdated Show resolved Hide resolved
@jmoralez jmoralez requested a review from guolinke June 20, 2022 16:36
Copy link
Collaborator

@guolinke guolinke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much!

@StrikerRUS
Copy link
Collaborator

@jameslamb Your previous requesting changes review doesn't allow to merge this PR. Please re-review.

Copy link
Collaborator

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice work on this @jmoralez , thanks very much!

@StrikerRUS , thanks for the @. Apologies for holding this up due to my lack of review.

@StrikerRUS StrikerRUS merged commit bdb02e0 into microsoft:master Jun 27, 2022
@StrikerRUS
Copy link
Collaborator

@jmoralez Could you please comment on this? #4909 (comment)

@github-actions
Copy link

This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 19, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[python package]: suggestion: lgb.Booster.predict() should check that the input X data makes sense
5 participants