-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[python-package] check feature names in predict with dataframe (fixes #812) #4909
Conversation
The tests are failing at
X_test dataframe. This seems more like a feature, since even though in this case the features are correct, these are the types of cases that this PR aims to help detect and allow the user to handle. Should I change that file to make the column names equal to the feature names?
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't support adding checks like this to Booster.predict()
. This extra overhead might have a noticeable impact on time-per-prediction if you're serving a LightGBM model and doing single-row predictions.
I understand that "columns are in the wrong order" is a failure mode to guard against in deploying machine learning models, but I'm not convinced that the Booster
object (or even the lightgbm
library) is the right place for such checks. Doing this in the way currently proposed in this PR makes the overhead of this check unavoidable, which latency-sensitive users might not be happy about.
I feel the same way about requests like #4040.
If @StrikerRUS @shiyu1994 disagree and are fine with adding things like this, I won't block this PR.
I'll also raise an alternative idea here...what about something like the following?
Booster.predict()
= optimized for performanceBooster.predict_strict()
= same asBooster.predict()
, but optimized for more input data checks. Raises informative errors and warnings when the data being scored is different from the training data.
Thank you for your comments @jameslamb! I agree that the Having said that I ran some benchmarks with the latest commit (e0827d0) that only changes the column order when necessary and I get 1.21 milliseconds in master and 1.23 milliseconds in this PR. Benchmarking scriptimport argparse
import timeit
import lightgbm as lgb
import numpy as np
import pandas as pd
parser = argparse.ArgumentParser()
parser.add_argument('--n_samples', type=int, default=10_000)
parser.add_argument('--n_features', type=int, default=100)
parser.add_argument('--repeats', type=int, default=20_000)
parser.add_argument('--predict_samples', type=int, default=1)
args = parser.parse_args()
X = np.random.rand(args.n_samples, args.n_features)
y = np.random.rand(args.n_samples)
df = pd.DataFrame(X, columns=[f'feature_{i}' for i in range(X.shape[1])])
df['y'] = y
ds = lgb.Dataset(df.drop(columns='y'), df['y'])
bst = lgb.train({'num_leaves': 31, 'verbosity': -1}, ds)
test_idxs = np.random.choice(range(X.shape[0]), args.predict_samples)
X_test = df.iloc[test_idxs, :-1]
print(f'Measuring time predicting {args.predict_samples} sample(s).')
print(f'{timeit.timeit("bst.predict(X_test)", number=args.repeats, globals=globals()) / args.repeats * 1_000:.2f} milliseconds') Also I think users who want the minimum latency when computing predictions use numpy arrays, where the latency for this test was 0.22 milliseconds. Happy to discuss this further, and merry Christmas @jameslamb! |
I'd better prefer to add new argument rather than new method. XGBoost has |
I'd also prefer to add new parameters in |
Yes, I think we should do so. |
Ok sure. Thanks for the links! I'm fine with adding such a parameter instead of a separate method. I think it should be specifically named |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these changes do address the request in #812 and the added tests cover that new behavior well. Nice work!
However, I have one more question. Still thinking about latency-sensitive users (#4909 (review)), I feel that the default value for validate_features
should be False
.
As a user of a machine learning framework, I think I'd prefer to be told
"hey this method can now do some extra validation, change your code to opt into it"
instead of
"hey this method now does some extra validation by default, change your code to avoid the performance penalty of that validation"
I feel like that specifically for predict()
methods since those are necessary for serving models and therefore latency-sensitive. I expect other methods related to training, cross validation, or plotting (for example) to be much less latency-sensitive.
Before making changes, let's also hear what @StrikerRUS and @shiyu1994 hear about this topic.
I have one general question before the review. Can this check be made at cpp side? Like reading feature names from file for |
I guess it's possible I'm just not sure how hard it would be, if it isn't too hard I can give it a shot. WDYT @shiyu1994? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much for rethinking the interface. Love it.
LGTM for the approach! However, I left some comments below.
@@ -779,9 +779,6 @@ def predict(self, data, start_iteration=0, num_iteration=-1, | |||
Prediction result. | |||
Can be sparse or a list of sparse objects (each element represents predictions for one class) for feature contributions (when ``pred_contrib=True``). | |||
""" | |||
if isinstance(data, Dataset): | |||
raise TypeError("Cannot use Dataset instance for prediction, please use raw data instead") | |||
data = _data_from_pandas(data, None, None, self.pandas_categorical)[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why was this moved from Predictor code? Now Predictor cannot accept pandas DataFrame which means, for example, refit()
method cannot accept DataFrames anymore:
LightGBM/python-package/lightgbm/basic.py
Line 3615 in ab9236b
leaf_preds = predictor.predict(data, -1, pred_leaf=True) |
and DataFrames cannot be used together with init_model
argument during constructing Dataset:
LightGBM/python-package/lightgbm/basic.py
Lines 1402 to 1404 in ab9236b
init_score = predictor.predict(data, | |
raw_score=True, | |
data_has_header=data_has_header) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with this comment, and will just add that it would be useful to have unit tests (in a separate PR) for them, so such regressions could be caught automatically in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in 774a715
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
Should we add validate_features
argument to those methods as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can work on adding to refit as well.
Co-authored-by: Nikita Titov <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much for adding this feature!
@guolinke Could you please help with review of cpp part? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you so much!
@jameslamb Your previous requesting changes review doesn't allow to merge this PR. Please re-review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice work on this @jmoralez , thanks very much!
@StrikerRUS , thanks for the @
. Apologies for holding this up due to my lack of review.
@jmoralez Could you please comment on this? #4909 (comment) |
This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this. |
This adds a check to verify that the feature names of the trained booster match the column names of a pandas dataframe when predicting with one and setting
validate_features=True
.I believe this closes #812, since the input shape is already checked at the C++ side and the dataset object can't be constructed with repeated feature names.