[python-package] check feature names in predict with dataframe (fixes #812) #4909

jmoralez · 2021-12-24T01:32:25Z

This adds a check to verify that the feature names of the trained booster match the column names of a pandas dataframe when predicting with one and setting validate_features=True.

I believe this closes #812, since the input shape is already checked at the C++ side and the dataset object can't be constructed with repeated feature names.

jmoralez · 2021-12-24T02:33:27Z

The tests are failing at

LightGBM/examples/python-guide/advanced_example.py

Line 84 in cace5bb

y_pred = bst.predict(X_test)

where the model was trained with feature names that are different than the column names of the X_test dataframe. This seems more like a feature, since even though in this case the features are correct, these are the types of cases that this PR aims to help detect and allow the user to handle. Should I change that file to make the column names equal to the feature names?

jameslamb

I don't support adding checks like this to Booster.predict(). This extra overhead might have a noticeable impact on time-per-prediction if you're serving a LightGBM model and doing single-row predictions.

I understand that "columns are in the wrong order" is a failure mode to guard against in deploying machine learning models, but I'm not convinced that the Booster object (or even the lightgbm library) is the right place for such checks. Doing this in the way currently proposed in this PR makes the overhead of this check unavoidable, which latency-sensitive users might not be happy about.

I feel the same way about requests like #4040.

If @StrikerRUS @shiyu1994 disagree and are fine with adding things like this, I won't block this PR.

I'll also raise an alternative idea here...what about something like the following?

Booster.predict() = optimized for performance
Booster.predict_strict() = same as Booster.predict(), but optimized for more input data checks. Raises informative errors and warnings when the data being scored is different from the training data.

jmoralez · 2021-12-24T16:49:35Z

Thank you for your comments @jameslamb! I agree that the Booster.predict is a crucial part and that we would like the overhead to be a small as possible there. I like your suggestion and I think we could incorporate this as an option within the Booster.predict method, having an argument like strict that defaults to False and only run these checks if that argument was True.

Having said that I ran some benchmarks with the latest commit (e0827d0) that only changes the column order when necessary and I get 1.21 milliseconds in master and 1.23 milliseconds in this PR.

Benchmarking script

import argparse
import timeit

import lightgbm as lgb
import numpy as np
import pandas as pd


parser = argparse.ArgumentParser()
parser.add_argument('--n_samples', type=int, default=10_000)
parser.add_argument('--n_features', type=int, default=100)
parser.add_argument('--repeats', type=int, default=20_000)
parser.add_argument('--predict_samples', type=int, default=1)
args = parser.parse_args()

X = np.random.rand(args.n_samples, args.n_features)
y = np.random.rand(args.n_samples)

df = pd.DataFrame(X, columns=[f'feature_{i}' for i in range(X.shape[1])])
df['y'] = y
ds = lgb.Dataset(df.drop(columns='y'), df['y'])
bst = lgb.train({'num_leaves': 31, 'verbosity': -1}, ds)
test_idxs = np.random.choice(range(X.shape[0]), args.predict_samples)
X_test = df.iloc[test_idxs, :-1]
print(f'Measuring time predicting {args.predict_samples} sample(s).')
print(f'{timeit.timeit("bst.predict(X_test)", number=args.repeats, globals=globals()) / args.repeats * 1_000:.2f} milliseconds')

Also I think users who want the minimum latency when computing predictions use numpy arrays, where the latency for this test was 0.22 milliseconds.

Happy to discuss this further, and merry Christmas @jameslamb!

StrikerRUS · 2021-12-26T16:01:23Z

Booster.predict_strict() = same as Booster.predict(), but optimized for more input data checks.

I like your suggestion and I think we could incorporate this as an option within the Booster.predict method, having an argument like strict that defaults to False and only run these checks if that argument was True.

I'd better prefer to add new argument rather than new method.

XGBoost has validate_features parameter for such purporses:
dmlc/xgboost#5191
dmlc/xgboost#3653
dmlc/xgboost#6605

shiyu1994 · 2021-12-27T13:54:23Z

I'd also prefer to add new parameters in predict method for feature name checks. Actually, we already have something like predict_disable_shape_check, see
https://lightgbm.readthedocs.io/en/latest/Parameters.html#predict_disable_shape_check

shiyu1994 · 2021-12-27T14:22:13Z

Should I change that file to make the column names equal to the feature names?

Yes, I think we should do so.

jameslamb · 2021-12-27T23:00:48Z

I'd better prefer to add new argument rather than new method.

XGBoost has validate_features parameter for such purposes:

Ok sure. Thanks for the links! I'm fine with adding such a parameter instead of a separate method. I think it should be specifically named validate_features to be consistent with XGBoost.

…e different arrays

jameslamb

I think these changes do address the request in #812 and the added tests cover that new behavior well. Nice work!

However, I have one more question. Still thinking about latency-sensitive users (#4909 (review)), I feel that the default value for validate_features should be False.

As a user of a machine learning framework, I think I'd prefer to be told

"hey this method can now do some extra validation, change your code to opt into it"

instead of

"hey this method now does some extra validation by default, change your code to avoid the performance penalty of that validation"

I feel like that specifically for predict() methods since those are necessary for serving models and therefore latency-sensitive. I expect other methods related to training, cross validation, or plotting (for example) to be much less latency-sensitive.

Before making changes, let's also hear what @StrikerRUS and @shiyu1994 hear about this topic.

python-package/lightgbm/basic.py

StrikerRUS · 2022-01-03T01:19:33Z

I have one general question before the review. Can this check be made at cpp side? Like reading feature names from file for LGBM_BoosterPredictForFile and optionally accepting feature names as new argument for all other LGBM_BoosterPredict* functions? Similarly to https://lightgbm.readthedocs.io/en/latest/Parameters.html#predict_disable_shape_check.

jmoralez · 2022-01-06T17:22:22Z

Can this check be made at cpp side?

I guess it's possible I'm just not sure how hard it would be, if it isn't too hard I can give it a shot. WDYT @shiyu1994?

StrikerRUS

Thank you very much for rethinking the interface. Love it.

LGTM for the approach! However, I left some comments below.

include/LightGBM/c_api.h

StrikerRUS · 2022-06-13T20:09:36Z

python-package/lightgbm/basic.py

@@ -779,9 +779,6 @@ def predict(self, data, start_iteration=0, num_iteration=-1,
            Prediction result.
            Can be sparse or a list of sparse objects (each element represents predictions for one class) for feature contributions (when ``pred_contrib=True``).
        """
-        if isinstance(data, Dataset):
-            raise TypeError("Cannot use Dataset instance for prediction, please use raw data instead")
-        data = _data_from_pandas(data, None, None, self.pandas_categorical)[0]


Why was this moved from Predictor code? Now Predictor cannot accept pandas DataFrame which means, for example, refit() method cannot accept DataFrames anymore:

LightGBM/python-package/lightgbm/basic.py

Line 3615 in ab9236b

leaf_preds = predictor.predict(data, -1, pred_leaf=True)

and DataFrames cannot be used together with init_model argument during constructing Dataset:

LightGBM/python-package/lightgbm/basic.py

Lines 1402 to 1404 in ab9236b

init_score = predictor.predict(data,

raw_score=True,

data_has_header=data_has_header)

I agree with this comment, and will just add that it would be useful to have unit tests (in a separate PR) for them, so such regressions could be caught automatically in the future.

Addressed in 774a715

Thanks!

Should we add validate_features argument to those methods as well?

I can work on adding to refit as well.

src/c_api.cpp

tests/python_package_test/test_sklearn.py

Co-authored-by: Nikita Titov <[email protected]>

StrikerRUS

Thank you very much for adding this feature!

StrikerRUS · 2022-06-19T14:05:56Z

@guolinke Could you please help with review of cpp part?

src/c_api.cpp

guolinke

Thank you so much!

StrikerRUS · 2022-06-25T17:48:41Z

@jameslamb Your previous requesting changes review doesn't allow to merge this PR. Please re-review.

jameslamb

Very nice work on this @jmoralez , thanks very much!

@StrikerRUS , thanks for the @. Apologies for holding this up due to my lack of review.

StrikerRUS · 2022-06-27T20:04:00Z

@jmoralez Could you please comment on this? #4909 (comment)

github-actions · 2023-08-19T03:35:20Z

This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

check feature names and order in predict with dataframe

84f5e1b

jmoralez added the feature label Dec 24, 2021

jmoralez requested review from chivee, henry0312, hzy46, jameslamb, shiyu1994, StrikerRUS and tongwu-sh as code owners December 24, 2021 01:32

jmoralez added 3 commits December 23, 2021 19:35

slice df in predict to remove the target

0d13475

scramble features

1adc7f2

handle int column names

a3cfcad

jameslamb reviewed Dec 24, 2021

View reviewed changes

only change column order when needed

e0827d0

include validate_features param in booster and sklearn estimators

8d814f5

jmoralez requested a review from guolinke as a code owner December 30, 2021 04:12

jmoralez added 3 commits December 30, 2021 19:47

document validate_features argument

0874157

merge master

1b9200c

use all_close in preds checks and check for assertion error to compar…

047c621

…e different arrays

jameslamb requested changes Jan 1, 2022

View reviewed changes

python-package/lightgbm/basic.py Outdated Show resolved Hide resolved

StrikerRUS added the in progress label Feb 12, 2022

perform remapping and checks in cpp

f85c25f

jmoralez changed the title ~~[python-package] check feature names and order in predict with dataframe (fixes #812)~~ [python-package] check feature names in predict with dataframe (fixes #812) May 30, 2022

jmoralez requested review from guolinke and removed request for hzy46 and tongwu-sh May 30, 2022 21:26

StrikerRUS added awaiting review and removed in progress labels Jun 4, 2022

StrikerRUS requested changes Jun 13, 2022

View reviewed changes

StrikerRUS removed the awaiting review label Jun 13, 2022

jmoralez and others added 3 commits June 15, 2022 14:49

Apply suggestions from code review

d899021

Co-authored-by: Nikita Titov <[email protected]>

move data conversion to Predictor.predict

774a715

merge master

1a34ebc

StrikerRUS approved these changes Jun 19, 2022

View reviewed changes

guolinke reviewed Jun 20, 2022

View reviewed changes

src/c_api.cpp Outdated Show resolved Hide resolved

guolinke reviewed Jun 20, 2022

View reviewed changes

src/c_api.cpp Outdated Show resolved Hide resolved

guolinke reviewed Jun 20, 2022

View reviewed changes

src/c_api.cpp Outdated Show resolved Hide resolved

jmoralez added 2 commits June 20, 2022 10:14

use Vector2Ptr

3ff99ff

merge master

ca7e131

jmoralez requested a review from guolinke June 20, 2022 16:36

guolinke approved these changes Jun 22, 2022

View reviewed changes

jameslamb approved these changes Jun 27, 2022

View reviewed changes

StrikerRUS merged commit bdb02e0 into microsoft:master Jun 27, 2022

jmoralez deleted the check-df-predict branch June 27, 2022 20:31

jmoralez mentioned this pull request Jun 27, 2022

[python-package] add validate_features argument to refit #5331

Merged

jameslamb mentioned this pull request Oct 7, 2022

[DO NOT MERGE] Release v3.3.3 #5525

Closed

40 tasks

jmoralez mentioned this pull request Dec 19, 2022

Unexpected indexing in .predict with dataframe #5627

Closed

github-actions bot locked as resolved and limited conversation to collaborators Aug 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[python-package] check feature names in predict with dataframe (fixes #812) #4909

[python-package] check feature names in predict with dataframe (fixes #812) #4909

jmoralez commented Dec 24, 2021 •

edited

Loading

jmoralez commented Dec 24, 2021

jameslamb left a comment

jmoralez commented Dec 24, 2021

StrikerRUS commented Dec 26, 2021

shiyu1994 commented Dec 27, 2021

shiyu1994 commented Dec 27, 2021

jameslamb commented Dec 27, 2021

jameslamb left a comment

StrikerRUS commented Jan 3, 2022

jmoralez commented Jan 6, 2022

StrikerRUS left a comment

StrikerRUS Jun 13, 2022

jameslamb Jun 14, 2022

jmoralez Jun 15, 2022

StrikerRUS Jun 19, 2022

jmoralez Jun 27, 2022

StrikerRUS left a comment

StrikerRUS commented Jun 19, 2022

guolinke left a comment

StrikerRUS commented Jun 25, 2022

jameslamb left a comment

StrikerRUS commented Jun 27, 2022

github-actions bot commented Aug 19, 2023

	init_score = predictor.predict(data,
	raw_score=True,
	data_has_header=data_has_header)

[python-package] check feature names in predict with dataframe (fixes #812) #4909

[python-package] check feature names in predict with dataframe (fixes #812) #4909

Conversation

jmoralez commented Dec 24, 2021 • edited Loading

jmoralez commented Dec 24, 2021

jameslamb left a comment

Choose a reason for hiding this comment

jmoralez commented Dec 24, 2021

StrikerRUS commented Dec 26, 2021

shiyu1994 commented Dec 27, 2021

shiyu1994 commented Dec 27, 2021

jameslamb commented Dec 27, 2021

jameslamb left a comment

Choose a reason for hiding this comment

StrikerRUS commented Jan 3, 2022

jmoralez commented Jan 6, 2022

StrikerRUS left a comment

Choose a reason for hiding this comment

StrikerRUS Jun 13, 2022

Choose a reason for hiding this comment

jameslamb Jun 14, 2022

Choose a reason for hiding this comment

jmoralez Jun 15, 2022

Choose a reason for hiding this comment

StrikerRUS Jun 19, 2022

Choose a reason for hiding this comment

jmoralez Jun 27, 2022

Choose a reason for hiding this comment

StrikerRUS left a comment

Choose a reason for hiding this comment

StrikerRUS commented Jun 19, 2022

guolinke left a comment

Choose a reason for hiding this comment

StrikerRUS commented Jun 25, 2022

jameslamb left a comment

Choose a reason for hiding this comment

StrikerRUS commented Jun 27, 2022

github-actions bot commented Aug 19, 2023

jmoralez commented Dec 24, 2021 •

edited

Loading