Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A model that was trained on a dense dataset makes incorrect predictions for sparse datasets #51

Open
SamWqc opened this issue Sep 14, 2021 · 7 comments

Comments

@SamWqc
Copy link

SamWqc commented Sep 14, 2021

Hi,
I found that the prediction results produce by python lightgbm model and pmml file is different.
It happens when training data did not contain missing value but predict the data which contains missing value.

Here is the example to show this case.

@SamWqc SamWqc changed the title Missing value handling when predicts data containing missing value but training data contains no missing value Missing value handling when test data contains missing value but training data contains no missing value Sep 14, 2021
@vruusmann
Copy link
Member

from pypmml import Model

@SamWqc The JPMML software project is not a place where to complain about third-party projects. Your reported results have no relevance here.

If you keep spamming the JPMML software project, you will be blocked.

@SamWqc
Copy link
Author

SamWqc commented Sep 14, 2021

@vruusmann
I am so sorry to bother you and I think there is maybe some misunderstanding. I did not want to spam the project at all.
But I still found the same problem when using jpmml_evaluator.
I hope you could have a look. Thanks!!!!

#######
import lightgbm as lgb
import pandas as pd
import numpy as np
import pandas
import joblib
from sklearn2pmml.pipeline import PMMLPipeline
from sklearn_pandas import DataFrameMapper
from sklearn2pmml import sklearn2pmml
from jpmml_evaluator.py4j import launch_gateway, Py4JBackend
from jpmml_evaluator import make_evaluator

np.random.seed(1)
n_feature = 20
fea_name = ['Fea'+str(i+1) for i in range(n_feature)]
####training without missing value
X =10*np.random.randn(1000,n_feature)
X =X.astype(np.float32)
Y = np.random.random_integers(0,1,1000)

my_model = lgb.LGBMClassifier(n_estimators=100)
my_model.fit(X,Y,feature_name=fea_name)

mapper = DataFrameMapper([([i], None) for i in fea_name])  

pipeline = PMMLPipeline([
    ('mapper', mapper), 
    ("classifier", my_model)
])

sklearn2pmml(pipeline, "lgb.pmml")

#####load pmml#####
gateway = launch_gateway()
backend = Py4JBackend(gateway)

evaluator = make_evaluator(backend, "lgb.pmml") \
.verify()
#evaluate with missing value
np.random.seed(9999)
X =10*np.random.randn(1000,n_feature)
X =X.astype(np.float32)
X[X<0]=np.nan

X = pd.DataFrame(X,columns=fea_name).replace({np.nan:None})

results_df = evaluator.evaluateAll(X)
Jpmml_model_pred = results_df.to_numpy()[:,2]
my_model_pred = my_model.predict_proba(X.to_numpy())[:,1]

res_df = pd.DataFrame({
    'my_model_pred':my_model_pred,
    'Jpmml_model_pred':Jpmml_model_pred
})
res_df['pred_diff'] = abs(res_df['my_model_pred'] -res_df['Jpmml_model_pred'] )

print(res_df.sort_values('pred_diff',ascending=False).head(10))
     my_model_pred  Jpmml_model_pred  pred_diff
321       0.869994          0.049991   0.820004
628       0.873887          0.056304   0.817583
704       0.974523          0.169809   0.804715
984       0.924378          0.131011   0.793367
893       0.822017          0.029407   0.792610
682       0.044943          0.826341   0.781398
921       0.903011          0.128266   0.774745
995       0.155294          0.925298   0.770004
844       0.856560          0.089665   0.766896
963       0.938739          0.173073   0.765666
#evaluate without missing value and set missing as 0
np.random.seed(999)
X =10*np.random.randn(1000,n_feature)
X =X.astype(np.float32)
X[X<0]=np.nan

X = pd.DataFrame(X,columns=fea_name).replace({np.nan:0})

results_df = evaluator.evaluateAll(X)
Jpmml_model_pred = results_df.to_numpy()[:,2]
my_model_pred = my_model.predict_proba(X.to_numpy())[:,1]

res_df = pd.DataFrame({
    'my_model_pred':my_model_pred,
    'Jpmml_model_pred':Jpmml_model_pred
})
res_df['pred_diff'] = abs(res_df['my_model_pred'] -res_df['Jpmml_model_pred'] )

print(res_df.sort_values('pred_diff',ascending=False).head(10))
     my_model_pred  Jpmml_model_pred  pred_diff
0         0.242943          0.242943        0.0
671       0.458703          0.458703        0.0
658       0.807326          0.807326        0.0
659       0.748976          0.748976        0.0
660       0.690734          0.690734        0.0
661       0.608443          0.608443        0.0
662       0.625638          0.625638        0.0
663       0.706605          0.706605        0.0
664       0.855556          0.855556        0.0
665       0.259897          0.259897        0.0

@vruusmann vruusmann transferred this issue from jpmml/sklearn2pmml Sep 14, 2021
@vruusmann vruusmann changed the title Missing value handling when test data contains missing value but training data contains no missing value A model that was trained on a dense dataset (ie. without missing values) makes incorrect predictions for sparse datasets (ie. with missing values) Sep 14, 2021
@vruusmann vruusmann changed the title A model that was trained on a dense dataset (ie. without missing values) makes incorrect predictions for sparse datasets (ie. with missing values) A model that was trained on a dense dataset makes incorrect predictions for sparse datasets Sep 14, 2021
@vruusmann
Copy link
Member

@SamWqc But I still found the same problem when using jpmml_evaluator.

That's the correct way of doing things!

I moved this issue to the JPMML-LightGBM project, because it looks like a LGBM-to-PMML conversion issue. Specifically, the "default child" instruction is wrong - it is "send missing values to the left", but it should be "send missing values to the right".

This issue manifests itself when the LGBM model was trained on a dataset that DID NOT contain any missing values.

See for yourself, if you insert some missing values into the training dataset, then JPMML-Evaluator predictions will be correct in both cases:

np.random.seed(1)
n_feature = 20
fea_name = ['Fea'+str(i+1) for i in range(n_feature)]
####training without missing value
X =10*np.random.randn(1000,n_feature)
X =X.astype(np.float32)
# THIS!
X[X<5]=np.nan
Y = np.random.random_integers(0,1,1000)

@vruusmann vruusmann reopened this Sep 14, 2021
@SamWqc
Copy link
Author

SamWqc commented Sep 15, 2021

For LGBM, when predict with missing value at some node of one certain feature but no missing value during training. It will set the missing value as 0. But in pmml, it seems to return the LastPrediction. And I also want to know how pmml handle missing value of certain feature if training data also contains the missing value of this feature.

<TreeModel functionName="regression" noTrueChildStrategy="returnLastPrediction">
  <MiningSchema>
    <MiningField name="Fea1"/>
    <MiningField name="Fea2"/>
    <MiningField name="Fea3"/>
    <MiningField name="Fea4"/>
    <MiningField name="Fea6"/>
    <MiningField name="Fea7"/>
    <MiningField name="Fea8"/>
    <MiningField name="Fea9"/>
    <MiningField name="Fea10"/>
    <MiningField name="Fea12"/>
    <MiningField name="Fea14"/>
    <MiningField name="Fea15"/>
    <MiningField name="Fea16"/>
    <MiningField name="Fea18"/>
    <MiningField name="Fea19"/>
    <MiningField name="Fea20"/>
  </MiningSchema>
  <Node score="-0.09482225077425536">
    <True/>
    <Node score="0.1452248008167614">
      <SimplePredicate field="Fea3" operator="greaterThan" value="-21.8774881362915"/>
      <Node score="0.09188101157431317">
        <SimplePredicate field="Fea6" operator="greaterThan" value="-19.934649467468258"/>
        <Node score="-0.10302898758078581">
          <SimplePredicate field="Fea7" operator="greaterThan" value="-12.244786739349363"/>
          <Node score="-0.06281597722878647">
            <SimplePredicate field="Fea4" operator="greaterThan" value="-12.073745250701903"/>
            <Node score="-0.10815819808486733">
              <SimplePredicate field="Fea4" operator="greaterThan" value="2.551533937454224"/>
              <Node score="-0.04641770127647833">
                <SimplePredicate field="Fea8" operator="greaterThan" value="8.414263725280763"/>
              </Node>
              <Node score="-0.07481832980833732">
                <SimplePredicate field="Fea8" operator="greaterThan" value="-7.9984328746795645"/>
                <Node score="0.1636899586314549">
                  <SimplePredicate field="Fea15" operator="greaterThan" value="-14.471662521362303"/>
                  <Node score="0.07521107743604809">
                    <SimplePredicate field="Fea2" operator="greaterThan" value="-4.377085924148559"/>
                    <Node score="0.14189081398910836">
                      <SimplePredicate field="Fea15" operator="greaterThan" value="10.133028984069826"/>
                    </Node>
                    <Node score="0.09664384989953179">
                      <SimplePredicate field="Fea16" operator="greaterThan" value="7.464995622634889"/>
                    </Node>
                    <Node score="0.09794280580640959">
                      <SimplePredicate field="Fea9" operator="greaterThan" value="5.7242188453674325"/>
                    </Node>
                    <Node score="-0.15578658133705325">
                      <SimplePredicate field="Fea9" operator="greaterThan" value="1.9439340829849245"/>
                    </Node>
                    <Node score="-0.06815035615303128">
                      <SimplePredicate field="Fea14" operator="greaterThan" value="0.9443753361701966"/>
                    </Node>
                  </Node>
                  <Node score="-0.022427108230932868">
                    <SimplePredicate field="Fea1" operator="greaterThan" value="-2.777882814407348"/>
                    <Node score="0.1252208798508433">
                      <SimplePredicate field="Fea4" operator="greaterThan" value="7.320906639099122"/>
                    </Node>
                  </Node>
                </Node>
              </Node>
              <Node score="0.019794809895329203">
                <SimplePredicate field="Fea10" operator="greaterThan" value="-1.9369573593139646"/>
              </Node>
            </Node>
            <Node score="0.0652091169530891">
              <SimplePredicate field="Fea9" operator="greaterThan" value="11.225318908691408"/>
              <Node score="-0.08833449262314681">
                <SimplePredicate field="Fea20" operator="greaterThan" value="-0.8956793844699859"/>
              </Node>
            </Node>
            <Node score="-0.03568022357067151">
              <SimplePredicate field="Fea2" operator="greaterThan" value="-15.326576232910154"/>
              <Node score="-0.030809703683317567">
                <SimplePredicate field="Fea18" operator="greaterThan" value="-5.7649521827697745"/>
                <Node score="0.12866983174151886">
                  <SimplePredicate field="Fea14" operator="greaterThan" value="5.901069641113282"/>
                  <Node score="-0.05868613548098403">
                    <SimplePredicate field="Fea18" operator="greaterThan" value="2.781560301780701"/>
                  </Node>
                </Node>
                <Node score="0.1842068006477812">
                  <SimplePredicate field="Fea15" operator="greaterThan" value="8.100379943847658"/>
                </Node>
                <Node score="0.18886971928785534">
                  <SimplePredicate field="Fea16" operator="greaterThan" value="8.023637294769289"/>
                </Node>
                <Node score="0.10299430099982321">
                  <SimplePredicate field="Fea10" operator="greaterThan" value="-6.957179784774779"/>
                </Node>
              </Node>
              <Node score="0.0954853216582624">
                <SimplePredicate field="Fea19" operator="greaterThan" value="0.7611226439476014"/>
              </Node>
            </Node>
          </Node>
          <Node score="0.011865327710640965">
            <SimplePredicate field="Fea12" operator="greaterThan" value="-1.6984871029853819"/>
          </Node>
        </Node>
        <Node score="0.1680864247778106">
          <SimplePredicate field="Fea9" operator="greaterThan" value="6.78233814239502"/>
        </Node>
        <Node score="-0.018285509687264518">
          <SimplePredicate field="Fea9" operator="greaterThan" value="0.5637701749801637"/>
        </Node>
      </Node>
    </Node>
  </Node>
</TreeModel>

@vruusmann
Copy link
Member

vruusmann commented Sep 15, 2021

For LGBM, when predict with missing value at some node of one certain feature but no missing value during training. It will set the missing value as 0. But in pmml, it seems to return the LastPrediction.

You can choose between different PMML representation when converting by toggling the compact flag:

pipeline = PMMLPipeline(..)
pipeline.fit(X, y)
# THIS
pipeline.configure(compact = False)
sklearn2pmml(pipeline, "lgbm.pmml")

Both compacted and non-compacted PMML representations suffer from the abovestated issue.

And I also want to know how pmml handle missing value of certain feature if training data also contains the missing value of this feature.

Missing values are sent to the left or right child node depending on the value of the MASK_DEFAULT_LEFT value:
https://github.com/jpmml/jpmml-lightgbm/blob/1.3.11/src/main/java/org/jpmml/lightgbm/Tree.java#L136

The question is that why LightGBM is setting the MASK_DEFAULT_LEFT value differently for dense vs sparse training datasets. Or perhaps there's some super-flag that overrides the MASK_DEFAULT_LEFT value in special cases.

@vruusmann
Copy link
Member

@SamWqc TLDR: If your testing datasets contains missing values then your training dataset should also contain missing values.

It seems to me like a flawed assumption that you can train with dense only, and then test both with dense AND sparse. No algorithm is guaranteed to have such generalization powers.

@SamWqc
Copy link
Author

SamWqc commented Sep 15, 2021

image
Yes. I think MASK_DEFAULT_LEFT is not enough.
LGBM will first look at the missing type. If the missing type is None, the missing value will be converted into 0 AND missing direction did not work. Missing value handling in LGBM: (microsoft/LightGBM#2921 (comment))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants