A model that was trained on a dense dataset makes incorrect predictions for sparse datasets #51

SamWqc · 2021-09-14T03:31:23Z

Hi,
I found that the prediction results produce by python lightgbm model and pmml file is different.
It happens when training data did not contain missing value but predict the data which contains missing value.

Here is the example to show this case.

vruusmann · 2021-09-14T05:52:34Z

from pypmml import Model

@SamWqc The JPMML software project is not a place where to complain about third-party projects. Your reported results have no relevance here.

If you keep spamming the JPMML software project, you will be blocked.

SamWqc · 2021-09-14T11:30:10Z

@vruusmann
I am so sorry to bother you and I think there is maybe some misunderstanding. I did not want to spam the project at all.
But I still found the same problem when using jpmml_evaluator.
I hope you could have a look. Thanks!!!!

#######
import lightgbm as lgb
import pandas as pd
import numpy as np
import pandas
import joblib
from sklearn2pmml.pipeline import PMMLPipeline
from sklearn_pandas import DataFrameMapper
from sklearn2pmml import sklearn2pmml
from jpmml_evaluator.py4j import launch_gateway, Py4JBackend
from jpmml_evaluator import make_evaluator

np.random.seed(1)
n_feature = 20
fea_name = ['Fea'+str(i+1) for i in range(n_feature)]
####training without missing value
X =10*np.random.randn(1000,n_feature)
X =X.astype(np.float32)
Y = np.random.random_integers(0,1,1000)

my_model = lgb.LGBMClassifier(n_estimators=100)
my_model.fit(X,Y,feature_name=fea_name)

mapper = DataFrameMapper([([i], None) for i in fea_name])  

pipeline = PMMLPipeline([
    ('mapper', mapper), 
    ("classifier", my_model)
])

sklearn2pmml(pipeline, "lgb.pmml")

#####load pmml#####
gateway = launch_gateway()
backend = Py4JBackend(gateway)

evaluator = make_evaluator(backend, "lgb.pmml") \
.verify()

#evaluate with missing value
np.random.seed(9999)
X =10*np.random.randn(1000,n_feature)
X =X.astype(np.float32)
X[X<0]=np.nan

X = pd.DataFrame(X,columns=fea_name).replace({np.nan:None})

results_df = evaluator.evaluateAll(X)
Jpmml_model_pred = results_df.to_numpy()[:,2]
my_model_pred = my_model.predict_proba(X.to_numpy())[:,1]

res_df = pd.DataFrame({
    'my_model_pred':my_model_pred,
    'Jpmml_model_pred':Jpmml_model_pred
})
res_df['pred_diff'] = abs(res_df['my_model_pred'] -res_df['Jpmml_model_pred'] )

print(res_df.sort_values('pred_diff',ascending=False).head(10))

     my_model_pred  Jpmml_model_pred  pred_diff
321       0.869994          0.049991   0.820004
628       0.873887          0.056304   0.817583
704       0.974523          0.169809   0.804715
984       0.924378          0.131011   0.793367
893       0.822017          0.029407   0.792610
682       0.044943          0.826341   0.781398
921       0.903011          0.128266   0.774745
995       0.155294          0.925298   0.770004
844       0.856560          0.089665   0.766896
963       0.938739          0.173073   0.765666

#evaluate without missing value and set missing as 0
np.random.seed(999)
X =10*np.random.randn(1000,n_feature)
X =X.astype(np.float32)
X[X<0]=np.nan

X = pd.DataFrame(X,columns=fea_name).replace({np.nan:0})

results_df = evaluator.evaluateAll(X)
Jpmml_model_pred = results_df.to_numpy()[:,2]
my_model_pred = my_model.predict_proba(X.to_numpy())[:,1]

res_df = pd.DataFrame({
    'my_model_pred':my_model_pred,
    'Jpmml_model_pred':Jpmml_model_pred
})
res_df['pred_diff'] = abs(res_df['my_model_pred'] -res_df['Jpmml_model_pred'] )

print(res_df.sort_values('pred_diff',ascending=False).head(10))

     my_model_pred  Jpmml_model_pred  pred_diff
0         0.242943          0.242943        0.0
671       0.458703          0.458703        0.0
658       0.807326          0.807326        0.0
659       0.748976          0.748976        0.0
660       0.690734          0.690734        0.0
661       0.608443          0.608443        0.0
662       0.625638          0.625638        0.0
663       0.706605          0.706605        0.0
664       0.855556          0.855556        0.0
665       0.259897          0.259897        0.0

vruusmann · 2021-09-14T20:15:39Z

@SamWqc But I still found the same problem when using jpmml_evaluator.

That's the correct way of doing things!

I moved this issue to the JPMML-LightGBM project, because it looks like a LGBM-to-PMML conversion issue. Specifically, the "default child" instruction is wrong - it is "send missing values to the left", but it should be "send missing values to the right".

This issue manifests itself when the LGBM model was trained on a dataset that DID NOT contain any missing values.

See for yourself, if you insert some missing values into the training dataset, then JPMML-Evaluator predictions will be correct in both cases:

np.random.seed(1)
n_feature = 20
fea_name = ['Fea'+str(i+1) for i in range(n_feature)]
####training without missing value
X =10*np.random.randn(1000,n_feature)
X =X.astype(np.float32)
# THIS!
X[X<5]=np.nan
Y = np.random.random_integers(0,1,1000)

SamWqc · 2021-09-15T05:58:15Z

For LGBM, when predict with missing value at some node of one certain feature but no missing value during training. It will set the missing value as 0. But in pmml, it seems to return the LastPrediction. And I also want to know how pmml handle missing value of certain feature if training data also contains the missing value of this feature.

<TreeModel functionName="regression" noTrueChildStrategy="returnLastPrediction">
  <MiningSchema>
    <MiningField name="Fea1"/>
    <MiningField name="Fea2"/>
    <MiningField name="Fea3"/>
    <MiningField name="Fea4"/>
    <MiningField name="Fea6"/>
    <MiningField name="Fea7"/>
    <MiningField name="Fea8"/>
    <MiningField name="Fea9"/>
    <MiningField name="Fea10"/>
    <MiningField name="Fea12"/>
    <MiningField name="Fea14"/>
    <MiningField name="Fea15"/>
    <MiningField name="Fea16"/>
    <MiningField name="Fea18"/>
    <MiningField name="Fea19"/>
    <MiningField name="Fea20"/>
  </MiningSchema>
  <Node score="-0.09482225077425536">
    <True/>
    <Node score="0.1452248008167614">
      <SimplePredicate field="Fea3" operator="greaterThan" value="-21.8774881362915"/>
      <Node score="0.09188101157431317">
        <SimplePredicate field="Fea6" operator="greaterThan" value="-19.934649467468258"/>
        <Node score="-0.10302898758078581">
          <SimplePredicate field="Fea7" operator="greaterThan" value="-12.244786739349363"/>
          <Node score="-0.06281597722878647">
            <SimplePredicate field="Fea4" operator="greaterThan" value="-12.073745250701903"/>
            <Node score="-0.10815819808486733">
              <SimplePredicate field="Fea4" operator="greaterThan" value="2.551533937454224"/>
              <Node score="-0.04641770127647833">
                <SimplePredicate field="Fea8" operator="greaterThan" value="8.414263725280763"/>
              </Node>
              <Node score="-0.07481832980833732">
                <SimplePredicate field="Fea8" operator="greaterThan" value="-7.9984328746795645"/>
                <Node score="0.1636899586314549">
                  <SimplePredicate field="Fea15" operator="greaterThan" value="-14.471662521362303"/>
                  <Node score="0.07521107743604809">
                    <SimplePredicate field="Fea2" operator="greaterThan" value="-4.377085924148559"/>
                    <Node score="0.14189081398910836">
                      <SimplePredicate field="Fea15" operator="greaterThan" value="10.133028984069826"/>
                    </Node>
                    <Node score="0.09664384989953179">
                      <SimplePredicate field="Fea16" operator="greaterThan" value="7.464995622634889"/>
                    </Node>
                    <Node score="0.09794280580640959">
                      <SimplePredicate field="Fea9" operator="greaterThan" value="5.7242188453674325"/>
                    </Node>
                    <Node score="-0.15578658133705325">
                      <SimplePredicate field="Fea9" operator="greaterThan" value="1.9439340829849245"/>
                    </Node>
                    <Node score="-0.06815035615303128">
                      <SimplePredicate field="Fea14" operator="greaterThan" value="0.9443753361701966"/>
                    </Node>
                  </Node>
                  <Node score="-0.022427108230932868">
                    <SimplePredicate field="Fea1" operator="greaterThan" value="-2.777882814407348"/>
                    <Node score="0.1252208798508433">
                      <SimplePredicate field="Fea4" operator="greaterThan" value="7.320906639099122"/>
                    </Node>
                  </Node>
                </Node>
              </Node>
              <Node score="0.019794809895329203">
                <SimplePredicate field="Fea10" operator="greaterThan" value="-1.9369573593139646"/>
              </Node>
            </Node>
            <Node score="0.0652091169530891">
              <SimplePredicate field="Fea9" operator="greaterThan" value="11.225318908691408"/>
              <Node score="-0.08833449262314681">
                <SimplePredicate field="Fea20" operator="greaterThan" value="-0.8956793844699859"/>
              </Node>
            </Node>
            <Node score="-0.03568022357067151">
              <SimplePredicate field="Fea2" operator="greaterThan" value="-15.326576232910154"/>
              <Node score="-0.030809703683317567">
                <SimplePredicate field="Fea18" operator="greaterThan" value="-5.7649521827697745"/>
                <Node score="0.12866983174151886">
                  <SimplePredicate field="Fea14" operator="greaterThan" value="5.901069641113282"/>
                  <Node score="-0.05868613548098403">
                    <SimplePredicate field="Fea18" operator="greaterThan" value="2.781560301780701"/>
                  </Node>
                </Node>
                <Node score="0.1842068006477812">
                  <SimplePredicate field="Fea15" operator="greaterThan" value="8.100379943847658"/>
                </Node>
                <Node score="0.18886971928785534">
                  <SimplePredicate field="Fea16" operator="greaterThan" value="8.023637294769289"/>
                </Node>
                <Node score="0.10299430099982321">
                  <SimplePredicate field="Fea10" operator="greaterThan" value="-6.957179784774779"/>
                </Node>
              </Node>
              <Node score="0.0954853216582624">
                <SimplePredicate field="Fea19" operator="greaterThan" value="0.7611226439476014"/>
              </Node>
            </Node>
          </Node>
          <Node score="0.011865327710640965">
            <SimplePredicate field="Fea12" operator="greaterThan" value="-1.6984871029853819"/>
          </Node>
        </Node>
        <Node score="0.1680864247778106">
          <SimplePredicate field="Fea9" operator="greaterThan" value="6.78233814239502"/>
        </Node>
        <Node score="-0.018285509687264518">
          <SimplePredicate field="Fea9" operator="greaterThan" value="0.5637701749801637"/>
        </Node>
      </Node>
    </Node>
  </Node>
</TreeModel>

vruusmann · 2021-09-15T06:34:15Z

For LGBM, when predict with missing value at some node of one certain feature but no missing value during training. It will set the missing value as 0. But in pmml, it seems to return the LastPrediction.

You can choose between different PMML representation when converting by toggling the compact flag:

pipeline = PMMLPipeline(..)
pipeline.fit(X, y)
# THIS
pipeline.configure(compact = False)
sklearn2pmml(pipeline, "lgbm.pmml")

Both compacted and non-compacted PMML representations suffer from the abovestated issue.

And I also want to know how pmml handle missing value of certain feature if training data also contains the missing value of this feature.

Missing values are sent to the left or right child node depending on the value of the MASK_DEFAULT_LEFT value:
https://github.com/jpmml/jpmml-lightgbm/blob/1.3.11/src/main/java/org/jpmml/lightgbm/Tree.java#L136

The question is that why LightGBM is setting the MASK_DEFAULT_LEFT value differently for dense vs sparse training datasets. Or perhaps there's some super-flag that overrides the MASK_DEFAULT_LEFT value in special cases.

vruusmann · 2021-09-15T06:36:58Z

@SamWqc TLDR: If your testing datasets contains missing values then your training dataset should also contain missing values.

It seems to me like a flawed assumption that you can train with dense only, and then test both with dense AND sparse. No algorithm is guaranteed to have such generalization powers.

SamWqc · 2021-09-15T07:10:18Z

Yes. I think MASK_DEFAULT_LEFT is not enough.
LGBM will first look at the missing type. If the missing type is None, the missing value will be converted into 0 AND missing direction did not work. Missing value handling in LGBM: (microsoft/LightGBM#2921 (comment))

SamWqc changed the title ~~Missing value handling when predicts data containing missing value but training data contains no missing value~~ Missing value handling when test data contains missing value but training data contains no missing value Sep 14, 2021

vruusmann closed this as completed Sep 14, 2021

vruusmann transferred this issue from jpmml/sklearn2pmml Sep 14, 2021

vruusmann changed the title ~~Missing value handling when test data contains missing value but training data contains no missing value~~ A model that was trained on a dense dataset (ie. without missing values) makes incorrect predictions for sparse datasets (ie. with missing values) Sep 14, 2021

vruusmann changed the title ~~A model that was trained on a dense dataset (ie. without missing values) makes incorrect predictions for sparse datasets (ie. with missing values)~~ A model that was trained on a dense dataset makes incorrect predictions for sparse datasets Sep 14, 2021

vruusmann reopened this Sep 14, 2021

vruusmann mentioned this issue Sep 15, 2021

Different results for pmml and LightGBM jpmml/sklearn2pmml#301

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A model that was trained on a dense dataset makes incorrect predictions for sparse datasets #51

A model that was trained on a dense dataset makes incorrect predictions for sparse datasets #51

SamWqc commented Sep 14, 2021

vruusmann commented Sep 14, 2021

SamWqc commented Sep 14, 2021 •

edited by vruusmann

Loading

vruusmann commented Sep 14, 2021

SamWqc commented Sep 15, 2021 •

edited by vruusmann

Loading

vruusmann commented Sep 15, 2021 •

edited

Loading

vruusmann commented Sep 15, 2021

SamWqc commented Sep 15, 2021 •

edited

Loading

A model that was trained on a dense dataset makes incorrect predictions for sparse datasets #51

A model that was trained on a dense dataset makes incorrect predictions for sparse datasets #51

Comments

SamWqc commented Sep 14, 2021

vruusmann commented Sep 14, 2021

SamWqc commented Sep 14, 2021 • edited by vruusmann Loading

vruusmann commented Sep 14, 2021

SamWqc commented Sep 15, 2021 • edited by vruusmann Loading

vruusmann commented Sep 15, 2021 • edited Loading

vruusmann commented Sep 15, 2021

SamWqc commented Sep 15, 2021 • edited Loading

SamWqc commented Sep 14, 2021 •

edited by vruusmann

Loading

SamWqc commented Sep 15, 2021 •

edited by vruusmann

Loading

vruusmann commented Sep 15, 2021 •

edited

Loading

SamWqc commented Sep 15, 2021 •

edited

Loading