-
Notifications
You must be signed in to change notification settings - Fork 58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
A model that was trained on a dense dataset makes incorrect predictions for sparse datasets #51
Comments
@SamWqc The JPMML software project is not a place where to complain about third-party projects. Your reported results have no relevance here. If you keep spamming the JPMML software project, you will be blocked. |
@vruusmann #######
import lightgbm as lgb
import pandas as pd
import numpy as np
import pandas
import joblib
from sklearn2pmml.pipeline import PMMLPipeline
from sklearn_pandas import DataFrameMapper
from sklearn2pmml import sklearn2pmml
from jpmml_evaluator.py4j import launch_gateway, Py4JBackend
from jpmml_evaluator import make_evaluator
np.random.seed(1)
n_feature = 20
fea_name = ['Fea'+str(i+1) for i in range(n_feature)]
####training without missing value
X =10*np.random.randn(1000,n_feature)
X =X.astype(np.float32)
Y = np.random.random_integers(0,1,1000)
my_model = lgb.LGBMClassifier(n_estimators=100)
my_model.fit(X,Y,feature_name=fea_name)
mapper = DataFrameMapper([([i], None) for i in fea_name])
pipeline = PMMLPipeline([
('mapper', mapper),
("classifier", my_model)
])
sklearn2pmml(pipeline, "lgb.pmml")
#####load pmml#####
gateway = launch_gateway()
backend = Py4JBackend(gateway)
evaluator = make_evaluator(backend, "lgb.pmml") \
.verify() #evaluate with missing value
np.random.seed(9999)
X =10*np.random.randn(1000,n_feature)
X =X.astype(np.float32)
X[X<0]=np.nan
X = pd.DataFrame(X,columns=fea_name).replace({np.nan:None})
results_df = evaluator.evaluateAll(X)
Jpmml_model_pred = results_df.to_numpy()[:,2]
my_model_pred = my_model.predict_proba(X.to_numpy())[:,1]
res_df = pd.DataFrame({
'my_model_pred':my_model_pred,
'Jpmml_model_pred':Jpmml_model_pred
})
res_df['pred_diff'] = abs(res_df['my_model_pred'] -res_df['Jpmml_model_pred'] )
print(res_df.sort_values('pred_diff',ascending=False).head(10))
#evaluate without missing value and set missing as 0
np.random.seed(999)
X =10*np.random.randn(1000,n_feature)
X =X.astype(np.float32)
X[X<0]=np.nan
X = pd.DataFrame(X,columns=fea_name).replace({np.nan:0})
results_df = evaluator.evaluateAll(X)
Jpmml_model_pred = results_df.to_numpy()[:,2]
my_model_pred = my_model.predict_proba(X.to_numpy())[:,1]
res_df = pd.DataFrame({
'my_model_pred':my_model_pred,
'Jpmml_model_pred':Jpmml_model_pred
})
res_df['pred_diff'] = abs(res_df['my_model_pred'] -res_df['Jpmml_model_pred'] )
print(res_df.sort_values('pred_diff',ascending=False).head(10))
|
@SamWqc But I still found the same problem when using jpmml_evaluator. That's the correct way of doing things! I moved this issue to the JPMML-LightGBM project, because it looks like a LGBM-to-PMML conversion issue. Specifically, the "default child" instruction is wrong - it is "send missing values to the left", but it should be "send missing values to the right". This issue manifests itself when the LGBM model was trained on a dataset that DID NOT contain any missing values. See for yourself, if you insert some missing values into the training dataset, then JPMML-Evaluator predictions will be correct in both cases: np.random.seed(1)
n_feature = 20
fea_name = ['Fea'+str(i+1) for i in range(n_feature)]
####training without missing value
X =10*np.random.randn(1000,n_feature)
X =X.astype(np.float32)
# THIS!
X[X<5]=np.nan
Y = np.random.random_integers(0,1,1000) |
For LGBM, when predict with missing value at some node of one certain feature but no missing value during training. It will set the missing value as 0. But in pmml, it seems to return the <TreeModel functionName="regression" noTrueChildStrategy="returnLastPrediction">
<MiningSchema>
<MiningField name="Fea1"/>
<MiningField name="Fea2"/>
<MiningField name="Fea3"/>
<MiningField name="Fea4"/>
<MiningField name="Fea6"/>
<MiningField name="Fea7"/>
<MiningField name="Fea8"/>
<MiningField name="Fea9"/>
<MiningField name="Fea10"/>
<MiningField name="Fea12"/>
<MiningField name="Fea14"/>
<MiningField name="Fea15"/>
<MiningField name="Fea16"/>
<MiningField name="Fea18"/>
<MiningField name="Fea19"/>
<MiningField name="Fea20"/>
</MiningSchema>
<Node score="-0.09482225077425536">
<True/>
<Node score="0.1452248008167614">
<SimplePredicate field="Fea3" operator="greaterThan" value="-21.8774881362915"/>
<Node score="0.09188101157431317">
<SimplePredicate field="Fea6" operator="greaterThan" value="-19.934649467468258"/>
<Node score="-0.10302898758078581">
<SimplePredicate field="Fea7" operator="greaterThan" value="-12.244786739349363"/>
<Node score="-0.06281597722878647">
<SimplePredicate field="Fea4" operator="greaterThan" value="-12.073745250701903"/>
<Node score="-0.10815819808486733">
<SimplePredicate field="Fea4" operator="greaterThan" value="2.551533937454224"/>
<Node score="-0.04641770127647833">
<SimplePredicate field="Fea8" operator="greaterThan" value="8.414263725280763"/>
</Node>
<Node score="-0.07481832980833732">
<SimplePredicate field="Fea8" operator="greaterThan" value="-7.9984328746795645"/>
<Node score="0.1636899586314549">
<SimplePredicate field="Fea15" operator="greaterThan" value="-14.471662521362303"/>
<Node score="0.07521107743604809">
<SimplePredicate field="Fea2" operator="greaterThan" value="-4.377085924148559"/>
<Node score="0.14189081398910836">
<SimplePredicate field="Fea15" operator="greaterThan" value="10.133028984069826"/>
</Node>
<Node score="0.09664384989953179">
<SimplePredicate field="Fea16" operator="greaterThan" value="7.464995622634889"/>
</Node>
<Node score="0.09794280580640959">
<SimplePredicate field="Fea9" operator="greaterThan" value="5.7242188453674325"/>
</Node>
<Node score="-0.15578658133705325">
<SimplePredicate field="Fea9" operator="greaterThan" value="1.9439340829849245"/>
</Node>
<Node score="-0.06815035615303128">
<SimplePredicate field="Fea14" operator="greaterThan" value="0.9443753361701966"/>
</Node>
</Node>
<Node score="-0.022427108230932868">
<SimplePredicate field="Fea1" operator="greaterThan" value="-2.777882814407348"/>
<Node score="0.1252208798508433">
<SimplePredicate field="Fea4" operator="greaterThan" value="7.320906639099122"/>
</Node>
</Node>
</Node>
</Node>
<Node score="0.019794809895329203">
<SimplePredicate field="Fea10" operator="greaterThan" value="-1.9369573593139646"/>
</Node>
</Node>
<Node score="0.0652091169530891">
<SimplePredicate field="Fea9" operator="greaterThan" value="11.225318908691408"/>
<Node score="-0.08833449262314681">
<SimplePredicate field="Fea20" operator="greaterThan" value="-0.8956793844699859"/>
</Node>
</Node>
<Node score="-0.03568022357067151">
<SimplePredicate field="Fea2" operator="greaterThan" value="-15.326576232910154"/>
<Node score="-0.030809703683317567">
<SimplePredicate field="Fea18" operator="greaterThan" value="-5.7649521827697745"/>
<Node score="0.12866983174151886">
<SimplePredicate field="Fea14" operator="greaterThan" value="5.901069641113282"/>
<Node score="-0.05868613548098403">
<SimplePredicate field="Fea18" operator="greaterThan" value="2.781560301780701"/>
</Node>
</Node>
<Node score="0.1842068006477812">
<SimplePredicate field="Fea15" operator="greaterThan" value="8.100379943847658"/>
</Node>
<Node score="0.18886971928785534">
<SimplePredicate field="Fea16" operator="greaterThan" value="8.023637294769289"/>
</Node>
<Node score="0.10299430099982321">
<SimplePredicate field="Fea10" operator="greaterThan" value="-6.957179784774779"/>
</Node>
</Node>
<Node score="0.0954853216582624">
<SimplePredicate field="Fea19" operator="greaterThan" value="0.7611226439476014"/>
</Node>
</Node>
</Node>
<Node score="0.011865327710640965">
<SimplePredicate field="Fea12" operator="greaterThan" value="-1.6984871029853819"/>
</Node>
</Node>
<Node score="0.1680864247778106">
<SimplePredicate field="Fea9" operator="greaterThan" value="6.78233814239502"/>
</Node>
<Node score="-0.018285509687264518">
<SimplePredicate field="Fea9" operator="greaterThan" value="0.5637701749801637"/>
</Node>
</Node>
</Node>
</Node>
</TreeModel> |
You can choose between different PMML representation when converting by toggling the pipeline = PMMLPipeline(..)
pipeline.fit(X, y)
# THIS
pipeline.configure(compact = False)
sklearn2pmml(pipeline, "lgbm.pmml") Both compacted and non-compacted PMML representations suffer from the abovestated issue.
Missing values are sent to the left or right child node depending on the value of the The question is that why LightGBM is setting the |
@SamWqc TLDR: If your testing datasets contains missing values then your training dataset should also contain missing values. It seems to me like a flawed assumption that you can train with dense only, and then test both with dense AND sparse. No algorithm is guaranteed to have such generalization powers. |
|
Hi,
I found that the prediction results produce by python lightgbm model and pmml file is different.
It happens when training data did not contain missing value but predict the data which contains missing value.
Here is the example to show this case.
The text was updated successfully, but these errors were encountered: