prediction of leaf ids #14

Denisevi4 · 2018-03-24T20:55:52Z

Awesome project! Thanks!

Could you also add a prediction of leaf ids in each tree.

For instance if I have 10 trees in the model, then for each event I would get a vector of length 10 with ids for each of the tree.

These function is needed if one wants to get just the partition info about each event.

hcho3 · 2018-03-25T00:27:39Z

I think XGBoost already lets you produce leaf ids. What would be the benefit of having treelite support leaf outputs? The focus of this project is faster prediction performance, and I don't see the point of outputting leaf ids in cases where fast performance is required.

I'm not familiar with what you are trying to achieve here, so a little bit of explanation would be appreciated. Thanks!

Denisevi4 · 2018-03-25T01:07:48Z

Sure, no problem.

I'm trying to improve an XGBoost model by doing a linear regression with regularizations. The linear model uses features = the ids of the leafs that XGBoost constructed. Similar to what Facebook was doing with their ads in this paper https://research.fb.com/publications/practical-lessons-from-predicting-clicks-on-ads-at-facebook/

I'm planning doing further pruning and maybe repeat linear regression. Once you do pruning, XGBoost model is gone and you have to create your own Tree structures. XGBoost just gives you initial partitioning.

I have my own Tree python class that I currently use. Once I prune my trees, I can make new predictions, predictions of new leaf ids etc. But it's extremely slow. I'm doing predictions through python linked list. I see a great benefit from your package for doing just this.

I could use leaf predictions from XGBoost to feed to your trees. However, I'd have to construct shared model files for each tree in the model.

So, basically if you could add a functionality that the predict method optionally returns not just the sum of the predictions from all trees, but the whole sequence of the tree predictions in the model, that'd be all I need. For example, if I have 200 trees, return the array of predictions from each of them.

hcho3 · 2018-03-25T01:48:13Z

I'm trying to improve an XGBoost model by doing a linear regression with regularizations. The linear model uses features = the ids of the leafs that XGBoost constructed

I've seen papers in the past that uses sparse linear classifiers to prune trees:

Is your approach similar to these papers? The facebook paper you linked appears to use random forests to create a non-linear feature transformer, but for a different purpose.

Once you do pruning, XGBoost model is gone and you have to create your own Tree structures. XGBoost just gives you initial partitioning.

I see now how treelite helps your work here. Treelite has the model builder API with which you can build any arbitrary decision trees.

Denisevi4 · 2018-03-25T01:59:44Z

Yes, those also seem similar to what I want to do. This idea has been floating around for some time. There is also RuleFit http://statweb.stanford.edu/~jhf/ftp/RuleFit.pdf paper by J.Friedman from 2005.

And yes, I was going to use your model builder API. And I should be able to do it even now. The only problem is that with current treelite setup I would have to build my custom model for each tree and make my custom leaf id predictions for each of them separately.

hcho3 · 2018-03-25T02:08:18Z

Ah ha, so if treelite starts supporting leaf id outputs, you could simply use the model builder API in treelite and be done with it.

Are you currently satisfied with the current performance of treelite?

Only concern for me is how much engineering effort would be necessary to support tree id outputs. It might be easier for me to build a model builder that produces an XGBoost model file, which you'd then feed into XGBoost to get leaf ids.

hcho3 · 2018-03-25T02:09:48Z

It might be easier for me to build a model builder that produces an XGBoost model file.

And this I can do very easily, because I am quite familiar with XGBoost model format.

Denisevi4 · 2018-03-25T02:14:46Z

It might be easier for me to build a model builder that produces an XGBoost model file, which you'd then feed into XGBoost to get leaf ids.

Oh... That's an interesting idea. I think that that would work fine too!

Denisevi4 · 2018-03-25T02:41:39Z

Are you currently satisfied with the current performance of treelite?

I don't know yet :) I just found it yesterday and it immediately clicked that this is exactly what I need. So I kept thinking about it all day.

But the fact that it's makes an efficient C++ code makes me think that I will be satisfied with the performance. I'll check that on Monday.

I'm also interested in the current prediction functionality too. Because I need low latency model evaluations. I also know that my data has no missing values. Can the missing values check be also removed optionally?

hcho3 · 2018-03-25T02:52:25Z

For now, I'll go ahead and add an experimental functionality for exporting XGBoost models. This can be done most easily on my part.

Can the missing values check be also removed optionally?

For now, I don't think this is the case. I'll get into more details later, if you are interested how missing values are handled.

hcho3 · 2018-03-25T10:02:58Z

@Denisevi4 I've added the exporting feature to the dev branch export_xgboost. Now you should be able to write

# model is of type treelite.Model
model.export_as_xgboost('test.model', name_obj='binary:logistic')

(API doc for export_as_xgboost())
The parameter name_obj should be set to one of possible values of objective in this page.

Denisevi4 · 2018-03-25T11:56:56Z

Beautiful! I will give it a try.

Denisevi4 · 2018-03-27T17:25:23Z

Hm. The exported model predicts zeros. For some reason I can't attach the jupyter notebook file. Maybe this functionality is blocked at my work.

But my code looks like this:

`from sklearn.datasets import load_boston
X, y = load_boston(return_X_y=True)

import xgboost
dtrain = xgboost.DMatrix(X, label=y)
params = {"max_depth":3, "eta":1, "silent":1, "objective":"reg:linear",
"eval_metric":"rmse", "base_score": 0.0}
bst = xgboost.train(params, dtrain, 3, [(dtrain, "train")])

This predicts well

bst.predict(dtrain)

import treelite
model = treelite.Model.from_xgboost(bst)

toolchain = 'clang'
model.export_lib(toolchain=toolchain, libpath="./mymodel.so", verbose=True)

import treelite.runtime # runtime module
predictor = treelite.runtime.Predictor("./mymodel.so", verbose=True)

batch = treelite.runtime.Batch.from_npy2d(X, rbegin=0, rend=10)

This also predicts well

out_pred = predictor.predict(batch, verbose=True)
out_pred

model.export_as_xgboost('test.model', name_obj="reg:linear")

import xgboost as xgb
bst_new = xgb.Booster(model_file="test.model")

This one predicts zeros

bst_new.predict(dtrain)`

Denisevi4 · 2018-03-27T17:40:04Z

json Dump of the initial XGBoost (obtained by bst.get_dump(with_stats=True)):
['0:[f12<9.725] yes=1,no=2,missing=1,gain=18223.5,cover=506\n\t1:[f5<6.941] yes=3,no=4,missing=3,gain=6826.89,cover=212\n\t\t3:[f7<1.48495] yes=7,no=8,missing=7,gain=525.821,cover=142\n\t\t\t7:leaf=40,cover=4\n\t\t\t8:leaf=24.5122,cover=138\n\t\t4:[f5<7.437] yes=9,no=10,missing=9,gain=675.372,cover=70\n\t\t\t9:leaf=32.7439,cover=40\n\t\t\t10:leaf=43.6419,cover=30\n\t2:[f12<16.085] yes=5,no=6,missing=5,gain=2368.79,cover=294\n\t\t5:[f11<116.025] yes=11,no=12,missing=11,gain=106.073,cover=150\n\t\t\t11:leaf=12.2625,cover=7\n\t\t\t12:leaf=20.4667,cover=143\n\t\t6:[f4<0.603] yes=13,no=14,missing=13,gain=624.634,cover=144\n\t\t\t13:leaf=17.358,cover=49\n\t\t\t14:leaf=12.3521,cover=95\n',
'0:[f12<5.23] yes=1,no=2,missing=1,gain=725.987,cover=506\n\t1:[f6<86.7] yes=3,no=4,missing=3,gain=216.132,cover=69\n\t\t3:[f9<270.5] yes=7,no=8,missing=7,gain=199.31,cover=59\n\t\t\t7:leaf=4.40188,cover=29\n\t\t\t8:leaf=0.72745,cover=30\n\t\t4:leaf=7.52351,cover=10\n\t2:[f5<8.589] yes=5,no=6,missing=5,gain=234.058,cover=437\n\t\t5:[f7<4.3607] yes=9,no=10,missing=9,gain=226.039,cover=436\n\t\t\t9:leaf=0.431017,cover=310\n\t\t\t10:leaf=-1.15223,cover=126\n\t\t6:leaf=-10.871,cover=1\n',
'0:[f0<15.718] yes=1,no=2,missing=1,gain=292.434,cover=506\n\t1:[f7<1.3034] yes=3,no=4,missing=3,gain=216.278,cover=480\n\t\t3:[f5<5.257] yes=7,no=8,missing=7,gain=31.0804,cover=5\n\t\t\t7:leaf=0.50845,cover=1\n\t\t\t8:leaf=7.17457,cover=4\n\t\t4:[f10<17.7] yes=9,no=10,missing=9,gain=210.272,cover=475\n\t\t\t9:leaf=1.07691,cover=151\n\t\t\t10:leaf=-0.348035,cover=324\n\t2:[f4<0.6695] yes=5,no=6,missing=5,gain=71.4314,cover=26\n\t\t5:[f0<39.8958] yes=11,no=12,missing=11,gain=2.78149,cover=5\n\t\t\t11:leaf=-0.248847,cover=4\n\t\t\t12:leaf=1.15324,cover=1\n\t\t6:[f4<0.675] yes=13,no=14,missing=13,gain=9.21314,cover=21\n\t\t\t13:leaf=-2.07168,cover=5\n\t\t\t14:leaf=-4.41412,cover=16\n']

json dump of the exported model:
['0:[f12<9.725] yes=1,no=2,missing=1,gain=nan,cover=nan\n\t1:[f5<6.941] yes=3,no=4,missing=3,gain=nan,cover=nan\n\t\t3:[f7<1.48495] yes=7,no=8,missing=7,gain=nan,cover=nan\n\t\t\t7:leaf=40,cover=nan\n\t\t\t8:leaf=24.5122,cover=nan\n\t\t4:[f5<7.437] yes=9,no=10,missing=9,gain=nan,cover=nan\n\t\t\t9:leaf=32.7439,cover=nan\n\t\t\t10:leaf=43.6419,cover=nan\n\t2:[f12<16.085] yes=5,no=6,missing=5,gain=nan,cover=nan\n\t\t5:[f11<116.025] yes=11,no=12,missing=11,gain=nan,cover=nan\n\t\t\t11:leaf=12.2625,cover=nan\n\t\t\t12:leaf=20.4667,cover=nan\n\t\t6:[f4<0.603] yes=13,no=14,missing=13,gain=nan,cover=nan\n\t\t\t13:leaf=17.358,cover=nan\n\t\t\t14:leaf=12.3521,cover=nan\n',
'0:[f12<5.23] yes=1,no=2,missing=1,gain=nan,cover=nan\n\t1:[f6<86.7] yes=3,no=4,missing=3,gain=nan,cover=nan\n\t\t3:[f9<270.5] yes=7,no=8,missing=7,gain=nan,cover=nan\n\t\t\t7:leaf=4.40188,cover=nan\n\t\t\t8:leaf=0.72745,cover=nan\n\t\t4:leaf=7.52351,cover=nan\n\t2:[f5<8.589] yes=5,no=6,missing=5,gain=nan,cover=nan\n\t\t5:[f7<4.3607] yes=9,no=10,missing=9,gain=nan,cover=nan\n\t\t\t9:leaf=0.431017,cover=nan\n\t\t\t10:leaf=-1.15223,cover=nan\n\t\t6:leaf=-10.871,cover=nan\n',
'0:[f0<15.718] yes=1,no=2,missing=1,gain=nan,cover=nan\n\t1:[f7<1.3034] yes=3,no=4,missing=3,gain=nan,cover=nan\n\t\t3:[f5<5.257] yes=7,no=8,missing=7,gain=nan,cover=nan\n\t\t\t7:leaf=0.50845,cover=nan\n\t\t\t8:leaf=7.17457,cover=nan\n\t\t4:[f10<17.7] yes=9,no=10,missing=9,gain=nan,cover=nan\n\t\t\t9:leaf=1.07691,cover=nan\n\t\t\t10:leaf=-0.348035,cover=nan\n\t2:[f4<0.6695] yes=5,no=6,missing=5,gain=nan,cover=nan\n\t\t5:[f0<39.8958] yes=11,no=12,missing=11,gain=nan,cover=nan\n\t\t\t11:leaf=-0.248847,cover=nan\n\t\t\t12:leaf=1.15324,cover=nan\n\t\t6:[f4<0.675] yes=13,no=14,missing=13,gain=nan,cover=nan\n\t\t\t13:leaf=-2.07168,cover=nan\n\t\t\t14:leaf=-4.41412,cover=nan\n']

Denisevi4 · 2018-03-27T17:44:47Z

However predict with pred_leaf=True option (prediction of node ids) works for both original and exported model and produces same output. So actually for the purpose that I want export_as_xgboost, it already works!

It's probably because xgboost need all those cover and gains. I wouldn't know why though. It doesn't make sense to me. And those fields are not stored in treelite.

Denisevi4 · 2018-03-27T18:00:17Z

Also, I tried using both Python2 and Python3. For compatibility I had to modify the code a little bit because subprocess in Python2 doesn't have DEVNULL

in python/treelitecontrib/gcc.py:

`
try:
from subprocess import DEVNULL
compat_subprocess_DEVNULL = DEVNULL
except ImportError:
compat_subprocess_DEVNULL = None

def _openmp_supported(toolchain):
with TemporaryDirectory() as temp_dir:
sfile = os.path.join(temp_dir, 'test.c')
output = os.path.join(temp_dir, 'test')
with open(sfile, 'w') as f:
f.write('int main() { return 0; }\n')
retcode = subprocess.call('{} -o {} {} -fopenmp'
.format(toolchain, output, sfile),
shell=True,
stdin=compat_subprocess_DEVNULL,
stdout=compat_subprocess_DEVNULL,
stderr=compat_subprocess_DEVNULL)
`

And same thing in in python/treelitecontrib/util.py:
`
try:
from subprocess import DEVNULL
compat_subprocess_get_DEVNULL = DEVNULL
except ImportError:
compat_subprocess_get_DEVNULL = None

def _is_windows():
return _platform == 'win32'

def _toolchain_exist_check(toolchain):
if toolchain != 'msvc':

retcode = subprocess.call('{} --version'.format(toolchain),
                          shell=True,
                          stdin=compat_subprocess_get_DEVNULL,
                          stdout=compat_subprocess_get_DEVNULL,
                          stderr=compat_subprocess_get_DEVNULL)

if retcode != 0:
  raise ValueError('Toolchain {} not found. '.format(toolchain) +
                  'Ensure that it is installed and that it is a variant ' +
                  'of GCC or Clang.')

`

Previously, tree_info vector was set to -1, which caused the issue #14 (comment)

hcho3 · 2018-03-30T18:22:04Z

@Denisevi4

Hm. The exported model predicts zeros.

I've pushed a small commit that fixes this problem. Thanks!

subprocess in Python2 doesn't have DEVNULL

This is an oversight on my part. Let me write a fix shortly.

As reported in #14 (comment), Python 2.7 does not have subprocess.DEVNULL. Use os.devnull instead.

As reported in #14 (comment), Python 2.7 does not have subprocess.DEVNULL. Use os.devnull instead. Release a postfix wheels to remedy this problem

hcho3 · 2018-04-03T18:23:57Z

@Denisevi4 The fixed package (0.31.post2) is now available on PyPI. Let me know if there's any other problem.

Denisevi4 · 2018-04-05T13:58:17Z

It works now!
The converted xgboost model
(xgboost model -> treelite model -> xgboost model)
now makes the same predictions as the original model!

I don't know what changed, because the json dump is identical to what I had before when it was predicting zeros.

Thanks a lot! This is going to be very useful. Do you want to merge it to master?

Denisevi4 · 2018-04-06T16:18:23Z

One more question/suggestion. Is it necessary to save the xgboost model to a file in model.export_as_xgboost? Could you instead return just the XGBoost model instead? And then I could save it to a file if I wanted to.

In my pruning procedure that I'm planning, all those xgboost models that I construct are temporary. They don't need to be saved. All I need the model for is to predict the leaf ids. Thus, saving to disk and then immediately reading the models from files is an unnecessary operation.

hcho3 · 2018-04-19T22:27:14Z

@Denisevi4 I've gone ahead and merged the feature into the master. (Keep in mind though that this is an experimental feature, so we won't provide any guarantee about its stability in the future.)

If you run into an issue, feel free to open another issue post.

As for returning XGBoost handles, let me get back to it later. Saving to the disk was easier, so that was what I ended up doing.

hcho3 added a commit that referenced this issue Mar 30, 2018

Store output group information to be compatible with XGBoost

bb32ec0

Previously, tree_info vector was set to -1, which caused the issue #14 (comment)

hcho3 added a commit that referenced this issue Mar 30, 2018

Python 2.7 compatibility fix: subprocess.DEVNULL

17e93cf

As reported in #14 (comment), Python 2.7 does not have subprocess.DEVNULL. Use os.devnull instead.

hcho3 added a commit that referenced this issue Mar 30, 2018

Python 2.7 compatibility fix: subprocess.DEVNULL

8da5cb1

As reported in #14 (comment), Python 2.7 does not have subprocess.DEVNULL. Use os.devnull instead. Release a postfix wheels to remedy this problem

hcho3 closed this as completed Apr 19, 2018

sunnyDX mentioned this issue Jun 25, 2020

core Errors when predict online from the xgboost model with java api #163

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

prediction of leaf ids #14

prediction of leaf ids #14

Denisevi4 commented Mar 24, 2018 •

edited

Loading

hcho3 commented Mar 25, 2018 •

edited

Loading

Denisevi4 commented Mar 25, 2018 •

edited

Loading

hcho3 commented Mar 25, 2018

Denisevi4 commented Mar 25, 2018 •

edited

Loading

hcho3 commented Mar 25, 2018

hcho3 commented Mar 25, 2018

Denisevi4 commented Mar 25, 2018

Denisevi4 commented Mar 25, 2018 •

edited

Loading

hcho3 commented Mar 25, 2018

hcho3 commented Mar 25, 2018 •

edited

Loading

Denisevi4 commented Mar 25, 2018

Denisevi4 commented Mar 27, 2018 •

edited

Loading

Denisevi4 commented Mar 27, 2018 •

edited

Loading

Denisevi4 commented Mar 27, 2018 •

edited

Loading

Denisevi4 commented Mar 27, 2018 •

edited

Loading

hcho3 commented Mar 30, 2018

hcho3 commented Apr 3, 2018

Denisevi4 commented Apr 5, 2018 •

edited

Loading

Denisevi4 commented Apr 6, 2018 •

edited

Loading

hcho3 commented Apr 19, 2018 •

edited

Loading

prediction of leaf ids #14

prediction of leaf ids #14

Comments

Denisevi4 commented Mar 24, 2018 • edited Loading

hcho3 commented Mar 25, 2018 • edited Loading

Denisevi4 commented Mar 25, 2018 • edited Loading

hcho3 commented Mar 25, 2018

Denisevi4 commented Mar 25, 2018 • edited Loading

hcho3 commented Mar 25, 2018

hcho3 commented Mar 25, 2018

Denisevi4 commented Mar 25, 2018

Denisevi4 commented Mar 25, 2018 • edited Loading

hcho3 commented Mar 25, 2018

hcho3 commented Mar 25, 2018 • edited Loading

Denisevi4 commented Mar 25, 2018

Denisevi4 commented Mar 27, 2018 • edited Loading

This predicts well

This also predicts well

This one predicts zeros

Denisevi4 commented Mar 27, 2018 • edited Loading

Denisevi4 commented Mar 27, 2018 • edited Loading

Denisevi4 commented Mar 27, 2018 • edited Loading

hcho3 commented Mar 30, 2018

hcho3 commented Apr 3, 2018

Denisevi4 commented Apr 5, 2018 • edited Loading

Denisevi4 commented Apr 6, 2018 • edited Loading

hcho3 commented Apr 19, 2018 • edited Loading

Denisevi4 commented Mar 24, 2018 •

edited

Loading

hcho3 commented Mar 25, 2018 •

edited

Loading

Denisevi4 commented Mar 25, 2018 •

edited

Loading

Denisevi4 commented Mar 25, 2018 •

edited

Loading

Denisevi4 commented Mar 25, 2018 •

edited

Loading

hcho3 commented Mar 25, 2018 •

edited

Loading

Denisevi4 commented Mar 27, 2018 •

edited

Loading

Denisevi4 commented Mar 27, 2018 •

edited

Loading

Denisevi4 commented Mar 27, 2018 •

edited

Loading

Denisevi4 commented Mar 27, 2018 •

edited

Loading

Denisevi4 commented Apr 5, 2018 •

edited

Loading

Denisevi4 commented Apr 6, 2018 •

edited

Loading

hcho3 commented Apr 19, 2018 •

edited

Loading