Sparse feature support for xgboost models #248

akshaykumar90 · 2019-10-18T00:30:16Z

We still use a dense feature vector for model evaluation, but reserve Float.MAX_VALUE to represent "missing". We then use this value to take the proper node path during model evaluation.

missingNodeId was already present in the xgboost json parser but was being ignored. Modified the parser to actually make use of it and send it down to NaiveAdditiveDecisionTree.Split constructor.

The missing node Id has to be one of right or left nodes.

Modified the simple_tree format to allow specifying what node to take on missing features. Specified as a boolean—take left on true.

nomoa · 2019-10-18T07:16:56Z

Adding a new condition to check the split might have a perf impact.
Here you decided to use MAX_VALUE but this is not how missing values are encoded in the rest of the plugin, I suspect this is because you have custom query/feature that follow this convention?
We did not find a consensus yet on how to treat/encode missing values in the plugin please see #135.

I think I would prefer a separate Ranker implementation instead of making the assumption that MAX_VALUE will work everywhere.

akshaykumar90 · 2019-10-18T20:59:39Z

@nomoa Thanks for the feedback and pointing out the existing open issue. I would look into the other approach you suggest.

aprudhomme · 2019-10-21T20:08:55Z

@nomoa Yes, we are wrapping some existing features/models and require sparse features evaluation. This was the convention we chose.
It would be best if there was general sparse feature support, but we were trying to do this as a nearer term fix.

Would it be more acceptable to add the 'missing' value as an XGBoost model parameter, like in #247 ?
Providing no value could be interpreted as opting out of missing support, and could be used to avoid the extra condition check during evaluation.

nomoa · 2019-10-22T08:57:35Z

Would it be more acceptable to add the 'missing' value as an XGBoost model parameter, like in #247 ?
Providing no value could be interpreted as opting out of missing support, and could be used to avoid the extra condition check during evaluation.

Yes for me it's totally acceptable to have a param in the model definition that instructs some specific behavior of the ranker. Having a way to tell the XGBoost parser what is the missing value it has to check makes sense to me.
In the actual prediction code (NaiveAdditiveDecisionTree) I would try to use this new param as early as possible so that the is this a missing value? checks are avoided when not needed.

akshaykumar90 · 2019-10-29T13:12:36Z

Closing this pull request. Will open a new one with the proposed changes.

Sparse feature support for xgboost models

50d4474

akshaykumar90 closed this Oct 29, 2019

akshaykumar90 mentioned this pull request Oct 31, 2019

Add new xgboost parameter to represent missing features in feature vector #252

Closed

akshaykumar90 deleted the akshay_support_sparse_feature_set branch December 5, 2019 22:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sparse feature support for xgboost models #248

Sparse feature support for xgboost models #248

akshaykumar90 commented Oct 18, 2019

nomoa commented Oct 18, 2019

akshaykumar90 commented Oct 18, 2019

aprudhomme commented Oct 21, 2019

nomoa commented Oct 22, 2019

akshaykumar90 commented Oct 29, 2019

Sparse feature support for xgboost models #248

Sparse feature support for xgboost models #248

Conversation

akshaykumar90 commented Oct 18, 2019

nomoa commented Oct 18, 2019

akshaykumar90 commented Oct 18, 2019

aprudhomme commented Oct 21, 2019

nomoa commented Oct 22, 2019

akshaykumar90 commented Oct 29, 2019