Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sparse feature support for xgboost models #248

Conversation

akshaykumar90
Copy link

We still use a dense feature vector for model evaluation, but reserve Float.MAX_VALUE to represent "missing". We then use this value to take the proper node path during model evaluation.

missingNodeId was already present in the xgboost json parser but was being ignored. Modified the parser to actually make use of it and send it down to NaiveAdditiveDecisionTree.Split constructor.

The missing node Id has to be one of right or left nodes.

Modified the simple_tree format to allow specifying what node to take on missing features. Specified as a boolean—take left on true.

@nomoa
Copy link
Collaborator

nomoa commented Oct 18, 2019

Adding a new condition to check the split might have a perf impact.
Here you decided to use MAX_VALUE but this is not how missing values are encoded in the rest of the plugin, I suspect this is because you have custom query/feature that follow this convention?
We did not find a consensus yet on how to treat/encode missing values in the plugin please see #135.

I think I would prefer a separate Ranker implementation instead of making the assumption that MAX_VALUE will work everywhere.

@akshaykumar90
Copy link
Author

@nomoa Thanks for the feedback and pointing out the existing open issue. I would look into the other approach you suggest.

@aprudhomme
Copy link
Contributor

@nomoa Yes, we are wrapping some existing features/models and require sparse features evaluation. This was the convention we chose.
It would be best if there was general sparse feature support, but we were trying to do this as a nearer term fix.

Would it be more acceptable to add the 'missing' value as an XGBoost model parameter, like in #247 ?
Providing no value could be interpreted as opting out of missing support, and could be used to avoid the extra condition check during evaluation.

@nomoa
Copy link
Collaborator

nomoa commented Oct 22, 2019

Would it be more acceptable to add the 'missing' value as an XGBoost model parameter, like in #247 ?
Providing no value could be interpreted as opting out of missing support, and could be used to avoid the extra condition check during evaluation.

Yes for me it's totally acceptable to have a param in the model definition that instructs some specific behavior of the ranker. Having a way to tell the XGBoost parser what is the missing value it has to check makes sense to me.
In the actual prediction code (NaiveAdditiveDecisionTree) I would try to use this new param as early as possible so that the is this a missing value? checks are avoided when not needed.

@akshaykumar90
Copy link
Author

Closing this pull request. Will open a new one with the proposed changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants