-
Notifications
You must be signed in to change notification settings - Fork 368
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add new xgboost parameter to represent missing features in feature vector #252
Add new xgboost parameter to represent missing features in feature vector #252
Conversation
Thank you! |
@nomoa I am taking a look at the branch. What do you think it would be best to optimize for when implementing this feature? Should we be worried about adding a configured 'missing' value to each split, or would it be better to change the interface to avoid data duplication? |
I'd probably look into trying to move the Split:eval method to its container (NaiveAdditiveDecisionTree) For the missing branch I'd perhaps look if it's possible to make the assumption that the missing branch is always the same, permitting to avoid an extra ref "missing" to the split. |
I am back from my vacation and will be picking up this branch back.
I think we should restrict what can be used as a missing value instead of allowing users to choose any value as the missing value. I think there could only be a handful of values that make sense as a placeholder for missing values. Creating a new class for each of those missing values would obviously not be a good design. I will look into your suggestions to DRY up the code. |
In the latest revision, I have done some refactoring to DRY up the code as was suggested in the comments above. I have also allowed users to specify arbitrary values as the missing value. In addition, there are convenience options (just one for now) which allow specific values (like Float max value) without writing down the exact value in the JSON. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There should be a single param to pass the missing value, having two leads to ambiguous behavior. The fact that the value for use_float_max_for_missing
is ignored and that the order of the params matters is problematic in my opinion.
Thanks!
@@ -138,7 +149,11 @@ void setNormalizer(String objectiveName) { | |||
} | |||
|
|||
void setFloatMaxForMissing(boolean useFloatMaxForMissing) { | |||
this.useFloatMaxForMissing = useFloatMaxForMissing; | |||
this.missingValue = Float.MAX_VALUE; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
useFloatMaxForMissing is now ignored
// Only one of the following parameter should be specified | ||
// otherwise the behavior is undefined (depends on the order of | ||
// names in JSON). | ||
PARSER.declareFloat(XGBoostDefinition::setMissingValue, new ParseField("missing_value")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of adding a new param it might be more coherent to be able to parse a float and string
this way one could pass:
"missing_value": "max"
or
"missing_value": 0.0
everything else would throw an exception.
See https://github.com/elastic/elasticsearch/blob/237650e9c054149fd08213b38a81a3666c1868e5/server/src/main/java/org/elasticsearch/search/suggest/completion/FuzzyOptions.java#L66 to declare a field that accepts any kind of value
and https://github.com/elastic/elasticsearch/blob/f92ebb2ff909d0083ae988e04ecd398d979e9210/server/src/main/java/org/elasticsearch/common/unit/Fuzziness.java#L160 for how to parse the value
Hey @akshaykumar90 - just thought I'd check in on this PR to see if it's still in progress? |
Sorry for not responding on this PR earlier. At work, we had decided to move ahead with a different approach which did not require changes proposed in this PR. I would not be pushing any changes to this branch. |
This is a follow up from #248.
Added new parameter
use_float_max_for_missing
to indicate missing features can be represented as float max value in the feature vector.cc @aprudhomme