Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new xgboost parameter to represent missing features in feature vector #252

Conversation

akshaykumar90
Copy link

This is a follow up from #248.

Added new parameter use_float_max_for_missing to indicate missing features can be represented as float max value in the feature vector.

cc @aprudhomme

@nomoa
Copy link
Collaborator

nomoa commented Nov 5, 2019

Thank you!
Just a suggestion:
why not let users decide what is the missing value instead of hard coding MAX_VALUE?
You could add a "missing_value" param and have it stored in the SplitWithMissing class, we could even interpret the string "max" as Float.MAX_VALUE so that users don't have to pass the actual java MAX_VALUE (3.4028235E38).

@aprudhomme
Copy link
Contributor

@nomoa I am taking a look at the branch. What do you think it would be best to optimize for when implementing this feature? Should we be worried about adding a configured 'missing' value to each split, or would it be better to change the interface to avoid data duplication?

@nomoa
Copy link
Collaborator

nomoa commented Nov 7, 2019

I'd probably look into trying to move the Split:eval method to its container (NaiveAdditiveDecisionTree)
and add a new evalWithMissing method to it that would be triggered if the missingValue is set on NaiveAdditiveDecisionTree, that would probably help avoid to duplicate it on every split.

For the missing branch I'd perhaps look if it's possible to make the assumption that the missing branch is always the same, permitting to avoid an extra ref "missing" to the split.

@akshaykumar90
Copy link
Author

I am back from my vacation and will be picking up this branch back.

why not let users decide what is the missing value

I think we should restrict what can be used as a missing value instead of allowing users to choose any value as the missing value. I think there could only be a handful of values that make sense as a placeholder for missing values.

Creating a new class for each of those missing values would obviously not be a good design. I will look into your suggestions to DRY up the code.

@akshaykumar90
Copy link
Author

In the latest revision, I have done some refactoring to DRY up the code as was suggested in the comments above.

I have also allowed users to specify arbitrary values as the missing value. In addition, there are convenience options (just one for now) which allow specific values (like Float max value) without writing down the exact value in the JSON.

Copy link
Collaborator

@nomoa nomoa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should be a single param to pass the missing value, having two leads to ambiguous behavior. The fact that the value for use_float_max_for_missing is ignored and that the order of the params matters is problematic in my opinion.

Thanks!

@@ -138,7 +149,11 @@ void setNormalizer(String objectiveName) {
}

void setFloatMaxForMissing(boolean useFloatMaxForMissing) {
this.useFloatMaxForMissing = useFloatMaxForMissing;
this.missingValue = Float.MAX_VALUE;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

useFloatMaxForMissing is now ignored

// Only one of the following parameter should be specified
// otherwise the behavior is undefined (depends on the order of
// names in JSON).
PARSER.declareFloat(XGBoostDefinition::setMissingValue, new ParseField("missing_value"));
Copy link
Collaborator

@nomoa nomoa Dec 13, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of adding a new param it might be more coherent to be able to parse a float and string
this way one could pass:
"missing_value": "max"
or
"missing_value": 0.0

everything else would throw an exception.

See https://github.com/elastic/elasticsearch/blob/237650e9c054149fd08213b38a81a3666c1868e5/server/src/main/java/org/elasticsearch/search/suggest/completion/FuzzyOptions.java#L66 to declare a field that accepts any kind of value
and https://github.com/elastic/elasticsearch/blob/f92ebb2ff909d0083ae988e04ecd398d979e9210/server/src/main/java/org/elasticsearch/common/unit/Fuzziness.java#L160 for how to parse the value

@softwaredoug
Copy link
Collaborator

Hey @akshaykumar90 - just thought I'd check in on this PR to see if it's still in progress?

@akshaykumar90
Copy link
Author

Sorry for not responding on this PR earlier. At work, we had decided to move ahead with a different approach which did not require changes proposed in this PR. I would not be pushing any changes to this branch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants