-
Notifications
You must be signed in to change notification settings - Fork 62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ML] Add information about samples per node to the tree #991
Conversation
retest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've done a pass through and left some minor style comments. However, I have two more significant comments: 1) I'm worried about the cost of computing the number of rows from the row mask where you do (this was the motivation for introducing leftChildHasFewerRows
in the first place), 2) I wonder if we should use the test rows as well when computing sample counts: this is a change in behaviour and it seems better to use all available data for these. To me these point to keeping the step to compute and write them to the tree separate from the main training loop (since they aren't needed here). It could just happen as a single standalone pass at the end of training.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good stuff @valeriy42! I've done a second pass through. I think there are some hangover TODOs from the refactoring. Also, it seems like we can exploit having the counts on the node to simplify SHAP code slightly.
…p-850 � Conflicts: � lib/maths/CBoostedTreeImpl.cc
Thank you, @tveasey for the review. I addressed your comments and removed unnecessary code. Let me know if everything is ok now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working through the suggestions, looks good!
retest |
This PR extends the definition of the tree node by adding information about the number of training samples that passed through the node (numberSamples or number_samples). The json schema for inference model is adjusted accordingly. Since this change the schema for persist/restore of the tree implementation, I bumped the version and removed 7.5 and 7.6 from the list of supported version. My reasoning: restoring from old schema and setting number samples to 0 would break feature importance at inference time. I also adjust feature importance computation to use pre-computed number samples instead of recomputing it on the fly.
This adds machine learning model feature importance calculations to the inference processor. The new flag in the configuration matches the analytics parameter name: `num_top_feature_importance_values` Example: ``` "inference": { "field_mappings": {}, "model_id": "my_model", "inference_config": { "regression": { "num_top_feature_importance_values": 3 } } } ``` This will write to the document as follows: ``` "inference" : { "feature_importance" : { "FlightTimeMin" : -76.90955548511226, "FlightDelayType" : 114.13514762158526, "DistanceMiles" : 13.731580450792187 }, "predicted_value" : 108.33165831875137, "model_id" : "my_model" } ``` This is done through calculating the [SHAP values](https://arxiv.org/abs/1802.03888). It requires that models have populated `number_samples` for each tree node. This is not available to models that were created before 7.7. Additionally, if the inference config is requesting feature_importance, and not all nodes have been upgraded yet, it will not allow the pipeline to be created. This is to safe-guard in a mixed-version environment where only some ingest nodes have been upgraded. NOTE: the algorithm is a Java port of the one laid out in ml-cpp: https://github.com/elastic/ml-cpp/blob/master/lib/maths/CTreeShapFeatureImportance.cc usability blocked by: elastic/ml-cpp#991
…c#52218) This adds machine learning model feature importance calculations to the inference processor. The new flag in the configuration matches the analytics parameter name: `num_top_feature_importance_values` Example: ``` "inference": { "field_mappings": {}, "model_id": "my_model", "inference_config": { "regression": { "num_top_feature_importance_values": 3 } } } ``` This will write to the document as follows: ``` "inference" : { "feature_importance" : { "FlightTimeMin" : -76.90955548511226, "FlightDelayType" : 114.13514762158526, "DistanceMiles" : 13.731580450792187 }, "predicted_value" : 108.33165831875137, "model_id" : "my_model" } ``` This is done through calculating the [SHAP values](https://arxiv.org/abs/1802.03888). It requires that models have populated `number_samples` for each tree node. This is not available to models that were created before 7.7. Additionally, if the inference config is requesting feature_importance, and not all nodes have been upgraded yet, it will not allow the pipeline to be created. This is to safe-guard in a mixed-version environment where only some ingest nodes have been upgraded. NOTE: the algorithm is a Java port of the one laid out in ml-cpp: https://github.com/elastic/ml-cpp/blob/master/lib/maths/CTreeShapFeatureImportance.cc usability blocked by: elastic/ml-cpp#991
#52666) This adds machine learning model feature importance calculations to the inference processor. The new flag in the configuration matches the analytics parameter name: `num_top_feature_importance_values` Example: ``` "inference": { "field_mappings": {}, "model_id": "my_model", "inference_config": { "regression": { "num_top_feature_importance_values": 3 } } } ``` This will write to the document as follows: ``` "inference" : { "feature_importance" : { "FlightTimeMin" : -76.90955548511226, "FlightDelayType" : 114.13514762158526, "DistanceMiles" : 13.731580450792187 }, "predicted_value" : 108.33165831875137, "model_id" : "my_model" } ``` This is done through calculating the [SHAP values](https://arxiv.org/abs/1802.03888). It requires that models have populated `number_samples` for each tree node. This is not available to models that were created before 7.7. Additionally, if the inference config is requesting feature_importance, and not all nodes have been upgraded yet, it will not allow the pipeline to be created. This is to safe-guard in a mixed-version environment where only some ingest nodes have been upgraded. NOTE: the algorithm is a Java port of the one laid out in ml-cpp: https://github.com/elastic/ml-cpp/blob/master/lib/maths/CTreeShapFeatureImportance.cc usability blocked by: elastic/ml-cpp#991
This PR extends the definition of the tree node by adding information about the number of training samples that passed through the node (
numberSamples
ornumber_samples
). The json schema for inference model is adjusted accordingly.Since this change the schema for persist/restore of the tree implementation, I bumped the version and removed 7.5 and 7.6 from the list of supported version. My reasoning: restoring from old schema and setting number samples to 0 would break feature importance at inference time.
I also adjust feature importance computation to use pre-computed number samples instead of recomputing it on the fly.
Closes #850