Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML] Add information about samples per node to the tree #991

Merged
merged 25 commits into from
Feb 17, 2020

Conversation

valeriy42
Copy link
Contributor

@valeriy42 valeriy42 commented Feb 6, 2020

This PR extends the definition of the tree node by adding information about the number of training samples that passed through the node (numberSamples or number_samples). The json schema for inference model is adjusted accordingly.

Since this change the schema for persist/restore of the tree implementation, I bumped the version and removed 7.5 and 7.6 from the list of supported version. My reasoning: restoring from old schema and setting number samples to 0 would break feature importance at inference time.

I also adjust feature importance computation to use pre-computed number samples instead of recomputing it on the fly.

Closes #850

@valeriy42 valeriy42 changed the title [ML] Add number_samples field to the node [ML] Add information about samples per node to the tree Feb 6, 2020
@valeriy42 valeriy42 requested a review from tveasey February 6, 2020 16:30
@valeriy42 valeriy42 added >enhancement and removed WIP labels Feb 6, 2020
@valeriy42
Copy link
Contributor Author

retest

Copy link
Contributor

@tveasey tveasey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've done a pass through and left some minor style comments. However, I have two more significant comments: 1) I'm worried about the cost of computing the number of rows from the row mask where you do (this was the motivation for introducing leftChildHasFewerRows in the first place), 2) I wonder if we should use the test rows as well when computing sample counts: this is a change in behaviour and it seems better to use all available data for these. To me these point to keeping the step to compute and write them to the tree separate from the main training loop (since they aren't needed here). It could just happen as a single standalone pass at the end of training.

include/maths/CBoostedTree.h Outdated Show resolved Hide resolved
include/maths/CBoostedTree.h Outdated Show resolved Hide resolved
include/maths/CBoostedTree.h Outdated Show resolved Hide resolved
include/maths/CBoostedTree.h Outdated Show resolved Hide resolved
include/maths/CBoostedTree.h Outdated Show resolved Hide resolved
include/maths/CTreeShapFeatureImportance.h Outdated Show resolved Hide resolved
lib/maths/CBoostedTreeImpl.cc Show resolved Hide resolved
lib/maths/CBoostedTreeLeafNodeStatistics.cc Outdated Show resolved Hide resolved
lib/maths/CTreeShapFeatureImportance.cc Outdated Show resolved Hide resolved
Copy link
Contributor

@tveasey tveasey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good stuff @valeriy42! I've done a second pass through. I think there are some hangover TODOs from the refactoring. Also, it seems like we can exploit having the counts on the node to simplify SHAP code slightly.

include/maths/CTreeShapFeatureImportance.h Outdated Show resolved Hide resolved
include/maths/CTreeShapFeatureImportance.h Outdated Show resolved Hide resolved
lib/maths/CBoostedTreeImpl.cc Show resolved Hide resolved
lib/maths/CBoostedTreeImpl.cc Outdated Show resolved Hide resolved
lib/maths/CBoostedTreeImpl.cc Outdated Show resolved Hide resolved
lib/maths/CBoostedTreeLeafNodeStatistics.cc Outdated Show resolved Hide resolved
lib/maths/CTreeShapFeatureImportance.cc Outdated Show resolved Hide resolved
@valeriy42
Copy link
Contributor Author

Thank you, @tveasey for the review. I addressed your comments and removed unnecessary code. Let me know if everything is ok now.

Copy link
Contributor

@tveasey tveasey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working through the suggestions, looks good!

@valeriy42
Copy link
Contributor Author

retest

@valeriy42 valeriy42 merged commit 309db87 into elastic:master Feb 17, 2020
@valeriy42 valeriy42 deleted the ml-cpp-850 branch February 17, 2020 15:32
valeriy42 added a commit to valeriy42/ml-cpp that referenced this pull request Feb 17, 2020
This PR extends the definition of the tree node by adding information about the number of training samples that passed through the node (numberSamples or number_samples). The json schema for inference model is adjusted accordingly.

Since this change the schema for persist/restore of the tree implementation, I bumped the version and removed 7.5 and 7.6 from the list of supported version. My reasoning: restoring from old schema and setting number samples to 0 would break feature importance at inference time.

I also adjust feature importance computation to use pre-computed number samples instead of recomputing it on the fly.
@tveasey tveasey mentioned this pull request Feb 17, 2020
benwtrent added a commit to elastic/elasticsearch that referenced this pull request Feb 21, 2020
This adds machine learning model feature importance calculations to the inference processor. 

The new flag in the configuration matches the analytics parameter name: `num_top_feature_importance_values`
Example:
```
"inference": {
   "field_mappings": {},
   "model_id": "my_model",
   "inference_config": {
      "regression": {
         "num_top_feature_importance_values": 3
      }
   }
}
```

This will write to the document as follows:
```
"inference" : {
   "feature_importance" : { 
      "FlightTimeMin" : -76.90955548511226,
      "FlightDelayType" : 114.13514762158526,
      "DistanceMiles" : 13.731580450792187
   },
   "predicted_value" : 108.33165831875137,
   "model_id" : "my_model"
}
```

This is done through calculating the [SHAP values](https://arxiv.org/abs/1802.03888). 

It requires that models have populated `number_samples` for each tree node. This is not available to models that were created before 7.7. 

Additionally, if the inference config is requesting feature_importance, and not all nodes have been upgraded yet, it will not allow the pipeline to be created. This is to safe-guard in a mixed-version environment where only some ingest nodes have been upgraded.

NOTE: the algorithm is a Java port of the one laid out in ml-cpp: https://github.com/elastic/ml-cpp/blob/master/lib/maths/CTreeShapFeatureImportance.cc

usability blocked by: elastic/ml-cpp#991
benwtrent added a commit to benwtrent/elasticsearch that referenced this pull request Feb 21, 2020
…c#52218)

This adds machine learning model feature importance calculations to the inference processor.

The new flag in the configuration matches the analytics parameter name: `num_top_feature_importance_values`
Example:
```
"inference": {
   "field_mappings": {},
   "model_id": "my_model",
   "inference_config": {
      "regression": {
         "num_top_feature_importance_values": 3
      }
   }
}
```

This will write to the document as follows:
```
"inference" : {
   "feature_importance" : {
      "FlightTimeMin" : -76.90955548511226,
      "FlightDelayType" : 114.13514762158526,
      "DistanceMiles" : 13.731580450792187
   },
   "predicted_value" : 108.33165831875137,
   "model_id" : "my_model"
}
```

This is done through calculating the [SHAP values](https://arxiv.org/abs/1802.03888).

It requires that models have populated `number_samples` for each tree node. This is not available to models that were created before 7.7.

Additionally, if the inference config is requesting feature_importance, and not all nodes have been upgraded yet, it will not allow the pipeline to be created. This is to safe-guard in a mixed-version environment where only some ingest nodes have been upgraded.

NOTE: the algorithm is a Java port of the one laid out in ml-cpp: https://github.com/elastic/ml-cpp/blob/master/lib/maths/CTreeShapFeatureImportance.cc

usability blocked by: elastic/ml-cpp#991
benwtrent added a commit to elastic/elasticsearch that referenced this pull request Feb 21, 2020
#52666)

This adds machine learning model feature importance calculations to the inference processor.

The new flag in the configuration matches the analytics parameter name: `num_top_feature_importance_values`
Example:
```
"inference": {
   "field_mappings": {},
   "model_id": "my_model",
   "inference_config": {
      "regression": {
         "num_top_feature_importance_values": 3
      }
   }
}
```

This will write to the document as follows:
```
"inference" : {
   "feature_importance" : {
      "FlightTimeMin" : -76.90955548511226,
      "FlightDelayType" : 114.13514762158526,
      "DistanceMiles" : 13.731580450792187
   },
   "predicted_value" : 108.33165831875137,
   "model_id" : "my_model"
}
```

This is done through calculating the [SHAP values](https://arxiv.org/abs/1802.03888).

It requires that models have populated `number_samples` for each tree node. This is not available to models that were created before 7.7.

Additionally, if the inference config is requesting feature_importance, and not all nodes have been upgraded yet, it will not allow the pipeline to be created. This is to safe-guard in a mixed-version environment where only some ingest nodes have been upgraded.

NOTE: the algorithm is a Java port of the one laid out in ml-cpp: https://github.com/elastic/ml-cpp/blob/master/lib/maths/CTreeShapFeatureImportance.cc

usability blocked by: elastic/ml-cpp#991
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[ML] Add information about samples per node to the tree
2 participants