Add shap command to internal CLI for debugging #1197
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR adds a
shap
command to the internal CLI to help explain a specific (per-sql) XGBoost prediction.Usage:
Example:
The output of the command looks like:
Where:
shap_value
), similar to a SHAP waterfall plot.model_rank
shows the feature importance rank on the training set.model_shap_value
shows the feature shap_value on the training set.train_[mean|std|min|max]
show the mean, standard deviation, min and max values of the feature in the training set.train_[25%|50%|75%]
show the feature value at the respective percentile in the training set.feature_value
shows the value of the feature used in prediction (for the indexed row/sqlID).out_of_range
indicates if thefeature_value
used in prediction was outside of the range of values seen in the training set.Shap base value
is the model's average prediction across the entire training set.Shap values sum
is the sum of theshap_value
column for this indexed instance.Shap prediction
is the sum ofShap base value
andShap values sum
, representing the model's predicted value.exp(prediction)
is the exponential ofShap prediction
, which represents the predicted speedup (since the XGBoost model currently predictslog(speedup)
).y_pred
inper_sql.csv
) is applied to the "supported" durations and combined with the unsupported" durations to produce a final per-sql speedup (speedup_pred
inper_sql.csv
).Changes
features.csv
to save the feature values used for prediction.shap_values.csv
tofeature_importance.csv
(which is more descriptive of its purpose).shap_values.csv
to save all of the shap values per feature per instance/sqlID during prediction.model.metrics
file (for each model) during training to store the feature shap values and distribution metrics of the training set.model.json.cfg
files tomodel.cfg
to avoid the double-suffix.compute_feature_importance
andcompute_shapley_values
functions.--qual_output
argument.shap
command to internal CLI, which joins the prediction shap_values w/ training shap_values and distribution metrics.Test
Following CMDs have been tested:
External Usage:
Internal Usage: