This repository contains the corresponding code of the master thesis on the topic ‘Calibration in mlr3 - Implementation and Impact on Model Interpretability’. All experimental setups, registries, evaluation scripts and bmr objects are in the Experiments folder. All other required functions and R6 classes, e.g. those required for calibration or for the experiments or the visualisation of the results, can be found in the R folder. All graphics that were created in R and are embedded in the thesis can be found in the figures folder. The Sandbox folder contains additional code that are either used to test the implementations or contains older versions of the implementations. This folder therefore does not contain any relevant scripts, but is kept for transparency.
The implementation of the mlr3 calirbation extension and how to use it can be seen in the following repository.
- Cross-validated calibration leads to more robust calibrated predictions than holdout calibration.
- Parametric calibration techniques are more effective when applied to smaller data sets, whereas non-parametric approaches tend to yield better results when applied to larger data sets.
- There is no significant difference in model performance whether a pre-tuned learner is calibrated or the entire pipeline is tuned.
- Calibration leads to a reduction in the deviation of the Partian Dependence Plot of a feature from the ground truth of a feature.
Three experiments were designed to answer these research questions. The first two research questions by a benchmark experiment testing three calibration approaches, six resampling strategies and eight learners on 10 different tasks. For the third hypothesis, an experiment was set up comparing two different tuning approaches, one tuning before calibration and one tuning around the calibrated learner, for the three calibration approaches, five learners on seven tasks. The last hypothesis, regarding the impact on interpretability, was answered by an experiment in which synthetic data were created in which the influence of the features on the target is known and thus a deviation between partial dependence plots (PDP) and ground truth can be calculated.
Only a summary of the results regarding the Brier scores is presented here, the detailed presentation including all measuers and sub categories and the discussion of the results can be found in the thesis.
This Critical Difference (CD) plot shows the average ranks of all resampling strategies with regard to the Brier score. These show that the cross validation (CV) and union resampling strategies lead to significantly better results than the holdout strategies. In addition, the five-fold CV strategy shows significantly better results than the union and the three-fold CV strategy. There is no significant difference within the holdout strategies.
This CD plot shows the average ranks of the three calibration approaches with regard to the Brier score. This show that all three apporaches achieve significantly lower average ranks compared to the uncalibrated models. In addition, the isotonic calibration achieves significantly lower ranks than the Platt scaling. Beta calibration can be found between the other appraches, but shows no significant differences to either of them.
This CD plot shows the average ranks of the different tuning approaches with regard to the Brier score. This shows that tuning before calibrating not only saves time, but also achieves better performance.
This CD plot shows the average ranks of the different calibration approaches with regard to the mse between PDP and ground truth of a feature. This show that the Platt scaling achieves a significantly lower mse compared to the uncalibrated model, while the isotonic calibration significantly increases it. Beta calibration does not lead to any significant difference to the uncalibrated model, as the CD is not exceeded.