summary_and_outlook.tex

\chapter{Summary and Outlook}

Within the wide range of algorithms for lightweight integer compression, no single-best one exists \cite{Damme2017, Damme2019}. As they all behave differently depending on hardware and data properties, multiple selection strategies have been proposed \cite{Damme2019, Woltmann2021}. All of them have in common that they can determine the best-fitting algorithm but do not consider possible algorithm parameterizations. Hence, we evaluated a \emph{Learned Selection Strategy for Lightweight Integer Compression Algorithm Parameterizations} which extends the strategy of Woltmann et. al \cite{Woltmann2021}.
Firstly, it was necessary to generate representative data for training and testing with the La-Ola strategy, because our strategy is a ML based approach. From the bwhists generated by the La-Ola generator we derived 13 features representing an integer sequence. Subsequently, we applied the compression algorithms \emph{StaticBP} and \emph{DynamicBP} and labeled the data set with the compression runtime and the compression rate representing the behavior of the algorithm. For this, we used the COLLATE implementation of Hildebrandt et. al \cite{Hildebrandt2017}. With the labeled data sets, our ML models for every combination of algorithm and target value could be trained and their hyperparameters tuned. We used GB regression which requires relatively low times for training and forward passes. 
After the training phase, we evaluated the quality of our approach in comparison to a baseline strategy always choosing the most simple algorithm and the parameters that are covering the largest range of data. A test data set containing generated data and a real world data from the Public BI benchmark have been used. 
In order to increase the transparency of our ML models, we calculated the impact of each feature on the prediction result using the permutation feature importance approach.\\
We could show in our evaluation that a Learned Selection Strategy based on Machine Learning is an effective way to predict suitable parameters in addition to the best-fitting compression algorithm for certain input data. Regarding the selection results, our approach outperforms the baseline strategy in almost every case. Hence, the application of our ML based approach leads to faster or better compression results of integer values, depending on whether the compression runtime or the compression rate was used as the target value. With the analysis of the feature importance we could derive certain behaviors of our ML models. Knowing these behaviors makes it now possible to consider them during the application on new data.

We evaluated our strategy using two different integer compression algorithms. Due to the fact that our approach is an extension of the black box strategy of Woltmann et. al \cite{Woltmann2021}, no information about the algorithm behavior is necessary what makes it possible to add further algorithms without a complete new training phase. Regarding our strategy, it is firstly necessary to apply the new algorithm to our generated data. This step leads to a new labeled data set which is used to train the new ML models. In contrast to modelling the problem as classification task, our regression approach only requires a training phase for every ML model that belongs to the new algorithm. 
The new model can subsequently be used to predict the behavior of the new algorithm and hence to extend the amount of possible selection results.
Another aspect is the extension of the considered parameters. Our data generation pipeline generates all valid algorithm parameter combinations. To add a new one, it is necessary to generate a new data set based on the new possible combinations. A new data set would also lead to a new labeling process and hence to a new learning phase for every combination of algorithm and target value. A more efficient way to add new parameters would be an interesting aspect of future work. In our evaluation, we only considered the permutation feature importance. Besides, the usage of Shapley Additive Explanations (SHAP) is a common approach \cite{Lundberg2017}. While the calculation of SHAP values is more time-consuming, they can further increase the transparency of an ML model as they reveal if a feature negatively or positively influences the prediction result. Analyzing the feature importances with SHAP could also be part of future research.