preliminaries.tex

\chapter{Preliminaries}
The following chapter reviews and discusses related work of integer compression algorithms and adequate selection strategies in order to describe the problem the strategy presented in this thesis aims to solve. 

\section{Lightweight Integer Compression}
Lightweight integer compression algorithms is a subject of current research mainly focusing on two different objectives. On the one hand, a lot of research aims to optimize the algorithm performance with different approaches like FPGA acceleration \cite{Mohsen2020, JahanLisa2019} or novel kinds of query processing models \cite{Damme2020}. On the other hand, the research focuses on analyzing the behavior of lightweight integer compression algorithms and finding of selection strategies \cite{Abadi2006, Damme2019, Woltmann2021}.
Damme et. al \cite{Damme2017, Damme2019} have shown that among the large variety of lightweight integer compression algorithms, there is no single-best one. They all behave differently depending on the input data and hardware properties.\\
\emph{BitPacking (BP)} is one of the most frequently used algorithms in this field. One advantage that leads to good compression rates is the adaptability to different bit widths \cite{Damme2019}.\\ 
The algorithm belongs to the group of null-suppression algorithms which means that the basic idea is the omission of leading zeros in the integer binary representation. Firstly, the binary representation has to be stored as a sequence. The omission of leading zeros in every sequence element results in blocks with different sizes for every value. The size is determined by the bitwidth of the largest value in the block. 
\emph{BP} can be divided into \emph{StaticBP} where every block has a predefined size and \emph{DynamicBP} where the block size is determined dynamically.\\
Being able to analyse, compare and select lightweight integer compression algorithms in a specific context requires a non-technical abstraction. Hildebrandt et. al \cite{Hildebrandt2017} developed the COLLATE metamodel and the description language COALA for lightweight integer compression algorithms and showed that every algorithm can be modeled with it. Their work resulted in the implementation of a software framework that allows the implementation of algorithms with the COALA language, their application to input data, and the measurement of properties like compression rate or (de)compression runtime. Furthermore, the algorithms can be defined with data-dependent and data-independent parameters which change their behavior. The software framework offers the technical functionalities to observe the performance of an algorithm but for the selection of the best-fitting one, it is necessary to test all possible candidates with different parameter combinations manually and compare them according to compression rate or (de)compression runtime.

\section{Algorithm Selection}
As there is no single-best algorithm, it is necessary to determine the one that fits best for certain input data and hardware properties. In order to achieve that, an adequate selection strategy is necessary. 
In the literature, there are three major concepts. The rule-based strategy by Abadi et. al \cite{Abadi2006} is based on a decision tree for compression schemes. Certain data properties lead to specific compression algorithms. As Woltmann et. al \cite{Woltmann2021} have already analysed, an advantage of their approach is the small tree size. The effort for traversing the tree is low, but the staticness of it makes it difficult to extend it to the current landscape of algorithms.\\
Damme et. al \cite{Damme2019} developed a cost-based approach which is based on calibration measurements for each algorithm in order to retrieve the impact of data and hardware characteristics. This strategy has also been part of the analysis of Woltmann et. al \cite{Woltmann2021}. They noted on the one hand, that it can take a wide range of algorithms into account, but adding new algorithms would always require manual effort on the other hand.\\
Based on their analysis, Woltmann et. al \cite{Woltmann2021} proposed a third strategy combining the advantages of the rule-based one and the cost-based approach. Their main objective was the reduction of manual effort if a new algorithm is to be included into the selection set. Therefore, they designed their strategy as a black-box approach which does not need any information about the behavior or characteristics of the algorithm. In order to implement this, Machine Learning was used to train a model for each algorithm based on generated training data. After the training phase, the model was able to predict the cost of the algorithm when applied to certain input data.\\
As described above, lightweight integer compression algorithms behave differently due to specific input parameters. The learned selection strategy of Woltmann et. al \cite{Woltmann2021} performed better than the cost-based one by Damme et. al \cite{Damme2019}, but it can only select the best-fitting algorithm and does not take parameterizations into account.

\section{Machine Learning}
Machine Learning (ML) has become a popular technique in database systems in order to manage and analyse data. Especially tasks like data sorting \cite{Kristo2020} or cardinality estimation \cite{Kipf} are suitable for it. Considering algorithm selection, only a few ML approaches exist. Jin et. al \cite{Jin2019} proposed a selection strategy that considers the algorithm selection as a classification problem. During the training phase, they aggregate data with similar properties to data blocks and label them with the algorithm behavior. Afterwards, they predict the best-fitting algorithm by getting the data block with the lowest distance to the test data. Due to the usage of classification, adding a new algorithm would require a complete new training phase.\\
Boissier and Jenduk \cite{Boissier} developed a selection strategy based on Linear Regression and Gradient Boosting (GB) for the Hyrise database. Even though the GB model showed the best results, they left out crucial validation aspects in their work.\\
Woltmann et. al \cite{Woltmann2021} also used GB regression to train their model for the selection of the best-fitting algorithm. It is an ensemble method consisting of several weak learners, mostly decision trees. The training of every decision tree is based on the residuals of its predecessor. In comparison to the usage of Neural Networks, the times for training and forward passes are relatively low.\\
The correct interpretation of a ML model requires the evaluation of feature importances. This aspect has not been a part of the work of Woltmann et. al \cite{Woltmann2021} but would lead to a better understanding of the model decision which is crucial if a larger amount of algorithms are considered.\\
All existing ML approaches have in common that they only aim to select the best-fitting algorithm. None of them also considers the selection of algorithm parameterizations.