legacy-data-sets.tex

\chapter{Legacy Data Sets for the Interactive workload}
\label{sec:legacy-data-sets}

The Interactive workload uses the legacy version of the data sets. These can be generated using the Hadoop-based \datagen hosted at \url{https://github.com/ldbc/ldbc_snb_datagen_hadoop/}. This chapter documents these data sets.

The SNB data sets are available in the SURF/CWI LDBC SNB data repository~\cite{cwi:snb} at \url{https://repository.surfsara.nl/datasets/cwi/ldbc-snb-interactive-v1-datagen-v100}.

\begin{itemize}
  \item Serializers:
    \texttt{csv\_basic},
    \texttt{csv\_basic-longdateformatter},
    \texttt{csv\_composite},
    \texttt{csv\_composite-longdateformatter},
    \texttt{csv\_composite\_merge\_foreign},
    \texttt{csv\_composite\_merge\_foreign-longdateformatter},
    \texttt{csv\_merge\_foreign},
    \texttt{csv\_merge\_foreign-longdateformatter},
    \texttt{ttl}
  \item Partition numbers: $2^k$ (1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024) and $6 \times 2^k$ (24, 48, 96, 192, 384, 768).
\end{itemize}

The \textbf{key differences from the latest (BI and Interactive v2) data sets} are the following:

\begin{itemize}
    \item DateTime values follow the format \texttt{yyyy-mm-ddTHH:MM:ss.sss+0000}, i.e. their offset string is \texttt{0000} instead of \texttt{00:00}. This implies that they are not compatible with the recommendations of RFC-3339\footnote{\url{https://tools.ietf.org/html/rfc3339}}.
    \item The Forum-hasModerator-Person edge type has an \emph{exactly one} cardinality on the Person's end.
\end{itemize}

\section{Output Data}

For each scale factor, \datagen produces three different artefacts:
\begin{itemize}
  \item \textbf{Dataset:} The dataset to be bulk loaded by the SUT. In the Interactive workload, it corresponds to roughly the 90\% of the total generated network.
  \item \textbf{Update Streams:} A set of update streams containing update
    queries, which are used by the driver to generate the update queries of the
    workloads. This update
    streams correspond to the remaining 10\% of the generated dataset.
  \item \textbf{Substitution Parameters:} A set of files containing the
    different parameter bindings that will be used by the driver to generate the
    read queries of the workloads.
\end{itemize}

\subsection{Scale Factors}
\label{sec:legacy-scale-factors}

\ldbcsnb defines a set of scale factors (SFs), targeting systems of different sizes and budgets.
SFs are computed based on the ASCII size in Gibibytes of the generated output files using the CsvMergeForeign serializer (see \autoref{sec:legacy-serializers}) and default settings, \ie both the 90\% initial data and the 10\% update streams count towards the total size.
For example, SF1 takes roughly 1~GiB in CSV format, SF3 weighs roughly 3~GiB and so on and so forth.
It is important to note that for a given scale factor, data sets generated using different serializers contain exaclty the same data, the only difference is in how they are represented.%
\footnote{Naturally, there are slight differences in the disk usage of the data sets created with different serializers. For example, for a given scale factor, the disk usage of the data set serialized with the CsvBasic serializer is expected to be higher, while with the CsvMergeForeignComposite, it is expected to be lower.}
The provided SFs are the following: 1, 3, 10, 30, 100, 300, 1000.
Additionally, two small data sets, 0.1, and 0.3 are provided to help initial validation efforts.

The Test Sponsor may select the SF that better fits their needs, by properly configuring the \datagen, as described in \autoref{sec:data_generation}.
The size of the resulting dataset is mainly affected by the following configuration parameters: the number of persons and the number of years simulated.
By default, all SFs are defined over a period of three years, starting from 2010, and SFs are computed by scaling the number of Persons in the network.
\autoref{tab:snsize-interactive} shows some metrics of SFs 0.1, \ldots, 1000 data sets.

\input{tables/legacy/table-snsize-interactive}

\input{tables/legacy/table-update-stream-statistics}

\input{tables/legacy/table-csv-serializers}

\autoref{tab:legacy-csv-serializers} show how each CSV serializer handles attributes/edges of different cardinalities. The data shows why CsvBasic has the most files and CsvCompositeMergeForeign has the least number of files.

\input{tables/legacy/table-csv-basic}

\input{tables/legacy/table-csv-merge-foreign}

\input{tables/legacy/table-csv-composite}

\input{tables/legacy/table-csv-composite-merge-foreign}

\subsection{Serializers}
\label{sec:legacy-serializers}

The datasets are generated in the \texttt{social\_network/} directory, split into static and dynamic parts (\autoref{fig:schema}).
The filenames (without the extension) end in \texttt{\_i\_j} where \texttt{i} is the block id and \texttt{j} is the partition id (set by \texttt{numThreads}).
The SUT has to take care only of the generated Dataset to be bulk loaded. Using \texttt{NULL} values for storing optional values is allowed.

Datagen's CSV (Comma Separated Values) serializers produce text files which use the pipe character ``\texttt{|}'' as the primary field separator and the semicolon character ``\texttt{;}'' as a separator for multi-valued attributes (for the composite serializers).
The CSV files are stored in two subdirectories: \texttt{static/} and \texttt{dynamic/}.
Depending on the number of threads used for generating the dataset, the number of files varies, since there is a file generated per thread. We indicate this with ``\texttt{*}'' in the specification.

The following CSV variants are supported:
    \begin{itemize}
      \item \textbf{CsvBasic:}
      Each entity, relation and attribute with a cardinality larger than one (including attributes \texttt{Person.email} and \texttt{Person.speaks}), are output in a separate file.
      Generated files are summarized in \autoref{table:csv_basic}.

      \item \textbf{CsvMergeForeign:}
      This serializer is similar to CsvBasic, but relations that have a cardinality of 1-to-N are merged in the entity files as a foreign keys.
      There are 13~such relations in total:
      \begin{itemize}
          \item comment\_hasCreator\_person, comment\_isLocatedIn\_place, comment\_replyOf\_comment, comment\_replyOf\_post (merged to Comment);
          \item forum\_hasModerator\_person (merged to Forum);
          \item organisation\_isLocatedIn\_place (merged to Organisation);
          \item person\_isLocatedIn\_place (merged to Person);
          \item place\_isPartOf\_place (merged to Place);
          \item post\_hasCreator\_person, post\_isLocatedIn\_place, forum\_containerOf\_post (merged to Post);
          \item tag\_hasType\_tagclass (merged to Tag);
          \item tagclass\_isSubclassOf\_tagclass (merged to TagClass)
      \end{itemize}
      Generated files are summarized in \autoref{table:csv_merge_foreign}.

      \item \textbf{CsvComposite:}
      Similar to the CsvBasic format but each entity, and relations with a cardinality larger than one, are output in a separate file.
      Multi-valued attributes (\texttt{Person.email} and \texttt{Person.speaks}) are stored as composite values.
      Generated files are summarized in \autoref{table:csv_composite}.

      \item \textbf{CsvCompositeMergeForeign:}
      Has the traits of both the CsvComposite and the CsvMergeForeign formats.
      Multi-valued attributes (\texttt{Person.email} and \texttt{Person.speaks}) are stored as composite values.
      Generated files are summarized in \autoref{table:csv_composite_merge_foreign}.
    \end{itemize}

Additionally, the Hadoop \datagen can generate files in Turtle format (\texttt{.ttl}).

\paragraph{Inheritance}

The inheritance hierarchies in the schema have two important characteristics
(1)~all subclasses use the same id space, \eg there cannot be a Comment and a Post with id 1 at the same time,
(2)~they are serialized to CSVs using either the \emph{map hierarchy to single table} or \emph{map each concrete class to its own table} strategies\footnote{\url{http://www.agiledata.org/essays/mappingObjects.html}}:

\begin{description}
    \item[Message = Comment | Post]
    \emph{Map each concrete class to its own table} is used \ie separate CSV files are used for the Comment and the Post classes.

    \item[Place = City | Country | Continent]
    \emph{Map hierarchy to single table} is used, \ie all Place node are serialized in a single file. A discriminator attribute ``type'' is used with the value denoting the concrete class, \eg ``Country''.

    \item[Organisation = Company | University]
    \emph{Map hierarchy to single table} is used, \ie all Organisation nodes are serialized in a single fiel. A discriminator attribute ``type'' is used with the value denoting the concrete class, \eg ``Company''.
\end{description}

\subsection{Update Streams}

\input{tables/legacy/table-update-streams}

The generic schema is given in \autoref{table:update_stream_generic_schema}, while the concrete schema of each insert operation is given in \autoref{table:update_stream_schemas}.
The update stream files are generated in the \texttt{social\_network/} directory and are grouped as follows:

\begin{itemize}
    \item \texttt{updateStream\_*\_person.csv} files contain update operation 1: \queryRefCard{insert-01}{INS}{1}
    \item \texttt{updateStream\_*\_forum.csv} files contain update operations 2--8: %
    \queryRefCard{insert-02}{INS}{2}
    \queryRefCard{insert-03}{INS}{3}
    \queryRefCard{insert-04}{INS}{4}
    \queryRefCard{insert-05}{INS}{5}
    \queryRefCard{insert-06}{INS}{6}
    \queryRefCard{insert-07}{INS}{7}
    \queryRefCard{insert-08}{INS}{8}
\end{itemize}

Remark: update streams in \interactivevone only contain inserts. Delete operations are being designed and will be released later.

\subsection{Substitution Parameters}

The substitution parameters are generated in the \texttt{substitution\_parameters/} directory.
Each parameter file is named \texttt{\{interactive|bi\}\_<id>\_param.txt}, corresponding to an operation of
Interactive complex reads (\queryRefCard{interactive-complex-read-01}{IC}{1}--\queryRefCard{interactive-complex-read-14-v1}{IC}{14v1}) and
BI reads (\queryRefCard{bi-read-01}{BI}{1}--\queryRefCard{bi-read-20}{BI}{20}).
The schemas of these files are defined by the operator, \eg the schema of \queryRefCard{interactive-complex-read-01}{IC}{1} is ``\texttt{personId|firstName}''.

\textbf{Warning.} Note that the substitution parameter files use UNIX epoch timestamps (which should be converted to a datetime value in GMT+0).