-
Notifications
You must be signed in to change notification settings - Fork 14
/
introduction.tex
237 lines (203 loc) · 14.1 KB
/
introduction.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
\chapter{Introduction}
\label{sec:introduction}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Motivation for the Benchmark}
The new era of data economy, based on large, distributed, and complexly
structured datasets, has brought on new and complex challenges in the field of
data management and analytics. These datasets, usually modeled as large
graphs, have attracted both industry and academia, due to new
opportunities in research and innovation they offer. This situation has also
opened the door for new companies to emerge, offering new non-relational and
graph-like technologies that are called to play a significant role in upcoming
years.
The change in the data paradigm calls for new benchmarks to test these new
emerging technologies, as they set a framework where different systems can
compete and be compared in a fair way, they let technology providers identify
the bottlenecks and gaps of their systems and, in general, drive the research
and development of new information technology solutions. Without them, the
uptake of these technologies is at risk by not providing the industry with
clear, user-driven targets for performance and functionality.
The Linked Data Benchmark Council's~\cite{DBLP:journals/corr/abs-2307-04350} Social Network Benchmark (\ldbcsnb) aims at being a comprehensive
benchmark by setting the rules for the evaluation of graph-like data management
technologies. \ldbcsnb is designed to be a plausible look-alike of all the
aspects of operating a social network site, as one of the most representative
and relevant use cases of modern graph-like applications.
\ldbcsnb includes the Interactive
workload~\cite{DBLP:conf/sigmod/ErlingALCGPPB15}, which consists of user-centric
transactional-like interactive queries, and the Business Intelligence workload,
which includes analytic queries to respond to business-critical questions.
Initially, a graph analytics workload was also included in the roadmap of
\ldbcsnb, but this was finally delegated to the Graphalytics benchmark
project~\cite{DBLP:journals/pvldb/IosupHNHPMCCSAT16,DBLP:journals/corr/abs-2011-15028}, which was adopted as an official LDBC graph
analytics benchmark. \ldbcsnb and Graphalytics combined target a broad range of
systems with different nature and characteristics. \ldbcsnb and Graphalytics
aim at capturing the essential features of these scenarios while
abstracting away details of specific business deployments.
This document contains the definition of the Interactive workload and the first
draft of the Business Intelligence workload. This includes a detailed
explanation of the data used in the \ldbcsnb benchmark, a detailed description
for all queries, and instructions on how to generate the data and run the
benchmark with the provided software.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Relevance to the Industry}
\ldbcsnb is intended to provide the following value to different stakeholders:
\begin{itemize}
\item For \textbf{end users} facing graph processing tasks, \ldbcsnb provides
a recognizable scenario against which it is possible to compare merits of
different products and technologies. By covering a wide variety of scales
and price points, \ldbcsnb can serve as an aid to technology selection.
\item For \textbf{vendors} of graph database technology, \ldbcsnb provides a
checklist of features and performance characteristics that helps in
product positioning and can serve to guide new development.
\item For \textbf{researchers}, both industrial and academic, the \ldbcsnb
dataset and workload provide interesting challenges in multiple
choke point areas, such as query optimization, (distributed) graph
analysis, transactional throughput, and provides a way to objectively
compare the effectiveness and efficiency of new and existing technology in
these areas.
\end{itemize}
The technological scope of \ldbcsnb comprises all systems that one might
conceivably use to perform social network data management tasks:
\begin{itemize}
\item \textbf{Graph database management systems} (\eg Neo4j, TigerGraph, AWS Neptune)
are novel technologies aimed at storing property graphs,
\ie graphs with labels and properties (attributes) on nodes and edges.
They support graph traverals, typically by means of APIs, though
some of them also support dedicated graph-oriented query languages (\eg
Neo4j's Cypher and TigerGraph's GSQL, as well as the GQL and SQL/PGQ standards).
These systems' internal structures are typically designed
to store dynamic graphs that change over time. They offer support for
transactional queries with some degree of consistency, and value-based
indices to quickly locate nodes and edges. Finally, their architecture is
typically single-machine (non-cluster). These systems can
potentially implement all three workloads, though Interactive and Business Intelligence
workloads are where they will presumably be more competitive.
\item \textbf{Graph processing frameworks} (\eg Giraph, Signal/Collect,
GraphLab, Green Marl) are designed to perform global graph
computations, executed in parallel or in a lockstep fashion. These computations are typically
long latency, involving many nodes and edges and often consist of approximation
answers to NP-complete problems. These systems expose an API, sometimes following
a vertex-centric paradigm, and their architecture targets both single-machine and
cluster systems. These systems will likely implement the Graph Analytics workload.
\item \textbf{RDF database systems} (\eg OWLIM, Virtuoso, Stardog, AWS Neptune)
are systems that implement the SPARQL~1.1 query
language, similar in complexity to \mbox{SQL-92}, which allows for structured
queries, and simple traversals. RDF database systems often come with
additional support for simple reasoning (sameAs, subClass), text search, and
geospatial predicates. RDF database systems generally support
transactions, but not always with full concurrency and serializability and
their supposed strength is integrating multiple data sources (\eg
DBpedia). Their architecture is both single-machine and clustered, and
they will likely target Interactive and Business Intelligence workloads.
\item \textbf{Relational database systems} (\eg PostgreSQL, MySQL, Oracle, IBM Db2,
Microsoft SQL Server, Virtuoso, MonetDB, Vectorwise, Vertica, DuckDB but also Hive and
Impala) treat graph data relationally, and queries are formulated in SQL and/or
PL/SQL. Both single-machine and cluster systems exist. They do not
normally support recursion or stateful recursive algorithms, which makes them not at home in the Graph Analytics workloads.
\end{itemize}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{General Benchmark Overview}
\ldbcsnb aims at being a complete benchmark, designed with the following goals in mind:
\begin{itemize}
\item \textbf{Rich coverage.} \ldbcsnb is intended to cover most demands
encountered in the management of complexly structured data.
\item \textbf{Modularity.} \ldbcsnb is broken into parts that can be
individually addressed. In this manner \ldbcsnb
stimulates innovation without imposing an overly high threshold for
participation.
\item \textbf{Reasonable implementation cost.} For a product offering relevant
functionality, the effort for obtaining initial results with SNB should be
small, in the order of days.
\item \textbf{Relevant selection of challenges.} Benchmarks are known to
direct product development in certain directions. \ldbcsnb is informed by
the state-of-the-art in database research so as to offer optimization
challenges for years to come while not having a prohibitively high
threshold for entry.
\item \textbf{Reproducibility and documentation of results.} \ldbcsnb
will specify the rules for full disclosure of benchmark execution and for
auditing of benchmark runs in accordance with the LDBC Byelaws~\cite{ldbc_byelaws}.
The workloads may be run on any equipment
but the exact configuration and price of the hardware and software must be
disclosed.
\end{itemize}
\ldbcsnb benchmark is modeled around the operation of a real social network
site. A social network site represents a relevant use case for the following
reasons:
\begin{itemize}
\item It is simple to understand for a large audience, as it is
arguably present in our every-day life in different shapes and forms.
\item It allows testing a complete range of interesting
challenges, by means of different workloads targeting systems of
different nature and characteristics.
\item A social network can be scaled, allowing the design of a
scalable benchmark targeting systems of different sizes and budgets.
\end{itemize}
\autoref{sec:benchmark-specification} summarizes LDBC's benchmark design philosophy.
In \autoref{sec:data}, we define the schema of the data used in
the benchmark. The schema represents a realistic social network, including
people and their activities in the social network during a period of time.
Personal information of each person, such as name, birthday, interests
or places where people work or study, is included. A person's activity is
represented in the form of friendship relationships and content sharing (\ie
messages and pictures). \ldbcsnb provides a scalable synthetic data generator
based on the MapReduce paradigm, which produces networks with the
described schema with distributions and correlations similar to those expected
in a real social network. Furthermore, the data generator is designed to be
user-friendly. The proposed data schema is shared by all the different proposed
workloads, those we currently have, and those that will be proposed in the future.
In \autoref{sec:workloads}, an overview of the workloads is provided.
All SNB workloads are designed to mimic
the different usage scenarios found in operating a real social network site, and
each of them targets one or more types of systems. Each workload defines a set
of queries and query mixes, designed to stress the SUTs in different choke point
areas, while being credible and realistic. The Interactive workload reproduces the
interaction between the users of the social network by including lookups and
transactions, which update small portions of the database. These queries are
designed to be interactive and target systems capable of responding to such queries
with low latency for multiple concurrent users. The Business Intelligence workload
represents analytic queries a social network company would
like to perform in the social network, to take advantage of the data and to
discover new business opportunities. This workload explores moderate to large
portions of the graph from different entities, and performs more resource-intensive
operations.
All workloads provide an execution test driver, which is responsible for executing
the workloads and gathering the results. The driver is designed with simplicity
and portability in mind to ease the implementation on systems with different
nature and characteristics at a low implementation cost. Furthermore, it
automatically handles the validation of the queries by means of a validation
dataset provided by LDBC. The overall philosophy of \ldbcsnb is to provide
the necessary software tools to run the benchmark, and therefore to reduce the
benchmark's entry point as much as possible.
\autoref{sec:update-operations} defines the update operations used in the SNB workloads. \autoref{sec:interactive-v1}, \autoref{sec:interactive-v2}, and \autoref{sec:bi} define the SNB \interactivevone, \interactivevtwo, and BI workloads, respectively.
\autoref{sec:auditing} contains the SNB auditing policies.
\autoref{sec:acid-test-suite} defines the ACID test suite.
\autoref{sec:related-work} summarized the related work on graph processing benchmarks.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Related Projects}
Along the Social Network Benchmark, LDBC~\cite{DBLP:journals/sigmod/AnglesBLF0ENMKT14} provides other benchmarks as well:
\begin{itemize}
\item The Semantic Publishing Benchmark (SPB)~\cite{DBLP:conf/semweb/SpasicJP16} measures the performance of \emph{semantic databases} operating on RDF datasets.
\item The Graphalytics benchmark~\cite{DBLP:journals/pvldb/IosupHNHPMCCSAT16} measures the performance of \emph{graph analysis} operations (\eg PageRank, local clustering coefficient).
\end{itemize}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Participation of Industry and Academia}
The list of institutions that take part in the definition and development
of \ldbcsnb is formed by relevant actors from both the industry and academia in
the field of linked data management. All the participants have contributed with
their experience and expertise in the field, making a credible and relevant
benchmark, which meets all the desired needs.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Technical Report}
This technical report is available on arXiv~\cite{DBLP:journals/corr/abs-2001-02299} and is updated upon new releases of the SNB.