forked from aQuaYi/MIT-6.824-Distributed-Systems
-
Notifications
You must be signed in to change notification settings - Fork 0
/
l-naiad.txt
166 lines (154 loc) · 7.8 KB
/
l-naiad.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
6.824 2018 Lecture 13: Naiad Case Study
Naiad: A Timely Dataflow System
Murray, McSherry, Isaacs, Isard, Barham, Abadi
SOSP 2013
Today: streaming and incremental computation
Case study: Naiad
Why are we reading Naiad?
elegant design
impressive performance
open-source and still being developed
recap: Spark improved performance for iterative computations
i.e., ones where the same data is used again and again
but what if the input data changes?
may happen, e.g., because crawler updates pages (new links)
or because an application appends new record to a log file
Spark (as in last lecture's paper) will actually have to start over!
run all PageRank iterations again, even if just one link changed
(but: Spark Streaming -- similar to Naiad, but atop Spark)
Incremental processing
input vertices
could be a file (as in Spark/MR)
or a stream of events, e.g. web requests, tweets, sensor readings...
inject (key, value, epoch) records into graph
fixed data-flow
records flow into vertices, which may emit zero or more records in response
stateful vertices (long-lived)
a bit like cached RDDs
but mutable: state changes in response to new records arriving at the vertex
ex: GroupBy with sum() stores one record for every group
new record updates current aggregation value
Iteration with data-flow cycles
contrast: Spark has no cycles (DAG), iterates by adding new RDDs
no RDD can depend on its own child
good for algorithms with loops
ex: PageRank sends rank updates back to start of computation
"root" streaming context
special, as introduces inputs with epochs (specified by program)
loop contexts
can nest arbitrarily
cannot partially overlap
[Figure 3 example w/ nested loop]
Problem: ordering required!
[example: delayed PR rank update, next iteration reads old rank]
intuition: avoid time-travelling updates
vertex sees update from past iteration when it already moved on
vertex emits update from a future iteration it's not at yet
key ingredient: hierarchical timestamps
timestamp: (epoch, [c1, ..., ck]) where ci is the ith-level loop context
ingress, egress, feedback vertices modify timestamps
Ingress: append new loop counter at end (start into loop)
Egress: drop loop counter from end (break out of loop)
Feedback: increment loop counter at end (iterate)
[Figure 3 timestamp propagation: lecture question]
epoch 1:
(a, 2): goes around the loop three times (value 4 -> 8 -> 16)
A: (1, []), I: (1, [0]), ... F: (1, [1]), ... F: (1, [2]), ... E: (1, [])
(b, 6): goes around the loop only once (value 6 -> 12)
A: (1, []), I: (1, [0]), ... F: (1, [1]), ... E: (1, [])
epoch 2:
(a, 5): dropped at A due to DISTINCT.
timestamps form a partial order
(1, [5, 7]) < (1, [5, 8]) < (1, [6, 2]) < (2, [1, 1])
but also: (1, [5, 7]) < (1, [5]) -- lexicographic ordering on loop counters
and, for two top-level loops: (1, lA [5]) doesn't compare to (1, lB [2])
allows vertices to order updates
and to decide when it's okay to release them
so downstream vertex gets a consistent outputs in order
Programmer API
good news: end users mostly don't worry about the timestamps
timely dataflow is low-level infrastructure
allow special-purpose implementations by experts for specific uses
[Figure 2: runtime, timely dataflow, differential dataflow, other libs]
high-level APIs provided by libraries built on the timely dataflow abstraction
e.g.: LINQ (SQL-like), BSP (Pregel-ish), Datalog
similar trend towards special-purpose libraries in Spark ecosystem
SparkSQL, GraphX, MLLib, Streaming, ..
Low-level vertex API
need to explicitly deal with timestamps
notifications
"you'll no longer receive records with this timestamp or any earlier one"
vertex may e.g., decide to compute final value and release
two callbacks invoked by the runtime on vertex:
OnRecv(edge, msg, timestamp) -- here are some messages
OnNotify(timestamp) -- no more messages <= timestamp coming
two API calls available to vertex code:
SendBy(edge, msg, timestamp) -- send a message at current or future time
NotifyAt(timestamp) -- call me back when at timestamp
allows different strategies
incrementally release records for a time, finish up with notification
buffer all records for a time, then release on notification
[code example: Distinct Count vertex]
Progress tracking
protocol to figure out when to deliver notifications to vertices
intuition
ok to deliver notification once it's *impossible* for predecessors to
generate records with earlier timestamp
single-threaded example
"pointstamps": just a timestamp + location (edge or vertex)
[lattice diagram of vertices/edges on x-axis, times on y-axis]
arrows indicate could-result-in
follow partial order on timestamps
ex: (1, [2]) at B in example C-R-I (1, [3]) but also (1, [])
=> not ok to finish (1, []) until everything has left the loop
remove active pointstamp once no events for it are left
i.e., occurrency count (OC) = 0
frontier: pointstamp without incoming arrows
keeps moving down the time axis, but speed differs for locations
distributed version
strawman 1: send all events to a single coordinator
slow, as need to wait for this coordinator
this is the "global frontier" -- coordinator has all information
strawman 2: process events locally, then inform all other workers
broadcast!
workers now maintain "local frontier", which approximates global one
workers can only be *behind*: progress update may still be in the network
local frontier can never advance past global frontier
hence safe, as will only deliver notifications late
problem: sends enormous amounts of progress chatter!
~10 GB for WCC on 60 computers ("None", Figure 6(c))
solution: aggregate events locally before broadcasting
each event changes OC by +1 or -1 => can combine
means we may wait a little longer for an update (always safe)
only sends updates when set of active pointstamps changes
global aggregator merges updates from different workers
like global coordinator in strawman, but far fewer messages
Fault tolerance
perhaps the weakest part of the paper
option 1: write globally synchronous, coordinated checkpoints
recovery loads last checkpoint (incl. timestamps, progress info)
then starts processing from there, possibly repeating inputs
induces pause times while making checkpoints (cf. tail latency, Fig 7c)
option 2: log all messages to disk before sending
no need to checkpoint
recovery can resume from any point
but high common-case overhead (i.e., pay price even when there's no failure)
Q: why not use a recomputation strategy like Spark's?
A: difficult to do with fine-grained updates. Up to what point do we recompute?
cleverer scheme developed after this paper
some vertices checkpoint, some log messages
can still recover the joint graph
see Falkirk Wheel paper -- https://arxiv.org/abs/1503.08877
Performance results
optimized system from ground up, as illustrated by microbenchmarks (Fig 6)
impressive PageRank performance (<1s per iteration on twitter_rv graph)
very good in 2013, still pretty good now!
Naiad matches or beats specialized systems in several different domains
iterative batch: PR, SCC, WCC &c vs. DB, DryadLINQ -- >10x improvement
graph processing: PR on twitter vs. PowerGraph -- ~10x improvement
iterative ML: logistic regression vs. Vowpal Wabbit -- ~40% improvement
References:
Differential Dataflow: http://cidrdb.org/cidr2013/Papers/CIDR13_Paper111.pdf
Rust re-implementations:
* https://github.com/frankmcsherry/timely-dataflow
* https://github.com/frankmcsherry/differential-dataflow