forked from aQuaYi/MIT-6.824-Distributed-Systems
-
Notifications
You must be signed in to change notification settings - Fork 0
/
l-vm-ft.txt
298 lines (254 loc) · 11.4 KB
/
l-vm-ft.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
6.824 2018 Lecture 4: Primary/Backup Replication
Today
Primary/Backup Replication for Fault Tolerance
Case study of VMware FT, an extreme version of the idea
Fault tolerance
we'd like a service that continues despite failures
some ideal properties:
available: still useable despite [some class of] failures
strongly consistent: looks just like a single server to clients
transparent to clients
transparent to server software
efficient
What failures will we try to cope with?
Fail-stop failures
Independent failures
Network drops some/all packets
Network partition
But not:
Incorrect execution
Correlated failures
Configuration errors
Malice
Behaviors
Available (e.g. if one server halts)
Wait (e.g. if network totally fails)
Stop forever (e.g. if multiple servers crash)
Malfunction (e.g. if h/w computes incorrectly, or software has a bug)
Core idea: replication
*Two* servers (or more)
Each replica keeps state needed for the service
If one replica fails, others can continue
Example: fault-tolerant MapReduce master
lab 1 workers are already fault-tolerant, but not master
master is a "single point of failure"
can we have two masters, in case one fails?
[diagram: M1, M2, workers]
state:
worker list
which jobs done
which workers idle
TCP connection state
program memory and stack
CPU registers
Big Questions:
What state to replicate?
Does primary have to wait for backup?
When to cut over to backup?
Are anomalies visible at cut-over?
How to bring a replacement up to speed?
Two main approaches:
State transfer
"Primary" replica executes the service
Primary sends [new] state to backups
Replicated state machine
All replicas execute all operations
If same start state,
same operations,
same order,
deterministic,
then same end state
State transfer is simpler
But state may be large, slow to transfer
VM-FT uses replicated state machine
Replicated state machine can be more efficient
If operations are small compared to data
But complex to get right
Labs 2/3/4 use replicated state machines
At what level to define a replicated state machine?
K/V put and get?
"application-level" RSM
usually requires server and client modifications
can be efficient; primary only sends high-level operations to backup
x86 instructions?
might allow us to replicate any existing server w/o modification!
but requires much more detailed primary/backup synchronization
and we have to deal with interrupts, DMA, weird x86 instructions
The design of a Practical System for Fault-Tolerant Virtual Machines
Scales, Nelson, and Venkitachalam, SIGOPS OSR Vol 44, No 4, Dec 2010
Very ambitious system:
Goal: fault-tolerance for existing server software
Goal: clients should not notice a failure
Goal: no changes required to client or server software
Very ambitious!
Overview
[diagram: app, O/S, VM-FT underneath, shared disk, network, clients]
words:
hypervisor == monitor == VMM (virtual machine monitor)
app and O/S are "guest" running inside a virtual machine
two machines, primary and backup
shared disk for persistent storage
shared so that bringing up a new backup is faster
primary sends all inputs to backup over logging channel
Why does this idea work?
It's a replicated state machine
Primary and backup boot with same initial state (memory, disk files)
Same instructions, same inputs -> same execution
All else being equal, primary and backup will remain identical
What sources of divergence must we guard against?
Many instructions are guaranteed to execute exactly the same on primary and backup.
As long as memory+registers are identical, which we're assuming by induction.
When might execution on primary differ from backup?
Inputs from external world (the network).
Data read from storage server.
Timing of interrupts.
Instructions that aren't pure functions of state, such as cycle counter.
Races.
Examples of divergence?
They all sound like "if primary fails, clients will see inconsistent story from backup."
Lock server grants lock to client C1, rejects later request from C2.
Primary and backup had better agree on input order!
Otherwise, primary fails, backup now tells clients that C2 holds the lock.
Lock server revokes lock after one minute.
Suppose C1 holds the lock, and the minute is almost exactly up.
C2 requests the lock.
Primary might see C2's request just before timer interrupt, reject.
Backup might see C2's request just after timer interrupt, grant.
So: backup must see same events, in same order, at same point in instruction stream.
Example: timer interrupts
Goal: primary and backup should see interrupt at exactly the same point in execution
i.e. between the same pair of executed instructions
Primary:
FT fields the timer interrupt
FT reads instruction number from CPU
FT sends "timer interrupt at instruction X" on logging channel
FT delivers interrupt to primary, and resumes it
(this relies on special support from CPU to count instructions, interrupt after X)
Backup:
ignores its own timer hardware
FT sees log entry *before* backup gets to instruction X
FT tells CPU to interrupt at instruction X
FT mimics a timer interrupt, resumes backup
Example: disk/network input
Primary and backup *both* ask h/w to read
FT intercepts, ignores on backup, gives to real h/w on primary
Primary:
FT tells the h/w to DMA data into FT's private "bounce buffer"
At some point h/w does DMA, then interrupts
FT gets the interrupt
FT pauses the primary
FT copies the bounce buffer into the primary's memory
FT simulates an interrupt to primary, resumes it
FT sends the data and the instruction # to the backup
Backup:
FT gets data and instruction # from log stream
FT tells CPU to interrupt at instruction X
FT copies the data during interrupt
Why the bounce buffer?
I.e. why wait until primary/backup aren't executing before copying the data?
We want the data to appear in memory at exactly the same point in
execution of the primary and backup.
Otherwise they may diverge.
Note that the backup must lag by one event (one log entry)
Suppose primary gets an interrupt, or input, after instruction X
If backup has already executed past X, it cannot handle the input correctly
So backup FT can't start executing at all until it sees the first log entry
Then it executes just to the instruction # in that log entry
And waits for the next log entry before restarting backup
Example: non-functional instructions
even if primary and backup have same memory/registers,
some instructions still execute differently
e.g. reading the current time or cycle count or processor serial #
Primary:
FT sets up the CPU to interrupt if primary executes such an instruction
FT executes the instruction and records the result
sends result and instruction # to backup
Backup:
backup also interrupts when it tries to execute that instruction
FT supplies value that the primary got
What about disk/network output?
Primary and backup both execute instructions for output
Primary's FT actually does the output
Backup's FT discards the output
But: the paper's Output Rule (Section 2.2) says primary must
tell backup when it produces output, and delay the output until the
backup says it has received the log entry.
Why the Output Rule?
Suppose there was no Output Rule.
The primary emits output immediately.
Suppose the primary has seen inputs I1 I2 I3, then emits output.
The backup has received I1 and I2 on the log.
The primary crashes and the packet for I3 is lost by the network.
Now the backup will go live without having processed I3.
But some client has seen output reflecting the primary having executed I3.
So that client may see anomalous state if it talks to the service again.
So: the primary doesn't emit output until it knows that the backup
has seen all inputs up to that output.
The Output Rule is a big deal
Occurs in some form in all replication systems
A serious constraint on performance
An area for application-specific cleverness
Eg. maybe no need for primary to wait before replying to read-only operation
FT has no application-level knowledge, must be conservative
Q: What if the primary crashes just after getting ACK from backup,
but before the primary emits the output?
Does this mean that the output won't ever be generated?
A: Here's what happens when the primary fails and the backup takes over.
The backup got some log entries from the primary.
The backup continues executing those log entries WITH OUTPUT SUPPRESSED.
After the last log entry, the backup starts emitting output
In our example, the last log entry is I3
So after input I3, the client will start emitting outputs
And thus it will emit the output that the primary failed to emit
Q: But what if the primary crashed *after* emitting the output?
Will the backup emit the output a *second* time?
A: Yes.
OK for TCP, since receivers ignore duplicate sequence numbers.
OK for writes to shared disk, since backup will write same data to same block #.
Duplicate output at cut-over is pretty common in replication systems
Not always possible for clients &c to ignore duplicates
For example, if output is vending money from an ATM machine
Q: Does FT cope with network partition -- could it suffer from split brain?
E.g. if primary and backup both think the other is down.
Will they both "go live"?
A: The shared disk breaks the tie.
Shared disk server supports atomic test-and-set.
Only one of primary/backup can successfully test-and-set.
If only one is alive, it will win test-and-set and go live.
If both try, one will lose, and halt.
Shared storage is single point of failure
If shared storage is down, service is down
Maybe they have in mind a replicated storage system
Q: Why don't they support multi-core?
Performance (table 1)
FT/Non-FT: impressive!
little slow down
Logging bandwidth
Directly reflects disk read rate + network input rate
18 Mbit/s for my-sql
These numbers seem low to me
Applications can read a disk at at least 400 megabits/second
So their applications aren't very disk-intensive
When might FT be attractive?
Critical but low-intensity services, e.g. name server.
Services whose software is not convenient to modify.
What about replication for high-throughput services?
People use application-level replicated state machines for e.g. databases.
The state is just the DB, not all of memory+disk.
The events are DB commands (put or get), not packets and interrupts.
Result: less fine-grained synchronization, less overhead.
GFS use application-level replication, as do Lab 2 &c
Summary:
Primary-backup replication
VM-FT: clean example
How to cope with partition without single point of failure?
Next lecture
How to get better performance?
Application-level replicated state machines
----
VMware KB (#1013428) talks about multi-CPU support. VM-FT may have switched
from a replicated state machine approach to the state transfer approach, but
unclear whether that is true or not.
http://www.wooditwork.com/2014/08/26/whats-new-vsphere-6-0-fault-tolerance/
http://www.tomsitpro.com/articles/vmware-vsphere-6-fault-tolerance-multi-cpu,1-2439.html
https://labs.vmware.com/academic/publications/retrace