-
Notifications
You must be signed in to change notification settings - Fork 467
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
record: ensure block.written is accessed atomically #1719
Conversation
Currently, a block's `written` field is intended to be used atomically. There are existing call sites where the field is read from or written to non-atomically, which can result in a data race (as observed in cockroachdb/cockroach#81511). Make reads and writes of `block.written` atomic. This results in up to a 8% reduction in latency and increase in throughput, depending on the size of the record, due to a change to use a stack variable rather that dereferencing a field on the struct. However, it should be noted that this particular benchmark does not touch the filesystem (it makes use of `io.Discard`). In practice the savings would be dwarfed by the filesystem operation (sync / write). ``` name old time/op new time/op delta RecordWrite/size=8-16 42.9ns ± 5% 40.7ns ± 2% -5.24% (p=0.000 n=10+10) RecordWrite/size=16-16 42.7ns ± 4% 40.7ns ± 1% -4.89% (p=0.000 n=10+10) RecordWrite/size=32-16 44.0ns ± 6% 40.4ns ± 1% -8.24% (p=0.000 n=10+10) RecordWrite/size=64-16 44.1ns ± 1% 41.6ns ± 0% -5.58% (p=0.000 n=10+10) RecordWrite/size=256-16 67.7ns ± 4% 65.2ns ± 1% -3.71% (p=0.000 n=10+9) RecordWrite/size=1028-16 124ns ± 2% 119ns ± 1% -4.04% (p=0.000 n=10+10) RecordWrite/size=4096-16 306ns ± 5% 296ns ± 1% -3.25% (p=0.000 n=10+9) RecordWrite/size=65536-16 4.54µs ± 1% 4.52µs ± 2% ~ (p=0.247 n=10+10) name old speed new speed delta RecordWrite/size=8-16 186MB/s ± 5% 197MB/s ± 2% +5.49% (p=0.000 n=10+10) RecordWrite/size=16-16 374MB/s ± 3% 394MB/s ± 1% +5.12% (p=0.000 n=10+10) RecordWrite/size=32-16 727MB/s ± 5% 792MB/s ± 1% +8.91% (p=0.000 n=10+10) RecordWrite/size=64-16 1.45GB/s ± 1% 1.54GB/s ± 0% +5.91% (p=0.000 n=10+10) RecordWrite/size=256-16 3.78GB/s ± 4% 3.92GB/s ± 1% +3.82% (p=0.000 n=10+9) RecordWrite/size=1028-16 8.31GB/s ± 2% 8.66GB/s ± 1% +4.21% (p=0.000 n=10+10) RecordWrite/size=4096-16 13.4GB/s ± 5% 13.8GB/s ± 1% +3.30% (p=0.000 n=10+9) RecordWrite/size=65536-16 14.4GB/s ± 1% 14.5GB/s ± 2% ~ (p=0.247 n=10+10) name old alloc/op new alloc/op delta RecordWrite/size=8-16 0.00B 0.00B ~ (all equal) RecordWrite/size=16-16 0.00B 0.00B ~ (all equal) RecordWrite/size=32-16 0.00B 0.00B ~ (all equal) RecordWrite/size=64-16 0.00B 0.00B ~ (all equal) RecordWrite/size=256-16 0.00B 0.00B ~ (all equal) RecordWrite/size=1028-16 0.00B 0.00B ~ (all equal) RecordWrite/size=4096-16 0.00B 0.00B ~ (all equal) RecordWrite/size=65536-16 2.00B ± 0% 2.00B ± 0% ~ (all equal) name old allocs/op new allocs/op delta RecordWrite/size=8-16 0.00 0.00 ~ (all equal) RecordWrite/size=16-16 0.00 0.00 ~ (all equal) RecordWrite/size=32-16 0.00 0.00 ~ (all equal) RecordWrite/size=64-16 0.00 0.00 ~ (all equal) RecordWrite/size=256-16 0.00 0.00 ~ (all equal) RecordWrite/size=1028-16 0.00 0.00 ~ (all equal) RecordWrite/size=4096-16 0.00 0.00 ~ (all equal) RecordWrite/size=65536-16 0.00 0.00 ~ (all equal) ``` Fixes cockroachdb#1692.
This is just one approach to solving this issue - honoring the contract that the struct field comment implies re: atomicity. It looks like there are also some comments around using the commit pipeline mutex for external synchronization (for example), though evidently something is slipping through the cracks are we're seeing the race in Cockroach. For context, in cockroachdb/cockroach#81133, parallel writes are made to each engine's WAL as part of liveness checking, which can race if the context is cancelled and the log is closed. Seems like this code in Pebble has been this way since inception, and we're only tickling the issue now due to the nature of the concurrent access introduced in the Cockroach PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are some gaps in the comments which make it difficult to tease out how this is intended to work. I believe its synchronization is designed for a concurrent single-consumer, single-producer scheme. Each log writer has a single long-lived flushing goroutine. (the consumer). The commit pipeline (d.commit.mu
) ensures at most a single writer is adding pending data to be flushed at a time (the producer).
Since we're guaranteed to have at most 1 writer appending data to be flushed, methods that are only used in the writing code path do not need to use atomic loads: There's no potential for another goroutine to write to b.written
concurrently, because only the goroutine with d.commit.mu
is permitted to write to b.written
. However, the flushing goroutine needs to be able to access b.written
concurrent with writes, so it uses atomic loads, and the writing goroutine must use atomic stores.
I think the data race is caused by the omission of a commit-pipeline lock in DB.Close
. Closing the WAL writer is required to hold the commit pipeline mutex normally (eg, makeRoomForWrite
always holds d.commit.mu
).
I think there's still a problem though. The (*pebble.DB)
contract is that calling any method concurrently with Close
is prohibited:
// Close closes the DB.
//
// It is not safe to close a DB until all outstanding iterators are closed
// or to call Close concurrently with any other DB method. It is not valid
// to call any of a DB's methods after the DB has been closed.
If we were to make the adjustment to lock the commit pipeline at the beginning of DB.Close
, the sync write batch application will panic when it acquires the mutex.
func (d *DB) Apply(batch *Batch, opts *WriteOptions) error {
if err := d.closed.Load(); err != nil {
panic(err)
}
It seems like we could change this into an ordinary error return, but I'm not sure whether that solves the problem. What happens in cockroachdb/cockroach#81133 if the liveness write returns an error? Can we avoid the race between engine close and the liveness write? cc @erikgrinaker
Reviewable status: 0 of 1 files reviewed, all discussions resolved (waiting on @sumeerbhola)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we were to make the adjustment to lock the commit pipeline at the beginning of DB.Close, the sync write batch application will panic when it acquires the mutex.
This is not quite right. There's already the risk of a panic if Close()
completes its atomic store before the liveness goroutine enters Apply
.
It's hard to reason what a concurrent liveness write would do, because the internal state of the database is ~undefined during Close. I think we should find a way to avoid the batch application racing with the closing of the database above Pebble, and then adjust Close
to make concurrent operations a little easier to catch. eg, changing the first line of Close()
to swap d.closed
's value to ErrClosed
, increasing the chances a concurrent operation observes the ErrClosed
.
Reviewable status: 0 of 1 files reviewed, all discussions resolved (waiting on @sumeerbhola)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe its synchronization is designed for a concurrent single-consumer, single-producer scheme
This was my read too. The commentary was a little confusing though.
While this patch fixes this particular issue, we're still getting a race (when used in Cockroach) between the following two call sites when we close the writer:
Read:
Line 677 in bb2c150
b := w.block |
Write:
Line 573 in bb2c150
w.block = nextBlock |
Rather than sprinkling mutexes and atomics around the place in Pebble to cover use-cases above Pebble, I wonder if there's a way we can continue to honor the existing contract and adapt what we're doing above Pebble:
I think we should find a way to avoid the batch application racing with the closing of the database above Pebble,
Basically +1 on this part.
Reviewable status: 0 of 1 files reviewed, all discussions resolved (waiting on @sumeerbhola)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm going to close this out as a signal that the code shouldn't be reviewed, but we can continue the conversation here.
Reviewable status: 0 of 1 files reviewed, all discussions resolved (waiting on @sumeerbhola)
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than sprinkling mutexes and atomics around the place in Pebble to cover use-cases above Pebble, I wonder if there's a way we can continue to honor the existing contract and adapt what we're doing above Pebble:
I think we should find a way to avoid the batch application racing with the closing of the database above Pebble,
Basically +1 on this part.
Ok, I'll add some synchronization in CRDB.
Reviewable status: 0 of 1 files reviewed, all discussions resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 1 files reviewed, all discussions resolved
Currently, a block's
written
field is intended to be used atomically.There are existing call sites where the field is read from or written to
non-atomically, which can result in a data race (as observed in
cockroachdb/cockroach#81511).
Make reads and writes of
block.written
atomic.This results in up to a 8% reduction in latency and increase in
throughput, depending on the size of the record, due to a change to use
a stack variable rather that dereferencing a field on the struct.
However, it should be noted that this particular benchmark does not
touch the filesystem (it makes use of
io.Discard
). In practice thesavings would be dwarfed by the filesystem operation (sync / write).
Fixes #1692.