-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: add WAL failover configuration #120509
Conversation
It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR? 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
a20b128
to
d62c7ae
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 11 of 11 files at r1, 8 of 8 files at r2, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @itsbilal and @jbowens)
pkg/server/config.go
line 761 at r1 (raw file):
defer storeEnvs.CloseAll() walFailoverConfig := storage.WALFailover(cfg.WALFailover, storeEnvs)
I don't see storage.WALFailover
on master or in this commit.
pkg/storage/open.go
line 292 at r2 (raw file):
UnhealthyOperationLatencyThreshold: func() time.Duration { return walFailoverUnhealthyOpThreshold.Get(&cfg.Settings.SV) },
We should probably reduce FailoverOptions.UnhealthySamplingInterval
to 25ms given the default for the cluster setting is 100ms. Polling every 25ms is cheap enough.
Also FailoverOptions.ElevatedWriteStallThresholdLag
doesn't seem to have a default in Pebble (oversight?). Something like 60s may be ok.
And HealthyInterval
default of 2min is too long, I think. Something like 15s seems more reasonable.
Of course, these are all guesses without experimental validation under real conditions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This turned out very clean!
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @itsbilal and @jbowens)
b3df667
to
96de542
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TFTR!
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @itsbilal and @sumeerbhola)
pkg/server/config.go
line 761 at r1 (raw file):
Previously, sumeerbhola wrote…
I don't see
storage.WALFailover
on master or in this commit.
you found it, right?
pkg/storage/open.go
line 292 at r2 (raw file):
Previously, sumeerbhola wrote…
We should probably reduce
FailoverOptions.UnhealthySamplingInterval
to 25ms given the default for the cluster setting is 100ms. Polling every 25ms is cheap enough.
AlsoFailoverOptions.ElevatedWriteStallThresholdLag
doesn't seem to have a default in Pebble (oversight?). Something like 60s may be ok.
AndHealthyInterval
default of 2min is too long, I think. Something like 15s seems more reasonable.Of course, these are all guesses without experimental validation under real conditions.
Put up cockroachdb/pebble#3413 to change them there
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 5 of 8 files at r4.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @itsbilal and @jbowens)
pkg/server/config.go
line 761 at r1 (raw file):
Previously, jbowens (Jackson Owens) wrote…
you found it, right?
yep. But isn't it in the wrong commit, in that CockroachDB won't build after the first commit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TFTR!
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @itsbilal and @sumeerbhola)
pkg/server/config.go
line 761 at r1 (raw file):
Previously, sumeerbhola wrote…
yep. But isn't it in the wrong commit, in that CockroachDB won't build after the first commit?
whoops, sorry, I somehow missed this. Moved it into second commit and verified that both commits build
bors r=sumeerbhola |
Build failed (retrying...): |
Build failed (retrying...): |
bors r- This looks like merge skew to me. |
Canceled. |
I believe the other commit within the batch (from #119885) contains the skew. Edit: They both did bors r=sumeerbhola |
Add reference counting to fs.Envs, so that multiple Engines can take on references to an Env. The underlying Env's resources won't be released until all references are released. This will be used by WAL failover to ensure that an fs.Env used as a failover destination isn't closed prematurely. Epic: none Release note: none
Introduce support for configuring a multi-store CockroachDB node to failover a store's write-ahead log (WAL) to another store's data directory. Failing over the write-ahead log may allow some operations against a store to continue to complete despite temporary unavailability of the underlying storage. Customers must opt into WAL failover by passing `--wal-failover=among-stores` to `cockroach start` or setting the env var `COCKROACH_WAL_FAILOVER=among-stores`. On start, cockroach will assign each store another store to be its failover destination. Cockroach will begin monitoring the latency of all WAL writes. If latency to the WAL exceeds the value of the storage.wal_failover.unhealthy_op_threshold cluster setting, Cockroach will attempt to write WAL entries to its secondary store's volume. If a user wishes to disable WAL failover, they must restart the node setting `--wal-failover=disabled`. Close cockroachdb#119418. Informs cockroachdb/pebble#3230 Epic: CRDB-35401 Release note (ops change): Introduces a new start option (--wal-failover or COCKROACH_WAL_FAILOVER env var) to opt into failing over WALs between stores in multi-store nodes. Introduces a new storage.wal_failover.unhealthy_op_threshold cluster setting for configuring the latency threshold at which a WAL write is considered unhealthy.
Canceled. |
bors r=sumeerbhola |
This commit expands on cockroachdb#120509, introducing a WAL failover mode that allows an operator of a node with a single store to configure WAL failover to failover to a particular path (rather than another store's directory). This is configured via the --wal-failover flag: --wal-failover=path=/mnt/data2 When disabling or changing the path, the operator is required to pass the previous path. Eg, --wal_failover=path=/mnt/data3,prev_path=/mnt/data2 or --wal_failover=disabled,prev_path=/mnt/data2 Informs cockroachdb#119418. Informs cockroachdb/pebble#3230 Epic: CRDB-35401 Release note (ops change): Adds an additional option to the new (in 24.1) --wal-failover CLI flag allowing an operator to specify an explicit path for WAL failover for single-store nodes.
This commit expands on cockroachdb#120509, introducing a WAL failover mode that allows an operator of a node with a single store to configure WAL failover to failover to a particular path (rather than another store's directory). This is configured via the --wal-failover flag: --wal-failover=path=/mnt/data2 When disabling or changing the path, the operator is required to pass the previous path. Eg, --wal-failover=path=/mnt/data3,prev_path=/mnt/data2 or --wal-failover=disabled,prev_path=/mnt/data2 Informs cockroachdb#119418. Informs cockroachdb/pebble#3230 Epic: CRDB-35401 Release note (ops change): Adds an additional option to the new (in 24.1) --wal-failover CLI flag allowing an operator to specify an explicit path for WAL failover for single-store nodes.
This commit expands on cockroachdb#120509, introducing a WAL failover mode that allows an operator of a node with a single store to configure WAL failover to failover to a particular path (rather than another store's directory). This is configured via the --wal-failover flag: --wal-failover=path=/mnt/data2 When disabling or changing the path, the operator is required to pass the previous path. Eg, --wal-failover=path=/mnt/data3,prev_path=/mnt/data2 or --wal-failover=disabled,prev_path=/mnt/data2 Informs cockroachdb#119418. Informs cockroachdb/pebble#3230 Epic: CRDB-35401 Release note (ops change): Adds an additional option to the new (in 24.1) --wal-failover CLI flag allowing an operator to specify an explicit path for WAL failover for single-store nodes.
This commit expands on cockroachdb#120509, introducing a WAL failover mode that allows an operator of a node with a single store to configure WAL failover to failover to a particular path (rather than another store's directory). This is configured via the --wal-failover flag: --wal-failover=path=/mnt/data2 When disabling or changing the path, the operator is required to pass the previous path. Eg, --wal-failover=path=/mnt/data3,prev_path=/mnt/data2 or --wal-failover=disabled,prev_path=/mnt/data2 Informs cockroachdb#119418. Informs cockroachdb/pebble#3230 Epic: CRDB-35401 Release note (ops change): Adds an additional option to the new (in 24.1) --wal-failover CLI flag allowing an operator to specify an explicit path for WAL failover for single-store nodes.
120783: storage: support WAL failover to an explicit path r=sumeerbhola a=jbowens This commit expands on #120509, introducing a WAL failover mode that allows an operator of a node with a single store to configure WAL failover to failover to a particular path (rather than another store's directory). This is configured via the --wal-failover flag: --wal-failover=path=/mnt/data2 When disabling or changing the path, the operator is required to pass the previous path. Eg, --wal_failover=path=/mnt/data3,prev_path=/mnt/data2 or --wal_failover=disabled,prev_path=/mnt/data2 Informs #119418. Informs cockroachdb/pebble#3230 Epic: CRDB-35401 Release note (ops change): Adds an additional option to the new (in 24.1) --wal-failover CLI flag allowing an operator to specify an explicit path for WAL failover for single-store nodes. Co-authored-by: Jackson Owens <[email protected]>
Introduce support for configuring a multi-store CockroachDB node to failover a
store's write-ahead log (WAL) to another store's data directory. Failing over
the write-ahead log may allow some operations against a store to continue to
complete despite temporary unavailability of the underlying storage.
Customers must opt into WAL failover by passing
--wal-failover=among-stores
to
cockroach start
or setting the env varCOCKROACH_WAL_FAILOVER=among-stores
. On start, cockroach will assign eachstore another store to be its failover destination. Cockroach will begin
monitoring the latency of all WAL writes. If latency to the WAL exceeds the
value of the storage.wal_failover.unhealthy_op_threshold cluster setting,
Cockroach will attempt to write WAL entries to its secondary store's volume.
If a user wishes to disable WAL failover, they must restart the node setting
--wal-failover=disabled
.Close #119418.
Informs cockroachdb/pebble#3230
Epic: CRDB-35401
Release note (ops change): Introduces a new start option (--wal-failover or
COCKROACH_WAL_FAILOVER env var) to opt into failing over WALs between stores in
multi-store nodes. Introduces a new storage.wal_failover.unhealthy_op_threshold
cluster setting for configuring the latency threshold at which a WAL write is
considered unhealthy.