Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wal: reader batch corruption #3865

Closed
2 tasks done
jbowens opened this issue Aug 19, 2024 · 3 comments · Fixed by #3893
Closed
2 tasks done

wal: reader batch corruption #3865

jbowens opened this issue Aug 19, 2024 · 3 comments · Fixed by #3893
Assignees
Labels
A-storage C-bug Something isn't working GA-blocker P-0 Issues/test failures with a fix SLA of 2 weeks T-storage

Comments

@jbowens
Copy link
Collaborator

jbowens commented Aug 19, 2024

While stitching together physical WAL segments, the WAL reader reads each record's contents into a temporary buffer. There's a bug where an error encountered mid-copy of a record (eg, because the record was large enough to be
split across multiple chunks), the buffer was not reset before attempting to read the next file's record. This corruption could prevent replay of the WAL.

  • fix wal.Reader
  • add randomized test of WAL failover and recovery

Jira issue: PEBBLE-244

@jbowens jbowens added the C-bug Something isn't working label Aug 19, 2024
@jbowens jbowens self-assigned this Aug 19, 2024
jbowens added a commit to jbowens/pebble that referenced this issue Aug 19, 2024
While stitching together physical WAL segments, the WAL reader reads each
record's contents into a temporary buffer. Previously, if an error was
encountered mid-copy of a record (eg, because the record was large enough to be
split across multiple chunks), the buffer was not reset before attempting to
read the next file's record. This corruption could prevent replay of the WAL.

This commit fixes the bug moving the buffer reset to immediately precede its
use.

Informs cockroachdb#3865.
@exalate-issue-sync exalate-issue-sync bot added the P-0 Issues/test failures with a fix SLA of 2 weeks label Aug 20, 2024
jbowens added a commit that referenced this issue Aug 20, 2024
While stitching together physical WAL segments, the WAL reader reads each
record's contents into a temporary buffer. Previously, if an error was
encountered mid-copy of a record (eg, because the record was large enough to be
split across multiple chunks), the buffer was not reset before attempting to
read the next file's record. This corruption could prevent replay of the WAL.

This commit fixes the bug moving the buffer reset to immediately precede its
use.

Informs #3865.
jbowens added a commit to jbowens/pebble that referenced this issue Aug 20, 2024
While stitching together physical WAL segments, the WAL reader reads each
record's contents into a temporary buffer. Previously, if an error was
encountered mid-copy of a record (eg, because the record was large enough to be
split across multiple chunks), the buffer was not reset before attempting to
read the next file's record. This corruption could prevent replay of the WAL.

This commit fixes the bug moving the buffer reset to immediately precede its
use.

Informs cockroachdb#3865.
jbowens added a commit to jbowens/pebble that referenced this issue Aug 20, 2024
While stitching together physical WAL segments, the WAL reader reads each
record's contents into a temporary buffer. Previously, if an error was
encountered mid-copy of a record (eg, because the record was large enough to be
split across multiple chunks), the buffer was not reset before attempting to
read the next file's record. This corruption could prevent replay of the WAL.

This commit fixes the bug moving the buffer reset to immediately precede its
use.

Informs cockroachdb#3865.
@dshjoshi
Copy link

dshjoshi commented Aug 20, 2024

What sequence of user actions led to this issue being uncovered? We should add a roach test operation for this scenario
"WAL-failover is enabled and disk-stall duration is longer than 20sec (MAX sync duration), then the node should be crashed by disk-stall-detector and followed by self-recovery of node"

@dshjoshi dshjoshi moved this from Incoming to In Progress (this milestone) in [Deprecated] Storage Aug 20, 2024
craig bot pushed a commit to cockroachdb/cockroach that referenced this issue Aug 20, 2024
129331: go.mod: bump Pebble to 3d14906a0e0c r=itsbilal a=jbowens

Changes:

 * [`3d14906a`](cockroachdb/pebble@3d14906a) wal: fix reader buffer reuse
 * [`c2749cc7`](cockroachdb/pebble@c2749cc7) colblk: use crlib/crbytes.CommonPrefix
 * [`f90b350e`](cockroachdb/pebble@f90b350e) docs/rfc: add RFC for point tombstone density compaction heuristic
 * [`1a5295f8`](cockroachdb/pebble@1a5295f8) block: move block handle encoding and decoding
 * [`0e8a11a5`](cockroachdb/pebble@0e8a11a5) metamorphic: no synthetic suffix with overlapping RangeKeySets

Release note (bug fix): Fix bug in the public preview WAL failover feature that could prevent a node from starting if it crashed during a failover. Epic: none.

Informs cockroachdb/pebble#3865.

Co-authored-by: Jackson Owens <[email protected]>
jbowens added a commit that referenced this issue Aug 20, 2024
While stitching together physical WAL segments, the WAL reader reads each
record's contents into a temporary buffer. Previously, if an error was
encountered mid-copy of a record (eg, because the record was large enough to be
split across multiple chunks), the buffer was not reset before attempting to
read the next file's record. This corruption could prevent replay of the WAL.

This commit fixes the bug moving the buffer reset to immediately precede its
use.

Informs #3865.
jbowens added a commit that referenced this issue Aug 20, 2024
While stitching together physical WAL segments, the WAL reader reads each
record's contents into a temporary buffer. Previously, if an error was
encountered mid-copy of a record (eg, because the record was large enough to be
split across multiple chunks), the buffer was not reset before attempting to
read the next file's record. This corruption could prevent replay of the WAL.

This commit fixes the bug moving the buffer reset to immediately precede its
use.

Informs #3865.
jbowens added a commit to jbowens/cockroach that referenced this issue Aug 20, 2024
Changes:

 * [`02bb64a5`](cockroachdb/pebble@02bb64a5) wal: fix reader buffer reuse
 * [`72e9205d`](cockroachdb/pebble@72e9205d) db: export MakeTrailer
 * [`37311795`](cockroachdb/pebble@37311795) internal/keyspan: export Truncate in keyspan.Fragmenter.

Release note (bug fix): Fix bug in the public preview WAL failover feature that
could prevent a node from starting if it crashed during a failover. Epic: none.
Epic: none.
Informs cockroachdb/pebble#3865.
jbowens added a commit to jbowens/cockroach that referenced this issue Aug 20, 2024
Changes:

 * [`211dce02`](cockroachdb/pebble@211dce02) wal: fix reader buffer reuse
 * [`cd809634`](cockroachdb/pebble@cd809634) db: export MakeTrailer
 * [`f318ad84`](cockroachdb/pebble@f318ad84) internal/keyspan: export Truncate in keyspan.Fragmenter.

Release note (bug fix): Fix bug in the public preview WAL failover feature that could prevent a node from starting if it crashed during a failover. Epic: none. Epic: none.
Epic: none.
Informs cockroachdb/pebble#3865.
jbowens added a commit to jbowens/pebble that referenced this issue Aug 27, 2024
Add a randomized test of WAL failover and recovery of a DB from failover WAL
logs.

Close cockroachdb#3865.
sumeerbhola added a commit to sumeerbhola/pebble that referenced this issue Aug 28, 2024
…on open

This also ensures we don't leak a file descriptor.

Informs cockroachdb#3865
sumeerbhola added a commit to sumeerbhola/pebble that referenced this issue Aug 28, 2024
…on open

This also ensures we don't leak a file descriptor.

Informs cockroachdb#3865
sumeerbhola added a commit to sumeerbhola/pebble that referenced this issue Aug 28, 2024
…on open

This also ensures we don't leak a file descriptor.

Informs cockroachdb#3865
jbowens added a commit to jbowens/pebble that referenced this issue Aug 28, 2024
Add a randomized test of WAL failover and recovery of a DB from failover WAL
logs.

Close cockroachdb#3865.
sumeerbhola added a commit that referenced this issue Aug 28, 2024
…on open

This also ensures we don't leak a file descriptor.

Informs #3865
jbowens added a commit to jbowens/pebble that referenced this issue Aug 28, 2024
Add a randomized test of WAL failover and recovery of a DB from failover WAL
logs.

Close cockroachdb#3865.
jbowens added a commit that referenced this issue Aug 28, 2024
Add a randomized test of WAL failover and recovery of a DB from failover WAL
logs.

Close #3865.
@github-project-automation github-project-automation bot moved this from In Progress (this milestone) to Done in [Deprecated] Storage Aug 28, 2024
jbowens pushed a commit to jbowens/pebble that referenced this issue Aug 28, 2024
…on open

This also ensures we don't leak a file descriptor.

Informs cockroachdb#3865
jbowens pushed a commit to jbowens/pebble that referenced this issue Aug 28, 2024
…on open

This also ensures we don't leak a file descriptor.

Informs cockroachdb#3865
jbowens pushed a commit that referenced this issue Aug 28, 2024
…on open

This also ensures we don't leak a file descriptor.

Informs #3865
jbowens pushed a commit that referenced this issue Aug 28, 2024
…on open

This also ensures we don't leak a file descriptor.

Informs #3865
@dshjoshi
Copy link

dshjoshi commented Sep 3, 2024

@jbowens Have we already added " add randomized test of WAL failover and recovery"?

@jbowens
Copy link
Collaborator Author

jbowens commented Sep 3, 2024

Yes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-storage C-bug Something isn't working GA-blocker P-0 Issues/test failures with a fix SLA of 2 weeks T-storage
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

2 participants