Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tests and fixes for oplog corruption bug #962

Merged
merged 1 commit into from
Sep 23, 2024
Merged

Tests and fixes for oplog corruption bug #962

merged 1 commit into from
Sep 23, 2024

Conversation

vigoo
Copy link
Contributor

@vigoo vigoo commented Sep 23, 2024

If a set of conditions were met:

  • All entries of the oplog were moved to archive layers from the primary oplog
  • All in-memory oplog related instances for the given worker were already dropped from all caches
  • The worker is reopened by some new invocation or update request

It could happen that the newly opened oplog used a wrong view of what the "last oplog index" is (only looking at the primary layer), and then the next written entry gets the wrong identifier (1). As this (and many following) index is already used and is stored in one of the archive layers, this becomes a corrupt oplog leading to many unexpected issues.

This pull request:

  • Provides a test that reproduced the issue and proves the fix
  • Fixes the root cause
  • Reduces the effect of trying to open such a corrupt oplog, to not panic but act as if the worker was failed.

(Also updates wasm-rpc to 1.0.3 as it contains some important stub generator fixes.)

@noise64
Copy link
Contributor

noise64 commented Sep 23, 2024

can getting the last oplog and open be in a race-condition?

@vigoo
Copy link
Contributor Author

vigoo commented Sep 23, 2024

can getting the last oplog and open be in a race-condition?

only in split-brain scenarios (if another executor is writing the oplog)

normally there is only a single instance of Oplog per worker and that's the only way to append the oplog

@vigoo vigoo merged commit 08eedaa into main Sep 23, 2024
17 checks passed
@vigoo vigoo deleted the oplog-fix branch September 23, 2024 11:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants