Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix storage worker #3820

Merged
merged 6 commits into from
Mar 31, 2021
Merged

Fix storage worker #3820

merged 6 commits into from
Mar 31, 2021

Conversation

jberci
Copy link
Contributor

@jberci jberci commented Mar 29, 2021

Avoid database corruption in certain cases.

  • Root existence checking in Finalize
  • Detect Finalize failures in the worker and don't update syncing metadata
  • Handle Apply failures during sync (+byzantine test for it)
  • Sync heartbeat independent of incoming blocks

@jberci jberci force-pushed the jberci/fix/storage branch from 24a416e to 488f863 Compare March 29, 2021 16:04
@jberci jberci force-pushed the jberci/fix/storage branch from 20b00a9 to f6a7f15 Compare March 30, 2021 13:28
@codecov
Copy link

codecov bot commented Mar 30, 2021

Codecov Report

Merging #3820 (2b80e98) into master (cd0686f) will decrease coverage by 0.04%.
The diff coverage is 96.82%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #3820      +/-   ##
==========================================
- Coverage   67.27%   67.22%   -0.05%     
==========================================
  Files         404      404              
  Lines       40889    40952      +63     
==========================================
+ Hits        27508    27530      +22     
- Misses       9512     9541      +29     
- Partials     3869     3881      +12     
Impacted Files Coverage Δ
go/storage/client/client.go 74.66% <83.33%> (+0.12%) ⬆️
go/worker/storage/committee/node.go 78.05% <97.64%> (+1.65%) ⬆️
go/oasis-node/cmd/node/node.go 53.66% <100.00%> (+0.10%) ⬆️
go/storage/api/context.go 100.00% <100.00%> (ø)
go/storage/mkvs/db/badger/badger.go 68.10% <100.00%> (-0.19%) ⬇️
go/worker/common/committee/node.go 70.76% <100.00%> (+0.15%) ⬆️
go/worker/common/worker.go 89.74% <100.00%> (+0.26%) ⬆️
go/worker/storage/committee/utils.go 100.00% <100.00%> (ø)
go/oasis-node/cmd/common/metrics/disk.go 65.38% <0.00%> (-19.24%) ⬇️
go/oasis-node/cmd/common/metrics/resource.go 78.94% <0.00%> (-10.53%) ⬇️
... and 30 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update cd0686f...2b80e98. Read the comment docs.

@jberci jberci force-pushed the jberci/fix/storage branch from f6a7f15 to e89db28 Compare March 30, 2021 14:06
@jberci jberci marked this pull request as ready for review March 30, 2021 14:13
go/oasis-test-runner/scenario/e2e/runtime/byzantine.go Outdated Show resolved Hide resolved
// watcherState is the (persistent) watcher state.
type watcherState struct {
LastBlock blockSummary `json:"last_block"`
}

// Node watches blocks for storage changes.
type Node struct { // nolint: maligned
hostNode control.ControlledNode
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this rather be part of the common committee node? It feels like it belongs there instead.

@jberci jberci force-pushed the jberci/fix/storage branch from e89db28 to 9631a2a Compare March 31, 2021 11:05
jberci and others added 4 commits March 31, 2021 13:14
Before this change, rounds were considered successfully finalized even
if the operation itself actually failed. This would technically lead a
corrupt database state.

On the other hand, the worker doesn't have much recourse to handle the
situation, so the effect with this fix is to stop syncing (since no
further rounds will be able to be finalized).
Errors reported by Apply were ignored and the worker just carried on
with a corrupt database.
@jberci jberci force-pushed the jberci/fix/storage branch from 9631a2a to 0439098 Compare March 31, 2021 11:15
// It can happen that for this test, the client will keep selecting the byzantine
// node instead of the other storage nodes and so would never be able to fully sync.
// It can take up to two minutes for storage-0 to sync in this scenario with the
// current shuffling method in the storage client. See also #1815.
Copy link
Member

@ptrus ptrus Mar 31, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe comment on the issue #1815 to remove this after it's done.

@jberci jberci force-pushed the jberci/fix/storage branch from 0439098 to 2b80e98 Compare March 31, 2021 11:45
@jberci jberci merged commit e2f6301 into master Mar 31, 2021
@jberci jberci deleted the jberci/fix/storage branch March 31, 2021 12:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants