Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

go/worker/common: Refresh node descriptors mid-epoch #2584

Merged
merged 4 commits into from
Jan 28, 2020

Conversation

kostko
Copy link
Member

@kostko kostko commented Jan 22, 2020

Fixes #1794.
Fixes #2193

Previously node descriptors were only refreshed on an epoch transition which
meant that any later updates were ignored. This caused stale RAKs to stay in
effect when runtime restarts happened.

Enabling mid-epoch refresh also makes having more ephemeral keys easier.

TODO

  • Watch node descriptors of committee members.
  • Remove NodeInfo from commitment pool.
  • Add a hook to get notified of node descriptor updates for things like gRPC policy updates.
  • Add general node/committee watcher interfaces.
  • Finish gRPC committee client.
  • Replace custom watchers in storage/runtime clients with generalized interfaces.
  • Replace custom watchers in keymanager clients with generalized interfaces.
  • Tests for the new general interface.

@codecov
Copy link

codecov bot commented Jan 22, 2020

Codecov Report

Merging #2584 into master will decrease coverage by 0.09%.
The diff coverage is 73.61%.

Impacted file tree graph

@@            Coverage Diff            @@
##           master    #2584     +/-   ##
=========================================
- Coverage   63.47%   63.38%   -0.1%     
=========================================
  Files         357      359      +2     
  Lines       33682    33841    +159     
=========================================
+ Hits        21381    21449     +68     
- Misses       9693     9764     +71     
- Partials     2608     2628     +20
Impacted Files Coverage Δ
go/consensus/tendermint/apps/roothash/roothash.go 72.6% <ø> (+0.03%) ⬆️
go/worker/common/worker.go 88.14% <ø> (-0.26%) ⬇️
go/storage/init.go 83.33% <ø> (-0.46%) ⬇️
go/oasis-node/cmd/node/node.go 57.42% <ø> (+0.38%) ⬆️
go/storage/api/grpc.go 71% <0%> (+0.21%) ⬆️
go/runtime/client/client.go 67.04% <0%> (-2.31%) ⬇️
go/worker/compute/merge/committee/node.go 76.5% <100%> (+1.2%) ⬆️
go/runtime/tagindexer/tagindexer.go 72.82% <100%> (+3.86%) ⬆️
go/worker/compute/executor/committee/node.go 62.84% <100%> (+0.07%) ⬆️
.../consensus/tendermint/apps/roothash/state/round.go 85.71% <100%> (ø) ⬆️
... and 49 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update baa8497...cc98774. Read the comment docs.

@kostko kostko force-pushed the kostko/feature/group-node-refresh branch 7 times, most recently from 7f3f147 to c607f08 Compare January 24, 2020 10:12
@kostko kostko force-pushed the kostko/feature/group-node-refresh branch 8 times, most recently from f11ae03 to 08bfcc4 Compare January 28, 2020 13:52
@kostko kostko marked this pull request as ready for review January 28, 2020 14:05
@kostko kostko requested a review from peterjgilbert as a code owner January 28, 2020 14:05
@kostko kostko force-pushed the kostko/feature/group-node-refresh branch from 08bfcc4 to d52b3da Compare January 28, 2020 16:53
Previously node descriptors were only refreshed on an epoch transition which
meant that any later updates were ignored. This caused stale RAKs to stay in
effect when runtime restarts happened.

Enabling mid-epoch refresh also makes having more ephemeral keys easier.
Commit 2fdfd28 introduced a possibility of the
storage round syncing process to deadlock when an unfortunate series of events
occur in specific order.

Assume the following happens, in order:

1. Round X has just completed syncing and has been applied, so
   lastFullyAppliedRound has been updated to X and all metadata about round X
   has been removed from syncingRounds and hashCache. Since the round has not
   yet been finalized, cachedLastRound still points to X-1.

2. A new incoming block for round X+1 appears.

3. The code checks what needs to be synced and starts with cachedLastRound which
   still points to round X-1 (because round X has not yet been applied). Because
   sync metadata of round X have been removed, it assumes that round X is not
   yet synced so it starts syncing it by queuing sync requests and adding new
   metadata to the top of the syncingRounds heap.

4. Round X is finalized, so cachedLastRound is updated to X.

Since the top of the sync management loop checks if lastFullyAppliedRound + 1 is
at the top of the heap this leads to a deadlock. The top item contains metadata
for round X while the management loop is waiting for X+1.

This commit fixes the issue by using lastFullyAppliedRound in the incoming block
handler instead of cachedLastRound.
Also use new generalized interfaces for managing gRPC connections to committee
members (with support for mid-epoch refresh).
@kostko kostko force-pushed the kostko/feature/group-node-refresh branch from d52b3da to cc98774 Compare January 28, 2020 16:54
Copy link
Contributor

@Yawning Yawning left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@kostko kostko merged commit a48cd81 into master Jan 28, 2020
@kostko kostko deleted the kostko/feature/group-node-refresh branch January 28, 2020 17:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

key manager: client mixes together all nodes' certs Support for node re-registration
2 participants