Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix P-chain validator set lookup race condition #2672

Merged
merged 3 commits into from
Jan 28, 2024

Conversation

StephenButtolph
Copy link
Contributor

Why this should be merged

The P-chain p2p network currently fetches the primary network validator set without first grabbing the P-chain context lock. This can result in racy (incorrect) calculations of the validator set that are cached. This can cause persistently incorrect validator set lookups at P-chain heights where this race occurred.

How this works

  • Wraps the validators.State passed into the p2p network with the context lock.
  • Exports a number of utility functions from the p2p library to facilitate testing.

How this was tested

  • New unit test (fails on master)
  • Ran node on fuji

Comment on lines +208 to +211
validators.NewLockedState(
&chainCtx.Lock,
validatorManager,
),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the actual fix. chainCtx.ValidatorState was replaced with validatorManager because chainCtx.ValidatorState is actually equal to vm here and I found that very confusing. Additionally, this made testing much better because we don't need to rely on mocking the ValidatorState.

@@ -2218,6 +2225,61 @@ func TestSubnetValidatorSetAfterPrimaryNetworkValidatorRemoval(t *testing.T) {
require.NoError(err)
}

func TestValidatorSetRaceCondition(t *testing.T) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't find a cleaner way to write this test... Very open to suggestions.

Comment on lines +14 to +59
func MarshalAppRequest(filter, salt []byte) ([]byte, error) {
request := &sdk.PullGossipRequest{
Filter: filter,
Salt: salt,
}
return proto.Marshal(request)
}

func ParseAppRequest(bytes []byte) (*bloom.ReadFilter, ids.ID, error) {
request := &sdk.PullGossipRequest{}
if err := proto.Unmarshal(bytes, request); err != nil {
return nil, ids.Empty, err
}

salt, err := ids.ToID(request.Salt)
if err != nil {
return nil, ids.Empty, err
}

filter, err := bloom.Parse(request.Filter)
return filter, salt, err
}

func MarshalAppResponse(gossip [][]byte) ([]byte, error) {
return proto.Marshal(&sdk.PullGossipResponse{
Gossip: gossip,
})
}

func ParseAppResponse(bytes []byte) ([][]byte, error) {
response := &sdk.PullGossipResponse{}
err := proto.Unmarshal(bytes, response)
return response.Gossip, err
}

func MarshalAppGossip(gossip [][]byte) ([]byte, error) {
return proto.Marshal(&sdk.PushGossip{
Gossip: gossip,
})
}

func ParseAppGossip(bytes []byte) ([][]byte, error) {
msg := &sdk.PushGossip{}
err := proto.Unmarshal(bytes, msg)
return msg.Gossip, err
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically we don't need to export all of these utilities... But I felt like it made the code easier to understand... If we'd prefer to just directly marshal the proto messages I'll revert these. @joshua-kim thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strongly prefer this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is better, this is also good since we frequently re-write this during testing as well

Comment on lines +408 to +409
handlerStr := strconv.FormatUint(handlerID, 10)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could reduce the diff here, but I don't think there is any good reason to do this with the lock held.

network/p2p/client.go Dismissed Show resolved Hide resolved
@@ -277,13 +277,14 @@ func defaultVM(t *testing.T, fork activeFork) (*VM, database.Database, *mutableS
return nil
}

dynamicConfigBytes := []byte(`{"network":{"max-validator-set-staleness":0}}`)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There might be other configs that would be good to default to. But this disables the optimization so that we re-calculate the validator set aggressively.

@@ -205,7 +205,10 @@ func (vm *VM) Initialize(
chainCtx.Log,
chainCtx.NodeID,
chainCtx.SubnetID,
chainCtx.ValidatorState,
validators.NewLockedState(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there other VMs that provide the raw state here?

Probably makes sense to double-check C/X-chain initialization.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a P-chain specific bug. In the C-chain and the X-chain the ctx.ValidatorState is already locked.

@StephenButtolph StephenButtolph added this pull request to the merge queue Jan 28, 2024
Merged via the queue into master with commit 68980eb Jan 28, 2024
17 checks passed
@StephenButtolph StephenButtolph deleted the fix-validator-set-lookup-race branch January 28, 2024 15:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working incident response
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

4 participants