Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

miner: Add block building interruption on payload resolution (getPayload) #186

Merged
merged 14 commits into from
Dec 19, 2023

Conversation

sebastianst
Copy link
Member

@sebastianst sebastianst commented Nov 16, 2023

Description

  • Block building is interrupted on a getPayload engine api call.
  • Empty block is now only built for NoTxPool FCU calls, and as before, FCU blocks on building the empty block. This has the advantage that if generating a lot of empty blocks, getPayload can immediately be called after forkchoiceUpdated to return that empty block.
  • If using the tx pool, FCU doesn't block on any block building but returns immediately after creating the payload promise.
    • Resolution of the payload promise (getPayload) now waits for the first full block to be built, as there is no empty block to return. If at least one block has already been built, as before, it is returned immediately, while block building is interrupted in the background.

Tests

Extended the TestBuildPayload test to cover both cases of NoTxPool.

Added an ugly global bool to indicate that a test is running, which guarantees one full block is built and block building is not interrupted. This seemed to be the least-invasive way to make all tests still pass. This was necessary because a couple of test call Resolve or getPayload immediately after a FCU, but this would now interrupts the block building process, which made tests fail.

Additional context

Metadata

@sebastianst sebastianst force-pushed the seb/getpayload-interrupt branch 2 times, most recently from f07b924 to c4a92d2 Compare November 17, 2023 14:18
miner/worker.go Outdated Show resolved Hide resolved
Copy link
Member Author

@sebastianst sebastianst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

notes from call with proto

miner/payload_building.go Outdated Show resolved Hide resolved
miner/payload_building.go Outdated Show resolved Hide resolved
miner/payload_building.go Show resolved Hide resolved
miner/payload_building.go Show resolved Hide resolved
miner/payload_building.go Show resolved Hide resolved
miner/payload_building.go Outdated Show resolved Hide resolved
miner/payload_building.go Outdated Show resolved Hide resolved
miner/worker.go Outdated Show resolved Hide resolved
@sebastianst sebastianst force-pushed the seb/getpayload-interrupt branch 3 times, most recently from 01831f8 to 8381705 Compare December 7, 2023 11:31
@sebastianst sebastianst marked this pull request as ready for review December 7, 2023 11:36
@sebastianst sebastianst force-pushed the seb/getpayload-interrupt branch from 8381705 to 552cd8a Compare December 8, 2023 09:27
Copy link
Contributor

@ajsutton ajsutton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When talking about returning the "full" block, are we literally waiting for it to be packed with as many tx as possible, or does the interrupt stop adding tx to the block?

The ideal behaviour when getPayload is called is basically:

  • If a valid block has already been built and is ready, return that and discard the in-progress block.
  • Otherwise, stop adding tx from the pool immediately and finalise the block in progress so it can be returned.

Potentially you could be even smarter and compare the value of the existing block to the value of the block in progress and finalized the block in progress if it's already worth more.

That may mean that we return a half-full block because that's all we had time to build but that's a lot better than being late delivering the block. I wonder if that potentially simplifies some of the logic here as well since FCU would always defer building to the worker thread and getPayload would always have a block to return.

eth/catalyst/api_test.go Outdated Show resolved Hide resolved
@sebastianst
Copy link
Member Author

sebastianst commented Dec 11, 2023

When talking about returning the "full" block, are we literally waiting for it to be packed with as many tx as possible, or does the interrupt stop adding tx to the block?

The interrupt stops adding to that "full" block that is currently being built.

The ideal behaviour when getPayload is called is basically:

  • If a valid block has already been built and is ready, return that and discard the in-progress block.
  • Otherwise, stop adding tx from the pool immediately and finalise the block in progress so it can be returned.

@ajsutton It is implemented like this 👍🏻

I now realize that there's one thing that can be optimized: in worker.generateWork, return immediately after fillTransactions if an interrupt is set and one block is already available. The current implementation still calls FinalizeAndAssemble even if the result is gonna be discarded. Will add this.

Potentially you could be even smarter and compare the value of the existing block to the value of the block in progress and finalized the block in progress if it's already worth more.

That may mean that we return a half-full block because that's all we had time to build but that's a lot better than being late delivering the block. I wonder if that potentially simplifies some of the logic here as well since FCU would always defer building to the worker thread and getPayload would always have a block to return.

This is exactly the implementation that I had up until last week. We then discussed last week that we rather want getPayload to instantly return if at least one block is already done, instead of it blocking on the currently ongoing block building process. This convinced me because my investigations showed that applying a single transaction, or calling FinalizeAndAssemble, might itself take several seconds in some instances. This isn't something that can easily be interrupted, since this is blocking somewhere down in some database access path.

The process could also happen with a timeout. Say, wait up to 10ms for the interrupted block building process to complete and the comparison to be done, and then return the existing block already if this times out. However, I remember that we discussed that an instant getPayload return is preferred.

If we want to revisit that decision, I suggest to do that in a follow-up.

Copy link
Contributor

@ajsutton ajsutton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, went through this carefully and it looks good to me. Only question is really around the global var and as mentioned, I think we can avoid it but you may have seen something I missed.

miner/payload_building.go Outdated Show resolved Hide resolved
eth/catalyst/api_test.go Outdated Show resolved Hide resolved
@sebastianst
Copy link
Member Author

@ajsutton thanks for review, I'm happy with your test workaround as well, will merge it into this pr

sebastianst and others added 9 commits December 14, 2023 22:58
We only build the empty block if we don't use the tx pool.
So if we use the tx pool, a forkchoiceUpdated call would miss
the implicit validation that's happening during empty block building,
so we need to add it back.
This commit changes the way the block builder/update routine and the
resolution functions Resolve and ResolveFull synchronize.
Resolve(Full) now signal the payload builder to pause and set the
interrupt signal in case any block building is ongoing. They then
wait for the interrupted block building to complete.

This allowed to simplify the Payload implementation somewhat because the
builder routine is now guaranteed to return before the resulting fields
(full, fullFees etc) are read, and closing of the `stop` channel is now
synchronized with a sync.Once. So the mutex and conditional variable
could be removed and we only use two simple signalling channels
`stop` and `done` for synchronization.
Some test in the miner and catalyst package assume that getPayload
can be immediately called after forkchoiceUpdated and then to return
some built block. Because of the new behavior of payload resolution to
interrupt any ongoing payload building process, this creates a race
condition on block building.

The new testing mode, which can be enabled by setting the package variable
IsPayloadBuildingTest to true, guarantees that always at least one
full block is built.

It's hacky, but seems to be the easiest and less-intrusive way to enable
the new behavior of payload resolution while still keeping all tests
happy.
- Priotize stop signal over recommit
- Don't start payload building update if last update duration
  doesn't fit until slot timeout.
When resolving, we don't want to wait for the latest update.
If a full block is available, we just return that one, as before.
Payload building is still interrupted, but exits in the background.
Use a longer wait in tests for the payload to build.
@sebastianst sebastianst force-pushed the seb/getpayload-interrupt branch from 9104cc0 to d880c81 Compare December 14, 2023 21:59
@protolambda
Copy link
Collaborator

protolambda commented Dec 15, 2023

Reviewed the payload-interrupt code, and the discussion about the half-full block, but I think we are not interrupting correctly on the first payload currently:

  • If resolve(false), and if there's no block yet, it currently does not interrupt the block building at all: it waits for the payload with payload.cond.Wait(), and only then calls payload.stopBuilding().
  • If an interrupt is received by the worker, during fillTransactions, then it doesn't seal the work, and returns an error instead. The error is propagated to payload.update(), and would set the payload.cond, but due to order of operations it'll be waiting for this cond before firing the interrupt.

The above two things combined mean we can't fire stopBuilding() early (due to lack of sealing on interrupt error)), and are forcing ourselves to build the full 30M gas block (due to lack of interrupt, since we wait for the cond if it's the first block).

I think we should add a flag to generateParams that defines whether or not we want to seal the half-full block after interrupt, or return an interrupt error.
And then run stopBuilding() before running the payload.cond.Wait(), to get that half-full block result, rather than not interrupting the first block. (This should work fine, since the interrupt is clean (we stop adding more txs, it doesn't leave the block-building state in an inconsistent place).

That way we don't get hung up on large 30M block building attempts if it's the first block building attempt.

Edit: I missed the isUpdate flag, it's already there. So it's mostly just the stopBuilding call we may want to move up, before the cond.Wait.

@sebastianst
Copy link
Member Author

sebastianst commented Dec 15, 2023

Thanks Proto, I moved the interrupt now before potentially waiting on the conditional, so we interrupt even the first block building routine. It was kind of a deliberate decision at the time to build the first block fully, but it's indeed better to even interrupt the first block, because even that one may take a long time.

As you wrote in the edit, I already added the isUpdate flag to generateParams which signals whether we seal an interrupted block or not.

I added another test that confirms that interruption actually works.

I now had to add more sleeps to the eth/catalyst/api_test.go tests, which now lets the tests run for ~143 s (measured with time go test ./eth/catalyst -count 1) instead of 2.7s! This was the reason in the first place why I just disabled interrupts completely in tests with the global var hack, because all geth tests assume that resolve doesn't interrupt. I personally would still favor just disabling it for the tests in eth/catalyst rather than adding sleeps.

Also added interrupt test.
Had to add sleep to make non-interrupt test work.
@sebastianst sebastianst force-pushed the seb/getpayload-interrupt branch from fd3c8d7 to bf9ea34 Compare December 15, 2023 15:19
@ajsutton
Copy link
Contributor

hmm, so there's a failure in TestSimulatedBeaconSendWithdrawals which is using GetPayload straight after starting block building (

var random [32]byte
rand.Read(random[:])
fcResponse, err := c.engineAPI.ForkchoiceUpdatedV2(c.curForkchoiceState, &engine.PayloadAttributes{
Timestamp: tstamp,
SuggestedFeeRecipient: feeRecipient,
Withdrawals: withdrawals,
Random: random,
})
if err != nil {
return err
}
if fcResponse == engine.STATUS_SYNCING {
return errors.New("chain rewind prevented invocation of payload creation")
}
envelope, err := c.engineAPI.getPayload(*fcResponse.PayloadID, true)
if err != nil {
return err
}
payload := envelope.ExecutionPayload
) so that's another place we'd need to wait for the building to complete.

I've exposed a couple of methods to deterministically wait for the block building to complete rather than needing a fixed timeout in #204 and hooked that up in the simulated beacon as well. We may need to add appropriate calls to wait in more tests yet which will be an annoying source of flakiness for a while but I think this is the right approach to deterministically wait. Definitely open to better ways to expose the ability to wait though.

ajsutton and others added 2 commits December 18, 2023 13:51
Also fix a bug in TestNilWithdrawals where the withdrawals weren't added
to the ephemeral BuildPayloadArgs instance for re-calculating the
payload id.
@sebastianst
Copy link
Member Author

Refined @ajsutton's solution and fixed a buggy test (TestNilWithdrawals)

@sebastianst sebastianst requested a review from ajsutton December 18, 2023 12:52
Copy link
Collaborator

@protolambda protolambda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Sorry I missed a few minor things in my last review, the PR looks ready otherwise

miner/worker.go Outdated Show resolved Hide resolved
miner/payload_building.go Outdated Show resolved Hide resolved
Also always stop interrupt timer after fillTransactions in generateWork.
@sebastianst sebastianst merged commit 8e15470 into optimism Dec 19, 2023
5 checks passed
@sebastianst sebastianst deleted the seb/getpayload-interrupt branch December 19, 2023 13:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants