Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

op-batcher: adjust error handling on pending-channels after close #7683

Merged
merged 7 commits into from
Dec 18, 2023

Conversation

protolambda
Copy link
Contributor

@protolambda protolambda commented Oct 14, 2023

Description

This PR depends on #7682

This fixes a CI flake by adjusting the error reporting of pending-channels.

CI would previously fail when there is remaining pending channel-data that does not fit all in one channel, due to the compressor-full status.


Here's the rabbit-hole I went down to find this issue:

ChannelOut.Close() is being called twice on a closed state, which should never happen.

CI Log snippet:

    driver.go:155:              INFO [10-12|17:46:15.226] Stopping Batch Submitter                 role=batcher
    driver.go:313:              ERROR[10-12|17:46:15.226] error closing the channel manager        role=batcher      err="creating frames with channel builder: closing channel out: already closed"

Batcher :

  • ChannelOut.Close() is called as part of channelBuilder.closeAndOutputAllFrames(), closing channel out is logged.
  • channelBuilder.closeAndOutputAllFrames() is called as part of outputFrames(), creating frames with channel builder is logged.
  • outputFrames() is called both as part of channelManager.Close() and channelManager.TxData()
  • channelManager.Close() is only called like BatchSubmitter.state.Close(), mutually exclusive driver loop events:
    • Option 1: closed and then cleared (reopened) on reorg
    • Option 2: closed on shutdown. error closing the channel manager is logged.
  • channelManager.TxData() is only called publishTxToL1()
  • publishTxToL1 is called in a go-routine by publishStateToL1, but caller always blocks till it is done publishing.
  • publishStateToL1 is called after closing the state and logging closing error on shutdown, in the driver loop.
  • publishStateToL1 is also called in more driver-loop places.

Given that publishStateToL1 waits till it is done, there seems to be no concurrency issue, and it's called many times over in regular operation.
But closeAndOutputAllFrames followed by Close seems to be the issue: only then we can hit the log from CI fail.

On channelManager.Close(), the channelBuilder.Close happens, but this only really marks it as "full", with ErrTerminated
However, the ChannelOut of the channelBuilder should have been removed if it were completely unutilized, and have nothing to submit.

The test-cases, when not dropping the error, then indicate the exact issue: the compressor is "full", and when an existing pending channel is "fully" submitted, it actually does not get fully submitted, and will appear already-closed.


Changes

  • Do not make the channel-manager Close func try to output frames: this happens as part of the later TxData calls anyway. Edit(seb): Only Close a channel and outputFrames in channelManager.Close if the channel wasn't already full. This was causing the double-closed error, when things didn't fit in a single channel.
  • Return an error that channel data remains to be submitted.
  • Add clear logging on channel-manager close on what is being dropped as never-submitted and what is still pending after close
  • Adjust the error log to a warn-log if pending data remains to be submitted: the driver is capable of draining that by submitting it all to L1 before final shutdown.
  • Include channel full-cause in error on non-err output of channel frames, so we can better understand issues like the previous already-closed error.
  • Comments to explain what's happening, because it was not clear.
  • Minor cleanup of error naming/duplication.

Tests

Close errors in the channel-manager tests are no longer dropped.

ChannelManagerClosePendingChannel tested the exact case of remaining pending-channels work after channel-manager close, but ignored the error.

Added new NonCompressor that can be used in testing. When writing to it, it first flushes data from any previous write. This makes it behave more predicable which is better suited for tests.

Using this one, added a new test ChannelManager_Close_PartiallyPendingChannel that shows that the outputFrames call is necessary in channelManager.Close in case there are unflushed blocks in the compressor.

@protolambda protolambda requested a review from a team as a code owner October 14, 2023 15:18
@semgrep-app
Copy link
Contributor

semgrep-app bot commented Oct 14, 2023

Semgrep found 3 todos_require_linear findings:

  • op-proposer/metrics/metrics.go: L87
  • op-challenger/metrics/metrics.go: L136
  • op-batcher/batcher/service.go: L309

Please create a Linear ticket for this TODO.

Ignore this finding from todos_require_linear.

@protolambda protolambda force-pushed the batcher-driver-close branch 2 times, most recently from c57fc21 to ea42a17 Compare October 14, 2023 15:45
@protolambda
Copy link
Contributor Author

fixed semgrep todo issues in base PR, semgrep comment is stale

op-batcher/batcher/driver.go Outdated Show resolved Hide resolved
op-batcher/batcher/driver.go Outdated Show resolved Hide resolved
op-batcher/batcher/channel_manager.go Outdated Show resolved Hide resolved
Base automatically changed from batcher-lifecycle to develop October 24, 2023 19:04
@codecov codecov bot requested a review from ajsutton as a code owner October 24, 2023 19:04
op-batcher/batcher/channel_manager.go Outdated Show resolved Hide resolved
op-batcher/batcher/driver.go Outdated Show resolved Hide resolved
op-batcher/batcher/driver.go Outdated Show resolved Hide resolved
op-batcher/batcher/driver.go Outdated Show resolved Hide resolved
op-batcher/batcher/driver.go Outdated Show resolved Hide resolved
op-batcher/batcher/driver.go Outdated Show resolved Hide resolved
op-batcher/batcher/channel_manager.go Outdated Show resolved Hide resolved
@semgrep-app
Copy link
Contributor

semgrep-app bot commented Oct 30, 2023

Semgrep found 1 nil-check-after-call finding:

Potential p2pmetrics nil dereference when NewBandwidthCounter is called

Ignore this finding from nil-check-after-call.

Semgrep found 4 todos_require_linear findings:

  • op-node/rollup/derive/channel_out.go: L84, L71
  • op-batcher/batcher/config.go: L64
  • .circleci/config.yml: L1215

Please create a Linear ticket for this TODO.

Ignore this finding from todos_require_linear.

Copy link
Contributor

This PR is stale because it has been open 14 days with no activity. Remove stale label or comment or this will be closed in 5 days.

Copy link
Contributor

This PR is stale because it has been open 14 days with no activity. Remove stale label or comment or this will be closed in 5 days.

Copy link
Contributor

coderabbitai bot commented Dec 13, 2023

Walkthrough

Walkthrough

The updates across various Go files in the op-batcher and op-node packages involve enhancing error handling, introducing new test cases, refining method behaviors, and expanding functionality with new types and constants. The changes focus on improving the robustness of channel builders and managers, adding a non-compressor option, and standardizing error responses for closed channels.

Changes

File(s) Change Summary
op-batcher/batcher/channel_builder.go Updated error handling in OutputFrames method; added conditional checks and detailed comments.
op-batcher/batcher/channel_manager.go Updated Close method with enhanced error handling and detailed comments.
op-batcher/batcher/channel_builder_test.go, op-batcher/batcher/channel_manager_test.go Added new test functions to validate the updated behaviors of methods.
op-batcher/batcher/driver.go Enhanced logging and error handling in the loop function of BatchSubmitter.
op-batcher/compressor/compressors.go, op-batcher/compressor/non_compressor.go, op-batcher/compressor/non_compressor_test.go Introduced NoneKind compressor and NonCompressor type with associated methods and tests.
op-node/rollup/derive/channel_out.go, op-node/rollup/derive/span_channel_out.go Added ErrChannelOutAlreadyClosed error variable and replaced direct error creation with this variable.

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on X ?


Tips

Chat with CodeRabbit Bot (@coderabbitai)

  • You can reply to a review comment made by CodeRabbit.
  • You can tag CodeRabbit on specific lines of code or files in the PR by tagging @coderabbitai in a comment.
  • You can tag @coderabbitai in a PR comment and ask one-off questions about the PR and the codebase. Use quoted replies to pass the context for follow-up questions.

CodeRabbit Commands (invoked as PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger a review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai help to get help.

Additionally, you can add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.

CodeRabbit Configration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • The JSON schema for the configuration file is available here.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/coderabbit-overrides.v2.json

@sebastianst sebastianst dismissed their stale review December 13, 2023 23:07

Dismissing my review since I took over this PR.

@sebastianst sebastianst requested a review from ajsutton December 13, 2023 23:07
op-batcher/compressor/non_compressor_test.go Show resolved Hide resolved
op-batcher/compressor/non_compressor_test.go Show resolved Hide resolved
op-batcher/compressor/non_compressor_test.go Show resolved Hide resolved
op-batcher/compressor/non_compressor.go Outdated Show resolved Hide resolved
op-batcher/compressor/non_compressor.go Outdated Show resolved Hide resolved
op-batcher/batcher/channel_manager.go Outdated Show resolved Hide resolved
Copy link

codecov bot commented Dec 13, 2023

Codecov Report

Merging #7683 (0a9ab58) into develop (6e371b4) will decrease coverage by 14.44%.
Report is 50 commits behind head on develop.
The diff coverage is n/a.

Additional details and impacted files
@@             Coverage Diff              @@
##           develop    #7683       +/-   ##
============================================
- Coverage    34.61%   20.18%   -14.44%     
============================================
  Files          167       88       -79     
  Lines         7162     2101     -5061     
  Branches      1212      481      -731     
============================================
- Hits          2479      424     -2055     
+ Misses        4532     1649     -2883     
+ Partials       151       28      -123     
Flag Coverage Δ
cannon-go-tests ?
chain-mon-tests ?
common-ts-tests ?
contracts-bedrock-tests 20.18% <ø> (-0.15%) ⬇️
contracts-ts-tests ?
core-utils-tests ?
sdk-next-tests ?
sdk-tests ?

Flags with carried forward coverage won't be shown. Click here to find out more.

see 82 files with indirect coverage changes

Copy link
Contributor

semgrep-app bot commented Dec 13, 2023

Semgrep found 1 import-text-template finding:

  • op-bindings/gen/main.go: L14

When working with web applications that involve rendering user-generated content, it's important to properly escape any HTML content to prevent Cross-Site Scripting (XSS) attacks. In Go, the text/template package does not automatically escape HTML content, which can leave your application vulnerable to these types of attacks. To mitigate this risk, it's recommended to use the html/template package instead, which provides built-in functionality for HTML escaping. By using html/template to render your HTML content, you can help to ensure that your web application is more secure and less susceptible to XSS vulnerabilities.

Ignore this finding from import-text-template.

op-batcher/batcher/channel_manager.go Outdated Show resolved Hide resolved
op-batcher/batcher/channel_manager.go Outdated Show resolved Hide resolved
protolambda and others added 6 commits December 14, 2023 14:21
Test added that validates that in rare circumstances this is needed.
This happens in scenarios where a block is written to the compressor,
but not flushed yet to the output buffer. If we don't call outputFrames
in channelManager.Close, this test fails.
@sebastianst sebastianst self-assigned this Dec 14, 2023
Copy link
Contributor

@ajsutton ajsutton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Just a couple of suggestions to avoid log messages sending me down a rabbit hole in the future. :)

op-batcher/batcher/channel_manager.go Outdated Show resolved Hide resolved
op-batcher/batcher/driver.go Outdated Show resolved Hide resolved
- clarify that pending channels will be submitted
- use same key "id" for channel ids everywhere
@sebastianst sebastianst added this pull request to the merge queue Dec 18, 2023
Comment on lines 284 to +289
if err != nil {
l.Log.Error("error closing the channel manager", "err", err)
if errors.Is(err, ErrPendingAfterClose) {
l.Log.Warn("Closed channel manager on shutdown with pending channel(s) remaining - submitting")
} else {
l.Log.Error("Error closing the channel manager on shutdown", "err", err)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This duplicate error handling seems to indicate that closing the channel manager with pending work isn't really an error at all, right?

Can Close return how much remaining work it has along with the error? That way its signal can be handled without these errors.Is calls

var ErrPendingAfterClose = errors.New("pending channels remain after closing channel-manager")

// Close clears any pending channels that are not in-flight already, to leave a clean derivation state.
// Close then marks the remaining current open channel, if any, as "full" so it can be submitted as well.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like we're marking non-full channels as full specifically so they'll be handled in a certain way. Is that an appropriate use of the marking? I see elsewhere that we emit errors about a "error while closing full channel", and worry that will be confusing if the channel isn't actually full.

Merged via the queue into develop with commit 82d9f11 Dec 18, 2023
62 checks passed
@sebastianst sebastianst deleted the batcher-driver-close branch December 18, 2023 18:10
roberto-bayardo pushed a commit to roberto-bayardo/optimism that referenced this pull request Dec 19, 2023
…hereum-optimism#7683)

* op-batcher: adjust error handling on pending-channels after close

* op-batcher: fix comment

* Capitalize start of log messages

Co-authored-by: Adrian Sutton <[email protected]>

* op-batcher: Add NonCompressor for testing purposes

* op-node/rollup/derive: Return ErrChannelOutAlreadyClosed in SpanChannelOut

* op-batcher: Add back outputFrames call to channelManager.Close

Test added that validates that in rare circumstances this is needed.
This happens in scenarios where a block is written to the compressor,
but not flushed yet to the output buffer. If we don't call outputFrames
in channelManager.Close, this test fails.

* op-batcher: Improve logging

- clarify that pending channels will be submitted
- use same key "id" for channel ids everywhere

---------

Co-authored-by: Sebastian Stammler <[email protected]>
Co-authored-by: Adrian Sutton <[email protected]>
roberto-bayardo pushed a commit to roberto-bayardo/optimism that referenced this pull request Dec 21, 2023
…hereum-optimism#7683)

* op-batcher: adjust error handling on pending-channels after close

* op-batcher: fix comment

* Capitalize start of log messages

Co-authored-by: Adrian Sutton <[email protected]>

* op-batcher: Add NonCompressor for testing purposes

* op-node/rollup/derive: Return ErrChannelOutAlreadyClosed in SpanChannelOut

* op-batcher: Add back outputFrames call to channelManager.Close

Test added that validates that in rare circumstances this is needed.
This happens in scenarios where a block is written to the compressor,
but not flushed yet to the output buffer. If we don't call outputFrames
in channelManager.Close, this test fails.

* op-batcher: Improve logging

- clarify that pending channels will be submitted
- use same key "id" for channel ids everywhere

---------

Co-authored-by: Sebastian Stammler <[email protected]>
Co-authored-by: Adrian Sutton <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants