Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: Helidon HTTP/2 support for rst_stream does not notify the gRPC layer that a Server Streaming connection ended #341

Open
Tracked by #395
mattp-swirldslabs opened this issue Nov 13, 2024 · 2 comments
Assignees
Labels
Bug A error that causes the feature to behave differently than what was expected based on design docs Helidon Helidon

Comments

@mattp-swirldslabs
Copy link
Contributor

mattp-swirldslabs commented Nov 13, 2024

Description

Helidon does not signal the gRPC layer when it receives a rst_stream header from a client on a Server Streaming connection.

Steps to reproduce

  1. Start a bidirectional publishBlockStream connection either with producer.sh 1 100 or with the simulator
  2. Start a server streaming subscribeBlockStream connection with Postman
  3. Start the dashboard and observe that there's 1 producer and 2 consumers (1st consumer is the stream persistence handler).
  4. Now, hit the cancel button in Postman. This emits a rst_stream header to Helidon.
  5. Repeat steps 2 and 4 several times
  6. Note in the dashboard that the number of consumers continues to increment and there are no exceptions thrown in the logs. The mediator ring buffer continues to stream block items to these zombie consumer instances and they continue to pass the block items down through their respective PBJ streamWriter instances. I watched for about 5 mins. Perhaps the tcp socket eventually times out? However, I did not observe it timing out. This is a concerning leak of resources.

This helidon code gets triggered when a rst_stream header is sent. That method will set the state to CLOSED but data streaming from the application through the Http2StreamWriter will continue to work.

There's a "TODO" comment in that Helidon method that may indicate they know there's more work to be done here.

Additional context

image

  • I first found the defect when testing the Helidon PBJ plugin integration
  • However, I confirmed the same issue exists when using grpc.io as the gRPC layer instead
  • Unfortunately, since grpc.io does not have a workaround, there's not likely anything we can do to fix the issue in Block Node. We will require a fix in Helidon.
  • I left an estimate of "3" based on the estimated complexity to integrate with a new version of Helidon that has a fix.

Version

Helidon v4.1.1

Operating system

None

@mattp-swirldslabs mattp-swirldslabs self-assigned this Nov 13, 2024
@mattp-swirldslabs mattp-swirldslabs added the Bug A error that causes the feature to behave differently than what was expected based on design docs label Nov 13, 2024
@mattp-swirldslabs
Copy link
Contributor Author

I filed a ticket with Helidon about the issue here: helidon-io/helidon#9496

@mattp-swirldslabs
Copy link
Contributor Author

@rbair23 I discovered this issue today. It affects both grpc.io and the PBJ Helidon plugin. I imagine we'll need a fix before we go to production since a downstream client could easily tie up hundreds of consumer instances and DoS a BN. The Helidon ticket is here: helidon-io/helidon#9496

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug A error that causes the feature to behave differently than what was expected based on design docs Helidon Helidon
Projects
None yet
Development

No branches or pull requests

1 participant