Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Client loses connection with Event Hub and does not recover #24224

Closed
3 tasks done
MickeyMouseS opened this issue Sep 21, 2021 · 19 comments
Closed
3 tasks done

[BUG] Client loses connection with Event Hub and does not recover #24224

MickeyMouseS opened this issue Sep 21, 2021 · 19 comments
Assignees
Labels
Client This issue points to a problem in the data-plane of the library. customer-reported Issues that are reported by GitHub users external to the Azure organization. Event Hubs pillar-reliability The issue is related to reliability, one of our core engineering pillars. (includes stress testing) question The issue doesn't require a change to the product in order to be resolved. Most issues start as that

Comments

@MickeyMouseS
Copy link

MickeyMouseS commented Sep 21, 2021

Describe the bug
Suddenly, after a period of activity, a consumer client connected to the Event Hub loses connection and never recovers, even though keeps on running.

Exception or Stack Trace

01:25:26.986  INFO 1 --- [ctor-executor-3] c.a.c.a.i.handler.ReceiveLinkHandler     : onLinkRemoteClose connectionId[MF_db4a99_1631152692131], linkName[1_5fce48_1631152692132], errorCondition[null], errorDescription[null]
01:25:26.986  INFO 1 --- [ctor-executor-3] c.a.c.a.i.handler.ReceiveLinkHandler     : processOnClose connectionId[MF_db4a99_1631152692131], linkName[1_5fce48_1631152692132], errorCondition[null], errorDescription[null]
01:25:26.986  INFO 1 --- [ctor-executor-3] c.a.c.a.implementation.ReactorSession    : linkName[1_5fce48_1631152692132] entityPath[yyyyyyy/ConsumerGroups/zzzzzz/Partitions/1]: Complete. Removing receive link.
01:25:26.986  INFO 1 --- [ctor-executor-3] c.a.m.e.i.AmqpReceiveLinkProcessor       : Receive link endpoint states are closed. Requesting another.
01:25:26.986  INFO 1 --- [ctor-executor-3] c.a.m.e.i.AmqpReceiveLinkProcessor       : Requesting a new AmqpReceiveLink from upstream.
01:25:26.987  INFO 1 --- [ctor-executor-3] c.a.m.e.i.AmqpReceiveLinkProcessor       : linkName[1_5fce48_1631152692132] entityPath[yyyyyyy/ConsumerGroups/zzzzzz/Partitions/1]. Setting next AMQP receive link.
01:25:26.987  INFO 1 --- [ctor-executor-3] c.a.m.e.i.AmqpReceiveLinkProcessor       : Receive link endpoint states are closed. Requesting another.
01:25:26.987  INFO 1 --- [ctor-executor-3] c.a.m.e.i.AmqpReceiveLinkProcessor       : Requesting a new AmqpReceiveLink from upstream.
01:25:26.987  INFO 1 --- [ctor-executor-3] c.a.m.e.EventHubConsumerAsyncClient      : connectionId[MF_db4a99_1631152692131] linkName[1_5fce48_1631152692132]: Creating receive consumer for partition '1'
01:25:26.988  INFO 1 --- [ctor-executor-3] c.a.c.a.i.handler.ReceiveLinkHandler     : onLinkLocalClose connectionId[MF_db4a99_1631152692131], linkName[1_5fce48_1631152692132], errorCondition[null], errorDescription[null]
01:25:26.989  INFO 1 --- [ctor-executor-3] c.a.c.a.i.handler.ReceiveLinkHandler     : onLinkFinal connectionId[MF_db4a99_1631152692131], linkName[1_5fce48_1631152692132]
01:25:26.989  INFO 1 --- [ctor-executor-3] c.a.c.a.i.handler.SessionHandler         : onSessionRemoteClose connectionId[yyyyyyy/ConsumerGroups/zzzzzz/Partitions/1], entityName[MF_db4a99_1631152692131], condition[Error{condition=null, description='null', info=null}]
01:25:26.990  INFO 1 --- [ctor-executor-3] c.a.c.a.i.handler.SessionHandler         : onSessionRemoteClose closing a local session for connectionId[MF_db4a99_1631152692131], entityName[yyyyyyy/ConsumerGroups/zzzzzz/Partitions/1], condition[null], description[null]
01:25:26.990  INFO 1 --- [ctor-executor-3] c.a.c.a.i.ReactorConnection              : connectionId[MF_db4a99_1631152692131] sessionName[yyyyyyy/ConsumerGroups/zzzzzz/Partitions/1]: Error occurred. Removing and disposing session.
01:25:26.990  INFO 1 --- [ctor-executor-3] c.a.c.a.implementation.ReactorSession    : connectionId[MF_db4a99_1631152692131], sessionId[yyyyyyy/ConsumerGroups/zzzzzz/Partitions/1], errorCondition[n/a]: Disposing of session.
01:25:26.990  INFO 1 --- [ctor-executor-3] c.a.c.a.i.handler.SendLinkHandler        : onLinkRemoteClose connectionId[MF_db4a99_1631152692131], linkName[cbs:sender], errorCondition[null], errorDescription[null]
01:25:26.990  INFO 1 --- [ctor-executor-3] c.a.c.a.i.handler.SendLinkHandler        : processOnClose connectionId[MF_db4a99_1631152692131], linkName[cbs:sender], errorCondition[null], errorDescription[null]
01:25:26.990  INFO 1 --- [ctor-executor-3] c.a.c.a.i.handler.ReceiveLinkHandler     : onLinkRemoteClose connectionId[MF_db4a99_1631152692131], linkName[cbs:receiver], errorCondition[null], errorDescription[null]
01:25:26.990  INFO 1 --- [ctor-executor-3] c.a.c.a.i.handler.ReceiveLinkHandler     : processOnClose connectionId[MF_db4a99_1631152692131], linkName[cbs:receiver], errorCondition[null], errorDescription[null]
01:25:26.990  INFO 1 --- [ctor-executor-3] c.a.c.a.i.handler.SessionHandler         : onSessionRemoteClose connectionId[cbs-session], entityName[MF_db4a99_1631152692131], condition[Error{condition=null, description='null', info=null}]
01:25:26.990  INFO 1 --- [ctor-executor-3] c.a.c.a.i.handler.SessionHandler         : onSessionRemoteClose closing a local session for connectionId[MF_db4a99_1631152692131], entityName[cbs-session], condition[null], description[null]
01:25:26.990  INFO 1 --- [ctor-executor-3] c.a.c.a.i.ReactorConnection              : connectionId[MF_db4a99_1631152692131] sessionName[cbs-session]: Error occurred. Removing and disposing session.
01:25:26.990  INFO 1 --- [ctor-executor-3] c.a.c.a.implementation.ReactorSession    : connectionId[MF_db4a99_1631152692131], sessionId[cbs-session], errorCondition[n/a]: Disposing of session.
01:25:26.990  INFO 1 --- [ctor-executor-3] c.a.c.a.i.handler.ConnectionHandler      : onConnectionRemoteClose hostname[iothub-ns-xxxxxxxxx.servicebus.windows.net], connectionId[MF_db4a99_1631152692131], errorCondition[null], errorDescription[null]
01:25:26.991  INFO 1 --- [ctor-executor-3] c.a.c.a.i.handler.SendLinkHandler        : onLinkLocalClose connectionId[MF_db4a99_1631152692131], linkName[cbs:sender], errorCondition[null], errorDescription[null]
01:25:26.991  INFO 1 --- [ctor-executor-3] c.a.c.a.i.handler.ReceiveLinkHandler     : onLinkLocalClose connectionId[MF_db4a99_1631152692131], linkName[cbs:receiver], errorCondition[null], errorDescription[null]
01:25:26.991  INFO 1 --- [ctor-executor-3] c.a.c.a.i.handler.ConnectionHandler      : onConnectionLocalClose hostname[iothub-ns-xxxxxxxxx.servicebus.windows.net], connectionId[MF_db4a99_1631152692131], errorCondition[null], errorDescription[null]
01:25:26.991  INFO 1 --- [ctor-executor-3] c.a.c.a.i.handler.ConnectionHandler      : onConnectionUnbound hostname[iothub-ns-xxxxxxxxx.servicebus.windows.net], connectionId[MF_db4a99_1631152692131], state[CLOSED], remoteState[CLOSED]
01:25:26.991  INFO 1 --- [ctor-executor-3] c.a.c.a.i.handler.SessionHandler         : onSessionFinal connectionId[MF_db4a99_1631152692131], entityName[yyyyyyy/ConsumerGroups/zzzzzz/Partitions/1], condition[null], description[null]
01:25:26.991  INFO 1 --- [ctor-executor-3] c.a.c.a.i.handler.SendLinkHandler        : onLinkFinal connectionId[MF_db4a99_1631152692131], linkName[cbs:sender]
01:25:26.991  INFO 1 --- [ctor-executor-3] c.a.c.a.i.handler.ReceiveLinkHandler     : onLinkFinal connectionId[MF_db4a99_1631152692131], linkName[cbs:receiver]
01:25:26.991  INFO 1 --- [ctor-executor-3] c.a.c.a.i.handler.SessionHandler         : onSessionFinal connectionId[MF_db4a99_1631152692131], entityName[cbs-session], condition[null], description[null]
01:25:26.991  INFO 1 --- [ctor-executor-3] c.a.c.a.i.handler.ConnectionHandler      : onConnectionFinal hostname[iothub-ns-xxxxxxxxx.servicebus.windows.net], connectionId[MF_db4a99_1631152692131], errorCondition[null], errorDescription[null]
01:25:26.994  INFO 1 --- [ctor-executor-3] c.a.m.e.i.EventHubConnectionProcessor    : namespace[iothub-ns-xxxxxxxxx.servicebus.windows.net] entityPath[yyyyyyy]: Channel is closed.
01:25:26.994  INFO 1 --- [ctor-executor-3] c.a.m.e.i.EventHubReactorAmqpConnection  : connectionId[MF_db4a99_1631152692131]: Disposing of connection.
01:25:26.994  INFO 1 --- [ctor-executor-3] c.a.c.a.i.ReactorConnection              : connectionId[MF_db4a99_1631152692131], errorCondition[n/a]: Disposing of ReactorConnection.

...

01:26:26.994  INFO 1 --- [ctor-executor-3] c.a.c.a.implementation.ReactorExecutor   : Unable to acquire dispose reactor semaphore within timeout.
01:26:26.994  INFO 1 --- [ctor-executor-3] c.a.c.a.i.AmqpExceptionHandler           : Shutdown received: ReactorExecutor.close() was called., isTransient[false], initiatedByClient[true]
01:26:26.996  WARN 1 --- [ctor-executor-3] c.a.c.a.implementation.ReactorExecutor   : connectionId[MF_db4a99_1631152692131], message[Scheduling reactor failed because the scheduler has been shut down.]
01:26:26.996  INFO 1 --- [ctor-executor-3] c.a.c.a.implementation.ReactorExecutor   : connectionId[MF_db4a99_1631152692131], message[Stopping the reactor because thread was interrupted or the reactor has no more events to process.]

To Reproduce
It looks a random issue after a period of activity for some hours or days. It would be a big help to know how to produce that the event hub decides to closes the link, session and connection.

Code Snippet
It is the normal connection from examples:

    BlobContainerAsyncClient blobContainerAsyncClient = new BlobContainerClientBuilder()
        .connectionString(_STORAGE_CONNECTION_STRING_)
        .containerName(_STORAGE_CONTAINER_NAME_)
        .buildAsyncClient();

    EventProcessorClientBuilder eventProcessorClientBuilder = new EventProcessorClientBuilder()
        .connectionString(_EH_NAMESPACE_CONNECTION_STRING_)
        .consumerGroup(_CONSUMERGROUP_NAME_)
        .processEvent(_PARTITION_PROCESSOR_)
        .processError(_ERROR_HANDLER_)
        .checkpointStore(new BlobCheckpointStore(blobContainerAsyncClient));

Expected behavior
I would expect not to receive such events from the Event Hub (close of link, session and connection) and, if received, the client should recover opening a new connection. Or at least being able to manage this error. I would like also to have a description telling why the Event Hub decides to close the connection.

Setup (please complete the following information):

  • OS: Linux RHEL
  • Library/Libraries: com.azure:azure-messaging-eventhubs:5.5.0
  • Java version: 8
  • App Server/Environment: openjdk:8 docker base image
  • Frameworks: Spring Boot

Additional context
It is actually the built-in Event Hub for the IoT Hub service.

Information Checklist
Kindly make sure that you have added all the following information above and checkoff the required fields otherwise we will treat the issuer as an incomplete report

  • Bug Description Added
  • Repro Steps Added
  • Setup information Added
@ghost ghost added needs-triage Workflow: This is a new issue that needs to be triaged to the appropriate team. customer-reported Issues that are reported by GitHub users external to the Azure organization. question The issue doesn't require a change to the product in order to be resolved. Most issues start as that labels Sep 21, 2021
@joshfree joshfree added Client This issue points to a problem in the data-plane of the library. Event Hubs pillar-reliability The issue is related to reliability, one of our core engineering pillars. (includes stress testing) labels Sep 21, 2021
@ghost ghost removed the needs-triage Workflow: This is a new issue that needs to be triaged to the appropriate team. label Sep 21, 2021
@joshfree joshfree added needs-author-feedback Workflow: More information is needed from author to address the issue. needs-triage Workflow: This is a new issue that needs to be triaged to the appropriate team. labels Sep 21, 2021
@ghost ghost removed the needs-triage Workflow: This is a new issue that needs to be triaged to the appropriate team. label Sep 21, 2021
@joshfree
Copy link
Member

Hi @MickeyMouseS, 5.5.0 is a fairly old version from February 2021. Could you try upgrading to 5.10.0, which has a significant number of performance and reliability improvements in its azure-core-amqp dependency made since February; and let us know if you're still experiencing any issues?

Here are more details on the changes included:

https://github.com/Azure/azure-sdk-for-java/blob/azure-messaging-eventhubs_5.10.1/sdk/eventhubs/azure-messaging-eventhubs/CHANGELOG.md

https://github.com/Azure/azure-sdk-for-java/blob/azure-messaging-eventhubs_5.10.1/sdk/core/azure-core-amqp/CHANGELOG.md

/cc @conniey

@MickeyMouseS
Copy link
Author

Thank you @joshfree for your prompt answer.

We will definitely update the library and let you know if we encounter any issue.

Additionally, it would be really helpful if you can clarify these points:

  • Why Event Hub is closing the connection? We have many consumers in Azure Cloud and these ones don't present any issue, but the ones deployed on-premise experience problems with the connection. We can assure is not any firewall issue, as we have checked. Also, as it can be seen in the log, the Event Hub starts the closure of the connection, but not reason given (timeout or similar).
  • Is there any recommended mechanism for us to double check for any issues at runtime? So we are able to capture the failure and restart the application.

@ghost ghost added needs-team-attention Workflow: This issue needs attention from Azure service team or SDK team and removed needs-author-feedback Workflow: More information is needed from author to address the issue. labels Sep 21, 2021
@anuchandy
Copy link
Member

Hi @MickeyMouseS, from the SDK side, we're not sure why the service would initiate closure of the endpoints (link, session, connection) without any "error". I would suggest opening a eventhub support ticket so that service dev can analyze the service log specific to your subscription/resource and clarify why this is happening.

That said, you're right that SDK should retry/recover when such closure happens. There has been a couple of reliability fixes since 5.5.0, so like Josh pointed, please update to the latest to get those fixes; we specifically had fixes related to the close path. There is also a fix for close merged last week that will be released in Oct mid.

@lidkowiak
Copy link

We have same issue on azure-messaging-eventhubs 5.10.0. It happens multiple times per day.

@dekel-upstream
Copy link

Also happens in 5.10.0, same time for two process running on different consumer groups

@conniey conniey added needs-author-feedback Workflow: More information is needed from author to address the issue. and removed needs-team-attention Workflow: This issue needs attention from Azure service team or SDK team labels Jan 14, 2022
@conniey
Copy link
Member

conniey commented Jan 14, 2022

If you could provide some INFO/DEBUG logs +/- 5 minutes after you experience the issue, it would help us narrow this down or a way to reproduce your issue/environment. :(

We have a few stress infrastructure runs that have been running for ~2 weeks and have not experienced the issue.

@gondar
Copy link

gondar commented Jan 14, 2022 via email

@ghost ghost added the no-recent-activity There has been no recent activity on this issue. label Jan 22, 2022
@ghost
Copy link

ghost commented Jan 22, 2022

Hi, we're sending this friendly reminder because we haven't heard back from you in a while. We need more information about this issue to help address it. Please be sure to give us your input within the next 7 days. If we don't hear back from you within 14 days of this comment the issue will be automatically closed. Thank you!

@gondar
Copy link

gondar commented Jan 22, 2022 via email

@ghost ghost removed the no-recent-activity There has been no recent activity on this issue. label Jan 22, 2022
@ghost ghost added the no-recent-activity There has been no recent activity on this issue. label Jan 29, 2022
@ghost
Copy link

ghost commented Jan 29, 2022

Hi, we're sending this friendly reminder because we haven't heard back from you in a while. We need more information about this issue to help address it. Please be sure to give us your input within the next 7 days. If we don't hear back from you within 14 days of this comment the issue will be automatically closed. Thank you!

@gondar
Copy link

gondar commented Jan 30, 2022 via email

@ghost ghost removed the no-recent-activity There has been no recent activity on this issue. label Jan 30, 2022
@ghost ghost added the no-recent-activity There has been no recent activity on this issue. label Feb 6, 2022
@ghost
Copy link

ghost commented Feb 6, 2022

Hi, we're sending this friendly reminder because we haven't heard back from you in a while. We need more information about this issue to help address it. Please be sure to give us your input within the next 7 days. If we don't hear back from you within 14 days of this comment the issue will be automatically closed. Thank you!

@gondar
Copy link

gondar commented Feb 6, 2022 via email

@ghost ghost removed the no-recent-activity There has been no recent activity on this issue. label Feb 6, 2022
@ghost ghost added the no-recent-activity There has been no recent activity on this issue. label Feb 13, 2022
@ghost
Copy link

ghost commented Feb 13, 2022

Hi, we're sending this friendly reminder because we haven't heard back from you in a while. We need more information about this issue to help address it. Please be sure to give us your input within the next 7 days. If we don't hear back from you within 14 days of this comment the issue will be automatically closed. Thank you!

@gondar
Copy link

gondar commented Feb 15, 2022 via email

@ghost ghost removed the no-recent-activity There has been no recent activity on this issue. label Feb 15, 2022
@ghost ghost added the no-recent-activity There has been no recent activity on this issue. label Feb 22, 2022
@ghost
Copy link

ghost commented Feb 22, 2022

Hi, we're sending this friendly reminder because we haven't heard back from you in a while. We need more information about this issue to help address it. Please be sure to give us your input within the next 7 days. If we don't hear back from you within 14 days of this comment the issue will be automatically closed. Thank you!

@gondar
Copy link

gondar commented Mar 4, 2022 via email

@ghost ghost removed the no-recent-activity There has been no recent activity on this issue. label Mar 4, 2022
@conniey conniey removed the needs-author-feedback Workflow: More information is needed from author to address the issue. label Mar 4, 2022
@conniey
Copy link
Member

conniey commented Mar 4, 2022

Hey @gondar, apologies for the bot bother, it doesn't seem to automatically take off the label. I was unable to catch your email because it was ***'d out. I looked into the issue "Tracking#2201040050000678" and the issue I have access to is a Service Bus related issue about overexcessive logging... I don't think that's the one you're referring to here.

Since some automation censored out your email, you can reach out to me. My email is conniey [at] microsoft.com

@dev-nimbeo
Copy link

Hi all,

Finally this issue has been resolved with the last library updates. Now the clients are able to recover AMQP communication after a disconnection event. The issue is resolved since version 5.11.1 (https://github.com/Azure/azure-sdk-for-java/blob/main/sdk/eventhubs/azure-messaging-eventhubs/CHANGELOG.md#5111-2022-03-17).

The "random" disconnection events are due to maintenance reasons in Azure Event Hub service.

@github-actions github-actions bot locked and limited conversation to collaborators Apr 11, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Client This issue points to a problem in the data-plane of the library. customer-reported Issues that are reported by GitHub users external to the Azure organization. Event Hubs pillar-reliability The issue is related to reliability, one of our core engineering pillars. (includes stress testing) question The issue doesn't require a change to the product in order to be resolved. Most issues start as that
Projects
None yet
Development

No branches or pull requests

8 participants