-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[core-amqp] onTransportError does not retry to get AMQP Connection #21156
Comments
We experienced a similar error last week on our production environment on Azure. We see an onTransportError in the logs, and then the app cannot recover the AMQP connection. We are using azure-messaging-servicebus version 7.3.0. |
@hemanttanwar Any update on this issue? We experienced this issue as well on azure-messaging-servicebus 7.4.0, and the SDK failed to recover the AMQP connection. |
We experienced same issue with azure-messaging-eventhubs 5.10.0 yesterday: com.azure.core.amqp.exception.AmqpException: An existing connection was forcibly closed by the remote host, errorContext[NAMESPACE: [REMOVED].servicebus.windows.net. ERROR CONTEXT: N/A] An existing connection was forcibly closed by the remote host, errorContext[NAMESPACE: [REMOVED].windows.net. ERROR CONTEXT: N/A]
|
@steven-cardini @lidkowiak Do you have an info level or debug level log for this issue? It's useful to understand what is going on in the link before. The error dropped is from another reactor operator. |
Here is what we are seeing in our logs, granted it is not at INFO or DEBUG level.
It always seems to fail to recover from this. We are seeing this pretty often on our end, but there does not seem to be much of a pattern. |
@conniey Unfortunately no, we have following logger config We have 6 consumer groups that process events from two EH each configured with 32 partitions. Generally we're observing two patterns:
All consumer groups are deployed as Azure VM within single vnet. Generally after library upgrade we're facing this issue quite frequently (almost every day). On legacy (azure-eventhubs-eph 3.2.0) it was once/twice a month. I've seen that there are couple of others issues raised with similar behaviour. |
"An existing connection was forcibly closed by the remote host." means the Event Hubs service disconnected the service; this is something we don't control. I would need more logs to understand where it's hanging (ie. failing to recover). I've tested this locally and I've seen it recover... so I'd need some DEBUG logs to understand this more. |
@SheldonB is that where your log ends? If you have more logs, could you upload them? (There are a lot of components that try to recover behind the scenes, so we'd want to understand where it is hanging.) |
@conniey Yes, that is where it ends. I have a more logs that are the exact same, but they are only ERROR logs. It never reconnects after that exception. I am going to do my best to get you a DEBUG log, but like I said the issue does not seem to have a pattern to it. |
@conniey The problem is that the issue is nondeterministic so I can't predict which consumer group will be affected. EH/amqp logs are very chatty so we go with WARN / ERROR levels. |
Morning, I work with @lidkowiak on this issue in our company. https://gist.githubusercontent.com/piotr-napadlek/70878388eaf0626c878190fcab46ab67/raw/208db4b561ce9955b9c9bee32a488a29812b0ece/eh_logs1.txt As an example, the library stopped receiving messages on partition 20 on 2021-10-01 06:25:26,004 (message on topic "rapa" with sequence number 4329039). From the sender app logs we can confirm that the next message on partition 20 was enqueued at 2021-10-01T06:27:49.5790000Z with sequence number 4329040. There is no trace of it in the receiver app until we restarted the client some half hour afterwards, with the .stop() and .start() invocation on the EventProcessorClient instance. Additionally our regular application logs emitted warning/errors below within the timerange: Note that our app also connects to servicebus to emit some events, so you can spot some exceptions of the service bus sender. Please let us know if you need any other details. |
@conniey I can provide the following logs from a recent occurrence of the problem (this is on behalf of @steven-cardini ). This covers the timeframe from 2021-09-21T22:26:30.000Z until 2021-09-21T22:26:40.000Z. In the meantime, we have also upgraded to azure-messaging-servicebus 7.4.0, and this problem still occurs, leading frequently to production problems where we have to restart the affected services. Please let me know if you need more information.
On a related note, we have also seen some of the following exceptions. Do you think this is a related issue or something that would need separate investigation?
|
Thanks for sending this log. I'm taking a look at it. |
Hello @conniey, we have experienced this issue 4 separate times over the past couple weeks with two of our App Services (using azure-messaging-eventhubs 5.10.0). These apps are simple Event Hub consumers (1 partition, 1 consumer group) and publishers. What is somewhat interesting is that even though these two apps are completely independent of each other, they both lost their ability to consume messages 2 times over the past two weeks and at about the same time as each other (probably just a coincidence, but interesting nonetheless). The thing that is concerning about this issue is that it is completely silent to our apps. Even though we have proper error handling implemented in our EventProcessorClient, when this issue occurs, no errors whatsoever are propagated up the stack to our app. Only by querying our logs in log analytics do we then see all the com.azure.core.amqp related errors. I have opened TrackingID#2110020040000392 with Microsoft support and if you access that ticket you will find a write up I did related to the 4 incidents and all the log files leading up to and after each of the incidents. Please let me know if you need any further details. |
Hello @conniey, Could you please provide a quick status/summary on the issue(s) reported recently with respect to Event Hub consumers stopping to read events from partitions? The reason I ask is that as I mentioned above, I had opened a ticket with support for this very same issue and they are reporting that the issue is fixed via PR #24141. However, after reading that and seeing issue #24426, I am not totally convinced that all is resolved and that we are now just waiting for the fix to be released. Therefore, could you please clarify where things are at? Thank you so much! |
@my3sons Hey. The end result of most bugs are the same, "It stops receiving". However, the cause of these bugs is often different. Without your logs, I can't say for sure which issue you're talking about. For the issue fixed in #24141 , you would see in your logs a graceful closure of the links, where errorCondition is null and see a mismatch in RequestResponseChannel's message ids that are pending versus scheduled. There are other issues such as the CBS node would never terminate because it scheduled a close on an already closed link, so you wouldn't see the completion of it. But the end result of both these bugs are the same, it stops receiving. |
Hello @conniey, I have attached the logs from a couple of our consumers which experienced the same issue about 24 hours a part. The logs provide you all the details just prior to and after it stopped consuming events (9:15:41 on 9/30/2021 and 3:02:25 PM on 10/01/2021). Based on what you see here, is this consistent with #24141? |
@piotr-napadlek: @anuchandy and I analysed your logs. The fix for this is in our October release. In your logs, there are 4 connections that are gracefully closed. And on each of these connections, there are CBS requests that are never settled, so it hangs forever waiting. 2021-10-01 06:27:09,050 INFO [com.azure.core.amqp.implementation.handler.ConnectionHandler] (reactor-executor-662) onConnectionRemoteClose connectionId[MF_e5e6a7_1633064490013] hostname[our-eventhub-namespace.servicebus.windows.net] errorCondition[null] errorDescription[null]
2021-10-01 06:27:09,081 INFO [com.azure.core.amqp.implementation.handler.ConnectionHandler] (reactor-executor-660) onConnectionRemoteClose connectionId[MF_3d0325_1633064489966] hostname[our-eventhub-namespace.servicebus.windows.net] errorCondition[null] errorDescription[null]
2021-10-01 06:27:09,206 INFO [com.azure.core.amqp.implementation.handler.ConnectionHandler] (reactor-executor-683) onConnectionRemoteClose connectionId[MF_25236f_1633064491873] hostname[our-eventhub-namespace.servicebus.windows.net] errorCondition[null] errorDescription[null]
2021-10-01 06:27:09,300 INFO [com.azure.core.amqp.implementation.handler.ConnectionHandler] (reactor-executor-650) onConnectionRemoteClose connectionId[MF_73109b_1633064488857] hostname[our-eventhub-namespace.servicebus.windows.net] errorCondition[null] errorDescription[null] CBS requests that are never settled: 2021-10-01 06:27:09,019 DEBUG [com.azure.core.amqp.implementation.RequestResponseChannel] (reactor-executor-662) connectionId[MF_e5e6a7_1633064490013], linkName[cbs]: Scheduling on dispatcher. MessageId[6]
2021-10-01 06:27:08,988 DEBUG [com.azure.core.amqp.implementation.RequestResponseChannel] (reactor-executor-660) connectionId[MF_3d0325_1633064489966], linkName[cbs]: Scheduling on dispatcher. MessageId[6]
2021-10-01 06:27:09,097 DEBUG [com.azure.core.amqp.implementation.RequestResponseChannel] (reactor-executor-683) connectionId[MF_25236f_1633064491873], linkName[cbs]: Scheduling on dispatcher. MessageId[6]
2021-10-01 06:27:09,144 DEBUG [com.azure.core.amqp.implementation.RequestResponseChannel] (reactor-executor-650) connectionId[MF_73109b_1633064488857], linkName[cbs]: Scheduling on dispatcher. MessageId[6] |
@my3sons : From your logs, we saw the graceful closure of the links where it does not recover. Unfortunately, DEBUG logs were not enabled, which contains the AM","onLinkRemoteClose connectionId[MF_06aefc_1633006667393] linkName[0_310f17_1633006667393], errorCondition[null] errorDescription[null]" |
We've investigated the |
Is this issue resolved ? I am also facing same issue. |
When AMQP Connection get Transport error, it does not retry to get new AMQP Connection.
Issue observed in: core-amqp : 2.0.4
This error shows up both in ServiceBus and EventHubs.
Replicating the issue:
The Transport Error could come from many different reason, but one way I am trying to generate this is to start a Async Receiver and remove the internet connection for 5 minutes.
Here is what we will see in logs.
com.azure.core.amqp.implementation.handler.ConnectionHandler - onTransportError hostname[eh-test-t2.servicebus.windows.net], connectionId[MF_ca8f0f_1619737351426], error[An existing connection was forcibly closed by the remote host]
com.azure.core.amqp.implementation.ReactorConnection - onConnectionShutdown connectionId[MF_ca8f0f_1619737351426], hostName[eh-test-t2.servicebus.windows.net], message[Shutting down], shutdown signal[false]
Commit point just before the AMQP Connection fixes in April: a8a39f8
My observation is that code from above commit point treat this error as Transient and retry since it call this endpointsState/onError
https://github.com/Azure/azure-sdk-for-java/blob/master/sdk/core/azure-core-amqp/src/main/java/com/azure/core/amqp/implementation/AmqpChannelProcessor.java#L104
but latest
core-amqp/2.0.4
. call endpointsState/onComplete consumer here https://github.com/Azure/azure-sdk-for-java/blob/master/sdk/core/azure-core-amqp/src/main/java/com/azure/core/amqp/implementation/AmqpChannelProcessor.java#L107Attached files :
Intellij-eventhubs-disconnect-2minutes-event-consumer_core-amqp-version2.0.4.txt : Shows this issue in latest Events Hubs.
core-amqp-version-before-amqp-connection-issuefix-wifi-considered-as-tranisient-error.txt : Shows older version of core-amqp treating it as Transient error
The text was updated successfully, but these errors were encountered: