Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Azure Event Hub binding stops working if storage act returns 503 #3573

Closed
MattCosturos opened this issue Oct 18, 2024 · 3 comments · Fixed by #3614
Closed

Azure Event Hub binding stops working if storage act returns 503 #3573

MattCosturos opened this issue Oct 18, 2024 · 3 comments · Fixed by #3614
Assignees
Labels
kind/bug Something isn't working
Milestone

Comments

@MattCosturos
Copy link
Contributor

Expected Behavior

Azure Event Hub binding will continue working after receiving a 503 error from the storage act used for checkpointing.

Actual Behavior

The event hub binding dies after receiving a 503 error from table storage. The component does not appear to retry the calls to the storage act, and just gives up. Logs become filled with Error from event processor: the Processor has been stopped. Create a new instance to start processing again

We are using a standard storage act and not a premium act. I could switch to a premium act for better throughput, but I would still have concerns that the component could still fail to process messages if it experiences any sort of transient error.

Can we somehow configure a retry policy to restart this component instead of filling the logs with an error?

time="2024-10-17T19:48:21.031356514Z" level=warning msg="The default value for 'spec.metric.http.increasedCardinality' will change to 'false' in Dapr 1.15 or later" app_id=component-telemetry instance=component-telemetry-microservice-54454fbdd7-mxr8s scope=dapr.runtime.diagnostics type=log ver=1.14.4
time="2024-10-17T19:48:24.243756578Z" level=warning msg="The default value for 'spec.metric.http.increasedCardinality' will change to 'false' in Dapr 1.15 or later" app_id=component-telemetry instance=component-telemetry-microservice-54454fbdd7-mxr8s scope=dapr.runtime.http type=log ver=1.14.4
time="2024-10-18T01:25:52.104204839Z" level=error msg="Error processing events from partition client: error receiving events: client has been closed by user" app_id=component-telemetry component="component-telemetry-1 (bindings.azure.eventhubs/v1)" instance=component-telemetry-microservice-54454fbdd7-mxr8s scope=dapr.contrib type=log ver=1.14.4
time="2024-10-18T01:25:56.922083584Z" level=error msg="Error processing events from partition client: error receiving events: client has been closed by user" app_id=component-telemetry component="component-telemetry-0 (bindings.azure.eventhubs/v1)" instance=component-telemetry-microservice-54454fbdd7-mxr8s scope=dapr.contrib type=log ver=1.14.4
time="2024-10-18T01:26:02.932347006Z" level=error msg="Error from event processor: GET https://my-storage-act.blob.core.windows.net/dapr-component-telemetry\n--------------------------------------------------------------------------------\nRESPONSE 503: 503 The server is busy.\nERROR CODE: ServerBusy\n--------------------------------------------------------------------------------\n\ufeff<?xml version=\"1.0\" encoding=\"utf-8\"?><Error><Code>ServerBusy</Code><Message>The server is busy.\nRequestId:9e9a9f71-c01e-0015-15fc-202deb000000\nTime:2024-10-18T01:25:52.0107247Z</Message></Error>\n--------------------------------------------------------------------------------\n" app_id=component-telemetry component="component-telemetry-1 (bindings.azure.eventhubs/v1)" instance=component-telemetry-microservice-54454fbdd7-mxr8s scope=dapr.contrib type=log ver=1.14.4
time="2024-10-18T01:26:07.643955727Z" level=error msg="Error from event processor: GET https://my-storage-act.blob.core.windows.net/dapr-component-telemetry\n--------------------------------------------------------------------------------\nRESPONSE 503: 503 The server is busy.\nERROR CODE: ServerBusy\n--------------------------------------------------------------------------------\n\ufeff<?xml version=\"1.0\" encoding=\"utf-8\"?><Error><Code>ServerBusy</Code><Message>The server is busy.\nRequestId:9e9aad9c-c01e-0015-75fc-202deb000000\nTime:2024-10-18T01:25:56.8329114Z</Message></Error>\n--------------------------------------------------------------------------------\n" app_id=component-telemetry component="component-telemetry-0 (bindings.azure.eventhubs/v1)" instance=component-telemetry-microservice-54454fbdd7-mxr8s scope=dapr.contrib type=log ver=1.14.4
time="2024-10-18T01:26:07.933024175Z" level=error msg="Error from event processor: the Processor has been stopped. Create a new instance to start processing again" app_id=component-telemetry component="component-telemetry-1 (bindings.azure.eventhubs/v1)" instance=component-telemetry-microservice-54454fbdd7-mxr8s scope=dapr.contrib type=log ver=1.14.4
time="2024-10-18T01:26:12.644981572Z" level=error msg="Error from event processor: the Processor has been stopped. Create a new instance to start processing again" app_id=component-telemetry component="component-telemetry-0 (bindings.azure.eventhubs/v1)" instance=component-telemetry-microservice-54454fbdd7-mxr8s scope=dapr.contrib type=log ver=1.14.4

Steps to Reproduce the Problem

Unable to reproduce on demand. Leave the component running long enough and this can happen.

@MattCosturos MattCosturos added the kind/bug Something isn't working label Oct 18, 2024
@oising
Copy link

oising commented Nov 14, 2024

This just happened to us -- on DAPR 1.14.4, except it was failure to connect to the storage account that triggered it.

time="2024-11-12T01:47:38.209787491Z" level=error msg="Error from event processor: Get \"https://REDACTED.blob.core.windows.net/checkpoints?comp=list&include=metadata&prefix=REDACTED.servicebus.windows.net%2Fdevice-out%2Fdispatcher%2Fownership%2F&restype=container\": dial tcp:REDACTED:443: connect: connection refused" app_id=dispatcher component="binding-device-out-input (bindings.azure.eventhubs/v1)" instance=dispatcher-848b9dd7fd-2cg2w scope=dapr.contrib type=log ver=1.14.4

@MattCosturos
Copy link
Contributor Author

Ok Did some research now that I have a basic understanding of Golang.

eventhubs.go creates the Event Hub Processor on Line 246

The main loop is declared further down on Line 304

This loop does try to resume from errors that get returned by the call to processor.Run.
However, the Azure SDK only returns unrecoverable errors.

The docs state

// Run handles the load balancing loop, blocking until the passed in context is cancelled
// or it encounters an unrecoverable error. On cancellation, it will return a nil error.
//
// This function should run for the lifetime of your application, or for as long as you want
// to continue to claim and process partitions.
//
// Once a Processor has been stopped it cannot be restarted and a new instance must
// be created.

The main loop has a 5 second sleep time before retrying, which I believe lined up with the timestamps in my logs.

This component has to be handle the fact that the errors returned by processor.Run are not recoverable, and this component needs to create a new event hub processor.
It might be as simple as moving the call to aeh.getProcessorForTopic into the subscriberLoop func

@yaron2 yaron2 self-assigned this Nov 25, 2024
@yaron2 yaron2 added this to the v1.15 milestone Nov 25, 2024
@yaron2
Copy link
Member

yaron2 commented Nov 25, 2024

Assigned to 1.15

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something isn't working
Projects
None yet
3 participants