Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ServiceBus] Fix an issue where user error handler is not called #19189

Merged
merged 17 commits into from
Feb 7, 2022

Conversation

jeremymeng
Copy link
Member

@jeremymeng jeremymeng commented Dec 10, 2021

when subscribing to a valid topic but an invalid or disabled subscription. Issue #18798

The problem is that in these cases, the service accepts our client's cbs claim negotiation, and the receiver link is
created. The client receiver enters
the retryForeverFn to call _initAndAddCreditOperation(), where the following

  await this._messageHandlers().postInitialize();
  this._receiverHelper.addCredit(this.maxConcurrentCalls);

will return successfully, despite that the fired-and-forgot addCredit() call would later leads to
MessagingEntityDisabled or MessagingEntityNotFound errors in the underlying link. Since there's no errors thrown in
our retry-forever loop, the onErrors() callback
is not invoked. It is one place where the user error handler is called.

Because of the error, the link is detatched. We have code to re-establish the link when errors happen, so we call
_subscribeImpl() where the retryForeverFn() is called again. This goes on in an endless loop.

We used to invoke user error handler and would not attempt to re-establish connections when errors are considered
non-retryable. In PR #11973 we removed the classification of errors that subscribe() used and instead continues to
retry infinitely. We also removed the code to invoke user error handler. This PR adds code to invoke the user error
handler in _onAmqpError() so users have a chance to stop the infinite loop.

There's another problem. When users call close() on the subscription in their error handler,
this._receiverHelper.suspend() is called to suspend the receiver. However, when re-connecting we call
this._receiverHelper.resume() again in _subscribeImpl(). This essentially reset the receiver to be active and we
will not abort the attempt to initialize connection

if (this._receiverHelper.isSuspended()) {
// user has suspended us while we were initializing
// the connection. Abort this attempt - if they attempt
// resubscribe we'll just reinitialize.
throw new AbortError("Receiver was suspended during initialization.");
}

To fix it, this PR moves the resume() call out. It is supposed to only called to enable receiver before
subscribing. It should not be called when we try to re-connect.

and change `testError()` to only verify non empty `entityPath`, thus allow verifying
error messages with different format.
and invoke user error handler in that case.
@jeremymeng jeremymeng changed the title [ServiceBus] don't attempt to reconnect on non-retryable errors [ServiceBus] don't attempt to reconnect on non-retryable errors in streaming receiver. Dec 10, 2021
@jeremymeng
Copy link
Member Author

/azp run js - service-bus - tests

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@jeremymeng
Copy link
Member Author

Known browser test failure after removing "assert" dependency

deyaaeldeen
deyaaeldeen previously approved these changes Dec 11, 2021
Copy link
Member

@deyaaeldeen deyaaeldeen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

@ramya-rao-a
Copy link
Contributor

I believe it was intentional to keep retrying in the streaming receiver even for non retryable errors as long as the user was notified of it. Please check with @richardpark-msft and @JoshLove-msft regarding past discussions on this.

If so, then for #18798, we may want to specifcally look for MessagingEntityNotFound to break out of the loop.

when subscribing to a valid topic but an invalid or disabled subscription.

The problem is that in these two cases, the service accepts our client's cbs
claim negotiation, and the receiver link is created. The [client receiver
enters](https://github.com/Azure/azure-sdk-for-js/blob/14099039a8009d6d9687daf65d22a998e3ad7920/sdk/servicebus/service-bus/src/core/streamingReceiver.ts#L539)
the `retryForeverFn` to call `_initAndAddCreditOperation()`, where the following

```ts
  await this._messageHandlers().postInitialize();
  this._receiverHelper.addCredit(this.maxConcurrentCalls);
```

will return successfully, despite that the fired-and-forgot `addCredit()` call
would later leads to `MessagingEntityDisabled` or `MessagingEntityNotFound`
errors in the underlying link. Since there's no errors thrown in our
retry-forever loop, the `onErrors()`
[callback](https://github.com/Azure/azure-sdk-for-js/blob/14099039a8009d6d9687daf65d22a998e3ad7920/sdk/servicebus/service-bus/src/core/streamingReceiver.ts#L541)
is not invoked. It is where the user error handler is called.

Because of the error, the link is detatched. We have code to re-initializing the
link when errors happen, so we call `_subscribeImpl()` where the
`retryForeverFn()` is called again. This goes on in an endless loop.

This PR adds code to invoke the user error handler in `_onAmqpError()` when the
error code is `MessagingEntityDisabled` or `MessagingEntityNotFound` so users
have a chance to stop the infinite loop.

There's another problem. When users call `close()` on the subscription in their
error handler, `this._receiverHelper.suspend()` is called to suspend the
receiver. However, when re-connecting we call `this._receiverHelper.resume()`
again in `_subscribeImpl()`. This essentially reset the receiver to be active
and we will not abort the attempt to initialize connection

https://github.com/Azure/azure-sdk-for-js/blob/14099039a8009d6d9687daf65d22a998e3ad7920/sdk/servicebus/service-bus/src/core/streamingReceiver.ts#L578

To fix it, this PR moves the `resume()` call out. It is supposed to only called
to enable receiver before subscribing. It should not be called when we try to
re-connect.
@jeremymeng jeremymeng changed the title [ServiceBus] don't attempt to reconnect on non-retryable errors in streaming receiver. [ServiceBus] Fix an issue where user error handler is not called Dec 18, 2021
@jeremymeng jeremymeng dismissed deyaaeldeen’s stale review December 18, 2021 01:15

A different fix now. Please re-review

@@ -203,6 +203,17 @@ export class StreamingReceiver extends MessageReceiver {
sbError,
`${this.logPrefix} 'receiver_error' event occurred. The associated error is`
);
if (
sbError?.code &&
["MessagingEntityDisabled", "MessagingEntityNotFound"].includes(sbError.code)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only targeting these two errors now to limit the impact.

@jeremymeng
Copy link
Member Author

/azp run js - service-bus - tests

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@jeremymeng
Copy link
Member Author

@ramya-rao-a @deyaaeldeen please have another look.

deyaaeldeen
deyaaeldeen previously approved these changes Dec 21, 2021
Copy link
Member

@deyaaeldeen deyaaeldeen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me based on the PR description but I would wait for @ramya-rao-a's approval. I left a couple nitpicky comments but feel free to ignore them.

I wonder if anything here is applicable to Event Hubs though.

@ramya-rao-a
Copy link
Contributor

The fix looks good to me, I only have 2 concerns

  • Looks like all errors on the AMQP link and session after the receiver is created have been logged but ignored as far as the end user is concerned. We are now including 2 types of errors as "worthy" of being reported to the end user. We should have a discussion with the feature crew and the service team to see if there are any other that are worth being reported. Or maybe we report all and document which of these errors should be a show stopper for the end user?
  • If we are indeed restricting this change for just the 2 error types, the same should be done across languages. Therefore, this merits a discussion as well

@ramya-rao-a
Copy link
Contributor

I wonder if anything here is applicable to Event Hubs though.

Event Hubs does not have a counterpart to Service Bus's topic-subscription model. So, I would imagine any such errors should occur when the receiver is created just like how it happens with Service Bus's queues. Event Hubs does have the concept of a consumer group and partitions though. So, you can try the combination of

  • valid event hub name - invalid consumer group
  • valid event hub name - valid consumer group - invalid partitionId

@jeremymeng
Copy link
Member Author

I have tried a .NET test earlier and their user error handler was called for this error. Let me take another look at what .NET does differently.

@deyaaeldeen
Copy link
Member

@ramya-rao-a sounds good! I created #19610.

@zoladkow
Copy link

zoladkow commented Jan 4, 2022

If I may add to what @ramya-rao-a suggests

As a client I would like to be able to make informed decisions and close any connections/stop connection attempts gracefully.

If this would require to expose internal errors (which also seems most flexible), then this would require good docs on what's what. Seeing what are those "worthy" errors it's obvious this option is the correct one, or at least consistent with how current samples are written.
While automated retry is nice and could be the default, as a client I would like to have the option to add a listener/handler and interrupt the process at my discretion. It could be a separate handler just for the retry phase, which would add an extra information I could use - at what phase it occurred.

If however some internal errors are effectively equal, it would be nice to wrap them in higher level errors - just to hide the complexity.

@ramya-rao-a
Copy link
Contributor

Thanks for the input @zoladkow! We will take it into consideration in the follow up discussions for this issue.

Some extra context: I believe the general rule of thumb has been that any need of deeper control than automatic retries is made available by the other APIs on the receiver like the receiver.receiveMessages() or receiver.getMessageIterator. Both of these methods surface each and every error seen when receiving messages. The receiver.subscribe() method was meant to be a convenience option with automatic retries in a forever loop that exits only on a small set of errors.

@zoladkow
Copy link

zoladkow commented Jan 5, 2022

@ramya-rao-a

The receiver.subscribe() method was meant to be a convenience option with automatic retries in a forever loop that exits only on a small set of errors.

Wait, when you say "convenience" do you mean it's actually using those other API methods underneath? Isn't subscribe() actually streaming, where client handlers are called directly (more or less) from websocket handlers, but actually a hidden loop of receiveMessages()? I was under impression it's the former, but also I didn't go as far as reading @azure/core-amqp code... That loop, if not done using timeout/interval would be a bad thing in browser land, I think.

Or do you mean "convenience" but only that retries are handled by default?

@jeremymeng
Copy link
Member Author

Or do you mean "convenience" but only that retries are handled by default?

@zoladkow You are right. no, we are not calling receiveMessages() underneath.

@jeremymeng
Copy link
Member Author

  • We are now including 2 types of errors as "worthy" of being reported to the end user. We should have a discussion with the feature crew and the service team to see if there are any other that are worth being reported.

I check with .NET and Python for this error

  • .NET was able to invoke the user error handler as the exception is caught in their retry policy.
  • Python aborts for not-found errors (though it's understandable that the Python lib is different from other languages.)

I think this PR is fixing a bug where this error is not reported to the user in JS because we have a different way of handling error then reconnecting in JS and no error is thrown in our retry code. To end users there are no two types of errors.

@jeremymeng
Copy link
Member Author

jeremymeng commented Jan 25, 2022

Or maybe we report all and document which of these errors should be a show stopper for the end user?

yes @richardpark-msft also mentioned that there's no reason not doing this for all errors. I was a bit conservative but I can remove the limit on just these two error kinds.

@zoladkow
Copy link

zoladkow commented Feb 3, 2022

@jeremymeng
Hi, I encountered another issue with error not being exposed to client code via handlers. I wonder if perhaps it might be covered in this fix as well...

The problem is this:
image

In fact, the ASB resource which is being used in this case has IP restrictions applied, which is done by corporate policy, however the IP of ASB service is still public, which is why the browser can connect using my public outbound IP.

In any case, that is the root cause, but still would be nice to have a chance to act upon such situation in the client code. Again I can see two situations:

  • one when we're just initiating the connection and perhaps would want to apply fallback in case of errors,
  • second, when the connection was already established and retry makes sense, because the situation might just be a temporary misconfiguration of the resource.

Final AMQP messages from the above screenshot (copied as hex):

414d515000010000

000002e202000000005310d0000002d20000000aa12331653333636530613032346134653735626439356537396338343030393764645f4733407000010000601387700001d4c040404040d10000029200000002a318636f6d2e6d6963726f736f66743a6f70656e2d6572726f7200531dd00000026c00000003a318616d71703a756e617574686f72697a65642d616363657373b100000248497020686173206265656e2070726576656e74656420746f20636f6e6e65637420746f2074686520656e64706f696e742e0d0a2020202020202020202020466f72206d6f726520696e666f726d6174696f6e207365653a0d0a20202020202020202020205669727475616c204e6574776f726b207365727669636520656e64706f696e74733a0d0a20202020202020202020202020204576656e7420487562733a2068747470733a2f2f676f2e6d6963726f736f66742e636f6d2f66776c696e6b2f3f6c696e6b69643d323034343139320d0a202020202020202020202020202053657276696365204275733a2068747470733a2f2f676f2e6d6963726f736f66742e636f6d2f66776c696e6b2f3f6c696e6b69643d323034343233350d0a202020202020202020202049502046696c746572733a0d0a20202020202020202020202020204576656e7420487562733a202068747470733a2f2f676f2e6d6963726f736f66742e636f6d2f66776c696e6b2f3f6c696e6b69643d323034343432380d0a202020202020202020202020202053657276696365204275733a2068747470733a2f2f676f2e6d6963726f736f66742e636f6d2f66776c696e6b2f3f6c696e6b69643d323034343138330d0a2020202020547261636b696e6749643a62323630633436662d646161662d343531382d383833632d3930616232643131623166375f47332c2053797374656d547261636b65723a416d71704761746577617950726f76696465722c2054696d657374616d703a323032322d30322d30335431313a31373a323040

0000028802000000005318d0000002780000000100531dd00000026c00000003a318616d71703a756e617574686f72697a65642d616363657373b100000248497020686173206265656e2070726576656e74656420746f20636f6e6e65637420746f2074686520656e64706f696e742e0d0a2020202020202020202020466f72206d6f726520696e666f726d6174696f6e207365653a0d0a20202020202020202020205669727475616c204e6574776f726b207365727669636520656e64706f696e74733a0d0a20202020202020202020202020204576656e7420487562733a2068747470733a2f2f676f2e6d6963726f736f66742e636f6d2f66776c696e6b2f3f6c696e6b69643d323034343139320d0a202020202020202020202020202053657276696365204275733a2068747470733a2f2f676f2e6d6963726f736f66742e636f6d2f66776c696e6b2f3f6c696e6b69643d323034343233350d0a202020202020202020202049502046696c746572733a0d0a20202020202020202020202020204576656e7420487562733a202068747470733a2f2f676f2e6d6963726f736f66742e636f6d2f66776c696e6b2f3f6c696e6b69643d323034343432380d0a202020202020202020202020202053657276696365204275733a2068747470733a2f2f676f2e6d6963726f736f66742e636f6d2f66776c696e6b2f3f6c696e6b69643d323034343138330d0a2020202020547261636b696e6749643a62323630633436662d646161662d343531382d383833632d3930616232643131623166375f47332c2053797374656d547261636b65723a416d71704761746577617950726f76696465722c2054696d657374616d703a323032322d30322d30335431313a31373a323040

Please do let me know should I create a seprate issue.

@jeremymeng
Copy link
Member Author

@zoladkow my current plan is to expose all errors via user error handlers. You might see more types of errors in error handler after my change, but it should cover the other error you were seeing too.

* This handler will be called for any error that occurs in the receiver when
* - receiving the message, or
* - executing your `processMessage` callback, or
* - the receiver automatically completes or abandons the message.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* - the receiver automatically completes or abandons the message.
* - receiver is completing the message on your behalf after successfully running your `processMessage` callback and `autoCompleteMessages` is enabled
* - receiver is abandoning the message on your behalf if running your `processMessage` callback fails and `autoCompleteMessages` is enabled
* - receiver is renewing the lock on your behalf due to auto lock renewal feature being enabled

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add something to reflect that the receiver will automatically retry on all errors?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The retry behavior has not been changed in this PR though since the non-retryable classification was removed. We could add some clarification on subscribe() ref doc?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that the behavior around retry has not changed. When I see docs around the different error scenarios, my next thought is around what should I do about them. Now that we are clarifying the kind of errors, it would also help in clarifying that we do indeed retry on all the errors. And perhaps what the user can do is log them on their side if needed. And maybe point them to the ref docs on ServiceBusError to see if there are any errors for which they might want to stop recieving altogether

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jeremymeng jeremymeng dismissed deyaaeldeen’s stale review February 3, 2022 22:49

new changes since then

and link to ref doc on service bus errors.
@check-enforcer
Copy link

check-enforcer bot commented Feb 4, 2022

This pull request is protected by Check Enforcer.

What is Check Enforcer?

Check Enforcer helps ensure all pull requests are covered by at least one check-run (typically an Azure Pipeline). When all check-runs associated with this pull request pass then Check Enforcer itself will pass.

Why am I getting this message?

You are getting this message because Check Enforcer did not detect any check-runs being associated with this pull request within five minutes. This may indicate that your pull request is not covered by any pipelines and so Check Enforcer is correctly blocking the pull request being merged.

What should I do now?

If the check-enforcer check-run is not passing and all other check-runs associated with this PR are passing (excluding license-cla) then you could try telling Check Enforcer to evaluate your pull request again. You can do this by adding a comment to this pull request as follows:
/check-enforcer evaluate
Typically evaulation only takes a few seconds. If you know that your pull request is not covered by a pipeline and this is expected you can override Check Enforcer using the following command:
/check-enforcer override
Note that using the override command triggers alerts so that follow-up investigations can occur (PRs still need to be approved as normal).

What if I am onboarding a new service?

Often, new services do not have validation pipelines associated with them, in order to bootstrap pipelines for a new service, you can issue the following command as a pull request comment:
/azp run prepare-pipelines
This will run a pipeline that analyzes the source tree and creates the pipelines necessary to build and validate your pull request. Once the pipeline has been created you can trigger the pipeline using the following comment:
/azp run js - [service] - ci

@@ -524,10 +537,6 @@ export class StreamingReceiver extends MessageReceiver {
*/
private async _subscribeImpl(caller: "detach" | "subscribe"): Promise<void> {
try {
// this allows external callers (ie: ServiceBusReceiver) to prevent concurrent `subscribe` calls
// by not starting new receiving options while this one has started.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we retain this comment for resume()?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Re-attached it to the resume() call

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reads strange. Is this for the whole _subscribeImpl() method instead?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually, am not sure what part of the code here is preventing concurrent subscribe calls...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

chat with @richardpark-msft offline. We agreed that the comments can be removed

Also fix linting issue. The linter insists "@" in ts-doc should be escaped.

The url still works after adding "\"
@jeremymeng jeremymeng merged commit 2b2b9a4 into Azure:main Feb 7, 2022
@jeremymeng jeremymeng deleted the sb/fix-inf-retry branch February 7, 2022 19:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants