Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix for EH receiver timeout while opening #21324

Merged
merged 19 commits into from
Jun 22, 2021

Conversation

JamesBirdsall
Copy link
Contributor

The CBSChannel is a shared resource on the MessagingFactory, used by all senders/receivers opened on the same EventHubClient. If a sender/receiver tries to send a token at the same time that the existing session to cbs$ node is closing, then a race can occur which leaves the MessagingFactory in a bad state. New senders and receivers will timeout while trying to open because the auth step stalls:

  1. The existing CBS session is closing.
  2. CBSChannel.sendToken is called, which calls CBSChannel.innerSendToken, which calls FaultTolerantObject.runOnOpenedObject
  3. FaultTolerantObject.runOnOpenedObject detects that the RequestResponseChannel it is wrapping is in state CLOSING, so it decides to create a new one and sets this.creatingNewInnerObject to true, then calls RequestResponseOpener.run, which sees that RequestResponseOpener.isOpened is true and short-circuits, doing nothing at all
  4. Later, RequestResponseOpener gets the callback from RequestResponseChannel and sets RequestResponseOpener.isOpened to false
  5. Still later, user tries to create a new receiver on the same EventHubClient
  6. MessageReceiver tries to create a receive link
    a. First step is to send a token via CBSChannel.sendToken, which chains down to FaultTolerantObject.runOnOpenedObject
    b. FaultTolerantObject.runOnOpenedObject sees that this.creatingNewInnerObject is true (left over from step 3) and just queues the action, assuming it will be handled when the channel is finally opened, but nobody is opening the channel…

This is similar to a previous race condition which was caused by tracking the same state in two different places and the two getting out of sync, but is not the same. In this case, RequestResponseOpener.isOpened tracks more than just the state of the inner RequestResponseChannel, so we don't want to change to just use the state of the RequestResponseChannel. The proposed fix is for RequestResponseOpener.run to also check the state of the inner RequestResponseChannel; if the state is mixed (isOpened is still true but the RequestResponseChannel is CLOSING or CLOSED) then use a continuation to replay the call to run() when the close callback for the existing channel has finished cleanup and set isOpened back to false.

@ghost ghost added the Event Hubs label May 12, 2021
@JamesBirdsall JamesBirdsall self-assigned this May 12, 2021
@JamesBirdsall JamesBirdsall requested a review from sjkwak May 12, 2021 00:48
@check-enforcer
Copy link

check-enforcer bot commented Jun 8, 2021

This pull request is protected by Check Enforcer.

What is Check Enforcer?

Check Enforcer helps ensure all pull requests are covered by at least one check-run (typically an Azure Pipeline). When all check-runs associated with this pull request pass then Check Enforcer itself will pass.

Why am I getting this message?

You are getting this message because Check Enforcer did not detect any check-runs being associated with this pull request within five minutes. This may indicate that your pull request is not covered by any pipelines and so Check Enforcer is correctly blocking the pull request being merged.

What should I do now?

If the check-enforcer check-run is not passing and all other check-runs associated with this PR are passing (excluding license-cla) then you could try telling Check Enforcer to evaluate your pull request again. You can do this by adding a comment to this pull request as follows:
/check-enforcer evaluate
Typically evaulation only takes a few seconds. If you know that your pull request is not covered by a pipeline and this is expected you can override Check Enforcer using the following command:
/check-enforcer override
Note that using the override command triggers alerts so that follow-up investigations can occur (PRs still need to be approved as normal).

What if I am onboarding a new service?

Often, new services do not have validation pipelines associated with them, in order to bootstrap pipelines for a new service, you can issue the following command as a pull request comment:
/azp run prepare-pipelines
This will run a pipeline that analyzes the source tree and creates the pipelines necessary to build and validate your pull request. Once the pipeline has been created you can trigger the pipeline using the following comment:
/azp run java - [service] - ci

@JamesBirdsall JamesBirdsall merged commit 7335011 into Azure:main Jun 22, 2021
@JamesBirdsall JamesBirdsall deleted the timeout300 branch June 22, 2021 23:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants