-
Notifications
You must be signed in to change notification settings - Fork 495
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Internal] Per Partition Automatic Failover: Fixes Metadata Requests Retry Policy #4205
[Internal] Per Partition Automatic Failover: Fixes Metadata Requests Retry Policy #4205
Conversation
…gateway timeouts.
Microsoft.Azure.Cosmos/src/HttpClient/HttpTimeoutPolicyDefault.cs
Outdated
Show resolved
Hide resolved
Microsoft.Azure.Cosmos/src/HttpClient/HttpTimeoutPolicyDefault.cs
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM except for few small issues
…quests_on_gateway_timeouts
Microsoft.Azure.Cosmos/src/HttpClient/HttpTimeoutPolicyMetadataRead.cs
Outdated
Show resolved
Hide resolved
Microsoft.Azure.Cosmos/src/HttpClient/HttpTimeoutPolicyMetadataRead.cs
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Waiting on the policy behavior that marks the region as unavailable on 503s. We explicitly document publicly we do not do this: https://learn.microsoft.com/en-us/azure/cosmos-db/nosql/troubleshoot-sdk-availability#transient-connectivity-issues-on-tcp-protocol
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unblocking the PR as changes are coming
…ation index. Addressed review comments.
…quests_on_gateway_timeouts
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM now
Microsoft.Azure.Cosmos/src/MetadataRequestThrottleRetryPolicy.cs
Outdated
Show resolved
Hide resolved
2f427e3
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM - Thanks
Pull Request Template
Description
Background:
The Cosmos .NET V3 SDK should attempt to retry on to another region for fetching the collection information (Read Collection call) or partition key ranges information (Get PkRanges call), if the master partition of the primary region is in complete quorum loss. However, this is not happening today reason being the request to the routing gateway takes more than
65
seconds to respond back, thus timing out the SDK request. The SDK makes3
retries, each of which times out within65
seconds. Today, our .NET v3 SDK doesn't retry on gateway timeouts (on TaskCancelled exceptions), thus if the metadata information is not retrieved, then the SDK is stuck to get initialized.Account Setup For 3 regions : Create a cosmos account with 3 regions, P1 (Write), P2 (Read) and P3 (Read). The PPAF configuration from the BE is to failover to P2, in case P1 is unavailable.
Scenario: While creating the cosmos client, in the application preferred region, provide P1, P2 and P3 as preferred regions.
Before initializing the cosmos client, use the service fabric commands to trigger a "full quorum loss" on the master partition.
Current Behavior: The SDK keeps retrying on the region P1 for reading the collection information and times out eventually. To understand this better, take a look at the below diagnostics:
Diagnostics Snippet - Scenario: Master Partitions are in Complete Quorum Loss.
Expected Behavior/ Acceptance Criteria:
Ideally, the above setup should have worked and the SDK should have retried to the region P2 to get the collection information from the gateway. Note that this behavior is expected irrespective to the fact that per partition automatic failover is enabled or not.
Scope:
CosmosClient
set inDirect
mode.CosmosClient
, when all the metadata information are needed to be fetched and cached.High Level Changes:
TaskCancelledException
which is eventually wrapped in aCosmosException
from theCosmosHttpClient
. The idea is to extend the retry policy to retry on CosmosExceptions.MetadataRequestThrottleRetryPolicy
, which is a wrapper around theRequestThrottleRetryPolicy
and particularly handles all of the metadata requests. The purpose is to mark an endpoint unavailable for read, when a gateway timeout occurs, so that the next retry could happen on another region.Design Approach:
Current Flow:
Sample Diagnostics with Current Flow:
Scenario: Master Partitions are in Complete Quorum Loss.
Proposed Flow:
Complete Diagnostics After the Above Design Changes:
Scenario: Master Partitions are in Complete Quorum Loss.
Type of change
Please delete options that are not relevant.
Closing issues
To automatically close an issue: closes #4181