Retry get_from_translog during relocations #104579

pxsalehi · 2024-01-19T11:59:47Z

During a promotable relocation, a get_from_translog sent by the unpromotable
shard to handle a real-time get might encounter ShardNotFoundException or
IndexNotFoundException. In these cases, we should retry.

This is just for GET. I'll open a second PR for mGET. The relevant IT is in the
Stateless PR.

Relates ES-5727

elasticsearchmachine · 2024-01-19T12:00:10Z

Pinging @elastic/es-distributed (Team:Distributed)

pxsalehi · 2024-01-19T12:04:43Z

server/src/main/java/org/elasticsearch/action/get/TransportGetAction.java

@@ -194,44 +200,91 @@ private void handleGetOnUnpromotableShard(GetRequest request, IndexShard indexSh
                refreshRequest,
                listener.delegateFailureAndWrap((l, replicationResponse) -> super.asyncShardOperation(request, shardId, l))
            );
-        } else if (request.realtime()) {


This branch has been moved as-is to tryGetFromTranslog

pxsalehi · 2024-01-24T11:10:29Z

server/src/main/java/org/elasticsearch/action/get/TransportGetAction.java

+                        r.segmentGeneration()
+                    );
+                    if (r.segmentGeneration() == -1) {
+                        // Nothing to wait for (no previous unsafe generation), just handle the Get locally.


I wonder if there is a corner-case here where we might "lose" the previous unsafe generation during a relocation. For example right after there is a switch from unsafe to safe and we flush and record the generation and the commit is uploaded, the indexing shard is moved to a new node. The GET hits the old node and has to be retried on the new node, which is not aware of any unsafe generation and would return -1. Is there anything preventing the search shard to handle the get locally if for whatever reason the commit that was created on the old node is still not on the search shard? It is probably a bit of an extreme corner-case. If it is possible, one way would be to conservatively set the lastUnsafeSegmentGenerationForGets to the current generation upon a relocation (or in general do not start from -1 but set to the generation we recover from). But that might be also an overkill.

I need sometime to think over this corner case. But it does remind me of a higher level question: why do we need to retry the RTG on relocating shard? I don't think we do that for Stateful, i.e. the list of shards for GET is computed once and any subsequent cluster changes do not affect it. In serverless, if we have an index with 1 primary shard and 1 replica, we do not retry a "non"-RTG if the search shard is relocating. If that is true, why should we retry RTG when primary shard is relocating?

I think the separation of index/search tier is the difference. An RTG is ALWAYS cross tier in stateless and therefore things like relocation shouldn't simply disrupt the request, as it is an internal corner-case that we should cover.

As for the corner case, I've mentioned above, I can try to verify it in a test. For now, we could ignore it maybe.

As for the corner case, I've mentioned above, I can try to verify it in a test. For now, we could ignore it maybe.

I believe this could also happen today w/o the retry logic of get_from_translog. If a get (not a retried one, but a new one) hits the new indexing shard (freshly relocated), and the search shard hasn't finished receiving/processing the new commit that comes out of the relocation-flush (and is on a generation that is less than the lastUnsafeGeneration), the search shard could get -1 back from get_from_translog, and handle the get locally. As mentioned, this is a bit extreme, but might be possible.

So I think we shouldn't tie that corner-case to this PR. I'll open a separate PR with at least a test to verify it and a fix if necessary.

I believe this could also happen today

Right. I also came to realise the same thing during my morning time walk, fresh air helped :)

ywangd

Just had a first pass and left a few questions to start with. Thanks!

server/src/main/java/org/elasticsearch/action/get/TransportGetAction.java

ywangd · 2024-01-29T07:07:09Z

server/src/main/java/org/elasticsearch/action/get/TransportGetAction.java

+        tryGetFromTranslog(request, indexShard, state, listener.delegateResponse((l, e) -> {
+            final var cause = ExceptionsHelper.unwrapCause(e);
+            logger.debug("get_from_translog failed", cause);
+            if (cause instanceof ShardNotFoundException || cause instanceof IndexNotFoundException) {


Not familiar with what error we can run into and should retry in this situation. Are these two exceptions sufficient to cover all cases? I noticed there is a TransportActions#isShardNotVailableException method. Some of the exceptions do not seem to be retryable, e.g. AlreadyClosedException. But I am not sure whether we want to consdier some others in this list?

These are the two that I think happen typically if the request hits a node where the shard was relocated from (at least based on the IT).

I noticed there is a TransportActions#isShardNotVailableException method.

Yes not all of those exceptions seem relevant here. Unfortunately I don't think these exceptions are used consistently everywhere and there might be some overlap between them. I've added NoShardAvailableActionException and UnavailableShardsException which seem to be relevant for relocation. Thanks for the suggestion. As for testing, I think not all of those are straightforward to trigger in an IT. But I'll add a simple test where these are returned randomly to see the retry.

ywangd · 2024-01-29T07:08:25Z

server/src/main/java/org/elasticsearch/action/get/TransportGetAction.java

+
+                    @Override
+                    public void onTimeout(TimeValue timeout) {
+                        l.onFailure(new ElasticsearchException("cs observer timed out", cause));


Nit: This error message may end up to be user visible? If so, I think we should make it a bit more descriptive (same argument goes for the above error message).

ywangd · 2024-01-29T07:14:43Z

server/src/main/java/org/elasticsearch/action/get/TransportGetAction.java

+                    public void onNewClusterState(ClusterState state) {
+                        getFromTranslog(request, indexShard, state, observer, l);
+                    }


IIUC, this method runs on the cluster state thread. In that case, we should dispatch the transportService.sendRequest call.

I am not sure if that is strictly necessary when we are not doing any heavy computations. I see quite a few places where we just send a request in onNewClusterState.

Thanks for making the extra effort to confirm this is not an issue.

ywangd · 2024-01-29T07:15:28Z

server/src/main/java/org/elasticsearch/action/get/TransportGetAction.java

+            final var cause = ExceptionsHelper.unwrapCause(e);
+            logger.debug("get_from_translog failed", cause);
+            if (cause instanceof ShardNotFoundException || cause instanceof IndexNotFoundException) {
+                observer.waitForNextChange(new ClusterStateObserver.Listener() {


Do we want to have a state predicate instead of retry on every state change?

Not sure if that is necessary or what the guideline is on this. This shouldn't be a very common occurrence, so I assume retrying on the next change should be fine. Unless there is a good reason not to, I'd rather keep it simple. I see also quite a few other places where we retry simply on the next state.

Thanks for explaining. IIUC, this can potentially retry for an extended period of time, i.e. if the cluster state changes within 60 seconds and the trigger retry still failed so that a new retry will be scheduled. If cluster state keeps having at least one change within the next 60 seconds, it will then retry again. This is likely fine. But I'd like to ensure that we explicitly agree on this. Or please let me know if my understanding is off.

@ywangd (just in case this helps with the review) this whole pattern of how we retry and chase the primary is the same that is used for TransportSendRecoveryCommitRegistrationAction. With the exception that while doing this, I realized at least IndexNotFoundException should be also considered since depending on shard allocations we could get either IndexNotFoundException or ShardNotFoundException. (I need to open a PR to adjust that too in TransportSendRecoveryCommitRegistrationAction.)

If cluster state keeps having at least one change within the next 60 seconds, it will then retry again.

Yes, that's true. Although since we set the parent task ID on the get from translog requests, then at some point I assume/expect the get request timeout would cancel this if it is set, or once explicitly cancelled.

The GET request does not seem to have a timeout nor cancellable. I think we want to make it at least cancellable in future.

ywangd · 2024-01-29T07:19:49Z

server/src/main/java/org/elasticsearch/action/get/TransportGetAction.java

+                        r.segmentGeneration()
+                    );
+                    if (r.segmentGeneration() == -1) {
+                        // Nothing to wait for (no previous unsafe generation), just handle the Get locally.


I need sometime to think over this corner case. But it does remind me of a higher level question: why do we need to retry the RTG on relocating shard? I don't think we do that for Stateful, i.e. the list of shards for GET is computed once and any subsequent cluster changes do not affect it. In serverless, if we have an index with 1 primary shard and 1 replica, we do not retry a "non"-RTG if the search shard is relocating. If that is true, why should we retry RTG when primary shard is relocating?

ywangd

LGTM

ywangd · 2024-01-30T07:07:35Z

server/src/main/java/org/elasticsearch/action/get/TransportGetAction.java

+                observer.waitForNextChange(new ClusterStateObserver.Listener() {
+                    @Override
+                    public void onNewClusterState(ClusterState state) {
+                        getFromTranslog(request, indexShard, state, observer, l);


We could use logging message for retry. If this is rare enough, it can probably be INFO. Otherwise, we could go with DEBUG.

ywangd · 2024-01-30T07:10:02Z

server/src/main/java/org/elasticsearch/action/get/TransportGetAction.java

+            final var observer = new ClusterStateObserver(
+                state,
+                clusterService,
+                TimeValue.timeValueSeconds(60),


I wonder how we decide the 60 seconds timeout. Is it to match the default timeout from ReplicationRequest or the default global timeout for ClusterStateObserver?

It is to match the default of the ClusterStateObserver itself, but I have to use a different constructor, so I pass it explicitly.

ywangd · 2024-01-30T08:29:23Z

server/src/main/java/org/elasticsearch/action/get/TransportGetAction.java

+                        r.segmentGeneration()
+                    );
+                    if (r.segmentGeneration() == -1) {
+                        // Nothing to wait for (no previous unsafe generation), just handle the Get locally.


I believe this could also happen today

Right. I also came to realise the same thing during my morning time walk, fresh air helped :)

ywangd · 2024-01-30T08:30:00Z

server/src/main/java/org/elasticsearch/action/get/TransportGetAction.java

+                    public void onNewClusterState(ClusterState state) {
+                        getFromTranslog(request, indexShard, state, observer, l);
+                    }


Thanks for making the extra effort to confirm this is not an issue.

ywangd · 2024-01-30T08:42:35Z

server/src/main/java/org/elasticsearch/action/get/TransportGetAction.java

+                        );
+                    }
+                }
+            }), TransportGetFromTranslogAction.Response::new, getExecutor(request, shardId))


Not for this PR. I think we should make sure TransportGetFromTranslogAction uses system thread pool for system indices similar to how it is done for TransportGetAction. Currently it always uses the regular GET threadpool

elasticsearch/server/src/main/java/org/elasticsearch/action/get/TransportGetFromTranslogAction.java

Line 55 in 35cc9e1

super(NAME, transportService, actionFilters, Request::new, transportService.getThreadPool().executor(ThreadPool.Names.GET));

A follow-up on this would be useful.

good catch. I'll follow this up in a new PR.

@ywangd Checking this now, I don't think this is an issue. In TransportGetAction we also use GET threadpool in the constructor. So I don't think any change is necessary here. The getExecutor is enough.

The getExecutor here ensures the response is handled with the right thread pool. It does nothing for request handling, i.e. TransportGetFromTranslogAction always handles the request with the GET thread pool.

The TransportGetAction is indeed instantiated with GET thread pool. But it does not just use it, it forks to the right thread pool with either asyncGet or asyncShardOperation. Before handleGetOnUnpromotableShard was added, TransportGetAction always forks. With handleGetOnUnpromotableShard, it sends transport actions without forking. This is fine since we have established that these are cheap enough to perform inline. My point is that the request handling on the otherside, in this case TransportGetFromTranslogAction, should fork to use the relevant thread pool which overall seems to match the behaviour before handleGetOnUnpromotableShard was introduced.

ywangd · 2024-01-30T08:50:39Z

server/src/main/java/org/elasticsearch/action/get/TransportGetAction.java

+            final var cause = ExceptionsHelper.unwrapCause(e);
+            logger.debug("get_from_translog failed", cause);
+            if (cause instanceof ShardNotFoundException || cause instanceof IndexNotFoundException) {
+                observer.waitForNextChange(new ClusterStateObserver.Listener() {


Thanks for explaining. IIUC, this can potentially retry for an extended period of time, i.e. if the cluster state changes within 60 seconds and the trigger retry still failed so that a new retry will be scheduled. If cluster state keeps having at least one change within the next 60 seconds, it will then retry again. This is likely fine. But I'd like to ensure that we explicitly agree on this. Or please let me know if my understanding is off.

henningandersen · 2024-01-31T11:40:06Z

server/src/main/java/org/elasticsearch/action/get/TransportGetAction.java

+        ClusterStateObserver observer,
+        ActionListener<GetResponse> listener
+    ) {
+        tryGetFromTranslog(request, indexShard, state, listener.delegateResponse((l, e) -> {


It is a little tricky readability-wise that this (IIUC) only works because tryGetFromTranslog throws a NoShardAvailableActionException rather than passing it through the listener when the shard is unassigned (we do not want to wait in that case). I also think it may not work if this fails on the retry, since the exception will be thrown back into onNewClusterState I think.

Can we refine this?

@henningandersen I guess you mean getCurrentNodeOfPrimary. That's true. I'll open a PR so it will not throw but rather returns null so we can feed that back to the listener.

As for the case where currently there is no node ie the primary is unassigned, I thought we want to retry on that. Why not retry with a new state in case the shard gets assigned?

My knownledge is limited in the allocation/relocation area. So please correct me if I am wrong here.

In my understanding, the PR is to retry when the primary shard is relocating. When a shard is relocating, can getCurrentNodeOfPrimary throw? It throws either when the index has a null primary (what does this mean? unassigned?) or the primary is neither started nor relocating. So it seems to me that when getCurrentNodeOfPrimary throws, we should stop retrying because the primary is not actually relocating. When reviewing the PR, I thought this was intentional. But maybe I am wrong?

I think fail fast is the expected behaviour here. That is at least my reading of the regular get action and how we normally handle reads.
Notice that the 60s delay is also not user specified here. Not suggesting to make it user specified though. But if we wait, gets during unassigned shards would halt for 60s then report error which is not ideal.

So making sure getCurrentNodeOfPrimary's failure goes through the listener and not throws is just good practice, and I just missed it in the first place. So I'll do that anyway.

The larger issue is about retries of (m)getFromTranslog. When I prepared the PR I considered only two exceptions which was IndexNotFound and ShardNotFoundException. As we discussed, it seemed that there are more exceptions to retry on. So those are not necessarily relocated-caused, I think.

So I guess my question goes back to those exceptions. Should we retry on NoShardAvailableActionException and UnavailableShardsException? To me it seems reasonable to do so. If not, I can take them out of the list.

It seems we were replying at the same time. Didn't see yours! :)

Thanks for the clarification. Then I'll open a PR to fix getCurrentNodeOfPrimary and change the exceptions to retry only on the first two which is what I had initially.

Yes, that looks reasonable to me.

* Change release version lookup to an instance method (#104902) * Upgrade to Lucene 9.9.2 (#104753) This commit upgrades to Lucene 9.9.2. * Improve `CANNOT_REBALANCE_CAN_ALLOCATE` explanation (#104904) Clarify that in this situation there is a rebalancing move that would improve the cluster balance, but there's some reason why rebalancing is not happening. Also points at the `can_rebalance_cluster_decisions` as well as the node-by-node decisions since the action needed could be described in either place. * Get from translog fails with large dense_vector (#104700) This change fixes the engine to apply the current codec when retrieving documents from the translog. We need to use the same codec than the main index in order to ensure that all the source data is indexable. The internal codec treats some fields differently than the default one, for instance dense_vectors are limited to 1024 dimensions. This PR ensures that these customizations are applied when indexing document for translog retrieval. Closes #104639 Co-authored-by: Elastic Machine <[email protected]> * [Connector Secrets] Add delete API endpoint (#104815) * Add DELETE endpoint for /_connector/_secret/{id} * Add endpoint to write_connector_secrets cluster privilege * Merge Aggregations into InternalAggregations (#104896) This commit merges Aggregations into InternalAggregations in order to remove the unnecessary hierarchy. * [Profiling] Simplify cost calculation (#104816) * [Profiling] Add the number of cores to HostMetadata * Update AWS pricelist (remove cost_factor, add usd_per_hour) * Switch cost calculations from 'cost_factor' to 'usd_per_hour' * Remove superfluous CostEntry.toXContent() * Check for Number type in CostEntry.fromSource() * Add comment * Retry get_from_translog during relocations (#104579) During a promotable relocation, a `get_from_translog` sent by the unpromotable shard to handle a real-time get might encounter `ShardNotFoundException` or `IndexNotFoundException`. In these cases, we should retry. This is just for `GET`. I'll open a second PR for `mGET`. The relevant IT is in the Stateless PR. Relates ES-5727 * indicating fix for 8.12.1 for int8_hnsw (#104912) * Removing the assumption from some tests that the request builder's request() method always returns the same object (#104881) * [DOCS] Adds get setting and update settings asciidoc files to security API index (#104916) * [DOCS] Adds get setting and update settings asciidoc files to security API index. * [DOCS] Fixes references in docs. * Reuse APMMeterService of APMTelemetryProvider (#104906) * Mute more tests that tend to leak searchhits (#104922) * ESQL: Fix SearchStats#count(String) to count values not rows (#104891) SearchStats#count incorrectly counts the number of documents (or rows) in which a document appears instead of the actual number of values. This PR fixes this by looking at the term frequency instead of the doc count. Fix #104795 * Adding request source for cohere (#104926) * Fixing a broken javadoc comment in ReindexDocumentationIT (#104930) This fixes a javadoc comment that was broken by #104881 * Fix enabling / disabling of APM agent "recording" in APMAgentSettings (#104324) * Add `type` parameter support, for sorting, to the Query API Key API (#104625) This adds support for the `type` parameter, for sorting, to the Query API key API. The type for an API Key can currently be either `rest` or `cross_cluster`. This was overlooked in #103695 when support for the `type` parameter was first introduced only for querying. * Apply publish plugin to es-opensaml-security-api project (#104933) * Support `match` for the Query API Key API (#104594) This adds support for the `match` query type to the Query API key Information API. Note that since string values associated to API Keys are mapped as `keywords`, a `match` query with no analyzer parameter is effectively equivalent to a `term` query for such fields (e.g. `name`, `username`, `realm_name`). Relates: #101691 * [Connectors API] Relax strict response parsing for get/list operations (#104909) * Limit concurrent shards per node for ESQL (#104832) Today, we allow ESQL to execute against an unlimited number of shards concurrently on each node. This can lead to cases where we open and hold too many shards, equivalent to opening too many file descriptors or using too much memory for FieldInfos in ValuesSourceReaderOperator. This change limits the number of concurrent shards to 10 per node. This number was chosen based on the _search API, which limits it to 5. Besides the primary reason stated above, this change has other implications: We might execute fewer shards for queries with LIMIT only, leading to scenarios where we execute only some high-priority shards then stop. For now, we don't have a partial reduce at the node level, but if we introduce one in the future, it might not be as efficient as executing all shards at the same time. There are pauses between batches because batches are executed sequentially one by one. However, I believe the performance of queries executing against many shards (after can_match) is less important than resiliency. Closes #103666 * [DOCS] Support for nested functions in ES|QL STATS...BY (#104788) * Document nested expressions for stats * More docs * Apply suggestions from review - count-distinct.asciidoc - Content restructured, moving the section about approximate counts to end of doc. - count.asciidoc - Clarified that omitting the `expression` parameter in `COUNT` is equivalent to `COUNT(*)`, which counts the number of rows. - percentile.asciidoc - Moved the note about `PERCENTILE` being approximate and non-deterministic to end of doc. - stats.asciidoc - Clarified the `STATS` command - Added a note indicating that individual `null` values are skipped during aggregation * Comment out mentioning a buggy behavior * Update sum with inline function example, update test file * Fix typo * Delete line * Simplify wording * Fix conflict fix typo --------- Co-authored-by: Liam Thompson <[email protected]> Co-authored-by: Liam Thompson <[email protected]> * [ML] Passing input type through to cohere request (#104781) * Pushing input type through to cohere request * switching logic to allow request to always override * Fixing failure * Removing getModelId calls * Addressing feedback * Switching to enumset * [Transform] Unmute 2 remaining continuous tests: HistogramGroupByIT and TermsGroupByIT (#104898) * Adding ActionRequestLazyBuilder implementation of RequestBuilder (#104927) This introduces a second implementation of RequestBuilder (#104778). As opposed to ActionRequestBuilder, ActionRequestLazyBuilder does not create its request until the request() method is called, and does not hold onto that request (so each call to request() gets a new request instance). This PR also updates BulkRequestBuilder to inherit from ActionRequestLazyBuilder as an example of its use. * Update versions to skip after backport to 8.12 (#104953) * Update/Cleanup references to old tracing.apm.* legacy settings in favor of the telemetry.* settings (#104917) * Exclude tests that do not work in a mixed cluster scenario (#104935) * ES|QL: Improve type validation in aggs for UNSIGNED_LONG and better support for VERSION (#104911) * [Connector API] Make update configuration action non-additive (#104615) * Save allocating enum values array in two hot spots (#104952) Our readEnum code instantiates/clones enum value arrays on read. Normally, this doesn't matter much but the two spots adjusted here are visibly hot during bulk indexing, causing GBs of allocations during e.g. the http_logs indexing run. * ESQL: Correct out-of-range filter pushdowns (#99961) Fix pushed down filters for binary comparisons that compare a byte/short/int/long with an out of range value, like WHERE some_int_field < 1E300. * [DOCS] Dense vector element type should be float for OpenAI (#104966) * Fix test assertions (#104963) * Move functions that generate lucene geometries under a utility class (#104928) We have functions that generate lucene geometries scattered in different places of the code. This commit moves everything under a utility class. * fixing index versions --------- Co-authored-by: Simon Cooper <[email protected]> Co-authored-by: Chris Hegarty <[email protected]> Co-authored-by: David Turner <[email protected]> Co-authored-by: Jim Ferenczi <[email protected]> Co-authored-by: Elastic Machine <[email protected]> Co-authored-by: Navarone Feekery <[email protected]> Co-authored-by: Ignacio Vera <[email protected]> Co-authored-by: Tim Rühsen <[email protected]> Co-authored-by: Pooya Salehi <[email protected]> Co-authored-by: Keith Massey <[email protected]> Co-authored-by: István Zoltán Szabó <[email protected]> Co-authored-by: Moritz Mack <[email protected]> Co-authored-by: Costin Leau <[email protected]> Co-authored-by: Jonathan Buttner <[email protected]> Co-authored-by: Albert Zaharovits <[email protected]> Co-authored-by: Mark Vieira <[email protected]> Co-authored-by: Jedr Blaszyk <[email protected]> Co-authored-by: Nhat Nguyen <[email protected]> Co-authored-by: Abdon Pijpelink <[email protected]> Co-authored-by: Liam Thompson <[email protected]> Co-authored-by: Liam Thompson <[email protected]> Co-authored-by: Przemysław Witek <[email protected]> Co-authored-by: Joe Gallo <[email protected]> Co-authored-by: Lorenzo Dematté <[email protected]> Co-authored-by: Luigi Dell'Aquila <[email protected]> Co-authored-by: Armin Braun <[email protected]> Co-authored-by: Alexander Spies <[email protected]> Co-authored-by: David Kyle <[email protected]>

…4901) Same as #104579 but for MGET. Relates ES-5727

Limit RTG exceptions to retry on and do not throw in `getCurrentNodeOfPrimary`. Follow up to #104579 (comment). Relates ES-5727

pxsalehi added >non-issue :Distributed Indexing/CRUD A catch all label for issues around indexing, updating and getting a doc by id. Not search. labels Jan 19, 2024

elasticsearchmachine added v8.13.0 Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. labels Jan 19, 2024

pxsalehi commented Jan 19, 2024

View reviewed changes

pxsalehi requested review from ywangd and henningandersen January 19, 2024 12:05

Retry get_from_translog during relocations

dc81a98

pxsalehi force-pushed the ps240117-RTG-retry branch from 31232e1 to dc81a98 Compare January 19, 2024 12:12

pxsalehi commented Jan 24, 2024

View reviewed changes

ywangd reviewed Jan 29, 2024

View reviewed changes

address review comments

f87877e

pxsalehi requested a review from ywangd January 29, 2024 09:57

pxsalehi added 5 commits January 29, 2024 11:32

Merge commit '98d0c5f1f4b' into ps240117-RTG-retry

2818475

better messages/exceptions

2345a19

revert logger level change

9fc8704

resolve listener on failure

5e567d9

remove todo

09088f9

ywangd approved these changes Jan 30, 2024

View reviewed changes

pxsalehi added 2 commits January 30, 2024 13:52

another log message

da3f4e4

Merge remote-tracking branch 'upstream/main' into ps240117-RTG-retry

5dbc77b

pxsalehi added the auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) label Jan 30, 2024

elasticsearchmachine merged commit dbefb32 into elastic:main Jan 30, 2024
15 checks passed

pxsalehi deleted the ps240117-RTG-retry branch January 30, 2024 13:50

pxsalehi mentioned this pull request Jan 30, 2024

Retry TransportShardMultiGetFomTranslogAction during relocations #104901

Merged

henningandersen reviewed Jan 31, 2024

View reviewed changes

pxsalehi mentioned this pull request Feb 1, 2024

Limit RTG exceptions to retry on #105003

Merged

pxsalehi added a commit that referenced this pull request Feb 1, 2024

Retry TransportShardMultiGetFomTranslogAction during relocations (#10…

5b22d42

…4901) Same as #104579 but for MGET. Relates ES-5727

elasticsearchmachine pushed a commit that referenced this pull request Feb 1, 2024

Limit RTG exceptions to retry on (#105003)

e2d68ae

Limit RTG exceptions to retry on and do not throw in `getCurrentNodeOfPrimary`. Follow up to #104579 (comment). Relates ES-5727

Retry get_from_translog during relocations #104579

Retry get_from_translog during relocations #104579

Conversation

pxsalehi commented Jan 19, 2024

elasticsearchmachine commented Jan 19, 2024

Choose a reason for hiding this comment

pxsalehi Jan 24, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ywangd left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ywangd left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

henningandersen Jan 31, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pxsalehi Jan 24, 2024 •

edited

Loading

henningandersen Jan 31, 2024 •

edited

Loading