-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Shard cannot be relocated after setting node exclusion. #57708
Comments
Pinging @elastic/es-distributed (:Distributed/Recovery) |
Strange indeed @howardhuanghua. Can you share the output of |
@DaveCTurner Thanks for checking this issue. Since it's customer's production env, we have triggered retry failed and the shard relocated success. It's a little bit hard to re-produce this issue. We have tried the same process in our test env for several times and cann't re-produce so far. But we do meet this issue several times in upgrading from 6.8 to 7.5. |
Noted. Would still be useful to see those outputs if the customer is ok with that, especially |
The original cluster doesn't exist. I have re-created the same version/configuration cluster, and got the
|
The error message here makes it suspiciously sound like a bug we just fixed a few days ago: https://github.com/elastic/elasticsearch/pull/57187/files#r431071766 (The linked PR is fixing another issue, but while @dnhatn added more tests, he uncovered that under certain edge conditions we were not properly setting the recovery stage from I think we can close this issue, and reopen if this still occurs on newer versions that have the above bug fix. |
If the recovery source is on an old node (before 7.2), then the recovery target won't have the safe commit after phase1 because the recovery source does not send the global checkpoint in the clean_files step. And if the recovery fails and retries, then the recovery stage won't transition properly. If a sync_id is used in peer recovery, then the clean_files step won't be executed to move the stage to TRANSLOG. This issue was addressed in #57187, but not forward-ported to 8.0. Closes #57708
If the recovery source is on an old node (before 7.2), then the recovery target won't have the safe commit after phase1 because the recovery source does not send the global checkpoint in the clean_files step. And if the recovery fails and retries, then the recovery stage won't transition properly. If a sync_id is used in peer recovery, then the clean_files step won't be executed to move the stage to TRANSLOG. Relates ##7187 Closes #57708
If the recovery source is on an old node (before 7.2), then the recovery target won't have the safe commit after phase1 because the recovery source does not send the global checkpoint in the clean_files step. And if the recovery fails and retries, then the recovery stage won't transition properly. If a sync_id is used in peer recovery, then the clean_files step won't be executed to move the stage to TRANSLOG. Relates ##7187 Closes #57708
We have met a shard relocation issue after setting node exclusion. In our case, original cluster is 6.8.2, we try to add the same amount of new 7.5.1 nodes and exclude the 6.8.2 nodes to upgrade cluster.
However, after adding 7.5.1 nodes, and set exclude 6.8.2 nodes in cluster setting, one of the single empty .kibana index shard cannot be relocated success, we have met this issue in several times.
Here is the node list after adding new nodes, we could see 4 6.8.2 nodes and 4 7.5.1 nodes:
And we set this cluster setting to exclude data from 6.8.2:
The cluster is empty and only contains kibana index. We could see the single internal .kibana_1 system index and it contains nothing docs:
Finally, the shard 0 replica cannot be relocated to the new node:
On the master and target node, we could see this exception, no exception on source node:
The cluster is in green status after relocating failed, just the shard cannot be relocated and remain on the excluding node. This issue could not be easily re-produced.
The key log message is
can't move recovery to stage [FINALIZE]. current stage: [INDEX] (expected [TRANSLOG])
, it seems has any recovering process gap between 6.8 and 7.5.The text was updated successfully, but these errors were encountered: