-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Snapshots on large indices fail on some shards when master election occurs #35229
Comments
Pinging @elastic/es-distributed |
Thanks @clandry94, I hope we can reproduce this on a index that's smaller than 8TB 😁 Could you share the server logs from around the time of the issue? It'd be useful to see the full stack traces from those exceptions, as well as any other messages that were logged at around the time of the master re-election. |
Yeah I will try to update with more info throughout the week 👍 |
A few quick observations here:
|
I setup some monitoring to keep track of our cluster's master elections and determined this is related to the issues @ywelsch linked rather than master election. I'm still working on figuring out why this happened after we switched from data nodes acting as master to dedicated master nodes,I can make a PR to update the GCS client dependency in the meantime. |
please do so. Thank you |
* Closes elastic#35459 * Closes elastic#35229
Elasticsearch version (
bin/elasticsearch --version
): 6.4.1Plugins installed: [analysis-kuromoji, analysis-icu, repository-gcs]
JVM version (
java -version
): 10.0.2OS version (
uname -a
if on a Unix-like system): Linux 4.14.56+Description of the problem including expected versus actual behavior:
Expected: Creating a snapshot for a large index (> 8TB) with high number of shards (>800) on a cluster that has dedicated master nodes completes successfully even if a master election occurs mid snapshot.
Actual: Creating a snapshot for a large index (> 8TB) with high number of shards (>800) on a cluster that has dedicated master nodes fails on some shards with
IndexShardSnapshotFailedException[Failed to perform snapshot (index files)]; nested: FileAlreadyExistsException
if a master election occurs during the snapshot.I think that this issue may be known and handled in the case that an election occurs during the
finalizeSnapshot
step as seen here https://github.com/elastic/elasticsearch/blob/master/server/src/main/java/org/elasticsearch/repositories/blobstore/BlobStoreRepository.java#L550. This seems like a similar bug. Perhaps the new master attempts to also create a snapshot of the shard, overwriting the successful snapshot with the failed?Steps to reproduce:
Please include a minimal but complete recreation of the problem, including
(e.g.) index creation, mappings, settings, query etc. The easier you make for
us to reproduce it, the more likely that somebody will take the time to look at it.
Provide logs (if relevant):
Snapshot status in
_cat/snapshots/<repo name>
2018-10-16t19-57-07 PARTIAL 1539719828 19:57:08 1539722642 20:44:02 46.8m 70 2716 4 2720
Snippet of errors seen in logs from viewing snapshot status in API
The text was updated successfully, but these errors were encountered: