Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Snapshots on large indices fail on some shards when master election occurs #35229

Closed
clandry94 opened this issue Nov 3, 2018 · 6 comments
Closed
Labels
:Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs

Comments

@clandry94
Copy link

Elasticsearch version (bin/elasticsearch --version): 6.4.1

Plugins installed: [analysis-kuromoji, analysis-icu, repository-gcs]

JVM version (java -version): 10.0.2

OS version (uname -a if on a Unix-like system): Linux 4.14.56+

Description of the problem including expected versus actual behavior:
Expected: Creating a snapshot for a large index (> 8TB) with high number of shards (>800) on a cluster that has dedicated master nodes completes successfully even if a master election occurs mid snapshot.

Actual: Creating a snapshot for a large index (> 8TB) with high number of shards (>800) on a cluster that has dedicated master nodes fails on some shards with IndexShardSnapshotFailedException[Failed to perform snapshot (index files)]; nested: FileAlreadyExistsException if a master election occurs during the snapshot.

I think that this issue may be known and handled in the case that an election occurs during the finalizeSnapshot step as seen here https://github.com/elastic/elasticsearch/blob/master/server/src/main/java/org/elasticsearch/repositories/blobstore/BlobStoreRepository.java#L550. This seems like a similar bug. Perhaps the new master attempts to also create a snapshot of the shard, overwriting the successful snapshot with the failed?

Steps to reproduce:

Please include a minimal but complete recreation of the problem, including
(e.g.) index creation, mappings, settings, query etc. The easier you make for
us to reproduce it, the more likely that somebody will take the time to look at it.

  1. Create a cluster with dedicated master nodes
  2. Create a large index pointed at a GCS repo with specs higher than above (>8TB index size and >800 shards)
  3. Perform a snapshot (snapshots at this size will take between 30m and 2 hours)

Provide logs (if relevant):
Snapshot status in _cat/snapshots/<repo name>
2018-10-16t19-57-07 PARTIAL 1539719828 19:57:08 1539722642 20:44:02 46.8m 70 2716 4 2720

Snippet of errors seen in logs from viewing snapshot status in API

{
          "index" : "<index_name>",
          "index_uuid" : "<index_name>",
          "shard_id" : 178,
          "reason" : "IndexShardSnapshotFailedException[Failed to perform snapshot (index files)]; nested: FileAlreadyExistsException[indices/Y0qyoa00TuaVx0iYLpVbXw/178/__1oc: Precondition Failed]; ",
          "node_id" : "lkwyayn5QVGh8okWvRIJpg",
          "status" : "INTERNAL_SERVER_ERROR"
        },
        {
          "index" : "<index_name>,
          "index_uuid" : "<index_name>",
          "shard_id" : 45,
          "reason" : "IndexShardSnapshotFailedException[com.google.cloud.storage.StorageException: Error writing request body to server]; nested: StorageException[Error writing request body to server]; nested: IOException[Error writing request body to server]; ",
          "node_id" : "xlJM3LeIS2SlitmN71R6fA",
          "status" : "INTERNAL_SERVER_ERROR"
        },
}
@clandry94 clandry94 changed the title Snapshot Snapshots on large indices fail on some shards when master election occurs Nov 3, 2018
@dnhatn dnhatn added the :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs label Nov 3, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

@DaveCTurner
Copy link
Contributor

Thanks @clandry94, I hope we can reproduce this on a index that's smaller than 8TB 😁 Could you share the server logs from around the time of the issue? It'd be useful to see the full stack traces from those exceptions, as well as any other messages that were logged at around the time of the master re-election.

@clandry94
Copy link
Author

Yeah I will try to update with more info throughout the week 👍

@ywelsch
Copy link
Contributor

ywelsch commented Nov 5, 2018

A few quick observations here:

@clandry94
Copy link
Author

clandry94 commented Nov 11, 2018

I setup some monitoring to keep track of our cluster's master elections and determined this is related to the issues @ywelsch linked rather than master election. I'm still working on figuring out why this happened after we switched from data nodes acting as master to dedicated master nodes,I can make a PR to update the GCS client dependency in the meantime.

@ywelsch
Copy link
Contributor

ywelsch commented Nov 12, 2018

I can make a PR to update the GCS client dependency in the meantime.

please do so. Thank you

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants