Snapshots on large indices fail on some shards when master election occurs #35229

clandry94 · 2018-11-03T17:18:24Z

Elasticsearch version (bin/elasticsearch --version): 6.4.1

Plugins installed: [analysis-kuromoji, analysis-icu, repository-gcs]

JVM version (java -version): 10.0.2

OS version (uname -a if on a Unix-like system): Linux 4.14.56+

Description of the problem including expected versus actual behavior:
Expected: Creating a snapshot for a large index (> 8TB) with high number of shards (>800) on a cluster that has dedicated master nodes completes successfully even if a master election occurs mid snapshot.

Actual: Creating a snapshot for a large index (> 8TB) with high number of shards (>800) on a cluster that has dedicated master nodes fails on some shards with IndexShardSnapshotFailedException[Failed to perform snapshot (index files)]; nested: FileAlreadyExistsException if a master election occurs during the snapshot.

I think that this issue may be known and handled in the case that an election occurs during the finalizeSnapshot step as seen here https://github.com/elastic/elasticsearch/blob/master/server/src/main/java/org/elasticsearch/repositories/blobstore/BlobStoreRepository.java#L550. This seems like a similar bug. Perhaps the new master attempts to also create a snapshot of the shard, overwriting the successful snapshot with the failed?

Steps to reproduce:

Please include a minimal but complete recreation of the problem, including
(e.g.) index creation, mappings, settings, query etc. The easier you make for
us to reproduce it, the more likely that somebody will take the time to look at it.

Create a cluster with dedicated master nodes
Create a large index pointed at a GCS repo with specs higher than above (>8TB index size and >800 shards)
Perform a snapshot (snapshots at this size will take between 30m and 2 hours)

Provide logs (if relevant):
Snapshot status in _cat/snapshots/<repo name>
2018-10-16t19-57-07 PARTIAL 1539719828 19:57:08 1539722642 20:44:02 46.8m 70 2716 4 2720

Snippet of errors seen in logs from viewing snapshot status in API

{
          "index" : "<index_name>",
          "index_uuid" : "<index_name>",
          "shard_id" : 178,
          "reason" : "IndexShardSnapshotFailedException[Failed to perform snapshot (index files)]; nested: FileAlreadyExistsException[indices/Y0qyoa00TuaVx0iYLpVbXw/178/__1oc: Precondition Failed]; ",
          "node_id" : "lkwyayn5QVGh8okWvRIJpg",
          "status" : "INTERNAL_SERVER_ERROR"
        },
        {
          "index" : "<index_name>,
          "index_uuid" : "<index_name>",
          "shard_id" : 45,
          "reason" : "IndexShardSnapshotFailedException[com.google.cloud.storage.StorageException: Error writing request body to server]; nested: StorageException[Error writing request body to server]; nested: IOException[Error writing request body to server]; ",
          "node_id" : "xlJM3LeIS2SlitmN71R6fA",
          "status" : "INTERNAL_SERVER_ERROR"
        },
}

The text was updated successfully, but these errors were encountered:

elasticmachine · 2018-11-03T19:35:40Z

Pinging @elastic/es-distributed

DaveCTurner · 2018-11-04T18:09:42Z

Thanks @clandry94, I hope we can reproduce this on a index that's smaller than 8TB 😁 Could you share the server logs from around the time of the issue? It'd be useful to see the full stack traces from those exceptions, as well as any other messages that were logged at around the time of the master re-election.

clandry94 · 2018-11-04T19:22:50Z

Yeah I will try to update with more info throughout the week 👍

ywelsch · 2018-11-05T13:55:50Z

A few quick observations here:

The exceptions you're seeing here are coming from the data nodes, not the master node. What makes you think that these have to do with master failovers?
The first exception FileAlreadyExistsException[indices/Y0qyoa00TuaVx0iYLpVbXw/178/__1oc: Precondition Failed comes from writing a data file, which already looks to exist. I wonder if this is triggered by a retry from the GCS client.
The second exception IOException[Error writing request body to server] is also mentioned here: Problems uploading files to Cloud Storage: com.google.cloud.storage.StorageException: Error writing request body to server googleapis/google-cloud-java#3410 and it looks like the Google clients team have worked on a fix for that here: Retry IOException: Error writing request body to server googleapis/google-cloud-java#3433. We should probably update our GCS client dependency.

clandry94 · 2018-11-11T14:55:58Z

I setup some monitoring to keep track of our cluster's master elections and determined this is related to the issues @ywelsch linked rather than master election. I'm still working on figuring out why this happened after we switched from data nodes acting as master to dedicated master nodes,I can make a PR to update the GCS client dependency in the meantime.

ywelsch · 2018-11-12T08:56:17Z

I can make a PR to update the GCS client dependency in the meantime.

please do so. Thank you

* Closes elastic#35459 * Closes elastic#35229

* Closes #35459 * Closes #35229

clandry94 changed the title ~~Snapshot~~ Snapshots on large indices fail on some shards when master election occurs Nov 3, 2018

dnhatn added the :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs label Nov 3, 2018

clandry94 mentioned this issue Nov 12, 2018

Increment google cloud apis to 1.52 #35459

Closed

original-brownbear added a commit to original-brownbear/elasticsearch that referenced this issue Dec 14, 2018

SNAPSHOTS: Upgrade GCS Dependencies to 1.55.0

e9703d9

* Closes elastic#35459 * Closes elastic#35229

original-brownbear mentioned this issue Dec 14, 2018

SNAPSHOTS: Upgrade GCS Dependencies to 1.55.0 #36634

Merged

original-brownbear closed this as completed in #36634 Dec 14, 2018

original-brownbear added a commit that referenced this issue Dec 14, 2018

SNAPSHOTS: Upgrade GCS Dependencies to 1.55.0 (#36634)

5df9321

* Closes #35459 * Closes #35229

original-brownbear added a commit that referenced this issue Dec 21, 2018

SNAPSHOTS: Upgrade GCS Dependencies to 1.55.0 (#36634)

aadebed

* Closes #35459 * Closes #35229

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Snapshots on large indices fail on some shards when master election occurs #35229

Snapshots on large indices fail on some shards when master election occurs #35229

clandry94 commented Nov 3, 2018

elasticmachine commented Nov 3, 2018

DaveCTurner commented Nov 4, 2018

clandry94 commented Nov 4, 2018

ywelsch commented Nov 5, 2018

clandry94 commented Nov 11, 2018 •

edited

Loading

ywelsch commented Nov 12, 2018

Snapshots on large indices fail on some shards when master election occurs #35229

Snapshots on large indices fail on some shards when master election occurs #35229

Comments

clandry94 commented Nov 3, 2018

elasticmachine commented Nov 3, 2018

DaveCTurner commented Nov 4, 2018

clandry94 commented Nov 4, 2018

ywelsch commented Nov 5, 2018

clandry94 commented Nov 11, 2018 • edited Loading

ywelsch commented Nov 12, 2018

clandry94 commented Nov 11, 2018 •

edited

Loading