Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GCS repository snapshot fails intermittently on some shards "Failed to check if blob exists" java.io.IOException: insufficient data written #26636

Closed
hoffoo opened this issue Sep 13, 2017 · 11 comments
Assignees
Labels
:Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs

Comments

@hoffoo
Copy link

hoffoo commented Sep 13, 2017

Elasticsearch version (bin/elasticsearch --version): 5.5.1

Plugins installed: [repository-gcs discovery-gce]

JVM version (java -version): 1.8.0_131

OS version (uname -a if on a Unix-like system): Linux XXX 4.10.0-27-generic #30~16.04.2-Ubuntu SMP Thu Jun 29 16:07:46 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

Description of the problem including expected versus actual behavior:

Creating a snapshot fails on certain shards. Retrying a new snapshot works. For me it seems to fail on about 10% of shards (testing with 51 shards, 4 failed last test, 2 when I retried, finally 0 on the third try)

The exception is IndexShardSnapshotFailedException[BlobStoreException[Failed to check if blob [__79.part4] exists]; nested: SocketTimeoutException[Read timed out];]; nested: BlobStoreException[Failed to check if blob [__79.part4] exists]; nested: SocketTimeoutException[Read timed out];

This is using gcs coldstorage.

I see that there are further options i can give the plugin, mainly http.connect_timeout and http.read_timeout, but im not sure if they are relevant for the exception below: java.io.IOException: insufficient data written

I wouldn't mind this failing if I could detect it and retry. Could I do this by deleting the snapshot and recreating it? From what I understand the successfully backed up shards will not be deleted if I did this?

Steps to reproduce:

  1. Create a gcs snapshot with these settings {"gcs":{"type":"gcs","settings":{"bucket":"XXXX","compress":"true"}}}

Provide logs (if relevant):

org.elasticsearch.index.snapshots.IndexShardSnapshotFailedException: Failed to perform snapshot (index files)
	at org.elasticsearch.repositories.blobstore.BlobStoreRepository$SnapshotContext.snapshot(BlobStoreRepository.java:1377) ~[elasticsearch-5.5.1.jar:5.5.1]
	at org.elasticsearch.repositories.blobstore.BlobStoreRepository.snapshotShard(BlobStoreRepository.java:972) ~[elasticsearch-5.5.1.jar:5.5.1]
	at org.elasticsearch.snapshots.SnapshotShardsService.snapshot(SnapshotShardsService.java:382) ~[elasticsearch-5.5.1.jar:5.5.1]
	at org.elasticsearch.snapshots.SnapshotShardsService.access$200(SnapshotShardsService.java:88) ~[elasticsearch-5.5.1.jar:5.5.1]
	at org.elasticsearch.snapshots.SnapshotShardsService$1.doRun(SnapshotShardsService.java:335) [elasticsearch-5.5.1.jar:5.5.1]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:638) [elasticsearch-5.5.1.jar:5.5.1]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-5.5.1.jar:5.5.1]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_131]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_131]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_131]
Caused by: java.io.IOException: insufficient data written
	at sun.net.www.protocol.http.HttpURLConnection$StreamingOutputStream.close(HttpURLConnection.java:3540) ~[?:?]
	at com.google.api.client.http.javanet.NetHttpRequest.execute(NetHttpRequest.java:81) ~[?:?]
	at com.google.api.client.http.HttpRequest.execute(HttpRequest.java:972) ~[?:?]
	at com.google.api.client.googleapis.media.MediaHttpUploader.executeCurrentRequestWithoutGZip(MediaHttpUploader.java:545) ~[?:?]
	at com.google.api.client.googleapis.media.MediaHttpUploader.resumableUpload(MediaHttpUploader.java:417) ~[?:?]
	at com.google.api.client.googleapis.media.MediaHttpUploader.upload(MediaHttpUploader.java:336) ~[?:?]
	at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:427) ~[?:?]
	at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:352) ~[?:?]```
@imotov
Copy link
Contributor

imotov commented Sep 14, 2017

@tlrx this looks very GCE-specific and I am traveling until the end of the week. Could you take a look?

@imotov
Copy link
Contributor

imotov commented Sep 19, 2017

According to googleapis/google-http-java-client#333 the error message insufficient data written that we are seeing hides the real issue. It seems the situation was improved in google-http-java-client v1.24. We are still on 1.20.0, which is almost 2 years old. @tlrx, should we upgrade the plugin to the latest?

@tlrx
Copy link
Member

tlrx commented Sep 20, 2017

@imotov Sorry, I didn't manage to find the time to look at this. I think it makes sense to upgrade the dependency but we'll have to wait for 1.24 to be released.

@imotov
Copy link
Contributor

imotov commented Sep 20, 2017

You are right, for some reason I thought they are already did, but it looks we will have to wait quite a bit.

@tlrx tlrx added the :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs label Sep 25, 2017
@imotov
Copy link
Contributor

imotov commented Sep 25, 2017

@tlrx can you think of any way to move this thing forward besides waiting for 1.24 release? The patch was merged almost a year ago and it still didn't make it into any releases.

@tlrx
Copy link
Member

tlrx commented Sep 26, 2017

@imotov They recently updated the development version of the lib so I asked here if they will release 1.23 soon. Let's wait a bit for an answer, ok?

@bw2
Copy link

bw2 commented Nov 5, 2017

I'm currently also blocked by this.

@bw2
Copy link

bw2 commented Nov 5, 2017

Using the following jar versions

bash-4.3# ls -1 repository-gcs/
commons-codec-1.10.jar
commons-logging-1.1.3.jar
google-api-client-1.21.0.jar
google-api-services-storage-v1-rev66-1.21.0.jar
google-http-client-1.21.0.jar
google-http-client-jackson2-1.21.0.jar
google-oauth-client-1.21.0.jar
httpclient-4.5.2.jar
httpcore-4.4.5.jar
repository-gcs-5.6.3.jar
...

@bw2
Copy link

bw2 commented Nov 6, 2017

Increasing max_snapshot_bytes_per_sec as below seems to improve % of successful shards in snapshot, but I still get the error for 1 or 2 shards on large indices (300gb+):

body = {
            "type": "gcs",
            "settings": {
                "bucket": bucket,
                "base_path": base_path,
                "compress": True,
                "chunk_size": "100mb",
                "max_snapshot_bytes_per_sec": "1tb",
            }
        }
es.snapshot.create_repository(repository=snapshot_repo, body=body)

@imotov imotov removed their assignment Nov 6, 2017
tlrx added a commit to tlrx/elasticsearch that referenced this issue Nov 14, 2017
This commit updates the google-api-client library to version 1.23.

Closes elastic#26636
tlrx added a commit that referenced this issue Nov 15, 2017
This commit updates the google-api-client library to version 1.23.0.

Related to #26636
tlrx added a commit that referenced this issue Nov 15, 2017
This commit updates the google-api-client library to version 1.23.0.

Related to #26636
tlrx added a commit that referenced this issue Nov 15, 2017
This commit updates the google-api-client library to version 1.23.0.

Related to #26636
tlrx added a commit that referenced this issue Nov 15, 2017
This commit updates the google-api-client library to version 1.23.0.

Related to #26636
@tlrx
Copy link
Member

tlrx commented Nov 15, 2017

Thanks @bw2 and @hoffoo for your feedback.

A new version (1.23.0) of google-http-client has been released in October 2017. I updated the versions used in the repository-gcs and discovery-gce plugins in #27381. This 1.23.0 version includes the change googleapis/google-http-java-client#333 so the underlying exception should bubble up instead of being hidden by the insufficient data written exception.

At that stage, I suggest to close this issue for now and to wait for more tests and feedback on plugins that use the new version of google-http-java-client. This will be released in Elasticsearch 6.0.1 (and potentially in 5.6.5 if this version is released).

@tlrx tlrx closed this as completed Nov 15, 2017
@bw2
Copy link

bw2 commented Nov 15, 2017

Thanks @tlrx

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs
Projects
None yet
Development

No branches or pull requests

4 participants