Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NullPointerException on snapshot #29052

Closed
gkozyryatskyy opened this issue Mar 14, 2018 · 11 comments
Closed

NullPointerException on snapshot #29052

gkozyryatskyy opened this issue Mar 14, 2018 · 11 comments
Assignees
Labels
>bug :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs

Comments

@gkozyryatskyy
Copy link

Elasticsearch version: 5.6.8 (Docker image docker.elastic.co/elasticsearch/elasticsearch:5.6.8)

Plugins installed:

ingest-geoip
ingest-user-agent
repository-s3
x-pack

JVM version (java -version):

openjdk version "1.8.0_161"
OpenJDK Runtime Environment (build 1.8.0_161-b14)
OpenJDK 64-Bit Server VM (build 25.161-b14, mixed mode)

OS version (uname -a if on a Unix-like system):

Linux b5f28c65ef45 4.9.60-linuxkit-aufs #1 SMP Mon Nov 6 16:00:12 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

Description of the problem including expected versus actual behavior:
I have an s3 snapshot repo

curl localhost:9200/_snapshot/*?pretty
{
  "weavo-backup" : {
    "type" : "s3",
    "settings" : {
      "bucket" : "...",
      "region" : "us-east-1",
      "base_path" : "dev/weavo-backup"
    }
  }
}

When im trying to snapshot im getting java.lang.NullPointerException

curl -XPUT 'localhost:9200/_snapshot/weavo-backup/snapshot_1?wait_for_completion=true'
{
  "error":{
    "root_cause":[
      {
        "type":"null_pointer_exception",
        "reason":null
      }
    ],
    "type":"null_pointer_exception",
    "reason":null
  },
  "status":500
}

When im trying to delete snapshot im getting java.lang.NullPointerException

curl -XDELETE localhost:9200/_snapshot/weavo-backup/curator-20180224000000?pretty
{
  "error" : {
    "root_cause" : [
      {
        "type" : "null_pointer_exception",
        "reason" : null
      }
    ],
    "type" : "null_pointer_exception",
    "reason" : null
  },
  "status" : 500
}

Provide logs (if relevant):
Snapshot logs error

[2018-03-14T08:17:29,967][INFO ][o.e.s.SnapshotShardsService] [iatu_s1] snapshot [weavo-backup:snapshot_1/ueYAa0XASNu1RK_2i2rTZQ] is done
[2018-03-14T08:17:30,086][WARN ][o.e.s.SnapshotsService   ] [iatu_s1] [weavo-backup:snapshot_1/ueYAa0XASNu1RK_2i2rTZQ] failed to finalize snapshot
java.lang.NullPointerException: null
	at org.elasticsearch.repositories.RepositoryData.snapshotsToXContent(RepositoryData.java:357) ~[elasticsearch-5.6.8.jar:5.6.8]
	at org.elasticsearch.repositories.blobstore.BlobStoreRepository.writeIndexGen(BlobStoreRepository.java:838) ~[elasticsearch-5.6.8.jar:5.6.8]
	at org.elasticsearch.repositories.blobstore.BlobStoreRepository.finalizeSnapshot(BlobStoreRepository.java:568) ~[elasticsearch-5.6.8.jar:5.6.8]
	at org.elasticsearch.snapshots.SnapshotsService$5.run(SnapshotsService.java:978) [elasticsearch-5.6.8.jar:5.6.8]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:575) [elasticsearch-5.6.8.jar:5.6.8]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_161]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_161]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_161]
[2018-03-14T08:17:30,088][WARN ][r.suppressed             ] path: /_snapshot/weavo-backup/snapshot_1, params: {repository=weavo-backup, wait_for_completion=true, snapshot=snapshot_1}
java.lang.NullPointerException: null
	at org.elasticsearch.repositories.RepositoryData.snapshotsToXContent(RepositoryData.java:357) ~[elasticsearch-5.6.8.jar:5.6.8]
	at org.elasticsearch.repositories.blobstore.BlobStoreRepository.writeIndexGen(BlobStoreRepository.java:838) ~[elasticsearch-5.6.8.jar:5.6.8]
	at org.elasticsearch.repositories.blobstore.BlobStoreRepository.finalizeSnapshot(BlobStoreRepository.java:568) ~[elasticsearch-5.6.8.jar:5.6.8]
	at org.elasticsearch.snapshots.SnapshotsService$5.run(SnapshotsService.java:978) ~[elasticsearch-5.6.8.jar:5.6.8]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:575) ~[elasticsearch-5.6.8.jar:5.6.8]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_161]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_161]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_161]

Delete logs error

[2018-03-14T01:00:02,957][WARN ][r.suppressed             ] path: /_snapshot/weavo-backup/curator-20180224000000, params: {repository=weavo-backup, snapshot=curator-20180224000000}
java.lang.NullPointerException: null
	at org.elasticsearch.repositories.RepositoryData.snapshotsToXContent(RepositoryData.java:357) ~[elasticsearch-5.6.8.jar:5.6.8]
	at org.elasticsearch.repositories.blobstore.BlobStoreRepository.writeIndexGen(BlobStoreRepository.java:838) ~[elasticsearch-5.6.8.jar:5.6.8]
	at org.elasticsearch.repositories.blobstore.BlobStoreRepository.deleteSnapshot(BlobStoreRepository.java:445) ~[elasticsearch-5.6.8.jar:5.6.8]
	at org.elasticsearch.snapshots.SnapshotsService.lambda$deleteSnapshotFromRepository$6(SnapshotsService.java:1309) ~[elasticsearch-5.6.8.jar:5.6.8]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:575) ~[elasticsearch-5.6.8.jar:5.6.8]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_161]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_161]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_161]
@gkozyryatskyy gkozyryatskyy changed the title NullPointerException on snaapshot NullPointerException on snapshot Mar 14, 2018
@colings86 colings86 added >bug :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs labels Mar 14, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

@bleskes
Copy link
Contributor

bleskes commented Mar 14, 2018

@tlrx have you seen this before?

@gkozyryatskyy
Copy link
Author

If this helps, all the snapshots, Im trying to remove, done on elasticsearch 5.4.0 version.

Also I have 4 different environments with same elasticsearch version and configs, but this problem I experienced just on single environment... So I think the problem is in specific snapshot...

@imotov
Copy link
Contributor

imotov commented Mar 14, 2018

It reminds me of #26127, but it was fixed in 5.6.0 and in 26127 it was failing because failure reason was null, but it looks like it fails here because snapshotId is null. So, it is probably a different issue.

@tlrx
Copy link
Member

tlrx commented Mar 14, 2018

@gkozyryatskyy Would it be possible that, at a given time, two environments accessed to the same repository to write a snapshot?

@gkozyryatskyy
Copy link
Author

gkozyryatskyy commented Mar 14, 2018

@tlrx theoretically, yes.. this is dev environment =( Someone can up few environments with one snapshot configs...

@tlrx
Copy link
Member

tlrx commented Mar 15, 2018

@gkozyryatskyy I've seen similar issue when two different clusters are accessing the same S3 repository (more exactly, the same S3 bucket): one environment is creating a snapshot while another environment is deleting a snapshot. This is a quite rare situation, as the creation and deletion must be executed exactly at the same time, but it can still happen, specially when there's a lot of indices/documents involves in the snapshot.

@gkozyryatskyy
Copy link
Author

@tlrx
Thank you a lot for your responses!

  • Is there any way to understand that it is my case?
  • How can I do a "hotfix" for this? Should I delete all the snapshots and create new one? Or I can delete some specific one? Or I can cleanup something from s3 bucket?
  • It any case, I think will be nice, to make some fix in the code for this to not cause NullPointer.. But it is on you.. =)

@tlrx
Copy link
Member

tlrx commented Mar 15, 2018

Is there any way to understand that it is my case?

That's not easy - do you think that your case is similar to the situation I explained in #29052 (comment)? If so, they we have an explanation.

How can I do a "hotfix" for this? Should I delete all the snapshots and create new one? Or I can delete some specific one? Or I can cleanup something from s3 bucket?

I think that the best fix would be to create a new repository, in a different S3 bucket (or a sub path of the same bucket, see base_path option), and have a single cluster that can write snapshots to it and the other clusters have this repository with the read_only option.

It any case, I think will be nice, to make some fix in the code for this to not cause NullPointer.. But it is on you.. =)

I'd love to have a fix for S3 and concurrent access :) But S3 is not a filesystem: it's a replicated, distributed, consistent-after-write blob storage system. We can't really implements locks or atomic writes with it, so we cannot have any strong guarantees except that an uploaded file will appear (after some undefined time) in the S3 bucket. If you need strong guarantees then you should consider using a real filesystem.

@gkozyryatskyy
Copy link
Author

@tlrx
Thank you a lot for your responses!

If this info helps you, Im able to delete/snapshot to this repo with elasticsearch 5.4.0. So it is not just S3 problem... Theoretically, snapshot logic can be reverted to 5.4.0 version and will work.

So for now, Im thinking to delete everything with older db version and start creating new snapshots with new db version... Changing the bucket or base path is not an option right now, because it is bound to environment name and will cause renaming/changing the environment just because of db snapshots...

@gkozyryatskyy
Copy link
Author

gkozyryatskyy commented Mar 16, 2018

@tlrx
Ok, here is what I did:

  • I create temp FS repo and backup there
  • I manually delete everything from backup bucket with backup path prefix
  • I run new, first snapshot in same repo/bucker/path prefix and it was succeeded.

@tlrx tlrx closed this as completed in 63148dd Apr 27, 2018
tlrx added a commit that referenced this issue Apr 27, 2018
A NullPointerException is thrown when trying to create or delete
a snapshot in a repository that has been written to by an older
Elasticsearch after writing to it with a newer Elasticsearch version.

This is because the way snapshots are formatted in the repository
snapshots index file changed in #24477.

This commit changes the parsing of the repository index file so that
it now detects a corrupted index file and fails early the snapshot
operation.

closes #29052
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs
Projects
None yet
Development

No branches or pull requests

6 participants