Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Creating a snapshot fails at the very last stage with NullPointerException #29649

Closed
redlus opened this issue Apr 22, 2018 · 9 comments
Closed
Labels
>bug :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs

Comments

@redlus
Copy link

redlus commented Apr 22, 2018

Hi,

Creating a snapshot of a single index reports it finished correctly through _snapshot/_status ("state": "SUCCESS"), but a NullPointerException is thrown and the snapshot eventually does not show up in the repository.

Elasticsearch version:
6.2.3

Plugins installed:
ingest-attachment
ingest-geoip
mapper-murmur3
mapper-size
repository-azure
repository-gcs
repository-s3

JVM version:
openjdk version "1.8.0_151"
OpenJDK Runtime Environment (build 1.8.0_151-8u151-b12-0ubuntu0.16.04.2-b12)
OpenJDK 64-Bit Server VM (build 25.151-b12, mixed mode)

OS version:
Linux 4.13.0-1011-azure #14-Ubuntu SMP 2018 x86_64 GNU/Linux

Description of the problem including expected versus actual behavior:
We've recently upgraded our elasticsearch cluster from v5.2.2 to v6.2.3, running both clusters in parallel and used the snapshots feature on Azure blob storage to replicate the data to the new cluster. After migration, new snapshot creation tasks are executed daily for backup from the new v6.2.3 cluster onto the same repositories. Initiating the create snapshots succeeds, taking the expected amount of time to copy the data to Azure and even reporting "state": "SUCCESS" on _snapshot/_status:

{
"snapshots": [
{
"snapshot": "2_newlogs_20180328",
"repository": "2_newlogs",
"uuid": "ds3cBPQOSpqIOt53-ImKBw",
"state": "SUCCESS",
"include_global_state": false,
"shards_stats": {
"initializing": 0,
"started": 0,
"finalizing": 0,
"done": 2,
"failed": 0,
"total": 2
},
"stats": {
"number_of_files": 147,
"processed_files": 147,
"total_size_in_bytes": 12541757401,
"processed_size_in_bytes": 12541757401,
"start_time_in_millis": 1524411697845,
"time_in_millis": 462463
},
"indices": {
"2_newlogs_20180328-01": {
"shards_stats": {
"initializing": 0,
"started": 0,
"finalizing": 0,
"done": 2,
"failed": 0,
"total": 2
},
"stats": {
"number_of_files": 147,
"processed_files": 147,
"total_size_in_bytes": 12541757401,
"processed_size_in_bytes": 12541757401,
"start_time_in_millis": 1524411697845,
"time_in_millis": 462463
},
"shards": {
"0": {
"stage": "DONE",
"stats": {
"number_of_files": 63,
"processed_files": 63,
"total_size_in_bytes": 6270745558,
"processed_size_in_bytes": 6270745558,
"start_time_in_millis": 1524411697845,
"time_in_millis": 439319
}
},
"1": {
"stage": "DONE",
"stats": {
"number_of_files": 84,
"processed_files": 84,
"total_size_in_bytes": 6271011843,
"processed_size_in_bytes": 6271011843,
"start_time_in_millis": 1524411697863,
"time_in_millis": 462445
}
}
}
}
}
}
]
}

After the create snapshot finishes (_snapshot/_status returns an empty array), the snapshot does not show up in the repository and a NullPointerException is thrown in the master node logs:

[2018-04-22T15:49:20,567][WARN ][o.e.s.SnapshotsService ] [prod-elasticsearch-master-003] [2_newlogs:2_newlogs_20180328/ds3cBPQOSpqIOt53-ImKBw] failed to finalize snapshot
java.lang.NullPointerException: null
at org.elasticsearch.repositories.RepositoryData.snapshotsToXContent(RepositoryData.java:341) ~[elasticsearch-6.2.4-SNAPSHOT.jar:6.2.4-SNAPSHOT]
at org.elasticsearch.repositories.blobstore.BlobStoreRepository.writeIndexGen(BlobStoreRepository.java:674) ~[elasticsearch-6.2.4-SNAPSHOT.jar:6.2.4-SNAPSHOT]
at org.elasticsearch.repositories.blobstore.BlobStoreRepository.finalizeSnapshot(BlobStoreRepository.java:470) ~[elasticsearch-6.2.4-SNAPSHOT.jar:6.2.4-SNAPSHOT]
at org.elasticsearch.snapshots.SnapshotsService.lambda$endSnapshot$3(SnapshotsService.java:974) ~[elasticsearch-6.2.4-SNAPSHOT.jar:6.2.4-SNAPSHOT]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:573) [elasticsearch-6.2.4-SNAPSHOT.jar:6.2.4-SNAPSHOT]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_151]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_151]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_151]

This has been replicated with multiple create snapshots operations on multiple repositories.
Additionally, deleting the repository (without individually deleting the snapshots it contains) and re-creating it (thereby loading the available snapshots from 5.2.2) did not solve the problem.

I'd be happy to provide any additional information as needed.

Thanks!

@tlrx
Copy link
Member

tlrx commented Apr 23, 2018

Thanks @redlus. It looks similar to other issue we saw. I'm going to look closer at this.

@redlus
Copy link
Author

redlus commented Apr 23, 2018

Thanks. Is there an open issue for what you mentioned?

@javanna javanna added the :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs label Apr 23, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

@javanna javanna added the >bug label Apr 23, 2018
@redlus
Copy link
Author

redlus commented Apr 23, 2018

@tlrx a quick update:
I've deleted all the repositories from elasticsearch, and created new (empty) repositories with the same name but pointing to a different container name within the Azure blob storage account. Snapshots are created, restored and deleted without errors.

This seems to indicate that repositories and snapshots created with elasticsearch 5.2.2 have some collision with create / restore / delete snapshot commands run on the same repos from elasticsearch 6.2.3.

@tlrx
Copy link
Member

tlrx commented Apr 23, 2018

This seems to indicate that repositories and snapshots created with elasticsearch 5.2.2 have some collision with create / restore / delete snapshot commands run on the same repos from elasticsearch 6.2.3.

Just to double check, does it mean that two clusters in different versions where writing to the same repository (Azure container) at the same time?

@redlus
Copy link
Author

redlus commented Apr 23, 2018

There were two phases. First, v5.2.2 successfully created snapshots and v6.2.3 successfully restored snapshots. Then the old v5.2.2 cluster was taken down, and v6.2.3 was now trying to create snapshots onto the same repo - resulting in the NullPointerException above.

@tlrx
Copy link
Member

tlrx commented Apr 24, 2018

This seems to indicate that repositories and snapshots created with elasticsearch 5.2.2 have some collision with create / restore / delete snapshot commands run on the same repos from elasticsearch 6.2.3.

Thanks for your responses.

In your case, you snapshotted data from 5.2.2 and restored them on 6.2.3. We usually advise to upgrade to the latest 5.x version first and then move to the next major version (see https://www.elastic.co/guide/en/elasticsearch/reference/6.2/setup-upgrade.html).

Also, according to the the documentation:

The snapshot format can change across major versions, so if you have clusters on different major versions trying to write the same repository, new snapshots written by one version will not be visible to the other. While setting the repository to readonly on all but one of the clusters should work with multiple clusters differing by one major version, it is not a supported configuration.

It means that the repository created using the 5.2.2 cluster must only be accessed by this 5.2.2 cluster for creating or deleting snapshots.

The cluster 6.2.3 can register the 5.2.2 repository too but it must be read-only. This way the 6.2.3 cluster can restore data from 5.2.2 but cannot create or delete snapshots.

If you want to create snapshots of the 6.2.3 cluster then you have to register a new repository in a different container/bucket/folder, as you did.

Is there an open issue for what you mentioned?

Finally found it :) See #29052 where a user had a similar issue than yours and the pull request #26127 that fixed the issue #25878.

I'm going to close this issue in favor of #29052.

@redlus
Copy link
Author

redlus commented Apr 24, 2018

Ok, thanks for the update.

@tlrx
Copy link
Member

tlrx commented Apr 25, 2018

Thanks @redlus, the description of your problem helped me to find a reproducible scenario where such NPE is thrown. See #30140 for some boring information :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs
Projects
None yet
Development

No branches or pull requests

5 participants