Curator Snapshot/delete action is failing which points to snapshot_missing_exception. #1697

pratik-devops · 2024-01-19T12:10:17Z

We have a single node ES cluster hosting around 15 aliases and which is working perfectly.
We have developed a backup and restore solution using Curator action which was working perfectly fine.

The solution works as follows:

Take snapshot of Elasticsearch Indices and then use the delete action to maintain the latest 2 snapshots.

actions:
1:
action: snapshot
description: >-
Snapshot selected aliases to 'aaf_backup' with the snapshot name
pattern in 'name'.
options:
repository: aaf_backup
name: curator_snapshot-%Y%m%d%H%M%S
wait_for_completion: True
ignore_unavailable: True
include_global_state: True
max_wait: 3600
wait_interval: 10
filters:
- filtertype: alias
aliases:
- test1
- test2
- testn
exclude: False
2:
action: delete_snapshots
description: "Delete snapshots from 'aaf_backup' repository (exclude latest 2 )."
options:
repository: aaf_backup
ignore_empty_list: True
retry_interval: 120
retry_count: 3
filters:
- filtertype: pattern
kind: prefix
value: curator_
- filtertype: count
count: 2

In case of disaster, we will use a Restore action to restore from the latest snapshot.

Suddenly - We started seeing failures in Snapshot operation as below:

2024-01-19 08:35:05,547 INFO Snapshot curator_snapshot-20240119083505 still in progress.
2024-01-19 08:35:15,562 WARNING Snapshot curator_snapshot-20240119083505 completed with state PARTIAL.
2024-01-19 08:35:15,565 ERROR Snapshot PARTIAL completed with state: PARTIAL
2024-01-19 08:35:15,565 ERROR Failed to complete action: snapshot. <class 'curator.exceptions.FailedExecution'>: Exception encountered. Rerun with loglevel DEBUG and/or check Elasticsearch logs for more information. Exception: Snapshot PARTIAL completed with state: PARTIAL

On checking further, In ES logs found below errors:

"stacktrace": ["java.nio.file.NoSuchFileException: /usr/share/elasticsearch/data/mount/backups/indices/87gNXnEESjCTaTFJqekEuw/0/index-2Q_3S0aYSCubJ39gsp7L9w",
"at sun.nio.fs.UnixException.translateToIOException(UnixException.java:92) ~[?:?]",
"at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:106) ~[?:?]",
"at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111) ~[?:?]",
"at sun.nio.fs.UnixFileSystemProvider.newByteChannel(UnixFileSystemProvider.java:219) ~[?:?]",
"at java.nio.file.Files.newByteChannel(Files.java:380) ~[?:?]",
"at java.nio.file.Files.newByteChannel(Files.java:432) ~[?:?]",
"at java.nio.file.spi.FileSystemProvider.newInputStream(FileSystemProvider.java:422) ~[?:?]",
"at java.nio.file.Files.newInputStream(Files.java:160) ~[?:?]",
"at org.elasticsearch.common.blobstore.fs.FsBlobContainer.readBlob(FsBlobContainer.java:195) ~[elasticsearch-7.16.3.jar:7.16.3]",
"at org.elasticsearch.repositories.blobstore.ChecksumBlobStoreFormat.read(ChecksumBlobStoreFormat.java:88) ~[elasticsearch-7.16.3.jar:7.16.3]",
"at org.elasticsearch.repositories.blobstore.BlobStoreRepository.buildBlobStoreIndexShardSnapshots(BlobStoreRepository.java:3416) ~[elasticsearch-7.16.3.jar:7.16.3]",
"at org.elasticsearch.repositories.blobstore.BlobStoreRepository.snapshotShard(BlobStoreRepository.java:2666) [elasticsearch-7.16.3.jar:7.16.3]",
"at org.elasticsearch.snapshots.SnapshotShardsService.snapshot(SnapshotShardsService.java:370) [elasticsearch-7.16.3.jar:7.16.3]",
"at org.elasticsearch.snapshots.SnapshotShardsService.lambda$startNewShards$1(SnapshotShardsService.java:267) [elasticsearch-7.16.3.jar:7.16.3]",
"at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:718) [elasticsearch-7.16.3.jar:7.16.3]",
"at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]",
"at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]",
"at java.lang.Thread.run(Thread.java:833) [?:?]"] }
{"type": "server", "timestamp": "2024-01-19T08:35:06,443Z", "level": "WARN", "component": "o.e.s.SnapshotShardsService", "cluster.name": "test_elasticsearch", "node.name": "test_node", "message": "[[test_index-2023.04.24-000002][0]][aaf_backup:curator_snapshot-20240119083505/Lj_q_BZTQqy9pxiEnbu2Mg] failed to snapshot shard", "cluster.uuid": "lp-MLx_wRN-UeC7oEFTxtw", "node.id": "luXNaXPhSJS2cuNANwi_HQ" ,

on checking the snapshot repo with below curl:

curl -k https://localhost:30920/_snapshot/aaf_backup/_all?pretty
{
"error" : {
"root_cause" : [
{
"type" : "snapshot_missing_exception",
"reason" : "[aaf_backup:curator_snapshot-20230821110005/F4kYVIy9Re6ea90ikPVNrw] is missing"
}
],
"type" : "snapshot_missing_exception",
"reason" : "[aaf_backup:curator_snapshot-20230821110005/F4kYVIy9Re6ea90ikPVNrw] is missing",
"caused_by" : {
"type" : "no_such_file_exception",
"reason" : "/usr/share/elasticsearch/data/mount/backups/snap-F4kYVIy9Re6ea90ikPVNrw.dat"
}
},
"status" : 404
}

$ curl -k https://localhost:30920/_cluster/health?pretty
{
"cluster_name" : "test_elasticsearch",
"status" : "green",
"timed_out" : false,
"number_of_nodes" : 1,
"number_of_data_nodes" : 1,
"active_primary_shards" : 119,
"active_shards" : 119,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 0,
"delayed_unassigned_shards" : 0,
"number_of_pending_tasks" : 0,
"number_of_in_flight_fetch" : 0,
"task_max_waiting_in_queue_millis" : 0,
"active_shards_percent_as_number" : 100.0
}

There is a WA to make snapshot operation working again, which is to Deregister the repo, delete the data from backup mount and recreate the backup repo and register it. After this it start with BAU work. But after certain interval - It keeps ending with same behavior where snapshot operation starts failing,

Expected Behavior

The above mentioned process where snapshot and delete actions are used.
We run the snapshot creation and old snapshot deletion operation in the interval of 6 hours every day.
The expectation here is it should keep taking the latest snapshot and also maintain the latest two snapshots in the repository.

Actual Behavior

Snapshot process mentioned above works till certain time and one day it starts failing with partial snapshot error:
Failed to complete action: snapshot. <class 'curator.exceptions.FailedExecution'>: Exception encountered. Rerun with loglevel DEBUG and/or check Elasticsearch logs for more information. Exception: Snapshot PARTIAL completed with state: PARTIAL**
Error in ES logs:

"stacktrace": ["java.nio.file.NoSuchFileException: /usr/share/elasticsearch/data/mount/backups/indices/87gNXnEESjCTaTFJqekEuw/0/index-2Q_3S0aYSCubJ39gsp7L9w",
"at sun.nio.fs.UnixException.translateToIOException(UnixException.java:92) ~[?:?]",
"at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:106) ~[?:?]",
"at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111) ~[?:?]",
"at sun.nio.fs.UnixFileSystemProvider.newByteChannel(UnixFileSystemProvider.java:219) ~[?:?]",
"at java.nio.file.Files.newByteChannel(Files.java:380) ~[?:?]",
"at java.nio.file.Files.newByteChannel(Files.java:432) ~[?:?]",
"at java.nio.file.spi.FileSystemProvider.newInputStream(FileSystemProvider.java:422) ~[?:?]",
"at java.nio.file.Files.newInputStream(Files.java:160) ~[?:?]",
"at org.elasticsearch.common.blobstore.fs.FsBlobContainer.readBlob(FsBlobContainer.java:195) ~[elasticsearch-7.16.3.jar:7.16.3]",
"at org.elasticsearch.repositories.blobstore.ChecksumBlobStoreFormat.read(ChecksumBlobStoreFormat.java:88) ~[elasticsearch-7.16.3.jar:7.16.3]",
"at org.elasticsearch.repositories.blobstore.BlobStoreRepository.buildBlobStoreIndexShardSnapshots(BlobStoreRepository.java:3416) ~[elasticsearch-7.16.3.jar:7.16.3]",
"at org.elasticsearch.repositories.blobstore.BlobStoreRepository.snapshotShard(BlobStoreRepository.java:2666) [elasticsearch-7.16.3.jar:7.16.3]",
"at org.elasticsearch.snapshots.SnapshotShardsService.snapshot(SnapshotShardsService.java:370) [elasticsearch-7.16.3.jar:7.16.3]",
"at org.elasticsearch.snapshots.SnapshotShardsService.lambda$startNewShards$1(SnapshotShardsService.java:267) [elasticsearch-7.16.3.jar:7.16.3]",
"at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:718) [elasticsearch-7.16.3.jar:7.16.3]",
"at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]",
"at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]",
"at java.lang.Thread.run(Thread.java:833) [?:?]"] }
{"type": "server", "timestamp": "2024-01-19T08:35:06,443Z", "level": "WARN", "component": "o.e.s.SnapshotShardsService", "cluster.name": "test_elasticsearch", "node.name": "test_node", "message": "[[test_index-2023.04.24-000002][0]][aaf_backup:curator_snapshot-20240119083505/Lj_q_BZTQqy9pxiEnbu2Mg] failed to snapshot shard", "cluster.uuid": "lp-MLx_wRN-UeC7oEFTxtw", "node.id": "luXNaXPhSJS2cuNANwi_HQ" ,

As per the above logs it seems it was not able to find the index related file. But as this is the issue in production and the files were never removed manually. So I am suspecting something is going wrong here.

Specifications

ES version: 7.16.3
Curator version: 5.8.4

Context (Environment)

This is causing issues in production, where we end up having corrupted repository of the backups and only solution we have is to recreate it by losing all the backups. and in worst case we will end up having no use of backup as no way to restore the data because of backup repo corruption.

The text was updated successfully, but these errors were encountered:

untergeek · 2024-01-19T15:49:47Z

I know it's easiest to blame Curator, but Curator only makes Elasticsearch API calls. What that means is that you would have the exact same result if you made the same API calls from a terminal or the Dev Tools Console in Kibana. The error you're seeing is at the Elasticsearch level. Curator is only logging that the Elasticsearch API call it made generated an error on the Elasticsearch side. The error suggests that a file copy did not pass a checksum test: (ChecksumBlobStoreFormat.read), which suggests the problem exists between Elasticsearch and your blob storage provider somehow. Curator does not perform these checksum tests. This is an operation Elasticsearch performs to ensure that what has been written to the blob store matches the local checksum. If there is a checksum mismatch, Elasticsearch will generate the above error.

Additionally, you might want to verify your node's Elasticsearch versions. You report that you're running on 7.17.12, but the stack trace you shared indicates that the node is running elasticsearch-7.16.3.jar:7.16.3, rather than 7.17.12.

pratik-devops · 2024-01-22T06:10:57Z

Thanks for the clarification @untergeek.
And you are right Elasticsearch version is 7.16.3, modifying the same above.
The storage I am using for storing the backup is on the Linux FS.
What I am more worried of is, This issue is not happening on a single machine. Almost all the machines where I have this solution implemented, I am seeing this issue happening frequently. I will try to raise the issue with Elasticsearch as well.

But do you have any other suggestions to tackle this problem ?

untergeek · 2024-01-22T15:42:51Z

It is an actual, local filesystem? Or is it mounted from something over the network?

pratik-devops · 2024-01-22T17:12:12Z

We have an elasticsearch pod running on Kubernetes and backup repository is mounted directly from local FS.

untergeek · 2024-01-22T17:20:26Z

Persistent Volume? If so, what is backing it?

pratik-devops · 2024-01-23T08:25:50Z

Yes at the moment its a persistent volume having infra level backups at regular interval ( Which I know is not a good practice ), We also planning to move to NFS but right now we are really concerned about the problem which is happening here.
What I am thinking is: When in ES logs, It complains for NoSuchFileException related to a particular index data, I am more worried that in the actions provided, I am deleting older snapshot - is it corrupting the repo by chance ?

untergeek · 2024-01-23T14:08:34Z

I'm concerned that the local file mount is partially to blame, but I'd need to know more about what kind of storage class your PV is. Elasticsearch works fine with NFS mounts, S3, Google Storage, Azure Blob, etc. But a PV can be made of so many different storage classes that it could have something to do with how Kubernetes handles it.

pratik-devops · 2024-01-23T15:18:58Z

Adding further details: ( missed to realize earlier, sorry about that )
Its not really PV instead a hostPath volume where I have mounted the Backup filesystem as well as data FS.

       volumeMounts:
      - mountPath: /usr/share/elasticsearch/data
        name: elasticsearch-data
      - mountPath: /usr/share/elasticsearch/data/mount/backups
        name: elasticsearch-backup
      volumes:
      - name: elasticsearch-certs
        secret:
          secretName: elasticsearch-certs
          defaultMode: 0777
      - hostPath:
          path: {{ .Values.mountPath.data }}
        name: elasticsearch-data
      - name: tz-config
        hostPath:
           path: /etc/localtime
      - name: elasticsearch-backup
        hostPath:
           path: {{ .Values.mountPath.backup }}

untergeek · 2024-01-23T17:47:25Z

What kind of drive is the hostPath? Local drive? NVMe? SATA? Network mount?

pratik-devops · 2024-01-24T12:42:38Z

Local Drive

pratik-devops mentioned this issue Jan 22, 2024

Snapshot/delete operations are failing with snapshot_missing_exception elastic/elasticsearch#104598

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Curator Snapshot/delete action is failing which points to snapshot_missing_exception. #1697

Curator Snapshot/delete action is failing which points to snapshot_missing_exception. #1697

pratik-devops commented Jan 19, 2024 •

edited

Loading

untergeek commented Jan 19, 2024

pratik-devops commented Jan 22, 2024

untergeek commented Jan 22, 2024

pratik-devops commented Jan 22, 2024

untergeek commented Jan 22, 2024

pratik-devops commented Jan 23, 2024

untergeek commented Jan 23, 2024

pratik-devops commented Jan 23, 2024

untergeek commented Jan 23, 2024

pratik-devops commented Jan 24, 2024

Curator Snapshot/delete action is failing which points to snapshot_missing_exception. #1697

Curator Snapshot/delete action is failing which points to snapshot_missing_exception. #1697

Comments

pratik-devops commented Jan 19, 2024 • edited Loading

Expected Behavior

Actual Behavior

Specifications

Context (Environment)

untergeek commented Jan 19, 2024

pratik-devops commented Jan 22, 2024

untergeek commented Jan 22, 2024

pratik-devops commented Jan 22, 2024

untergeek commented Jan 22, 2024

pratik-devops commented Jan 23, 2024

untergeek commented Jan 23, 2024

pratik-devops commented Jan 23, 2024

untergeek commented Jan 23, 2024

pratik-devops commented Jan 24, 2024

pratik-devops commented Jan 19, 2024 •

edited

Loading