-
Notifications
You must be signed in to change notification settings - Fork 635
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Curator Snapshot/delete action is failing which points to snapshot_missing_exception. #1697
Comments
I know it's easiest to blame Curator, but Curator only makes Elasticsearch API calls. What that means is that you would have the exact same result if you made the same API calls from a terminal or the Dev Tools Console in Kibana. The error you're seeing is at the Elasticsearch level. Curator is only logging that the Elasticsearch API call it made generated an error on the Elasticsearch side. The error suggests that a file copy did not pass a checksum test: ( Additionally, you might want to verify your node's Elasticsearch versions. You report that you're running on 7.17.12, but the stack trace you shared indicates that the node is running |
Thanks for the clarification @untergeek. But do you have any other suggestions to tackle this problem ? |
It is an actual, local filesystem? Or is it mounted from something over the network? |
We have an elasticsearch pod running on Kubernetes and backup repository is mounted directly from local FS. |
Persistent Volume? If so, what is backing it? |
Yes at the moment its a persistent volume having infra level backups at regular interval ( Which I know is not a good practice ), We also planning to move to NFS but right now we are really concerned about the problem which is happening here. |
I'm concerned that the local file mount is partially to blame, but I'd need to know more about what kind of storage class your PV is. Elasticsearch works fine with NFS mounts, S3, Google Storage, Azure Blob, etc. But a PV can be made of so many different storage classes that it could have something to do with how Kubernetes handles it. |
Adding further details: ( missed to realize earlier, sorry about that )
|
What kind of drive is the hostPath? Local drive? NVMe? SATA? Network mount? |
Local Drive |
We have a single node ES cluster hosting around 15 aliases and which is working perfectly.
We have developed a backup and restore solution using Curator action which was working perfectly fine.
The solution works as follows:
Suddenly - We started seeing failures in Snapshot operation as below:
On checking further, In ES logs found below errors:
on checking the snapshot repo with below curl:
There is a WA to make snapshot operation working again, which is to Deregister the repo, delete the data from backup mount and recreate the backup repo and register it. After this it start with BAU work. But after certain interval - It keeps ending with same behavior where snapshot operation starts failing,
Expected Behavior
The above mentioned process where snapshot and delete actions are used.
We run the snapshot creation and old snapshot deletion operation in the interval of 6 hours every day.
The expectation here is it should keep taking the latest snapshot and also maintain the latest two snapshots in the repository.
Actual Behavior
Snapshot process mentioned above works till certain time and one day it starts failing with partial snapshot error:
Failed to complete action: snapshot. <class 'curator.exceptions.FailedExecution'>: Exception encountered. Rerun with loglevel DEBUG and/or check Elasticsearch logs for more information. Exception: Snapshot PARTIAL completed with state: PARTIAL**
Error in ES logs:
As per the above logs it seems it was not able to find the index related file. But as this is the issue in production and the files were never removed manually. So I am suspecting something is going wrong here.
Specifications
ES version: 7.16.3
Curator version: 5.8.4
Context (Environment)
This is causing issues in production, where we end up having corrupted repository of the backups and only solution we have is to recreate it by losing all the backups. and in worst case we will end up having no use of backup as no way to restore the data because of backup repo corruption.
The text was updated successfully, but these errors were encountered: