Add troubleshooting guide for ciphertext verification failed issue (#…

…187) Signed-off-by: Emruz Hossain <[email protected]>
stashed · Jan 4, 2022 · aa2f022 · aa2f022
1 parent e7e49a0
commit aa2f022
Show file tree

Hide file tree

Showing 6 changed files with 80 additions and 35 deletions.
diff --git a/.../troubleshooting/ciphertext-verification-failed/images/gcs_lifecycle_policy.png b/.../troubleshooting/ciphertext-verification-failed/images/gcs_lifecycle_policy.png
diff --git a/docs/guides/latest/troubleshooting/ciphertext-verification-failed/index.md b/docs/guides/latest/troubleshooting/ciphertext-verification-failed/index.md
@@ -0,0 +1,45 @@
+---
+title: Ciphertext Verification Failed| Stash
+description: Troubleshooting "ciphertext verification failed" issue
+menu:
+  docs_{{ .version }}:
+    identifier: troubleshooting-ciphertext-verification-failed
+    name: Ciphertext verification failed
+    parent: troubleshooting
+    weight: 40
+product_name: stash
+menu_name: docs_{{ .version }}
+section_menu_id: guides
+---
+
+# Troubleshooting "ciphertext verification failed" issue
+
+Sometimes, the backup starts failing after a few days with an error indicating `ciphertext verification failed`. In this guide, we are going to explain what might cause the issue and how to solve it.
+
+## Identifying the issue
+
+The backup should run successfully for a few days. Suddenly, it will start failing. Any subsequent backups will fail too. If you describe the respective `BackupSession` or view the log from the respective backup sidecar/job, you should see the following error:
+
+```bash
+Fatal: ciphertext verification failed
+```
+
+## Possible reasons
+
+This can happen if the backed-up data get corrupted for any of these reasons.
+
+- Someone deleted some files/folders from the backend manually.
+- The respective bucket has a policy configured to delete the old data automatically.
+
+## Solution
+
+At first, check if the bucket has any policy configured to delete the old data automatically. If this is the case, please remove that policy and depend only on the retention policy provided by Stash Repository to cleanup the old data.
+
+For example, if you are using a GCS bucket, you can check for such policy in the `LIFECYCLE` tab.
+
+<figure align="center">
+  <img alt="Object deletion policy in GCS" src="images/gcs_lifecycle_policy.png">
+<figcaption align="center">Fig: Object deletion policy in GCS</figcaption>
+</figure>
+
+Unfortunately, there is no known way to fix the corrupted repository. You have to delete all the corrupted data from the backend. Only then, the subsequent backups will succeed again.
diff --git a/docs/guides/latest/troubleshooting/how-to-troubleshooot/index.md b/docs/guides/latest/troubleshooting/how-to-troubleshooot/index.md
@@ -14,15 +14,15 @@ section_menu_id: guides
 
 # How to Troubleshoot Stash Issues
 
-This guide will give you an overview of how you can gather necessary information to identify the issue that causes the backup/restore failure.
+This guide will give you an overview of how you can gather the necessary information to identify the issue that causes the backup/restore failure.
 
 ## Troubleshoot Backup Issues
 
 In this section, we are going to explain how to troubleshoot backup issues.
 
 ### Backup was Never Triggered
 
-If you have created the desired `BackupConfiguration` but the respective backup triggering CronJob was never created or any `BackupSession` was not created in the scheduled time, in this case follow the following steps:
+If you have created the desired `BackupConfiguration` but the respective backup triggering CronJob was never created or any `BackupSession` was not created in the scheduled time, in this case, follow the following steps:
 
 #### Describe the `BackupConfiguration`
 
@@ -34,13 +34,13 @@ kubectl describe backupconfiguration <backupconfiguration name> -n <namespace>
 
 Now, check the `Status` section of `BackupConfiguration`. Make sure all the `conditions` are `True`. If there is any issue during backup setup, you should see the error in the respective condition.
 
-Also, check the event to see if there is any indication of error.
+Also, check the event to see if there is any indication of an error.
 
 #### Check Stash operator log
 
-If you don't notice any error on the previous step, you should check the Stash operator log.
+If you don't notice any error in the previous step, you should check the Stash operator log.
 
-Run the following command to view Stash operator log:
+Run the following command to view the Stash operator log:
 
 ```bash
 # Identify the Stash operator pod
@@ -68,7 +68,7 @@ Also, check the `Events` section. Sometimes, it can be helpful to identify the i
 
 #### View Backup Job/Sidecar log
 
-If you don't see any error in the previous step, you should try checking log of the respective backup job / sidecar.
+If you don't see any error in the previous step, you should try checking the log of the respective backup job/sidecar.
 
 If you are trying to backup a workload, run the following command to inspect the log:
 
@@ -86,7 +86,7 @@ kubectl get pods -n <namespace> | grep stash-backup
 kubectl logs -n <namespace> <backup pod name> --all-containers
 ```
 
-Inspect the log carefully. You should notice the respective error that lead to backup failure.
+Inspect the log carefully. You should notice the respective error that leads to backup failure.
 
 ## Troubleshoot Restore Issues
 
@@ -104,7 +104,7 @@ Also, check the `Events` section. Sometimes, it can be helpful to identify the i
 
 #### View Restore Job/Init-Container log
 
-If you don't see any error in the previous step, you should try checking log of the respective restore job / init-container.
+If you don't see any error in the previous step, you should try checking the log of the respective restore job / init-container.
 
 If you are trying to restore a workload, run the following command to inspect the log:
 
@@ -122,4 +122,4 @@ kubectl get pods -n <namespace> | grep stash-restore
 kubectl logs -n <namespace> <restore pod name> --all-containers
 ```
 
-Inspect the log carefully. You should notice the respective error that lead to restore failure.
+Inspect the log carefully. You should notice the respective error that leads to restore failure.
diff --git a/docs/guides/latest/troubleshooting/permission-denided/index.md b/docs/guides/latest/troubleshooting/permission-denided/index.md
@@ -14,15 +14,15 @@ section_menu_id: guides
 
 # Troubleshooting `"permission denied"` issue
 
-Sometimes the backup or restore fails due to permission issue. This can happen for various reasons. In this guide, we are going to explain the known scenarios when this issue can arise and what you can do to solve it.
+Sometimes the backup or restore fails due to permission issues. This can happen for various reasons. In this guide, we are going to explain the known scenarios when this issue can arise and what you can do to solve it.
 
 ## Identifying the issue
 
-If you describe the respective `BackupSession` / `RestoreSession` or view the log from respective backup/restore sidecar/job, you should see a message pointing to `permission denied` error.
+If you describe the respective `BackupSession` / `RestoreSession` or view the log from the respective backup/restore sidecar/job, you should see a message pointing to the `permission denied` error.
 
 ## Possible reasons
 
-The issue can happen during both backup and restore. Here, are few possible scenarios when you can face the issue.
+The issue can happen during both backup and restore. Here, are a few possible scenarios when you can face the issue.
 
 ### During Backup
 
@@ -38,7 +38,7 @@ If you are using an addon that needs `interimVolume` for storing the data tempor
 
 ### During Restore
 
-You may see the permission issue during restore process in the following scenarios.
+You may see the permission issue during the restore process in the following scenarios.
 
 ### Backup was taken as a particular user
 
@@ -50,11 +50,11 @@ If you are using an addon that needs `interimVolume` for storing the data tempor
 
 ## Solutions
 
-Here, are few actions you can take to solve the issue in the scenarios mentioned  above.
+Here, are a few actions you can take to solve the issue in the scenarios mentioned above.
 
 ### For local volume as backend
 
-If you are facing the issue while using local volume as backend, you can take any of the following actions to solve the issue.
+If you are facing the issue while using local volume as a backend, you can take any of the following actions to solve the issue.
 
 #### Run the backup/restore as `root` user
 
@@ -93,7 +93,7 @@ spec:
     prune: true
 ```
 
-Here, is an example of running restore as `root` user:
+Here, is an example of running restores as `root` user:
 
 ```yaml
 apiVersion: stash.appscode.com/v1beta1
@@ -149,15 +149,15 @@ spec:
     prune: true
 ```
 
-> If you are taking backup of workload (i.e. StatefulSet, Deployment etc.) volumes, you have to provide the `fsGroup` in your workload spec instead of `BackupConfiguration` / `RestoreSession`.
+> If you are taking backup of workload (i.e. StatefulSet, Deployment, etc.) volumes, you have to provide the `fsGroup` in your workload spec instead of `BackupConfiguration` / `RestoreSession`.
 
-#### Give read,write permissions to all users
+#### Give read, write permissions to all users
 
 You can also use `chmod` to give read, write permissions to all users for the directory you are using as backend.
 
 ### For using InterimVolume
 
-If you are facing the issue for using `interimVolume` in your backup/restore process, you can either run the backup/restore process as root user or you can provide the storage access permission to Stash using `fsGroup`.
+If you are facing the issue because of using `interimVolume` in your backup/restore process, you can either run the backup/restore process as `root` user or you can provide the storage access permission to Stash using `fsGroup`.
 
 Here, is an example of running backup as `root` user:
 
@@ -234,7 +234,7 @@ spec:
 
 ### For user id mismatch during restore
 
-If your restore fails because it does not have necessary permission to read backed up data from the repository, you have to run the restore process as the same user as the backup process or `root` user using the `runtimeSettings.container.securityContext` section.
+If your restore fails because it does not have the necessary permission to read backed up data from the repository, you have to run the restore process as the same user as the backup process or `root` user using the `runtimeSettings.container.securityContext` section.
 
 Here, is an example of running restore as a particular user:
 

diff --git a/docs/guides/latest/troubleshooting/repo-locked/index.md b/docs/guides/latest/troubleshooting/repo-locked/index.md
@@ -14,11 +14,11 @@ section_menu_id: guides
 
 # Troubleshooting `"repository is already locked "` issue
 
-Sometimes, the backend repository get locked and subsequent backup fail. In this guide, we are going to explain why this can happen and what you can do to solve the issue.
+Sometimes, the backend repository gets locked and the subsequent backup fails. In this guide, we are going to explain why this can happen and what you can do to solve the issue.
 
 ## Identifying the issue
 
-If the repository get locked, new backup will fail. If you describe the `BackupSession`, you should see error message indicating that the repository is already locked by other process.
+If the repository gets locked, the new backup will fail. If you describe the `BackupSession`, you should see an error message indicating that the repository is already locked by another process.
 
 ```bash
 kubectl describe -n <namespace> backupsession <backupsession name>
@@ -36,25 +36,25 @@ kubectl logs -n <namespace> <backup job's pod name> --all-containers
 
 ## Possible reasons
 
-A restic process that modify the repository, create a lock at the beginning it's operation. When it completes the operation, it remove the lock so that other restic process can use the repository. Now, if the process is killed unexpectedly, it can not remove the lock. As a result, the repository remains locked and become unusable for other process.
+A restic process that modifies the repository, creates a lock at the beginning of its operation. When it completes the operation, it removes the lock so that other restic processes can use the repository. Now, if the process is killed unexpectedly, it can not remove the lock. As a result, the repository remains locked and becomes unusable for other processes.
 
 ### Possible scenarios when a repository can get locked
 
 The repository can get locked in the following scenarios.
 
-#### 1. The backup job/pod containing sidecar has been terminated.
+#### 1. The backup job/pod containing the sidecar has been terminated.
 
-If the workload pod that has `stash` sidecar or backup job's pod get terminated while a backup is running, the repository can get locked. In this case, you have to find out why the pod was terminated.
+If the workload pod that has the `stash` sidecar or backup job's pod gets terminated while a backup is running, the repository can get locked. In this case, you have to find out why the pod was terminated.
 
 #### 2. The temp-dir is set too low
 
-Stash uses an `emptyDir` as temporary volume where it store cache for improving backup performance. By default the `emptyDir` does not have any limit on size. However, if you set the limit manually using `spec.tempDir` section of `BackupConfiguration` make sure you have set it to a reasonable size based on your targeted data size. If the `tempDir` limit is too low, cache size may cross the limit resulting the backup pod get evicted by Kubernetes. This is a tricky case because you may not notice that the backup pod has been evicted. You can describe the respective workload/job to check if it was the case.
+Stash uses an `emptyDir` as a temporary volume where it stores cache for improving backup performance. By default, the `emptyDir` does not have any size limit. However, if you set the limit manually using `spec.tempDir` section of `BackupConfiguration` make sure you have set it to a reasonable size based on your targeted data size. If the `tempDir` limit is too low, cache size may cross the limit resulting in the backup pod getting evicted by Kubernetes. This is a tricky case because you may not notice that the backup pod has been evicted. You can describe the respective workload/job to check if it was the case.
 
-In such scenario, make sure that you have set the `tempDir` size to a reasonable amount. You can also disable caching by setting `spec.tempDir.disableCaching: true`. However, this might impact the backup performance significantly.
+In such a scenario, make sure that you have set the `tempDir` size to a reasonable amount. You can also disable caching by setting `spec.tempDir.disableCaching: true`. However, this might impact the backup performance significantly.
 
 ## Solutions
 
-If your repository get locked, you have to unlock it manually. You can use one of the following methods.
+If your repository gets locked, you have to unlock it manually. You can use one of the following methods.
 
 ### Use Stash kubectl plugin
 
@@ -66,9 +66,9 @@ Then, run the following command to unlock the repository:
 kubectl stash unlock <repository name> --namespace=<namespace>
 ```
 
-### Delete the locks folder from backend
+### Delete the locks folder from the backend
 
-If you are using a cloud bucket that provide a UI to browse the storage, you can go to the repository directory and delete the `locks` folder.
+If you are using a cloud bucket that provides a UI to browse the storage, you can go to the repository directory and delete the `locks` folder.
 
 <figure align="center">
   <img alt="Locks in the backend repository" src="images/repo_lock.png">
@@ -77,4 +77,4 @@ If you are using a cloud bucket that provide a UI to browse the storage, you can
 
 ## Further Action
 
-Once you have found the issue why the repository got locked in the fist place, take necessary measure to prevent it from occurring in future.
+Once you have found the issue of why the repository got locked in the first place, take the necessary measure to prevent it from occurring in the future.
diff --git a/docs/guides/latest/troubleshooting/source-data-read-failed/index.md b/docs/guides/latest/troubleshooting/source-data-read-failed/index.md
@@ -14,7 +14,7 @@ section_menu_id: guides
 
 # Troubleshooting `"failed to read all source data"` issue
 
-Sometime backup fails due to Stash being unable to read the targeted data. In this guide, we are going to explain the possible scenario when this error can happen and what you can do to solve the issue.
+Sometimes, the backup fails due to Stash being unable to read the targeted data. In this guide, we are going to explain the possible scenario when this error can happen and what you can do to solve the issue.
 
 ## Identifying the issue
 
@@ -26,13 +26,13 @@ Warning: failed to read all source data during backup
 
 ## Possible reason
 
-By default, Stash runs backup as non-root user. If the target data directory is not readable to all users, then Stash will fail to read the targeted data.
+By default, Stash runs backup as a non-root user. If the target data directory is not readable to all users, then Stash will fail to read the targeted data.
 
 ## Solution
 
-Run the backup process as same user as the targeted application or run the backup process as root user.
+Run the backup process as the same user as the targeted application or run the backup process as the root user.
 
-Here is an example of `BackupConfiguration` for running backup as root user:
+Here is an example of `BackupConfiguration` for running backup as the root user:
 
 ```yaml
 apiVersion: stash.appscode.com/v1beta1