Skip to content

Commit

Permalink
Add troubleshooting guide for ciphertext verification failed issue (#…
Browse files Browse the repository at this point in the history
…187)

Signed-off-by: Emruz Hossain <[email protected]>
  • Loading branch information
Emruz Hossain authored Jan 4, 2022
1 parent e7e49a0 commit aa2f022
Show file tree
Hide file tree
Showing 6 changed files with 80 additions and 35 deletions.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
---
title: Ciphertext Verification Failed| Stash
description: Troubleshooting "ciphertext verification failed" issue
menu:
docs_{{ .version }}:
identifier: troubleshooting-ciphertext-verification-failed
name: Ciphertext verification failed
parent: troubleshooting
weight: 40
product_name: stash
menu_name: docs_{{ .version }}
section_menu_id: guides
---

# Troubleshooting "ciphertext verification failed" issue

Sometimes, the backup starts failing after a few days with an error indicating `ciphertext verification failed`. In this guide, we are going to explain what might cause the issue and how to solve it.

## Identifying the issue

The backup should run successfully for a few days. Suddenly, it will start failing. Any subsequent backups will fail too. If you describe the respective `BackupSession` or view the log from the respective backup sidecar/job, you should see the following error:

```bash
Fatal: ciphertext verification failed
```

## Possible reasons

This can happen if the backed-up data get corrupted for any of these reasons.

- Someone deleted some files/folders from the backend manually.
- The respective bucket has a policy configured to delete the old data automatically.

## Solution

At first, check if the bucket has any policy configured to delete the old data automatically. If this is the case, please remove that policy and depend only on the retention policy provided by Stash Repository to cleanup the old data.

For example, if you are using a GCS bucket, you can check for such policy in the `LIFECYCLE` tab.

<figure align="center">
<img alt="Object deletion policy in GCS" src="images/gcs_lifecycle_policy.png">
<figcaption align="center">Fig: Object deletion policy in GCS</figcaption>
</figure>

Unfortunately, there is no known way to fix the corrupted repository. You have to delete all the corrupted data from the backend. Only then, the subsequent backups will succeed again.
18 changes: 9 additions & 9 deletions docs/guides/latest/troubleshooting/how-to-troubleshooot/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,15 +14,15 @@ section_menu_id: guides

# How to Troubleshoot Stash Issues

This guide will give you an overview of how you can gather necessary information to identify the issue that causes the backup/restore failure.
This guide will give you an overview of how you can gather the necessary information to identify the issue that causes the backup/restore failure.

## Troubleshoot Backup Issues

In this section, we are going to explain how to troubleshoot backup issues.

### Backup was Never Triggered

If you have created the desired `BackupConfiguration` but the respective backup triggering CronJob was never created or any `BackupSession` was not created in the scheduled time, in this case follow the following steps:
If you have created the desired `BackupConfiguration` but the respective backup triggering CronJob was never created or any `BackupSession` was not created in the scheduled time, in this case, follow the following steps:

#### Describe the `BackupConfiguration`

Expand All @@ -34,13 +34,13 @@ kubectl describe backupconfiguration <backupconfiguration name> -n <namespace>

Now, check the `Status` section of `BackupConfiguration`. Make sure all the `conditions` are `True`. If there is any issue during backup setup, you should see the error in the respective condition.

Also, check the event to see if there is any indication of error.
Also, check the event to see if there is any indication of an error.

#### Check Stash operator log

If you don't notice any error on the previous step, you should check the Stash operator log.
If you don't notice any error in the previous step, you should check the Stash operator log.

Run the following command to view Stash operator log:
Run the following command to view the Stash operator log:

```bash
# Identify the Stash operator pod
Expand Down Expand Up @@ -68,7 +68,7 @@ Also, check the `Events` section. Sometimes, it can be helpful to identify the i

#### View Backup Job/Sidecar log

If you don't see any error in the previous step, you should try checking log of the respective backup job / sidecar.
If you don't see any error in the previous step, you should try checking the log of the respective backup job/sidecar.

If you are trying to backup a workload, run the following command to inspect the log:

Expand All @@ -86,7 +86,7 @@ kubectl get pods -n <namespace> | grep stash-backup
kubectl logs -n <namespace> <backup pod name> --all-containers
```

Inspect the log carefully. You should notice the respective error that lead to backup failure.
Inspect the log carefully. You should notice the respective error that leads to backup failure.

## Troubleshoot Restore Issues

Expand All @@ -104,7 +104,7 @@ Also, check the `Events` section. Sometimes, it can be helpful to identify the i

#### View Restore Job/Init-Container log

If you don't see any error in the previous step, you should try checking log of the respective restore job / init-container.
If you don't see any error in the previous step, you should try checking the log of the respective restore job / init-container.

If you are trying to restore a workload, run the following command to inspect the log:

Expand All @@ -122,4 +122,4 @@ kubectl get pods -n <namespace> | grep stash-restore
kubectl logs -n <namespace> <restore pod name> --all-containers
```

Inspect the log carefully. You should notice the respective error that lead to restore failure.
Inspect the log carefully. You should notice the respective error that leads to restore failure.
22 changes: 11 additions & 11 deletions docs/guides/latest/troubleshooting/permission-denided/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,15 +14,15 @@ section_menu_id: guides

# Troubleshooting `"permission denied"` issue

Sometimes the backup or restore fails due to permission issue. This can happen for various reasons. In this guide, we are going to explain the known scenarios when this issue can arise and what you can do to solve it.
Sometimes the backup or restore fails due to permission issues. This can happen for various reasons. In this guide, we are going to explain the known scenarios when this issue can arise and what you can do to solve it.

## Identifying the issue

If you describe the respective `BackupSession` / `RestoreSession` or view the log from respective backup/restore sidecar/job, you should see a message pointing to `permission denied` error.
If you describe the respective `BackupSession` / `RestoreSession` or view the log from the respective backup/restore sidecar/job, you should see a message pointing to the `permission denied` error.

## Possible reasons

The issue can happen during both backup and restore. Here, are few possible scenarios when you can face the issue.
The issue can happen during both backup and restore. Here, are a few possible scenarios when you can face the issue.

### During Backup

Expand All @@ -38,7 +38,7 @@ If you are using an addon that needs `interimVolume` for storing the data tempor

### During Restore

You may see the permission issue during restore process in the following scenarios.
You may see the permission issue during the restore process in the following scenarios.

### Backup was taken as a particular user

Expand All @@ -50,11 +50,11 @@ If you are using an addon that needs `interimVolume` for storing the data tempor

## Solutions

Here, are few actions you can take to solve the issue in the scenarios mentioned above.
Here, are a few actions you can take to solve the issue in the scenarios mentioned above.

### For local volume as backend

If you are facing the issue while using local volume as backend, you can take any of the following actions to solve the issue.
If you are facing the issue while using local volume as a backend, you can take any of the following actions to solve the issue.

#### Run the backup/restore as `root` user

Expand Down Expand Up @@ -93,7 +93,7 @@ spec:
prune: true
```
Here, is an example of running restore as `root` user:
Here, is an example of running restores as `root` user:

```yaml
apiVersion: stash.appscode.com/v1beta1
Expand Down Expand Up @@ -149,15 +149,15 @@ spec:
prune: true
```

> If you are taking backup of workload (i.e. StatefulSet, Deployment etc.) volumes, you have to provide the `fsGroup` in your workload spec instead of `BackupConfiguration` / `RestoreSession`.
> If you are taking backup of workload (i.e. StatefulSet, Deployment, etc.) volumes, you have to provide the `fsGroup` in your workload spec instead of `BackupConfiguration` / `RestoreSession`.

#### Give read,write permissions to all users
#### Give read, write permissions to all users

You can also use `chmod` to give read, write permissions to all users for the directory you are using as backend.

### For using InterimVolume

If you are facing the issue for using `interimVolume` in your backup/restore process, you can either run the backup/restore process as root user or you can provide the storage access permission to Stash using `fsGroup`.
If you are facing the issue because of using `interimVolume` in your backup/restore process, you can either run the backup/restore process as `root` user or you can provide the storage access permission to Stash using `fsGroup`.

Here, is an example of running backup as `root` user:

Expand Down Expand Up @@ -234,7 +234,7 @@ spec:

### For user id mismatch during restore

If your restore fails because it does not have necessary permission to read backed up data from the repository, you have to run the restore process as the same user as the backup process or `root` user using the `runtimeSettings.container.securityContext` section.
If your restore fails because it does not have the necessary permission to read backed up data from the repository, you have to run the restore process as the same user as the backup process or `root` user using the `runtimeSettings.container.securityContext` section.

Here, is an example of running restore as a particular user:

Expand Down
22 changes: 11 additions & 11 deletions docs/guides/latest/troubleshooting/repo-locked/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,11 +14,11 @@ section_menu_id: guides

# Troubleshooting `"repository is already locked "` issue

Sometimes, the backend repository get locked and subsequent backup fail. In this guide, we are going to explain why this can happen and what you can do to solve the issue.
Sometimes, the backend repository gets locked and the subsequent backup fails. In this guide, we are going to explain why this can happen and what you can do to solve the issue.

## Identifying the issue

If the repository get locked, new backup will fail. If you describe the `BackupSession`, you should see error message indicating that the repository is already locked by other process.
If the repository gets locked, the new backup will fail. If you describe the `BackupSession`, you should see an error message indicating that the repository is already locked by another process.

```bash
kubectl describe -n <namespace> backupsession <backupsession name>
Expand All @@ -36,25 +36,25 @@ kubectl logs -n <namespace> <backup job's pod name> --all-containers
## Possible reasons
A restic process that modify the repository, create a lock at the beginning it's operation. When it completes the operation, it remove the lock so that other restic process can use the repository. Now, if the process is killed unexpectedly, it can not remove the lock. As a result, the repository remains locked and become unusable for other process.
A restic process that modifies the repository, creates a lock at the beginning of its operation. When it completes the operation, it removes the lock so that other restic processes can use the repository. Now, if the process is killed unexpectedly, it can not remove the lock. As a result, the repository remains locked and becomes unusable for other processes.
### Possible scenarios when a repository can get locked
The repository can get locked in the following scenarios.
#### 1. The backup job/pod containing sidecar has been terminated.
#### 1. The backup job/pod containing the sidecar has been terminated.
If the workload pod that has `stash` sidecar or backup job's pod get terminated while a backup is running, the repository can get locked. In this case, you have to find out why the pod was terminated.
If the workload pod that has the `stash` sidecar or backup job's pod gets terminated while a backup is running, the repository can get locked. In this case, you have to find out why the pod was terminated.

#### 2. The temp-dir is set too low

Stash uses an `emptyDir` as temporary volume where it store cache for improving backup performance. By default the `emptyDir` does not have any limit on size. However, if you set the limit manually using `spec.tempDir` section of `BackupConfiguration` make sure you have set it to a reasonable size based on your targeted data size. If the `tempDir` limit is too low, cache size may cross the limit resulting the backup pod get evicted by Kubernetes. This is a tricky case because you may not notice that the backup pod has been evicted. You can describe the respective workload/job to check if it was the case.
Stash uses an `emptyDir` as a temporary volume where it stores cache for improving backup performance. By default, the `emptyDir` does not have any size limit. However, if you set the limit manually using `spec.tempDir` section of `BackupConfiguration` make sure you have set it to a reasonable size based on your targeted data size. If the `tempDir` limit is too low, cache size may cross the limit resulting in the backup pod getting evicted by Kubernetes. This is a tricky case because you may not notice that the backup pod has been evicted. You can describe the respective workload/job to check if it was the case.

In such scenario, make sure that you have set the `tempDir` size to a reasonable amount. You can also disable caching by setting `spec.tempDir.disableCaching: true`. However, this might impact the backup performance significantly.
In such a scenario, make sure that you have set the `tempDir` size to a reasonable amount. You can also disable caching by setting `spec.tempDir.disableCaching: true`. However, this might impact the backup performance significantly.

## Solutions

If your repository get locked, you have to unlock it manually. You can use one of the following methods.
If your repository gets locked, you have to unlock it manually. You can use one of the following methods.

### Use Stash kubectl plugin

Expand All @@ -66,9 +66,9 @@ Then, run the following command to unlock the repository:
kubectl stash unlock <repository name> --namespace=<namespace>
```

### Delete the locks folder from backend
### Delete the locks folder from the backend

If you are using a cloud bucket that provide a UI to browse the storage, you can go to the repository directory and delete the `locks` folder.
If you are using a cloud bucket that provides a UI to browse the storage, you can go to the repository directory and delete the `locks` folder.

<figure align="center">
<img alt="Locks in the backend repository" src="images/repo_lock.png">
Expand All @@ -77,4 +77,4 @@ If you are using a cloud bucket that provide a UI to browse the storage, you can

## Further Action

Once you have found the issue why the repository got locked in the fist place, take necessary measure to prevent it from occurring in future.
Once you have found the issue of why the repository got locked in the first place, take the necessary measure to prevent it from occurring in the future.
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ section_menu_id: guides

# Troubleshooting `"failed to read all source data"` issue

Sometime backup fails due to Stash being unable to read the targeted data. In this guide, we are going to explain the possible scenario when this error can happen and what you can do to solve the issue.
Sometimes, the backup fails due to Stash being unable to read the targeted data. In this guide, we are going to explain the possible scenario when this error can happen and what you can do to solve the issue.

## Identifying the issue

Expand All @@ -26,13 +26,13 @@ Warning: failed to read all source data during backup

## Possible reason

By default, Stash runs backup as non-root user. If the target data directory is not readable to all users, then Stash will fail to read the targeted data.
By default, Stash runs backup as a non-root user. If the target data directory is not readable to all users, then Stash will fail to read the targeted data.

## Solution

Run the backup process as same user as the targeted application or run the backup process as root user.
Run the backup process as the same user as the targeted application or run the backup process as the root user.

Here is an example of `BackupConfiguration` for running backup as root user:
Here is an example of `BackupConfiguration` for running backup as the root user:

```yaml
apiVersion: stash.appscode.com/v1beta1
Expand Down

0 comments on commit aa2f022

Please sign in to comment.