From e38a1aa5ff8f4b224c3202fdf38391dbee25a5ed Mon Sep 17 00:00:00 2001 From: ykadowak Date: Mon, 3 Jul 2023 05:11:37 +0000 Subject: [PATCH] add broken index backup doc --- docs/user-guides/backup-configuration.md | 62 ++++++++++++++++++++++++ 1 file changed, 62 insertions(+) diff --git a/docs/user-guides/backup-configuration.md b/docs/user-guides/backup-configuration.md index 3a148c5637..0f8872f38f 100644 --- a/docs/user-guides/backup-configuration.md +++ b/docs/user-guides/backup-configuration.md @@ -247,3 +247,65 @@ Agent Sidecar tries to get the backup file from S3, unpacks it, and starts index In using both the PV and S3 case, the backup file used for restoration will prioritize the file on PV. If the backup file does not exist on the PV, the backup file will be retrieved from S3 via the Vald Agent Sidecar and restored. + +## Broken index backup + +If a backup file of an index is corrupted for some reason, Vald agent fails to load the index file, and the index file is then identified as a broken index. + +> Causes of broken index could be agent crash during save index operation, partial storage corruption, etc. + +When an index is broken, the default behavior is to discard it and continue running the Pod. This is useful for saving storage space, but sometimes you may need to inspect the contents of a broken index at a later time. By enabling the `broken index backup` feature, a backup is created without deleting the broken index before running the Pod. This feature can help you investigate the cause of index corruption at a later time. + +### Settings + +To enable this feature, set the `agent.ngt.broken_index_history_limit` setting to at least 1 (default: 0). The system stores backups of broken indexes up to the number of generations specified by this variable. If a backup of a broken index is needed that goes beyond this value, the system will delete the oldest backup. + +``` +agent: + ngt: + ... + broken_index_history_limit: 3 + ... +``` + +### Backup location + +The backup is stored under `${index_path}/broken`. Each directory name represents the Unix nanosecond when an attempt was made to read the broken index. + +``` +${index_path}/ + origin/ + ngt-meta.kvsdb + ngt-timestamp.kvsdb + metadata.json + prf + grp + tre + obj + broken/ + 1611271735938403848/ + ngt-meta.kvsdb + ... + 1611271749583028942/ + ngt-meta.kvsdb + ... + 1611271759849304593/ + ngt-meta.kvsdb + ... +``` + +### Restore + +#### CoW: disabled + +If an index file exists under `${index_path}/origin`, restore is attempted based on that index file. If the restore fails, the index file is backed up as a broken index. The agent starts in its initial state. + +#### CoW: enabled + +If an index file exists under `${index_path}/origin`, restore is attempted based on that index file. If the restore fails, `${index_path}/origin` is backed up as a broken index at that point. Then, restore is attempted based on the index file in `${index_path}/backup` (one generation older index file). If the restore fails again, the agent starts in its initial state. + +### Metrics + +The number of generations of broken indexes currently stored can be obtained as a metric `agent_core_ngt_broken_index_store_count`. + +Reference: [vald/k8s/metrics/grafana/dashboards/01-vald-agent.yaml](https://github.com/vdaas/vald/blob/main/k8s/metrics/grafana/dashboards/01-vald-agent.yaml)