Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add index correction document #2217

Merged
merged 4 commits into from
Oct 27, 2023
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
37 changes: 37 additions & 0 deletions docs/user-guides/index-correction.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
# Index Correction

In the Vald cluster, the same Index is replicated to multiple agents due to the `index_replica` setting. However, inconsistencies between replicas may occur due to pod eviction or the occurrence of OOM killer during vector insertions. For example,

1. The timestamp of the index differs between agents (some agents have an old index saved and it has not been updated).
ykadowak marked this conversation as resolved.
Show resolved Hide resolved
2. The number of replicas does not meet the value set in `index_replica`.

To resolve these inconsistencies, you can use the `Index Correction` feature.

`Index Correction` is implemented as a [`CronJob`](https://kubernetes.io/docs/concepts/workloads/controllers/cron-jobs/), checking the consistency between replicas regularly and resolving any inconsistencies.

## Settings

- enabled
Turns the index correction feature on/off.
- schedule
Sets the interval for the job start in cron notation (the default value is `3 6 * * *`, which means 3:06 AM every day).
- suspend
[Temporary suspension setting](https://kubernetes.io/docs/concepts/workloads/controllers/cron-jobs/#schedule-suspension) for CronJob.

```yaml
manager:
index:
corrector:
enabled: true
schedule: "3 6 * * *"
suspend: false
```

## Important Notes

- Processing time
Under conditions of 10 million vectors and agent replica *10, it takes about 10~20 minutes. The process is O(MN) where M is the number of vector items and N is the number of agent replicas.
- concurrencyPolicy
`Forbid` is set internally, so a new job will not be created while an existing job is running. In other words, if the process does not finish within the interval specified by the schedule, the next job will not be scheduled.
- Index operations during correction
Vector operations performed after the start of the index correction job are not considered in that job.
Loading