Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

remove vcache for vald agent due to vcache delete timing control failure and time ordered concurrent vector queue called vqueue #1028

Conversation

kpango
Copy link
Collaborator

@kpango kpango commented Feb 19, 2021

Signed-off-by: kpango [email protected]

Description:

We are currently experiencing a bug where the Vald agent is unable to ensure the integrity of vector Delete and Insert when performing high frequency Upsert and Delete in a multi-threaded environment.
The Vald Agent was using Vector Cache = vcache to reduce the overhead of calling FFI inline in user requests, and also to aggregate FFI calls during CreateIndex.

However, since vcache uses a sync.Map-like implementation and does not support ordered loops, depending on the timing, the old vcache may not be deleted and may be loaded during the next CreateIndex.
In addition, Insert Vector Cache (IVC) and Delete Vector Cache (DVC) compare the timestamps of execution plans with the same UUID and delete unnecessary caches to avoid duplication of FFIs.
I believe that this reciprocal cache deletion logic further expands the garbage data and the Delete and Insert operations were not executed.
To solve this problem, we need to removed the vcache and added a cache layer to ensure orderliness.

So, I implemented TOCVQ (Time Ordered Concurrent Vector Queue) to avoid this problem, always calculating the execution plan from the head of the queue (operating on the old time axis) and calling FFI calls to strongly guarantee the consistency of the results.

Related Issue:

How Has This Been Tested?:

Environment:

  • Go Version: 1.16
  • Docker Version: 19.03.8
  • Kubernetes Version: 1.18.2
  • NGT Version: 1.12.3

Types of changes:

  • Bug fix [type/bug]
  • New feature [type/feature]
  • Add tests [type/test]
  • Security related changes [type/security]
  • Add documents [type/documentation]
  • Refactoring [type/refactoring]
  • Update dependencies [type/dependency]
  • Update benchmarks and performances [type/bench]
  • Update CI [type/ci]

Changes to Core Features:

  • Have you added an explanation of what your changes do and why you'd like us to include them?
  • Have you written new tests for your core changes, as applicable?
  • Have you successfully ran tests with your changes locally?

Checklist:

  • I have read the CONTRIBUTING document.
  • I have checked open Pull Requests for the similar feature or fixes?
  • I have added tests and benchmarks to cover my changes.
  • I have ensured all new and existing tests passed.
  • I have commented my code, particularly in hard-to-understand areas
  • I have updated the documentation accordingly.

@vdaas-ci
Copy link
Collaborator

[CHATOPS:HELP] ChatOps commands.

  • 🙆‍♀️ /approve - approve
  • 💌 /changelog - replace the PR body by changelog details
  • 🍱 /format - format codes and add licenses
  • /gen-test - generate test codes
  • 🏷️ /label - add labels
  • /rebase - rebase master
  • 🔚 2️⃣ 🔚 /label actions/e2e-deploy - run E2E deploy & integration test

@kpango kpango force-pushed the bugfix/agent/remove-vcache-due-to-timing-control-failure-and-add-concurrent-queue branch 2 times, most recently from 4dcc4a1 to 2d94bcb Compare February 19, 2021 21:08
@codecov
Copy link

codecov bot commented Feb 19, 2021

Codecov Report

Merging #1028 (facbd8d) into master (f12c2b6) will decrease coverage by 0.59%.
The diff coverage is 0.74%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1028      +/-   ##
==========================================
- Coverage   15.01%   14.42%   -0.60%     
==========================================
  Files         497      495       -2     
  Lines       28703    28398     -305     
==========================================
- Hits         4310     4096     -214     
+ Misses      24123    24046      -77     
+ Partials      270      256      -14     
Impacted Files Coverage Δ
...ternal/observability/metrics/agent/core/ngt/ngt.go 0.00% <0.00%> (ø)
pkg/agent/core/ngt/handler/grpc/handler.go 0.00% <0.00%> (ø)
pkg/agent/core/ngt/service/ngt.go 0.00% <0.00%> (ø)
pkg/agent/core/ngt/service/vqueue/option.go 0.00% <0.00%> (ø)
pkg/agent/core/ngt/service/vqueue/queue.go 0.00% <0.00%> (ø)
internal/config/ngt.go 100.00% <100.00%> (ø)
internal/worker/worker.go 83.33% <0.00%> (ø)
internal/net/dialer.go
internal/net/option.go
internal/net/net.go
... and 4 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f12c2b6...78997d6. Read the comment docs.

@kpango kpango force-pushed the bugfix/agent/remove-vcache-due-to-timing-control-failure-and-add-concurrent-queue branch from 3e44b48 to 051e19d Compare February 21, 2021 16:10
…ure and time ordered concurrent vector queue called vqueue

Signed-off-by: kpango <[email protected]>
@kpango kpango force-pushed the bugfix/agent/remove-vcache-due-to-timing-control-failure-and-add-concurrent-queue branch from e364393 to f4a5911 Compare February 27, 2021 21:01
pkg/agent/core/ngt/service/vqueue/queue.go Outdated Show resolved Hide resolved
pkg/agent/core/ngt/service/vqueue/queue.go Outdated Show resolved Hide resolved
pkg/agent/core/ngt/service/vqueue/queue.go Show resolved Hide resolved
pkg/agent/core/ngt/service/vqueue/queue.go Show resolved Hide resolved
pkg/agent/core/ngt/service/vqueue/queue.go Show resolved Hide resolved
pkg/agent/core/ngt/service/vqueue/queue_test.go Outdated Show resolved Hide resolved
pkg/agent/core/ngt/service/vqueue/queue_test.go Outdated Show resolved Hide resolved
pkg/agent/core/ngt/service/vqueue/queue_test.go Outdated Show resolved Hide resolved
WithInsertBufferSize(100),
}

func WithErrGroup(eg errgroup.Group) Option {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚫 [golangci] reported by reviewdog 🐶
exported function WithErrGroup should have comment or be unexported (golint)

}
}

func WithInsertBufferSize(size int) Option {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚫 [golangci] reported by reviewdog 🐶
exported function WithInsertBufferSize should have comment or be unexported (golint)

}
}

func WithDeleteBufferSize(size int) Option {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚫 [golangci] reported by reviewdog 🐶
exported function WithDeleteBufferSize should have comment or be unexported (golint)

}
}

func WithInsertBufferPoolSize(size int) Option {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚫 [golangci] reported by reviewdog 🐶
exported function WithInsertBufferPoolSize should have comment or be unexported (golint)

}
}

func WithDeleteBufferPoolSize(size int) Option {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚫 [golangci] reported by reviewdog 🐶
exported function WithDeleteBufferPoolSize should have comment or be unexported (golint)

})
dup := make(map[string]bool, len(uii)/2)
dl := make([]int, 0, len(uii)/2)
for i, idx := range uii {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚫 [golangci] reported by reviewdog 🐶
only one cuddle assignment allowed before range statement (wsl)

dup[idx.uuid] = true
}
}
sort.Sort(sort.Reverse(sort.IntSlice(dl)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚫 [golangci] reported by reviewdog 🐶
expressions should not be cuddled with blocks (wsl)

func (v *vqueue) RangePopInsert(ctx context.Context, f func(uuid string, vector []float32) bool) {
if v.finalizing.Load().(bool) {
for !v.finalizing.Load().(bool) {
time.Sleep(time.Millisecond * 100)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚫 [golangci] reported by reviewdog 🐶
mnd: Magic number: 100, in detected (gomnd)

//

// Package vqueue manages the vector cache layer for reducing FFI overhead for fast Agent processing.
package vqueue
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚫 [golangci] reported by reviewdog 🐶
package should be vqueue_test instead of vqueue (testpackage)

//

// Package vqueue manages the vector cache layer for reducing FFI overhead for fast Agent processing.
package vqueue
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚫 [golangci] reported by reviewdog 🐶
package should be vqueue_test instead of vqueue (testpackage)

@rinx
Copy link
Contributor

rinx commented Mar 3, 2021

Seems okay to merge but please ask to the others.

@kpango
Copy link
Collaborator Author

kpango commented Mar 3, 2021

@vankichi @hlts2 @kevindiu @datelier can you please review this PR?

vankichi
vankichi previously approved these changes Mar 3, 2021
log.Info("create index delete phase finished")
n.gc()
log.Info("create index insert phase started")
n.vq.RangePopInsert(ctx, func(uuid string, vector []float32) bool {
Copy link
Contributor

@kevindiu kevindiu Mar 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if it is correct:
The current logic is process all the delete request first, and insert request later

For example we have 2 request, reqA and reqB
reqA: insert vecA request, time t+0
reqB: delete vecA request, time t+10

If they are processing the same vector, even the reqB comes later then reqA, the final result will be the vecA is inserted, but maybe it is not correct.

Am I correct?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see... I think it is good.

hlts2
hlts2 previously approved these changes Mar 3, 2021
Copy link
Collaborator

@hlts2 hlts2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

kevindiu
kevindiu previously approved these changes Mar 3, 2021
Copy link
Contributor

@kevindiu kevindiu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@kpango kpango dismissed stale reviews from kevindiu, hlts2, and vankichi via 78997d6 March 3, 2021 08:19
@kpango kpango merged commit 972f77d into master Mar 3, 2021
@kpango kpango deleted the bugfix/agent/remove-vcache-due-to-timing-control-failure-and-add-concurrent-queue branch March 3, 2021 08:19
@vdaas-ci vdaas-ci mentioned this pull request Mar 17, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants