-
Notifications
You must be signed in to change notification settings - Fork 336
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(kuma-cp) bug with lost update of Dataplane #1313
Conversation
Signed-off-by: Ilya Lobkov <[email protected]>
return reconciler.Clear(&proxyID) | ||
} | ||
return err | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to "port" my comment. I think it won't solve the problem you either have to use dataplane that was used to build hash or use ResourceManager()
(non cached)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are not doing ResolveAddress
after we fetch it.
Do we even need to resolve it twice? Can we skip first fetch?
If we do:
- check hash mesh
- fetch the dataplane
instead of
- fetch the dataplane
- check hash mesh
We are reducing A LOT of queries to the database instead of introducing A LOT more queries because of 2 fetches.
Also I'm wondering now. What is happening when we are disconnecting the Dataplane and we don't have reconciler.Clear(&proxyID)
line. This is not guaranteed that reconciler.Clear(&proxyID)
will be executed. Is the snapshot for given dataplane stuck indefinitely in the cache (essentially a memory leak) or is it cleaned up?
Signed-off-by: Ilya Lobkov <[email protected]>
Signed-off-by: Ilya Lobkov <[email protected]> # Conflicts: # go.sum
return reconciler.Clear(&proxyID) | ||
} | ||
return err | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are not doing ResolveAddress
after we fetch it.
Do we even need to resolve it twice? Can we skip first fetch?
If we do:
- check hash mesh
- fetch the dataplane
instead of
- fetch the dataplane
- check hash mesh
We are reducing A LOT of queries to the database instead of introducing A LOT more queries because of 2 fetches.
Also I'm wondering now. What is happening when we are disconnecting the Dataplane and we don't have reconciler.Clear(&proxyID)
line. This is not guaranteed that reconciler.Clear(&proxyID)
will be executed. Is the snapshot for given dataplane stuck indefinitely in the cache (essentially a memory leak) or is it cleaned up?
Signed-off-by: Ilya Lobkov <[email protected]>
Signed-off-by: Ilya Lobkov <[email protected]>
pkg/xds/cache/cla/cache.go
Outdated
defer c.keysMux.RUnlock() | ||
for _, key := range c.keys { | ||
c.cache.Delete(key) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't we also delete all the keys?
pkg/xds/server/components.go
Outdated
if err == nil { | ||
prevError = nil | ||
} | ||
return err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of introducing prevErr
. Can we do this
err = reconciler.Reconcile(envoyCtx, &proxy)
if err != nil {
return err;
}
prevHash = snapshotHash
return nil
pkg/xds/server/components.go
Outdated
return nil | ||
} | ||
log.V(1).Info("snapshot hash updated, reconcile", "prev", prevHash, "current", snapshotHash) | ||
prevHash = snapshotHash | ||
|
||
mesh := core_mesh.NewMeshResource() | ||
if err := rt.ReadOnlyResourceManager().Get(ctx, mesh, core_store.GetByKey(proxyID.Mesh, core_model.NoMesh)); err != nil { | ||
// hacky way to be sure that Cluster Load Assignment is up to date |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this can better explained. Why it wasn't up to date.
TBH, on the first look - this cache invalidation will be almost the same as we remove the cache altogether. If we want to do it this way, we should immediately work on moving EDS to LinearCache...
the alternative - build better invalidation based on the hash.
Signed-off-by: Ilya Lobkov <[email protected]>
(cherry picked from commit d595909)
(cherry picked from commit d595909) Co-authored-by: Ilya Lobkov <[email protected]>
Signed-off-by: Ilya Lobkov [email protected]
Summary
Follow up on the discussion https://github.com/kumahq/kuma/pull/1254/files#r540057668
Problem
In general, the problem sounds like this: sometimes we build hash based on fresher resources than ones we use later for Envoy config generation. The problem shows up like a lost update:
Full changelog
OnStop
method toSimpleWatchdog
to clean Snapshot in case Dataplane is disconnectedCLACache
right after compare hashes and decide that reconciliation is needed