fix(kuma-cp) bug with lost update of Dataplane #1313

lobkovilya · 2020-12-14T09:58:34Z

Signed-off-by: Ilya Lobkov [email protected]

Summary

Follow up on the discussion https://github.com/kumahq/kuma/pull/1254/files#r540057668

Problem

In general, the problem sounds like this: sometimes we build hash based on fresher resources than ones we use later for Envoy config generation. The problem shows up like a lost update:

Dataplane model has new outbound, but Envoy doesn't have a corresponding cluster
Ingress is presented in the list of Endpoints for specific service, but there is no Ingress in EDS
...

Full changelog

Add OnStop method to SimpleWatchdog to clean Snapshot in case Dataplane is disconnected
Invalidate CLACache right after compare hashes and decide that reconciliation is needed

Signed-off-by: Ilya Lobkov <[email protected]>

jakubdyszkiewicz · 2020-12-14T11:00:14Z

pkg/xds/server/components.go

+						return reconciler.Clear(&proxyID)
+					}
+					return err
+				}


to "port" my comment. I think it won't solve the problem you either have to use dataplane that was used to build hash or use ResourceManager() (non cached)

We are not doing ResolveAddress after we fetch it.

Do we even need to resolve it twice? Can we skip first fetch?
If we do:

check hash mesh

fetch the dataplane

instead of

fetch the dataplane

check hash mesh

We are reducing A LOT of queries to the database instead of introducing A LOT more queries because of 2 fetches.

Also I'm wondering now. What is happening when we are disconnecting the Dataplane and we don't have reconciler.Clear(&proxyID) line. This is not guaranteed that reconciler.Clear(&proxyID) will be executed. Is the snapshot for given dataplane stuck indefinitely in the cache (essentially a memory leak) or is it cleaned up?

Signed-off-by: Ilya Lobkov <[email protected]>

Signed-off-by: Ilya Lobkov <[email protected]> # Conflicts: # go.sum

jakubdyszkiewicz · 2020-12-15T08:19:06Z

pkg/xds/server/components.go

+						return reconciler.Clear(&proxyID)
+					}
+					return err
+				}


We are not doing ResolveAddress after we fetch it.

Do we even need to resolve it twice? Can we skip first fetch?
If we do:

check hash mesh

fetch the dataplane

instead of

fetch the dataplane

check hash mesh

We are reducing A LOT of queries to the database instead of introducing A LOT more queries because of 2 fetches.

Also I'm wondering now. What is happening when we are disconnecting the Dataplane and we don't have reconciler.Clear(&proxyID) line. This is not guaranteed that reconciler.Clear(&proxyID) will be executed. Is the snapshot for given dataplane stuck indefinitely in the cache (essentially a memory leak) or is it cleaned up?

Signed-off-by: Ilya Lobkov <[email protected]>

jakubdyszkiewicz · 2020-12-17T11:02:38Z

pkg/xds/cache/cla/cache.go

+	defer c.keysMux.RUnlock()
+	for _, key := range c.keys {
+		c.cache.Delete(key)
+	}


shouldn't we also delete all the keys?

jakubdyszkiewicz · 2020-12-17T11:05:40Z

pkg/xds/server/components.go

+				if err == nil {
+					prevError = nil
+				}
+				return err


instead of introducing prevErr. Can we do this

err = reconciler.Reconcile(envoyCtx, &proxy) if err != nil { return err; } prevHash = snapshotHash return nil

jakubdyszkiewicz · 2020-12-17T11:12:02Z

pkg/xds/server/components.go

 					return nil
 				}
 				log.V(1).Info("snapshot hash updated, reconcile", "prev", prevHash, "current", snapshotHash)
 				prevHash = snapshotHash

-				mesh := core_mesh.NewMeshResource()
-				if err := rt.ReadOnlyResourceManager().Get(ctx, mesh, core_store.GetByKey(proxyID.Mesh, core_model.NoMesh)); err != nil {
+				// hacky way to be sure that Cluster Load Assignment is up to date


I think this can better explained. Why it wasn't up to date.

TBH, on the first look - this cache invalidation will be almost the same as we remove the cache altogether. If we want to do it this way, we should immediately work on moving EDS to LinearCache...
the alternative - build better invalidation based on the hash.

Signed-off-by: Ilya Lobkov <[email protected]>

(cherry picked from commit d595909)

(cherry picked from commit d595909) Co-authored-by: Ilya Lobkov <[email protected]>

fix(kuma-cp) bug with lost update of Dataplane

128c36e

Signed-off-by: Ilya Lobkov <[email protected]>

lobkovilya requested a review from a team as a code owner December 14, 2020 09:58

lobkovilya mentioned this pull request Dec 14, 2020

fix(kuma-cp) logging of resource conflicts #1254

Merged

nickolaev approved these changes Dec 14, 2020

View reviewed changes

nickolaev added the backport-to-stable label Dec 14, 2020

jakubdyszkiewicz reviewed Dec 14, 2020

View reviewed changes

lobkovilya added 2 commits December 14, 2020 19:00

fix(kuma-cp) use non-cached manager to get Dataplane

d9b3194

Signed-off-by: Ilya Lobkov <[email protected]>

Merge branch 'master' into fix/lost-dp-update

4580aa9

Signed-off-by: Ilya Lobkov <[email protected]> # Conflicts: # go.sum

jakubdyszkiewicz requested changes Dec 15, 2020

View reviewed changes

fix(kuma-cp) review

4e13cfc

Signed-off-by: Ilya Lobkov <[email protected]>

lobkovilya marked this pull request as draft December 16, 2020 09:29

fix(kuma-cp) invalidate cache

a12806c

Signed-off-by: Ilya Lobkov <[email protected]>

lobkovilya marked this pull request as ready for review December 17, 2020 08:36

jakubdyszkiewicz reviewed Dec 17, 2020

View reviewed changes

fix(kuma-cp) put mesh hash into CLA cache

df4e11d

Signed-off-by: Ilya Lobkov <[email protected]>

jakubdyszkiewicz approved these changes Dec 18, 2020

View reviewed changes

lobkovilya merged commit d595909 into master Dec 18, 2020

lobkovilya deleted the fix/lost-dp-update branch December 18, 2020 11:48

mergify bot pushed a commit that referenced this pull request Dec 18, 2020

fix(kuma-cp) bug with lost update of Dataplane (#1313)

5b95445

(cherry picked from commit d595909)

mergify bot mentioned this pull request Dec 18, 2020

fix(kuma-cp) bug with lost update of Dataplane (bp #1313) #1333

Merged

nickolaev pushed a commit that referenced this pull request Dec 18, 2020

fix(kuma-cp) bug with lost update of Dataplane (#1313) (#1333)

ffc7e72

(cherry picked from commit d595909) Co-authored-by: Ilya Lobkov <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(kuma-cp) bug with lost update of Dataplane #1313

fix(kuma-cp) bug with lost update of Dataplane #1313

lobkovilya commented Dec 14, 2020 •

edited

Loading

jakubdyszkiewicz Dec 14, 2020

jakubdyszkiewicz Dec 15, 2020

jakubdyszkiewicz Dec 15, 2020

jakubdyszkiewicz Dec 17, 2020

jakubdyszkiewicz Dec 17, 2020

jakubdyszkiewicz Dec 17, 2020

fix(kuma-cp) bug with lost update of Dataplane #1313

fix(kuma-cp) bug with lost update of Dataplane #1313

Conversation

lobkovilya commented Dec 14, 2020 • edited Loading

Summary

Problem

Full changelog

jakubdyszkiewicz Dec 14, 2020

Choose a reason for hiding this comment

jakubdyszkiewicz Dec 15, 2020

Choose a reason for hiding this comment

jakubdyszkiewicz Dec 15, 2020

Choose a reason for hiding this comment

jakubdyszkiewicz Dec 17, 2020

Choose a reason for hiding this comment

jakubdyszkiewicz Dec 17, 2020

Choose a reason for hiding this comment

jakubdyszkiewicz Dec 17, 2020

Choose a reason for hiding this comment

lobkovilya commented Dec 14, 2020 •

edited

Loading