Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(kuma-cp) bug with lost update of Dataplane #1313

Merged
merged 6 commits into from
Dec 18, 2020
Merged

Conversation

lobkovilya
Copy link
Contributor

@lobkovilya lobkovilya commented Dec 14, 2020

Signed-off-by: Ilya Lobkov [email protected]

Summary

Follow up on the discussion https://github.com/kumahq/kuma/pull/1254/files#r540057668

Problem

In general, the problem sounds like this: sometimes we build hash based on fresher resources than ones we use later for Envoy config generation. The problem shows up like a lost update:

  • Dataplane model has new outbound, but Envoy doesn't have a corresponding cluster
  • Ingress is presented in the list of Endpoints for specific service, but there is no Ingress in EDS
  • ...

Full changelog

  • Add OnStop method to SimpleWatchdog to clean Snapshot in case Dataplane is disconnected
  • Invalidate CLACache right after compare hashes and decide that reconciliation is needed

return reconciler.Clear(&proxyID)
}
return err
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to "port" my comment. I think it won't solve the problem you either have to use dataplane that was used to build hash or use ResourceManager() (non cached)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are not doing ResolveAddress after we fetch it.

Do we even need to resolve it twice? Can we skip first fetch?
If we do:

  1. check hash mesh
  2. fetch the dataplane

instead of

  1. fetch the dataplane
  2. check hash mesh

We are reducing A LOT of queries to the database instead of introducing A LOT more queries because of 2 fetches.

Also I'm wondering now. What is happening when we are disconnecting the Dataplane and we don't have reconciler.Clear(&proxyID) line. This is not guaranteed that reconciler.Clear(&proxyID) will be executed. Is the snapshot for given dataplane stuck indefinitely in the cache (essentially a memory leak) or is it cleaned up?

return reconciler.Clear(&proxyID)
}
return err
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are not doing ResolveAddress after we fetch it.

Do we even need to resolve it twice? Can we skip first fetch?
If we do:

  1. check hash mesh
  2. fetch the dataplane

instead of

  1. fetch the dataplane
  2. check hash mesh

We are reducing A LOT of queries to the database instead of introducing A LOT more queries because of 2 fetches.

Also I'm wondering now. What is happening when we are disconnecting the Dataplane and we don't have reconciler.Clear(&proxyID) line. This is not guaranteed that reconciler.Clear(&proxyID) will be executed. Is the snapshot for given dataplane stuck indefinitely in the cache (essentially a memory leak) or is it cleaned up?

Signed-off-by: Ilya Lobkov <[email protected]>
@lobkovilya lobkovilya marked this pull request as draft December 16, 2020 09:29
@lobkovilya lobkovilya marked this pull request as ready for review December 17, 2020 08:36
defer c.keysMux.RUnlock()
for _, key := range c.keys {
c.cache.Delete(key)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't we also delete all the keys?

if err == nil {
prevError = nil
}
return err
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of introducing prevErr. Can we do this

err = reconciler.Reconcile(envoyCtx, &proxy)
if err != nil {
  return err;
}
prevHash = snapshotHash
return nil

return nil
}
log.V(1).Info("snapshot hash updated, reconcile", "prev", prevHash, "current", snapshotHash)
prevHash = snapshotHash

mesh := core_mesh.NewMeshResource()
if err := rt.ReadOnlyResourceManager().Get(ctx, mesh, core_store.GetByKey(proxyID.Mesh, core_model.NoMesh)); err != nil {
// hacky way to be sure that Cluster Load Assignment is up to date
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can better explained. Why it wasn't up to date.

TBH, on the first look - this cache invalidation will be almost the same as we remove the cache altogether. If we want to do it this way, we should immediately work on moving EDS to LinearCache...
the alternative - build better invalidation based on the hash.

@lobkovilya lobkovilya merged commit d595909 into master Dec 18, 2020
@lobkovilya lobkovilya deleted the fix/lost-dp-update branch December 18, 2020 11:48
mergify bot pushed a commit that referenced this pull request Dec 18, 2020
nickolaev pushed a commit that referenced this pull request Dec 18, 2020
(cherry picked from commit d595909)

Co-authored-by: Ilya Lobkov <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants