Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More robust system Tier creation / update #6696

Conversation

antoninbas
Copy link
Contributor

@antoninbas antoninbas commented Sep 27, 2024

System Tier initialization should ideally wait for the Tier informer's cache to sync before using the lister.
Otherwise we may think that the system Tiers have not been created yet when in fact it is just the informer that has not completed the initial List operation.

Additionally, we improve the logic in system Tier initialization:

  • After every failed API call, we should probably check the state of the informer cache again (via the lister) and decide which step is needed next (creation / update / nothing). In case of a failed create or update, this gives us the opportunity to check again if the Tier actually exists, and if yes, what its priority is.
  • AlreadyExists is no longer treated differently for create. The reason is that if the Tier already exists and for some reason it was not returned by the Lister, it will probably be the validation webhook that fails (overlapping priority), and the error won't match AlreadyExists.

Related to #6694

@antoninbas antoninbas force-pushed the more-robust-system-tier-creation-update branch from 5564a1c to c23379f Compare September 27, 2024 01:08
Comment on lines 372 to 374
// The apiserver has dependencies on multiple informer listers, which are obtained
// from the networkPolicyController object when configuring the apiserver. Before
// starting serving requests, it makes sense to wait for the informers to have
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems a potential issue but in theory the full sync of informers may not represent the readiness of the API handlers as the workers may be still processing the items.

Does it make sense to run the apiserver and let each API handlers to check if their own dependecies are ready? Then for InitializeTiers, it could just wait for its only dependency, TierInformer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see your point, but I was mostly concerned about handlers for webhooks, which use the listers directly. In this case, there is no worker, and if the informer cache is synced, the lister should return "up-to-date" (or rather up-to-date enough) data.

You will see below that I already included some logic for InitializeTiers to wait for TierInformer (so it's a bit duplicate with the "generic" logic here). Since this patch is kind of specific to Tiers, I am ok with removing the changes to this file. What do you think?

I still think that the logic here is a positive improvement, but it could be done in a separate PR.

Does it make sense to run the apiserver and let each API handlers to check if their own dependecies are ready?

This was the approach I was looking at initially, but I realized it would be quite cumbersome. And because I was specifically targeting webhooks that consume listers directly, this seemed much easier. I am not sure there is a good way to wait for some condition in handlers either, whereas it is easy to do here, before starting the APIserver.

Besides validation webhooks, do you think there are handlers that would benefit from that?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this patch is kind of specific to Tiers, I am ok with removing the changes to this file. What do you think?

I had two concerns with the waiting:

  • I wasn't sure if it would introduce a dependency loop between kube-apiserver and antrea-controller (as it provides CRD webhooks), but thinking it more, List requests should not rely on webhooks.
  • The apiserver would never run if one of the CRDs whose informers have been created is not installed, but that should be a programming error or an installation error, perhaps not make sense as a factor here.

This was the approach I was looking at initially, but I realized it would be quite cumbersome. And because I was specifically targeting webhooks that consume listers directly, this seemed much easier. I am not sure there is a good way to wait for some condition in handlers either, whereas it is easy to do here, before starting the APIserver.

Doing a cache sync check when handling every request does sound inefficient and unnecessary. Doing it once before running apiserver makes sense from this perspective. However, shouldn't we remove the check in InitializeTiers? Otherwise people might follow it to add such checks, which would be unwanted?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I'll remove the check in InitializeTiers as it is redundant, and I'll improve the comment here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't sure if it would introduce a dependency loop between kube-apiserver and antrea-controller (as it provides CRD webhooks), but thinking it more, List requests should not rely on webhooks.

That's true for validation webhooks, but maybe not for conversion webhooks? For example, Tiers could be stored in an older version. In order to List the Tiers with the "current" version, the Controller has a dependency on itself. So it seems that your point was actually valid?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was able to confirm that this is indeed an issue for conversion webhooks. I used the following steps to reproduce:

  1. Install Antrea v1.15.1, with AntreaIPAM enabled
  2. Create an IPPool CR using version v1alpha2
  3. Upgrade Antrea to my dev version, which includes this patch

The antrea-controller will now fail to start:

E0930 20:31:11.963897       1 reflector.go:158] "Unhandled Error" err="k8s.io/[email protected]/tools/cache/reflector.go:243: Failed to watch *v1beta1.IPPool: failed to list *v1beta1.IPPool: conversion webhook for crd.antrea.io/v1alpha2, Kind=IPPool failed: Post \"https://antrea.kube-system.svc:443/convert/ippool?timeout=30s\": dial tcp 10.96.20.239:443: connect: connection refused"
W0930 20:31:28.532298       1 reflector.go:561] k8s.io/[email protected]/tools/cache/reflector.go:243: failed to list *v1beta1.IPPool: conversion webhook for crd.antrea.io/v1alpha2, Kind=IPPool failed: Post "https://antrea.kube-system.svc:443/convert/ippool?timeout=30s": dial tcp 10.96.20.239:443: connect: connection refused
E0930 20:31:28.532622       1 reflector.go:158] "Unhandled Error" err="k8s.io/[email protected]/tools/cache/reflector.go:243: Failed to watch *v1beta1.IPPool: failed to list *v1beta1.IPPool: conversion webhook for crd.antrea.io/v1alpha2, Kind=IPPool failed: Post \"https://antrea.kube-system.svc:443/convert/ippool?timeout=30s\": dial tcp 10.96.20.239:443: connect: connection refused"

The controller will exit and get into CrashLoopBackOff.

This is an edge case but it highlights the fact that it is tricky to introduce this dependency. Therefore I will update my patch to focus on Tier initialization. It seems that the best approach long term would be to find a way to declare and enforce dependencies for each handler, even though that means checking a condition before serving every request.

@antoninbas antoninbas force-pushed the more-robust-system-tier-creation-update branch from c23379f to 6b90470 Compare September 30, 2024 20:51
tnqn
tnqn previously approved these changes Oct 8, 2024
Copy link
Member

@tnqn tnqn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

pkg/controller/networkpolicy/tier.go Outdated Show resolved Hide resolved
System Tier initialization should ideally wait for the Tier informer's
cache to sync before using the lister.  Otherwise we may think that the
system Tiers have not been created yet when in fact it is just the
informer that has not completed the initial List operation.

Additionally, we improve the logic in system Tier initialization:
* After every failed API call, we should probably check the state of the
  informer cache again (via the lister) and decide which step is needed
  next (creation / update / nothing). In case of a failed create or
  update, this gives us the opportunity to check again if the Tier
  actually exists, and if yes, what its priority is.
* AlreadyExists is no longer treated differently for create. The reason
  is that if the Tier already exists and for some reason it was not
  returned by the Lister, it will probably be the validation webhook
  that fails (overlapping priority), and the error won't match
  AlreadyExists.

Related to antrea-io#6694

Signed-off-by: Antonin Bas <[email protected]>

Don't block apiserver when caches are not synced

Signed-off-by: Antonin Bas <[email protected]>
@antoninbas
Copy link
Contributor Author

/test-all

@antoninbas antoninbas added kind/bug Categorizes issue or PR as related to a bug. action/backport Indicates a PR that requires backports. action/release-note Indicates a PR that should be included in release notes. labels Oct 8, 2024
@antoninbas antoninbas added this to the Antrea v2.2 release milestone Oct 8, 2024
@antoninbas antoninbas requested a review from tnqn October 8, 2024 19:50
@antoninbas
Copy link
Contributor Author

Thanks for the reviews!

@antoninbas antoninbas merged commit 4edb5ec into antrea-io:main Oct 8, 2024
61 of 66 checks passed
@antoninbas antoninbas deleted the more-robust-system-tier-creation-update branch October 8, 2024 21:20
antoninbas added a commit to antoninbas/antrea that referenced this pull request Oct 8, 2024
System Tier initialization should ideally wait for the Tier informer's
cache to sync before using the lister.  Otherwise we may think that the
system Tiers have not been created yet when in fact it is just the
informer that has not completed the initial List operation.

Additionally, we improve the logic in system Tier initialization:
* After every failed API call, we should probably check the state of the
  informer cache again (via the lister) and decide which step is needed
  next (creation / update / nothing). In case of a failed create or
  update, this gives us the opportunity to check again if the Tier
  actually exists, and if yes, what its priority is.
* AlreadyExists is no longer treated differently for create. The reason
  is that if the Tier already exists and for some reason it was not
  returned by the Lister, it will probably be the validation webhook
  that fails (overlapping priority), and the error won't match
  AlreadyExists.

Related to antrea-io#6694

Signed-off-by: Antonin Bas <[email protected]>
antoninbas added a commit to antoninbas/antrea that referenced this pull request Oct 8, 2024
System Tier initialization should ideally wait for the Tier informer's
cache to sync before using the lister.  Otherwise we may think that the
system Tiers have not been created yet when in fact it is just the
informer that has not completed the initial List operation.

Additionally, we improve the logic in system Tier initialization:
* After every failed API call, we should probably check the state of the
  informer cache again (via the lister) and decide which step is needed
  next (creation / update / nothing). In case of a failed create or
  update, this gives us the opportunity to check again if the Tier
  actually exists, and if yes, what its priority is.
* AlreadyExists is no longer treated differently for create. The reason
  is that if the Tier already exists and for some reason it was not
  returned by the Lister, it will probably be the validation webhook
  that fails (overlapping priority), and the error won't match
  AlreadyExists.

Related to antrea-io#6694

Signed-off-by: Antonin Bas <[email protected]>
antoninbas added a commit to antoninbas/antrea that referenced this pull request Oct 8, 2024
System Tier initialization should ideally wait for the Tier informer's
cache to sync before using the lister.  Otherwise we may think that the
system Tiers have not been created yet when in fact it is just the
informer that has not completed the initial List operation.

Additionally, we improve the logic in system Tier initialization:
* After every failed API call, we should probably check the state of the
  informer cache again (via the lister) and decide which step is needed
  next (creation / update / nothing). In case of a failed create or
  update, this gives us the opportunity to check again if the Tier
  actually exists, and if yes, what its priority is.
* AlreadyExists is no longer treated differently for create. The reason
  is that if the Tier already exists and for some reason it was not
  returned by the Lister, it will probably be the validation webhook
  that fails (overlapping priority), and the error won't match
  AlreadyExists.

Related to antrea-io#6694

Signed-off-by: Antonin Bas <[email protected]>
antoninbas added a commit to antoninbas/antrea that referenced this pull request Oct 8, 2024
System Tier initialization should ideally wait for the Tier informer's
cache to sync before using the lister.  Otherwise we may think that the
system Tiers have not been created yet when in fact it is just the
informer that has not completed the initial List operation.

Additionally, we improve the logic in system Tier initialization:
* After every failed API call, we should probably check the state of the
  informer cache again (via the lister) and decide which step is needed
  next (creation / update / nothing). In case of a failed create or
  update, this gives us the opportunity to check again if the Tier
  actually exists, and if yes, what its priority is.
* AlreadyExists is no longer treated differently for create. The reason
  is that if the Tier already exists and for some reason it was not
  returned by the Lister, it will probably be the validation webhook
  that fails (overlapping priority), and the error won't match
  AlreadyExists.

Related to antrea-io#6694

Signed-off-by: Antonin Bas <[email protected]>
antoninbas added a commit that referenced this pull request Oct 9, 2024
System Tier initialization should ideally wait for the Tier informer's
cache to sync before using the lister.  Otherwise we may think that the
system Tiers have not been created yet when in fact it is just the
informer that has not completed the initial List operation.

Additionally, we improve the logic in system Tier initialization:
* After every failed API call, we should probably check the state of the
  informer cache again (via the lister) and decide which step is needed
  next (creation / update / nothing). In case of a failed create or
  update, this gives us the opportunity to check again if the Tier
  actually exists, and if yes, what its priority is.
* AlreadyExists is no longer treated differently for create. The reason
  is that if the Tier already exists and for some reason it was not
  returned by the Lister, it will probably be the validation webhook
  that fails (overlapping priority), and the error won't match
  AlreadyExists.

Related to #6694

Signed-off-by: Antonin Bas <[email protected]>
antoninbas added a commit to antoninbas/antrea that referenced this pull request Oct 9, 2024
System Tier initialization should ideally wait for the Tier informer's
cache to sync before using the lister.  Otherwise we may think that the
system Tiers have not been created yet when in fact it is just the
informer that has not completed the initial List operation.

Additionally, we improve the logic in system Tier initialization:
* After every failed API call, we should probably check the state of the
  informer cache again (via the lister) and decide which step is needed
  next (creation / update / nothing). In case of a failed create or
  update, this gives us the opportunity to check again if the Tier
  actually exists, and if yes, what its priority is.
* AlreadyExists is no longer treated differently for create. The reason
  is that if the Tier already exists and for some reason it was not
  returned by the Lister, it will probably be the validation webhook
  that fails (overlapping priority), and the error won't match
  AlreadyExists.

Related to antrea-io#6694

Signed-off-by: Antonin Bas <[email protected]>
antoninbas added a commit to antoninbas/antrea that referenced this pull request Oct 9, 2024
System Tier initialization should ideally wait for the Tier informer's
cache to sync before using the lister.  Otherwise we may think that the
system Tiers have not been created yet when in fact it is just the
informer that has not completed the initial List operation.

Additionally, we improve the logic in system Tier initialization:
* After every failed API call, we should probably check the state of the
  informer cache again (via the lister) and decide which step is needed
  next (creation / update / nothing). In case of a failed create or
  update, this gives us the opportunity to check again if the Tier
  actually exists, and if yes, what its priority is.
* AlreadyExists is no longer treated differently for create. The reason
  is that if the Tier already exists and for some reason it was not
  returned by the Lister, it will probably be the validation webhook
  that fails (overlapping priority), and the error won't match
  AlreadyExists.

Related to antrea-io#6694

Signed-off-by: Antonin Bas <[email protected]>
antoninbas added a commit to antoninbas/antrea that referenced this pull request Oct 9, 2024
System Tier initialization should ideally wait for the Tier informer's
cache to sync before using the lister.  Otherwise we may think that the
system Tiers have not been created yet when in fact it is just the
informer that has not completed the initial List operation.

Additionally, we improve the logic in system Tier initialization:
* After every failed API call, we should probably check the state of the
  informer cache again (via the lister) and decide which step is needed
  next (creation / update / nothing). In case of a failed create or
  update, this gives us the opportunity to check again if the Tier
  actually exists, and if yes, what its priority is.
* AlreadyExists is no longer treated differently for create. The reason
  is that if the Tier already exists and for some reason it was not
  returned by the Lister, it will probably be the validation webhook
  that fails (overlapping priority), and the error won't match
  AlreadyExists.

Related to antrea-io#6694

Signed-off-by: Antonin Bas <[email protected]>
antoninbas added a commit that referenced this pull request Oct 10, 2024
System Tier initialization should ideally wait for the Tier informer's
cache to sync before using the lister.  Otherwise we may think that the
system Tiers have not been created yet when in fact it is just the
informer that has not completed the initial List operation.

Additionally, we improve the logic in system Tier initialization:
* After every failed API call, we should probably check the state of the
  informer cache again (via the lister) and decide which step is needed
  next (creation / update / nothing). In case of a failed create or
  update, this gives us the opportunity to check again if the Tier
  actually exists, and if yes, what its priority is.
* AlreadyExists is no longer treated differently for create. The reason
  is that if the Tier already exists and for some reason it was not
  returned by the Lister, it will probably be the validation webhook
  that fails (overlapping priority), and the error won't match
  AlreadyExists.

Related to #6694

Signed-off-by: Antonin Bas <[email protected]>
antoninbas added a commit that referenced this pull request Oct 10, 2024
System Tier initialization should ideally wait for the Tier informer's
cache to sync before using the lister.  Otherwise we may think that the
system Tiers have not been created yet when in fact it is just the
informer that has not completed the initial List operation.

Additionally, we improve the logic in system Tier initialization:
* After every failed API call, we should probably check the state of the
  informer cache again (via the lister) and decide which step is needed
  next (creation / update / nothing). In case of a failed create or
  update, this gives us the opportunity to check again if the Tier
  actually exists, and if yes, what its priority is.
* AlreadyExists is no longer treated differently for create. The reason
  is that if the Tier already exists and for some reason it was not
  returned by the Lister, it will probably be the validation webhook
  that fails (overlapping priority), and the error won't match
  AlreadyExists.

Related to #6694

Signed-off-by: Antonin Bas <[email protected]>
hangyan pushed a commit to hangyan/antrea that referenced this pull request Oct 29, 2024
System Tier initialization should ideally wait for the Tier informer's
cache to sync before using the lister.  Otherwise we may think that the
system Tiers have not been created yet when in fact it is just the
informer that has not completed the initial List operation.

Additionally, we improve the logic in system Tier initialization:
* After every failed API call, we should probably check the state of the
  informer cache again (via the lister) and decide which step is needed
  next (creation / update / nothing). In case of a failed create or
  update, this gives us the opportunity to check again if the Tier
  actually exists, and if yes, what its priority is.
* AlreadyExists is no longer treated differently for create. The reason
  is that if the Tier already exists and for some reason it was not
  returned by the Lister, it will probably be the validation webhook
  that fails (overlapping priority), and the error won't match
  AlreadyExists.

Related to antrea-io#6694

Signed-off-by: Antonin Bas <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
action/backport Indicates a PR that requires backports. action/release-note Indicates a PR that should be included in release notes. kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants