Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reinitialize ARP/NDP responders when the transport interface changes #6700

Merged

Conversation

xliuxu
Copy link
Contributor

@xliuxu xliuxu commented Sep 29, 2024

For secondary-network scenarios, the transport interface can be changed after the agent is started. The ARP/NDP responders should be restarted after the initialization of secondary-network to bind to the transport interface of the new index.

Besides, this change also addresses the following issues:

  • NDP responder may fail to bind to the new interface due to the Duplicate Address Detection process.
  • Golang caches the zone index for the interface, which may result in NDP responder binding on the stale interface

Fixes: #6623

Comment on lines 995 to 1010
if egressController != nil {
go egressController.Run(stopCh)
}

if externalIPController != nil {
go externalIPController.Run(stopCh)
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My concern with changing the order is that it is a bit arbitrary, we can introduce new unexpected issues, and it limits what we can do in the future. For example, we could in the future want to introduce a dependency of flowRestoreCompleteWait on the realization of Egress policies. It would make sense: delay the removal of flow-restore-wait until Egress policy flows have been installed, in order to provide a more consistent datapath on (re)start. See #6342 for more context.

However, we know that there is already a dependency of SecondaryNetwork initialization on flowRestoreCompleteWait. This dependency is important and AFAIK cannot be broken. So with the change described above, we would end up with a circular dependency:
EgressController before flowRestoreCompleteWait before SecondaryNetwork initialization before EgressController.

I would rather avoid "introducing" this new dependency (or rather enforcing this new dependency).

cc @tnqn for his opinion

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a valid concern. we can check/watch for interface changes in the responders to avoid the hard dependencies. Waiting for Quan's insights.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not introducing the dependency makes sense to me. Actually I'm considering something similar (check/watch for interface changes) to support #6547, for which we might add an externalInterface configuration and it could happen that not all nodes have the interface (and it's a valid case because user can select certain nodes as egress nodes, then raising error because the interface doesn't exist on non egress nodes doesn't make sense). If we can handle interface change in egress controller, it would solve two problems.

@luolanzone luolanzone added the kind/bug Categorizes issue or PR as related to a bug. label Oct 17, 2024
@luolanzone luolanzone added this to the Antrea v2.2 release milestone Oct 17, 2024
@xliuxu xliuxu force-pushed the xliuxu/delay-initialize-responders branch 3 times, most recently from 1945b5b to 586807b Compare October 23, 2024 09:56
Comment on lines 19 to 31
type Interface interface {
LinkExists(linkName string) bool

// Run starts the detector.
Run(stopCh <-chan struct{})

// AddEventHandler registers an eventHandler of link updates. It's not thread-safe and should be called before
// starting the detector.
AddEventHandler(handler LinkEventHandler, linkName ...string)

// HasSynced returns true if the cache has been initialized with the existing links.
HasSynced() bool
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tnqn, do you think this interface is appropriate to address the issue and #6547 as well? I did not cache the netlink.Link structs in the detector for simplicity.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @tnqn, could you comment on this? Thanks a lot!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't thought about how to leverage it in #6547 yet but it looks good to me for now.
Please design a minimum interface in a generic way that is best for addressing your current issue, I could add new methods or update it when using.

@luolanzone
Copy link
Contributor

@tnqn @antoninbas I have added this to the 2.2 release log, can you prioritize this PR's review as well? We'd probaboly better to include it considering it's a bug fix. Let me know if you have a different view. Thanks.

cmd/antrea-agent/agent.go Outdated Show resolved Hide resolved
go.mod Outdated Show resolved Hide resolved
pkg/agent/ipassigner/linkdetector/link_detector.go Outdated Show resolved Hide resolved
pkg/agent/ipassigner/linkdetector/link_detector.go Outdated Show resolved Hide resolved
pkg/agent/ipassigner/linkdetector/link_detector_linux.go Outdated Show resolved Hide resolved
pkg/agent/ipassigner/linkdetector/link_detector_linux.go Outdated Show resolved Hide resolved
Comment on lines 26 to 35
var (
// map of transportInterfaceName to ARP responder
arpResponders = make(map[string]*arpResponder)
// map of transportInterfaceName to NDP responder
ndpResponders = make(map[string]*ndpResponder)
)

// NewARPResponder creates a new ARP responder if it does not exist for the given transportInterfaceName.
// This function is not thread-safe.
func NewARPResponder(transportInterfaceName string, linkDetector linkdetector.Interface) *arpResponder {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it feels like this is independent of this change itself

I see this in the PR description:

NDP responder may fail to bind to the new interface due to the Duplicate Address Detection process

I assume it is related, but I am still not clear what the issue was

I do agree that having multiple ARP responders for the same interface seems unnecessary, but was it actually an issue?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is not described in the description. Having multiple responders will not cause issues.

The change of

NDP responder may fail to bind to the new interface due to the Duplicate Address Detection process

is to wrap the binding/accepting into wait.Until to retry on binding errors.

@antoninbas
Copy link
Contributor

@tnqn @antoninbas I have added this to the 2.2 release log, can you prioritize this PR's review as well? We'd probaboly better to include it considering it's a bug fix. Let me know if you have a different view. Thanks.

I started reviewing the code. It may take a few days to get it merged based on the size. One concern right now is that it depends on an upstream change to the ndp library, and it seems that @xliuxu hasn't heard back from the maintainer in a while. We could make a temporary fork of ndp under the antrea-io organization; that would be slightly better than relying on a personal fork.

@xliuxu xliuxu force-pushed the xliuxu/delay-initialize-responders branch from 586807b to 27541cf Compare October 30, 2024 09:58
@xliuxu xliuxu changed the title Delay the initialization of ARP/NDP responders Reinitialize ARP/NDP responders when the transport interface changes Nov 5, 2024
pkg/agent/ipassigner/responder/arp_responder.go Outdated Show resolved Hide resolved
pkg/agent/ipassigner/responder/arp_responder.go Outdated Show resolved Hide resolved
pkg/agent/ipassigner/responder/arp_responder.go Outdated Show resolved Hide resolved
@xliuxu xliuxu force-pushed the xliuxu/delay-initialize-responders branch from 27541cf to 204c6c0 Compare November 6, 2024 10:02
pkg/agent/ipassigner/linkmonitor/link_monitor_linux.go Outdated Show resolved Hide resolved
pkg/agent/ipassigner/linkmonitor/link_monitor_linux.go Outdated Show resolved Hide resolved
pkg/agent/ipassigner/responder/arp_responder.go Outdated Show resolved Hide resolved
pkg/agent/ipassigner/responder/ndp_responder.go Outdated Show resolved Hide resolved
pkg/agent/ipassigner/responder/ndp_responder.go Outdated Show resolved Hide resolved
pkg/agent/ipassigner/responder/ndp_responder.go Outdated Show resolved Hide resolved
go.mod Outdated Show resolved Hide resolved
@xliuxu xliuxu force-pushed the xliuxu/delay-initialize-responders branch 3 times, most recently from 68227ed to 4515385 Compare November 8, 2024 02:27
tnqn
tnqn previously approved these changes Nov 8, 2024
Copy link
Member

@tnqn tnqn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@antoninbas will you take another look?

Copy link
Contributor

@antoninbas antoninbas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all minor comments except for 2

Comment on lines 73 to 75
go wait.NonSlidingUntil(func() {
d.listAndWatchLinks(stopCh)
}, 5*time.Second, stopCh)

<-stopCh
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this could just be:

wait.NonSlidingUntil(func() { d.listAndWatchLinks(stopCh) }, 5*time.Second, stopCh)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

}

func (d *linkMonitor) Run(stopCh <-chan struct{}) {
klog.InfoS("Starting localLinkMonitor")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see no other reference to "localLinkMonitor", why did we add "local" here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed "local".

type linkEventHandler struct {
watchLinkNames []string
receivedEvents []string
lock sync.Mutex
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: rename lock to mutex

func Test_linkMonitor_listAndWatchLinks(t *testing.T) {
tests := []struct {
name string
eventHandlers []*linkEventHandler
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you made this a list but all the test cases I see only use a single event handler?
either you want to test the case where 2 event handlers are registered, and you should add a test case for that, or you don't think it is necessary to test that case specifically, and the code can be improved.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added test cases for two event handlers.

name string
eventHandlers []*linkEventHandler
linkEvents []netlink.LinkUpdate
expectNetlinkCalls func(netlink *netlinktesting.MockInterfaceMockRecorder)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should just make this initialLinkList []netlink.Link and set the expectation in the test body. You always expect the same netlink call and the only difference is the list of links being returned.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

func (r *arpResponder) onLinkUpdate(linkName string) {
klog.V(4).InfoS("Received link update event", "name", linkName)
select {
case r.linkEventCh <- struct{}{}:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add comment here:
// if an event is already present in the channel, we can drop this new one as we only monitor one link

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added.

continue
}
if transportInterface.Index != newTransportInterface.Index {
klog.InfoS("Transport interface index changed. Restarting ARP responder", "name", transportInterface.Name, "oldIndex", transportInterface.Index, "newIndex", newTransportInterface.Index)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: make it a single sentence

Transport interface index changed, restarting ARP responder

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines 25 to 28
var (
// IPv6AllNodes is the IPv6 link-local multicast address for all nodes.
IPv6AllNodes = netip.MustParseAddr("ff02::1")
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

defer r.mutex.Unlock()
if !r.assignedIPs.Has(ip.String()) {
target, _ := netip.AddrFromSlice(ip)
if !r.isIPAssigned(target) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really feel good about the fact that we check this separately from the logic below. Usually when you have this pattern:

  1. acquire lock
  2. check condition
  3. release lock
  4. acquire lock
  5. perform action which depends on above condition
  6. release lock

it can be prone to races
it feels like we could hold the lock for the duration of the function

cc @tnqn

Copy link
Member

@tnqn tnqn Nov 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, I didn't notice the change. This is not a very frequent operation that worths extreme performance optimization, we should keep the previous model.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reverted the changes of the lock

For secondary-network scenarios, the transport interface can be
changed after the agent is started. The ARP/NDP responders should
be started after the initialization of secondary-network to bind
to the transport interface of the new index.

Besides, this change also addresses the following issues:
- NDP responder may fail to bind to the new interface due to the
Duplicate Address Detection process.
- Golang caches the zone index for the interface, which may result
in NDP responder binding on the stale interface

Fixes: antrea-io#6623

Signed-off-by: Xu Liu <[email protected]>
@xliuxu xliuxu force-pushed the xliuxu/delay-initialize-responders branch from ae3d98f to 0004915 Compare November 8, 2024 06:30
Copy link
Member

@tnqn tnqn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@antoninbas antoninbas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for making this change

@antoninbas
Copy link
Contributor

/test-all

@antoninbas antoninbas merged commit 8b514d3 into antrea-io:main Nov 9, 2024
56 of 59 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

SecondaryNetwork breaks ServiceExternalIP feature
4 participants