Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

route: fix IPv6 ecmp route deleted nexthop matching #382

Closed
wants to merge 1 commit into from

Conversation

KanjiMonster
Copy link
Contributor

@KanjiMonster KanjiMonster commented Apr 30, 2024

When the kernel sends a ECMP route update with just the deleted nexthop, the nexthop will have no associated weight, and its flags may indicate that it is dead:

    route_update: RTM_DELROUTE
    new route:
    inet6 default table main type unicast <DEAD,>
        scope global priority 0x400 protocol 0x9
        nexthop via fe80::b226:28ff:fe62:8841 dev port4 <dead,>
    old route:
    inet6 default table main type unicast
        scope global priority 0x400 protocol 0x9
        nexthop via fe80::b226:28ff:fe62:8841 dev port4 weight 0 <>
        nexthop via fe80::fa8e:a1ff:fee0:8344 dev port49 weight 0 <>
        nexthop via fe80::b226:28ff:fe62:d400 dev port3 weight 0 <>
        nexthop via fe80::fa8e:a1ff:fee0:8349 dev port54 weight 0 <>

Since we are comparing the nexthops strictly with all attributes, we can never match the deleted nexthop. This causes libnl to fail to remove the deleted nexthop from the route, and consequently send out a nop-update and a desync of the route in the cache and in the kernel.

Fix this by ignoring NH_ATTR_FLAGS (0x1) and NH_ATTR_WEIGHT (0x2) when comparing nexthops to properly match the deleted one.

Fixes: 29b7137 ("route cache: Fix handling of ipv6 multipath routes")

@KanjiMonster
Copy link
Contributor Author

The output were from some strategically places debug prints at the top of route_update(). No idea if this is a changed behavior in recent kernels, or if it was always broken. This is with kernel 6.6 FWIW.

I did not touch adding a nexthop, since I didn't observe this case while testing/debugging, and I was unsure if it is fixable, since we would be then potentially adding a nexthop with no associated weight (since new "added" route isn't a multipath route?).

All added nexthops had routes with all nexthops in new, so I'm not even sure if this is a case that can happen (anymore).

@thom311
Copy link
Owner

thom311 commented May 6, 2024

Fixes: 5c3aea338b37 ("route: fix check for ecmp route nexthop deletion")

I don't find this reference. Do you have a better one?

I think it should be just ~3. It should be ~((uint32_t) (NH_ATTR_FLAGS | NH_ATTR_WEIGHT)).

Note that there are many problems with caching of the routes in libnl3. In general, and specifically to IPv6 ECMP routes. Anyway, the patch doesn't make it worse, so ok!

Thank you.

@thom311
Copy link
Owner

thom311 commented May 6, 2024

I think it should be just ~3. It should be ~((uint32_t) (NH_ATTR_FLAGS | NH_ATTR_WEIGHT)).

It also would warrant a code comment, about why those are ignored.

@KanjiMonster KanjiMonster force-pushed the jogo_fix_ecmp_route_nh_del branch from 43da423 to 8b0008b Compare May 6, 2024 10:22
@KanjiMonster
Copy link
Contributor Author

Fixes: 5c3aea338b37 ("route: fix check for ecmp route nexthop deletion")

I don't find this reference. Do you have a better one?

Oops, looks like I copied the wrong commit hash. Updated with the correct one.

I think it should be just ~3. It should be ~((uint32_t) (NH_ATTR_FLAGS | NH_ATTR_WEIGHT)).

These are currently non-accessible in lib/route/nexthop.c, which is why I opted for a magic value to keep the change small. We could move the NH_ATTR_* defines into lib/netlink/route/nexhop.h though. Should I do that?

Regardless, I agree it needs a comment explaining the reasoning.

Note that there are many problems with caching of the routes in libnl3. In general, and specifically to IPv6 ECMP routes. Anyway, the patch doesn't make it worse, so ok!

Do you have any specific issues in mind? I'm open to taking a stab at them, since I'm currently working on trying to work with IPv6 ECMP routes, so knowing about the issues (and maybe even fixing them) would be nice.

@thom311
Copy link
Owner

thom311 commented May 6, 2024

These are currently non-accessible in lib/route/nexthop.c, which is why I opted for a magic value to keep the change small. We could move the NH_ATTR_* defines into lib/netlink/route/nexhop.h though. Should I do that?

Ah right. nexthop.h is a public header, so better not there. lib/route/nl-route.h could be the right place (and rename to NL_ROUTE_NEXTHOP_ATTR_*).

But arguably, a code comment also suffices explaining what the 3 stands for.

Do you have any specific issues in mind?

for one, that the identity operator of routes is wrong. See for example oo_compare and oo_id_attrs in route_obj_ops. It would be closer to what NetworkManager does here and here.

Then, handling ip route replace (NLM_F_REPLACE) is not correct. The flag only tells that another route was replaced, but we don't immeditely know which one. Theoretically, we would need to keep the routes sorted in the order in which we receive them (ip route order) and make sure that events properly APPEND/PREPEND in that order. For NetworkManager, I gave up on that, it's basically impossible to get this right based on the rtnetlink events. If you get it wrong, the cache is inconsistent. It's better on every replace to assume we don't know the cache's content anymore, and re-sync. It's expensive, but correct.

Finally, as you have noticed, IPv6 ECMP routes are represented quite oddly on the rtnetlink API. In NetworkManger's cache, a IPv6 route is always a singlehop route. The twist is only that a single RTM_NEWROUTE/RTM_DELROUTE can contain multiple next hops, which however are interpreted as as if receiving a list of multiple single-hop events here.

All in all, it is IMO not possible to maintain a consistent cache based on the current events from kernel. NetworkManager gets it mostly right, by triggering expensive sync-all. It also probably won't fail badly in face of a cache inconsistancy.

Patch welcome. But it will require a large effort.

@thom311
Copy link
Owner

thom311 commented May 6, 2024

also, there are cases where kernel will remove a route and not send a RTM_DELROUTE event. Because the user is supposed to understand that from circumstantial events.

NetworkManager doesn't want to to do that (make such circumstantial decisions based on potentially invalid cache content). Instead, in such cases it also triggers a re-sync (here. But that doesn't really work for libnl, because the libnl cache is about routes only. Somehow a even for addresses would trigger a resync for the route cache.

@KanjiMonster
Copy link
Contributor Author

These are currently non-accessible in lib/route/nexthop.c, which is why I opted for a magic value to keep the change small. We could move the NH_ATTR_* defines into lib/netlink/route/nexhop.h though. Should I do that?

Ah right. nexthop.h is a public header, so better not there. lib/route/nl-route.h could be the right place (and rename to NL_ROUTE_NEXTHOP_ATTR_*).

But arguably, a code comment also suffices explaining what the 3 stands for.

I'll add a comment explaining it.

Though I feel moving the _ATTR_ defines to the public API would be arguable, since we expose a _compare() function that takes a mask, but we never provide any definition for the values you can pass there, so ~0 is the only value option you currently pass.

But if we do that, it should be done for all objects, not just nexthops.

Do you have any specific issues in mind?

for one, that the identity operator of routes is wrong. See for example oo_compare and oo_id_attrs in route_obj_ops. It would be closer to what NetworkManager does here and here.

Then, handling ip route replace (NLM_F_REPLACE) is not correct. The flag only tells that another route was replaced, but we don't immeditely know which one. Theoretically, we would need to keep the routes sorted in the order in which we receive them (ip route order) and make sure that events properly APPEND/PREPEND in that order. For NetworkManager, I gave up on that, it's basically impossible to get this right based on the rtnetlink events. If you get it wrong, the cache is inconsistent. It's better on every replace to assume we don't know the cache's content anymore, and re-sync. It's expensive, but correct.

Finally, as you have noticed, IPv6 ECMP routes are represented quite oddly on the rtnetlink API. In NetworkManger's cache, a IPv6 route is always a singlehop route. The twist is only that a single RTM_NEWROUTE/RTM_DELROUTE can contain multiple next hops, which however are interpreted as as if receiving a list of multiple single-hop events here.

All in all, it is IMO not possible to maintain a consistent cache based on the current events from kernel. NetworkManager gets it mostly right, by triggering expensive sync-all. It also probably won't fail badly in face of a cache inconsistancy.

Patch welcome. But it will require a large effort.

Ah, that indeed sounds like a lot of work.

also, there are cases where kernel will remove a route and not send a RTM_DELROUTE event. Because the user is supposed to understand that from circumstantial events.

Right, I already noticed that deleting a nexthop created via the nexthop API will silently delete all (non-ECMP) routes using it.

To fix that, we would need to allow nexthop updates to update the route cache. And probably similar triggers for other events that causes silent route deletion.

So far our main use case (for ECMP routes) is listening to netlink and apply routes created by FRR to a switch ASIC, and FRR is nice enough to delete routes before deleting nexthops, for which the current implementation is sufficient.

@thom311
Copy link
Owner

thom311 commented May 6, 2024

Though I feel moving the ATTR defines to the public API would be arguable, since we expose a _compare() function that takes a mask, but we never provide any definition for the values you can pass there, so ~0 is the only value option you currently pass.

True, but IMO that is more of an API ugliness of the compare functions. I don't think that all ATTR flags should become public API.

It's partly because libnl-3-route.so uses libnl-3.so as a shared library, so it needs to be able to specify those flags. But external users, don't really have anything to do with them.

@thom311
Copy link
Owner

thom311 commented May 6, 2024

To fix that, we would need to allow nexthop updates to update the route cache. And probably similar triggers for other events that causes silent route deletion.

Oh, I wasn't aware that changes to nexthops can cause (silent) changes to the routes. That's seems bad. NetworkManager doesn't watch the nexthops and would miss such changes. Oh well...

When the kernel sends a ECMP route update with just the deleted nexthop,
the nexthop will have no associated weight, and its flags may indicate
that it is dead:

    route_update: RTM_DELROUTE
    new route:
    inet6 default table main type unicast <DEAD,>
        scope global priority 0x400 protocol 0x9
        nexthop via fe80::b226:28ff:fe62:8841 dev port4 <dead,>
    old route:
    inet6 default table main type unicast
        scope global priority 0x400 protocol 0x9
        nexthop via fe80::b226:28ff:fe62:8841 dev port4 weight 0 <>
        nexthop via fe80::fa8e:a1ff:fee0:8344 dev port49 weight 0 <>
        nexthop via fe80::b226:28ff:fe62:d400 dev port3 weight 0 <>
        nexthop via fe80::fa8e:a1ff:fee0:8349 dev port54 weight 0 <>

Since we are comparing the nexthops strictly with all attributes, we can
never match the deleted nexthop. This causes libnl to fail to remove the
deleted nexthop from the route, and consequently send out a nop-update
and a desync of the route in the cache and in the kernel.

Fix this by ignoring NH_ATTR_FLAGS (0x1) and NH_ATTR_WEIGHT (0x2) when
comparing nexthops to properly match the deleted one.

Fixes: 29b7137 ("route cache: Fix handling of ipv6 multipath routes")
Signed-off-by: Jonas Gorski <[email protected]>
@KanjiMonster KanjiMonster force-pushed the jogo_fix_ecmp_route_nh_del branch from 8b0008b to 07d80fc Compare May 7, 2024 07:12
@KanjiMonster
Copy link
Contributor Author

Added a comment, and changed the ~3 to a ~0x3 to make it more clear this is a bitmask.

Oh, I wasn't aware that changes to nexthops can cause (silent) changes to the routes. That's seems bad. NetworkManager doesn't watch the nexthops and would miss such changes. Oh well...

IIRC this only applies to IPv4, not IPv6, but if you create a nexthop, then add a route via this nexthop (via nhid), then delete the nexthop, you will get a netlink notification about then nexthop being deleted, but not the route itself.

True, but IMO that is more of an API ugliness of the compare functions. I don't think that all ATTR flags should become public API.

There's another place where having access to the ATTR flags would be benefical: when using the callbacks for cache changes, libnl passes a diff bitmask to the callback on NL_ACT_CHANGE notifications. Having access to ATTR flags would allow users to know which values changed without having to compare each attribute between the old and new object themselves (again).

@thom311
Copy link
Owner

thom311 commented May 7, 2024

There's another place where having access to the ATTR flags would be benefical

Yes, I am not saying it's useless.

But it would mean to expose (and commit) to quite a large new API (all the attributes), which limits future changes. It's high effort to expose this, and to maintain. Unless there is a strong request, better don't...

thom311 pushed a commit that referenced this pull request May 7, 2024
When the kernel sends a ECMP route update with just the deleted nexthop,
the nexthop will have no associated weight, and its flags may indicate
that it is dead:

    route_update: RTM_DELROUTE
    new route:
    inet6 default table main type unicast <DEAD,>
        scope global priority 0x400 protocol 0x9
        nexthop via fe80::b226:28ff:fe62:8841 dev port4 <dead,>
    old route:
    inet6 default table main type unicast
        scope global priority 0x400 protocol 0x9
        nexthop via fe80::b226:28ff:fe62:8841 dev port4 weight 0 <>
        nexthop via fe80::fa8e:a1ff:fee0:8344 dev port49 weight 0 <>
        nexthop via fe80::b226:28ff:fe62:d400 dev port3 weight 0 <>
        nexthop via fe80::fa8e:a1ff:fee0:8349 dev port54 weight 0 <>

Since we are comparing the nexthops strictly with all attributes, we can
never match the deleted nexthop. This causes libnl to fail to remove the
deleted nexthop from the route, and consequently send out a nop-update
and a desync of the route in the cache and in the kernel.

Fix this by ignoring NH_ATTR_FLAGS (0x1) and NH_ATTR_WEIGHT (0x2) when
comparing nexthops to properly match the deleted one.

Fixes: 29b7137 ("route cache: Fix handling of ipv6 multipath routes")
Signed-off-by: Jonas Gorski <[email protected]>

#382
@KanjiMonster
Copy link
Contributor Author

But it would mean to expose (and commit) to quite a large new API (all the attributes), which limits future changes. It's high effort to expose this, and to maintain. Unless there is a strong request, better don't...

Right, so if I would propose doing it, I would only do so with an actual user. Our product currently does all the comparisons itself, and would profit from having access to the meaning of the diff, but we currently don't have the resources for the rather large conversion to change this without an actual gain in functionality, just efficiency.

@thom311
Copy link
Owner

thom311 commented May 7, 2024

merged. Thank you!!!

@thom311 thom311 closed this May 7, 2024
@thom311
Copy link
Owner

thom311 commented May 7, 2024

Right, so if I would propose doing it, I would only do so with an actual user. Our product currently does all the comparisons itself, and would profit from having access to the meaning of the diff, but we currently don't have the resources for the rather large conversion to change this without an actual gain in functionality, just efficiency.

It might not be that much effort. Just move the defines to public headers, and give them a consistent, good name. E.g. NL_OBJATTR_. The main burden comes from in the future not breaking that API (but that's doable). If you wish, patch welcome (start small).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants