Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] VLANs don't work on IOL/IOLL2 #1381

Closed
ipspace opened this issue Oct 16, 2024 · 49 comments
Closed

[BUG] VLANs don't work on IOL/IOLL2 #1381

ipspace opened this issue Oct 16, 2024 · 49 comments
Assignees
Labels
bug Something isn't working

Comments

@ipspace
Copy link
Owner

ipspace commented Oct 16, 2024

A topology using VLANs on IOL/IOLL2 crashes during "netlab initial". The initial configuration template tries to include platform-specific VLAN configuration, and those files don't exist for iol/ioll2. We could either create symlinks or change the include logic.

Sample test scenario: tests/integration/vlan/01-vlan-bridge-single.yml

@ipspace ipspace added the bug Something isn't working label Oct 16, 2024
@DanPartelly DanPartelly self-assigned this Oct 16, 2024
@DanPartelly
Copy link
Collaborator

This is a more serious bug. Symlinking iosvl2 results in the configuration being deployed successfully, but the interfaces come up with "no switchport". Still at work, will look into it later.

@ipspace
Copy link
Owner Author

ipspace commented Oct 16, 2024

Symlinking initial/iosvl2.vlan.j2 into initial/ioll2.vlan.j2 and vlan/iosvl2.j2 into vlan/ioll2.j2 resulted in working 01-vlan-bridge-simple.yml test. Will run the full set of VLAN integration tests once the BGP plugin ones finish.

IOL is a different story. It does not have the vlan database, but also does not work with the IOS bridging configuration. You can't even configure the IEEE STP (which is a huge red flag). I would suggest we declare VLAN unsupported on IOL unless you really want to figure out how to make it work ;)

@DanPartelly
Copy link
Collaborator

Indeed, it does work. And no, I do not have an immediate itch to figure this out. Id rather spend the time I have learning more about netlab internals and explore the test suite. I learned a lot those days, and your comments where very useful, but there is much more left.

@ipspace
Copy link
Owner Author

ipspace commented Oct 16, 2024

So, I ran the VLAN integration tests for IOLL2 and all the more complex ones failed. The results are here:

https://tests.netlab.tools/_html/ioll2-clab-vlan

Unfortunately, there's not much one can do to validate the VLAN setups apart from end-to-end pings, so the errors are not particularly enlightening. If you want to fix stuff, it's best if you spin up one of the failing scenarios, figure out what's wrong, fix the config, and repeat.

@ipspace
Copy link
Owner Author

ipspace commented Oct 16, 2024

I created the ioll2_vlan branch with the initial changes. You could start from there, do additional configuration tweaks for IOLL2, and then submit the PR, either against the ioll2_vlan branch or the dev branch.

@ipspace
Copy link
Owner Author

ipspace commented Oct 16, 2024

I think I found the root cause: all IOLL2 instances have the same base MAC address (STP system ID), so the trunk ports go into blocking because the switches think they hear themselves.

No idea how to change that on IOLL2 :(

@DanPartelly
Copy link
Collaborator

How the heck did you figured that out ? Anyway, I will look into options. It might be possible to change it at image startup. NETMAP iol startup file options to dig in, or env vars. Ill ask containerlab guy who did the iol integration if he knows the full netmap format.

@jbemmel
Copy link
Collaborator

jbemmel commented Oct 17, 2024

Apparently VIRL can do it: https://learningnetwork.cisco.com/s/question/0D53i00000KszBMCAZ/change-switch-base-mac-in-virl-and-remove-management-ports-from-stp-evaluation

Otherwise, we could start with supporting at most 1 node per topology

@ipspace
Copy link
Owner Author

ipspace commented Oct 17, 2024

How the heck did you figured that out?

The trunk port was not in the list of active VLAN ports, so I started investigating. It was blocking, so STP was the culprit. STP claimed the device is the root bridge, so I started looking at STP details and found that both devices use the same system ID.

Anyway, I will look into options. It might be possible to change it at image startup. NETMAP iol startup file options to dig in, or env vars. Ill ask containerlab guy who did the iol integration if he knows the full netmap format.

It's definitely possible (or VIRL wouldn't be able to do it), but I couldn't figure out how.

Anyway, looking at GNS3 code, it looks like IOL can take node ID, and the GNS3 code has "512 + id" in https://github.com/GNS3/gns3-server/blob/225779bc11a0d5a5af6aeb2c9a7642639cf3da06/gns3server/compute/iou/iou_vm.py#L776, and there's hard-coded 513 in https://github.com/hellt/vrnetlab/blob/master/cisco/iol/docker/entrypoint.sh#L14 so... 🤔

@ipspace
Copy link
Owner Author

ipspace commented Oct 17, 2024

Otherwise, we could start with supporting at most 1 node per topology

@DanPartelly: I would start with a very strong caveat saying "bridge domains don't work on IOL, so we disabled VLANs, and all IOL-L2 nodes use the same System ID, so you can have only one IOL-L2 node in the bridging domain". I can also add the same caveat to integration tests.

Without that, the current state of IOL-L2 is a release show-stopper. We can't release a broken functionality that is not described in caveats.

@kaelemc
Copy link

kaelemc commented Oct 17, 2024

How the heck did you figured that out?

The trunk port was not in the list of active VLAN ports, so I started investigating. It was blocking, so STP was the culprit. STP claimed the device is the root bridge, so I started looking at STP details and found that both devices use the same system ID.

Anyway, I will look into options. It might be possible to change it at image startup. NETMAP iol startup file options to dig in, or env vars. Ill ask containerlab guy who did the iol integration if he knows the full netmap format.

It's definitely possible (or VIRL wouldn't be able to do it), but I couldn't figure out how.

Anyway, looking at GNS3 code, it looks like IOL can take node ID, and the GNS3 code has "512 + id" in https://github.com/GNS3/gns3-server/blob/225779bc11a0d5a5af6aeb2c9a7642639cf3da06/gns3server/compute/iou/iou_vm.py#L776, and there's hard-coded 513 in https://github.com/hellt/vrnetlab/blob/master/cisco/iol/docker/entrypoint.sh#L14 so... 🤔

@ipspace Hey, I did the integration for IOL in Containerlab, and i'm currently working on a fix to this. I discussed this with @DanPartelly in the Containerlab Discord.

To sum it up, the system base MAC is set by the PID which the IOL binary launches as. You have to set a PID when executing the IOL binary, The entrypoint script for the container statically sets the PID to 1.

NETMAP uses the PID to bind the IOL processes interfaces to UDP ports, then IOUYAP will bind the UDP port to the linux container interfaces (eth0, eth1 etc.).

It should be easy enough to signal a PID to the entrypoint script when launching the container in containerlab, the problem is just making sure each IOL node has a unique PID that can persist reboots.

VIRL/CML launches IOL in LXCs and has some mechanism to increment the PID that IOL launches with to make sure there are no overlaps between the nodes.

@kaelemc
Copy link

kaelemc commented Oct 17, 2024

FYI, @ipspace Big fan of your blog and your work.

I see in a recent commit that the docs have been edited to say Catalyst 8000v doesn't support MPLS.

Maybe you are already aware of this but you just have to upgrade the boot license to 'advantage' or 'premier' for MPLS/SRv6 support. vrnetlab already does this with

license boot level network-premier addon dna-premier

Since we initially boot the node in the container build process, the license is applied in the bootstrap config. Then when the node is booted in a containerlab topology this license will have been applied on boot.

@DanPartelly
Copy link
Collaborator

DanPartelly commented Oct 17, 2024

I totally agree. Documentation was always a first class citizen in networklab, few tools are so well documented.

When do you want to release next version ? If it is not right around the corner , maybe we can give it a few days. Weekend is here in a day and we can work on it. I will keep in touch with @kaelemc on this issue, with his permission.

@DanPartelly: I would start with a very strong caveat saying "bridge domains don't work on IOL, so we disabled VLANs, and all IOL-L2 nodes use the same System ID, so you can have only one IOL-L2 node in the bridging domain". I can also add the same caveat to integration tests.

Without that, the current state of IOL-L2 is a release show-stopper. We can't release a broken functionality that is not described in caveats.

@kaelemc
Copy link

kaelemc commented Oct 17, 2024

I've submitted the PRs which fix this.

Even in the CML the base of the MAC is aabb.cc00. Sadly I don't think we can change that. But this should be enough.

srl-labs/containerlab#2239
hellt/vrnetlab#270

image

@ipspace
Copy link
Owner Author

ipspace commented Oct 17, 2024

I've submitted the PRs which fix this.

That was fast, thanks a million.

@DanPartelly: I would suggest we still add that caveat explaining what's going on (so we can push out a new release at any time), and once the new containerlab version comes out, I run the integration tests, change the containerlab release in the installation script, and we revise the caveats. OK?

@ipspace
Copy link
Owner Author

ipspace commented Oct 17, 2024

When do you want to release next version?

No rush, we don't have any major feature to push out (but have accumulated enough stuff so I'm not comfortable with a -post1 release), I just like to have my Ts crossed ;)

@ipspace
Copy link
Owner Author

ipspace commented Oct 17, 2024

FYI, @ipspace Big fan of your blog and your work.

Thank you!

I see in a recent commit that the docs have been edited to say Catalyst 8000v doesn't support MPLS.

Maybe you are already aware of this but you just have to upgrade the boot license to 'advantage' or 'premier' for MPLS/SRv6 support. vrnetlab already does this with

Thanks a million, will add to the initial configuration script (in case someone is running a Cat8K VM) and run the tests.

@kaelemc
Copy link

kaelemc commented Oct 17, 2024

@ipspace No problem Netlab looks really cool and could be of some use for me. I'm currently a heavy user of IOS-XR, but XRv runs too old of a software version (6.x) and XRv9k is well.. too heavy.

I'm curious, how much effort do you think it would be for me to integrate XRd support into netlab?.

I would say XRd is almost on par with the the containerised IOL, fast boot, instant commits and 90% feature parity with the full fat XR VMs.

I assume it's not that much work as XR support is somewhat existent with XRv/9k? Maybe just adding the relevant provider 'stuff'? (sorry not too familiar with the project code).

@ipspace
Copy link
Owner Author

ipspace commented Oct 17, 2024

I'm curious, how much effort do you think it would be for me to integrate XRd support into netlab?

I think it's working: https://netlab.tools/platforms/#supported-virtualization-providers

I never tried it myself, but someone submitted XRv patches and claimed it was running for him.

@kaelemc
Copy link

kaelemc commented Oct 17, 2024

@ipspace I meant XRd, as in: the containerised version of IOS-XR, would only work from the containerlab provider (I assume; unless someone built a VM which runs the container...).

Not the virtualised ones like XRv or XRv9k.

Unless you are saying this is already supported?

@ipspace
Copy link
Owner Author

ipspace commented Oct 17, 2024

I'm saying this should already be supported. It uses ios-xr/xrd-control-plane:7.11.1 image (obviously that can be changed) and containerlab provider.

@kaelemc
Copy link

kaelemc commented Oct 17, 2024

I'm saying this should already be supported. It uses ios-xr/xrd-control-plane:7.11.1 image (obviously that can be changed) and containerlab provider.

Awesome, thanks. I'll give it a shot 😊. Sorry for clouding this issue with XR stuff.

@DanPartelly
Copy link
Collaborator

DanPartelly commented Oct 17, 2024

Yes, we should add the caveats.

  1. It needs not only the next containerlab version, it also needs master (after PR lands) of vrnetlab. But thats what people use anyway.
  2. Furthermore , the bridge id is now built with the help of a variable which increases by one for each node. Nodes are sorted alphabetically so this means that if topology changes and nodes are added, or node names are changed the internal index will change, and so will the bridge ID for the device.

I think point 2 should be documented as a caveat too.

Ill run more tests this evening with the netlab more complex VLAN toplogies. Ive run a simple test and i have rstp up.

I would suggest we still add that caveat explaining what's going on (so we can push out a new release at any time),

@DanPartelly
Copy link
Collaborator

DanPartelly commented Oct 17, 2024

@ipspace . Ive ran almost all the VLAN test battery. The first 5 tests - prefix 01- to 23 all succeed. The second trunking was involved starting with test prefix 31, all went south. nothing worked anymore. If anyone has any fast ideas, Im all ears.

@ipspace
Copy link
Owner Author

ipspace commented Oct 17, 2024

@ipspace . Ive ran almost all the VLAN test battery. The first 5 tests - prefix 01- to 23 all succeed. The second trunking was involved starting with test prefix 31, all went south. nothing worked anymore. If anyone has any fast ideas, Im all ears.

Yes, the moment you add the second IOLL2 node the "duplicate STP system ID" kicks in. We have to wait for the vrnetlab/containerlab fixes.

@ipspace
Copy link
Owner Author

ipspace commented Oct 17, 2024

  1. Furthermore , the bridge id is now built with the help of a variable which increases by one for each node. Nodes are sorted alphabetically so this means that if topology changes and nodes are added, or node names are changed the internal index will change, and so will the bridge ID for the device.

I think point 2 should be documented as a caveat too.

Of course we should document it (give me a day or so), but this just makes it more like real life where you never know who the root bridge will be after you add a node to the network (unless you set bridge priorities). Nonetheless, if you don't rename IOLL2 nodes, their relative order will not change, and the node with the highest MAC address will stay the same.

@DanPartelly
Copy link
Collaborator

DanPartelly commented Oct 17, 2024

Yes, the moment you add the second IOLL2 node the "duplicate STP system ID" kicks in. We have to wait for the vrnetlab/containerlab fixes.

Both of them are using the new PR branches. I have different STP IDs on all nodes now. Ports do go though learning and end up in forwarding state in 31-xxxx_xxx where I spent some time.

@ipspace
Copy link
Owner Author

ipspace commented Oct 17, 2024

Both of them are using the new PR branches. I have different STP IDs on all nodes now. Ports do go though learning and end up in forwarding state in 31-xxxx_xxx where I spent some time.

Oh, so it's worse than I thought. No further ideas at the moment, will wait for the new releases. I could rebuild the IOL container, but would setting the environment variable for the container be enough? Looking at the containerlab code, it seems it's doing more than that.

@DanPartelly
Copy link
Collaborator

In theory yes, you could do that and pass a unique PID to the image. But you have more important things to do probably.
Ill spend more time on it in weekend, and we can safely wait until next release.

but would setting the environment variable for the container be enough? Looking at the containerlab code, it seems it's >>doing more than that.

@kaelemc
Copy link

kaelemc commented Oct 17, 2024

Both of them are using the new PR branches. I have different STP IDs on all nodes now. Ports do go though learning and end up in forwarding state in 31-xxxx_xxx where I spent some time.

Oh, so it's worse than I thought. No further ideas at the moment, will wait for the new releases. I could rebuild the IOL container, but would setting the environment variable for the container be enough? Looking at the containerlab code, it seems it's doing more than that.

Yeah containerlab generates the NETMAP and IOUYAP files. NETMAP needs to know the PID of the IOL container so that it can do it's IOL->Container interface binding magic (with IOUYAP).

Manually changing the PID will not get you connectivity into IOL and ports won't work.

You can always use the gh actions build artifacts. Containerlab artifact download

image

@ipspace
Copy link
Owner Author

ipspace commented Oct 18, 2024

The baseline settings and caveats are in #1390. We should merge that one to stop 'netlab initial' crashes and to disable VLANs on IOL.

@DanPartelly
Copy link
Collaborator

DanPartelly commented Oct 21, 2024

I merged that one the day you put it up for review.

As a side note, the only way I could get spanning tree to work on ioll2 was in mstp mode. With the changes to containerlab and image generation to have different bridge ids. But still no connectivity in advanced labs. In all other modes, BPDUs are sent by interfaces E0/1 (s1-s2)but not received, according to show spanning-tree detail.

I then stopped looking into it, as I wanted to do netlab exec.

The baseline settings and caveats are in #1390. We should merge that one to stop 'netlab initial' crashes and to disable VLANs on IOL.

@ipspace
Copy link
Owner Author

ipspace commented Oct 21, 2024

As a side note, the only way I could get spanning tree to work on ioll2 was in mstp mode.

When you decide that's good enough, please add the necessary configuration commands to the VLAN configuration module (so we'll have a working config), document the caveat, and submit a PR.

But still no connectivity in advanced labs.

If we can't get a two-switch network with a single trunk to work we might as well call it a day and disable VLANs for IOLL2 (or drop IOLL2 support -- what good is it without VLANs?)

@DanPartelly
Copy link
Collaborator

DanPartelly commented Oct 22, 2024

At this stage, Im tempted to drop it altogether , at least for the foreseeable future. Ill stash all ioll2 changes in a backup branch until such time at least all VLAN tests run. Then we can put it back. You OK with this move ?

About iol L3 , are there integration tests that still need to run?

But still no connectivity in advanced labs.

If we can't get a two-switch network with a single trunk to work we might as well call it a day and disable VLANs for IOLL2 (or drop IOLL2 support -- what good is it without VLANs?)

@ipspace
Copy link
Owner Author

ipspace commented Oct 22, 2024

At this stage, Im tempted to drop it altogether, at least for the foreseeable future.

As we have IOLL2 mentioned in so many places, I'd just write a caveat. It would also be evident from the integration tests that things don't work. Maybe it will trigger someone to chime in ;) OK?

@DanPartelly
Copy link
Collaborator

Ok sure. It will be done.

@DanPartelly
Copy link
Collaborator

So, Ive been going the wrong way about this. Today after capturing packets from 31-vlan-bridge-trunk again, what I seen made not much sense. So I decided to replicate the config in GNS3. I dumped the device configs from netlab, copy paste , and lo and behold, everything works.

The key difference is the utility used to tunnel IOU udp in GNS3 is a new one, called ubridge.
The dockerized version uses an utility unmaintained by years, iouyap.

This result is something tractable we can follow up. It points towards a possible bug in the utility used in the container. Finally light.

@kaelemc
Copy link

kaelemc commented Oct 23, 2024

@DanPartelly I saw uBridge but didn't give any thought after seeing it supported plenty of other non-IOL relevant things, my take was to keep the IOL container somewhat lean in a sense, and considering IOL just 'worked' in my limited testing I didn't need to pursue uBridge.

I tried to make some modifications to iouyap using the source but I couldn't even get it to build after mucking around with it for a while, decided it's not worth my time and just to live with whatever issues we have. It's also the reason we use the prepackaged iouyap via apt, couldn't get it to build.

We could give uBridge a shot and see if you get the desired behaviour 🙂.

@ipspace
Copy link
Owner Author

ipspace commented Oct 24, 2024

The key difference is the utility used to tunnel IOU udp in GNS3 is a new one, called ubridge. The dockerized version uses an utility unmaintained by years, iouyap.

Thanks a million for figuring this out. I added an IOL L2 caveat warning users that multi-node topologies won't work, and will remove it whenever we manage to solve this.

@ipspace
Copy link
Owner Author

ipspace commented Oct 24, 2024

@DanPartelly I saw uBridge but didn't give any thought after seeing it supported plenty of other non-IOL relevant things, my take was to keep the IOL container somewhat lean in a sense, and considering IOL just 'worked' in my limited testing I didn't need to pursue uBridge.

There seems to be an Alpine ubridge package (https://pkgs.alpinelinux.org/package/edge/community/x86/ubridge) and an RPM (https://rpmfind.net/linux/RPM/fedora/devel/rawhide/x86_64/u/ubridge-0.9.18-13.fc41.x86_64.html) so maybe it's as simple as installing a different package and creating a more complex config file?

I know I'm kibitzing ;)

@DanPartelly
Copy link
Collaborator

There is a package for ubridge in the GNS3 ppa used by the Dockerfile to instal iouyap. So yeah, fingers crossed, we install it , generate on containerlab side a different config, and test.

There seems to be an Alpine ubridge package (https://pkgs.alpinelinux.org/package/edge/community/x86/ubridge) and an RPM (https://rpmfind.net/linux/RPM/fedora/devel/rawhide/x86_64/u/ubridge-0.9.18-13.fc41.x86_64.html) so maybe it's as simple as installing a different package and creating a more complex config file?

I know I'm kibitzing ;)

@DanPartelly
Copy link
Collaborator

@ipspace Unfortunately finger crossing did not worked out ;) seems uBridge at this time can;t crate an IOL type bridge from an ini file. And you have to have special logic to multiplex/demultiplex IOL packets.

In GNS3 they control the creation of bridges using telnet. That's it, the bridge utility will open a local port, and you connect to telnet to this port and send instructions what type of bridge to create and what are the members. So that's that.

@DanPartelly
Copy link
Collaborator

DanPartelly commented Oct 25, 2024

Another hour of debugging. Some light struck. An update:

When you capture traffic using a AF_PACKET raw socket in Linux, it appears that VLAN tags are always stripped and stored away in data structure. It is later made available to the raw socket (if relevant setsockopt() are set), in a special control message. this is by design [1]

I looked into libpcap source code and indeed, they are using cmsg() to rebuild vlan data. this behavior explains all oddities I seen with Cisco L2 IOL devices. (putting ports in broken state in tests which had a native VLAN over trunk , not negotiating pv-rstp over trunks which have no native vlan, no multicasts reaching switch interface - as reported by Cisco ios when using non-native vlan trunk, - but seeing them coming in on the veth interface with wireshark, symmetric behavior on rx path )

I will try to confirm this by dumping the raw socket data. But even at this step, I feel pretty strongly that we will need to fix
iouyap by:

  • either capturing ancillary control messages and rebuilding packet data.
  • switch to libpcap - it rebuilds vlan data (maybe less debugging needed after with this option)
  • rewrite it (I dont feel like it )

Chime in with your opinions !

[1] https://lore.kernel.org/all/[email protected]/T/

@ipspace
Copy link
Owner Author

ipspace commented Oct 25, 2024

In GNS3 they control the creation of bridges using telnet. That's it, the bridge utility will open a local port, and you connect to telnet to this port and send instructions what type of bridge to create and what are the members. So that's that.

Most other vrnetlab integrations connect to the VM console port and run an equivalent of an expect script. Taking that and extracting the relevant bits should be good enough to do the telnet stuff. It would definitely be simpler than rewriting iouyap.

@DanPartelly
Copy link
Collaborator

DanPartelly commented Oct 26, 2024

@ipspace As my C is good, it took me half an hour to understand ioyapp and another half to fix the bugs in iouyap, switch to the ubridge version of rebuilding vlan headers (which is more or less copied from libpacp) . Ill upload the code to my github.
As I am posting this, I pass 31-vlan-bridge-trunk, and enjoy great functionality of ioll2.

However, the following considerations must be made regarding a potential new version iouyap release:

  1. the code must be cleaned and refactored a bit.
  2. the code must be run through address sanitizer, memory sanitizer and leak sanitizer at a minimum
  3. it must built statically against libmusl
  4. makefile brought up to speed
  5. someone which knows devops should help me to release a signed archive of it - I wont do packages, a static build for a single executable will doit.
  6. Ask Roman and Kaelem to check it and switch to it in vrnetlab Docker
  7. I have to decide whatever I want to maintain it (although it would not be much work once sanitizers are passed and eventual bugs found by those fixed

@ipspace
Copy link
Owner Author

ipspace commented Oct 26, 2024

@ipspace As my C is good, it took me half an hour to understand ioyapp and another half to fix the bugs in iouyap

Wow. Congratulations!! Now we can only hope the rest of your todo list gets implemented.

@DanPartelly
Copy link
Collaborator

Thanks. The code is up now on github for those adventurous enough to want to compile it and use it in their images.
Ive ran it all day while doing tests and working on more layer 2 features enablement and it was stable.
(makefile needs work. Always run make clean if you modify anything)

Wow. Congratulations!! Now we can only hope the rest of your todo list gets implemented.

@DanPartelly
Copy link
Collaborator

Can we close this now, or the documentation still needs some PRs ?

@ipspace
Copy link
Owner Author

ipspace commented Oct 26, 2024

Can we close this now, or the documentation still needs some PRs ?

If you wish you could mention something or add a point to this issue in caveats.md. Right now, the caveats describe the current state (without your changes), so I'm OK with the way documentation is right now.

@ipspace ipspace closed this as completed Oct 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants