-
Notifications
You must be signed in to change notification settings - Fork 148
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create veth interfaces to mirror VF details for userspace mode #37
Comments
One of the use cases we have is to send all control packets into the kernel via veth and use netlink to listen for route updates. Having the feature where veth pairs are created automatically for each VF, with one end named with what was passed as |
@krsna1729 Can you give us the link of the dpdk sample app ? |
This actual app is for now internal. Will see if I can create a sample. |
Here is a sample - setup.sh Haven't passed traffic to test but that's the idea |
I think there will be implications creating a dummy interface with same IP/mac config of the VF itself. Most likely this will mess up the forwarding table and ARP. We need to think of different approach of capturing IPAM information then pass it on to the userspace driver. |
@ahalim-intel if the VF is bound to uio/vfio drivers then they would not appear in the linux stack and not cause an issue right? |
We would not do this if VFs are in |
Internally we have tested this model and it works when passing traffic. The sample linked above demonstrates the concept. Let me know if you see any other issues. It can be gated by a conf option for backward compatibility reasons if need be. |
@krsna1729 so to clarify.
So the Linux kernel within the namespace can respond to ARP/Ping etc on the programmed end. |
@mcastelino That is correct! |
+1 for this. I like to test the sample app. I will give a try on the sample app. One more question, Is this features or plugin can work like chain model instead of being in SRIOV CNI ? The plugin should be getting previous results and work on the container namespace itself. I like to decouple the features and complexity as much as possible. In CNI community, we usually use chaining mechanism. |
@rkamudhan yup it can be done as a separate plugin instead of complicating sriov-cni. I guess if you are using device plugin in usermode then one should not be calling sriov-cni today. Since the devices are already bound to vfio-pci and it seems non-trivial to retrieve mac address of the VF in such case I was thinking -
|
@krsna1729 CNI results field now supports reporting mac address but it is optional! https://github.com/containernetworking/cni/blob/master/SPEC.md#result However that means that the SRIOV plugin should report the macaddress. @rkamudhan does the SRIOV plugin populate the mac address field of the result |
@mcastelino you are right exactly. CNI Spec support mac address in the result. There are two ways to do it, either through current SRIOV CNI and chaining mechanism or through device plugin as @krsna1729 explained. But it is tricky because MAC address is not properly populated in few NICs. |
@rkamudhan i meant in CNI itself. This does not touch device plugin. Sorry if i muddied the flow/explanation |
@rkamudhan Updated setup.sh to show the different stages
|
@krsna1729 in your bash plugin, you need to pass the result through. i.e. what you get as input, you need to passthrough. I assume you plan to chain this? |
@mcastelino chaining may not be needed. In case of userspace sriov we just need the above plugin. |
Well, shouldn't we perhaps revisit it and make CNI plugin bind to userspace driver as needed? This can be controlled by additional args passed to ADD command (so you'll have a So the flow would look the same way as for |
@booxter i think the reason for moving the binding logic out of the scope of CNI is primarily for proper implementation of Kata today would not work out of the box if you want to run a DPDK app in userspace mode (inside the kataVM). The device would end up as a netdevice inside the VM. For kata they bind every physical netdevice in the NetNS to vfio-pci because they need to pass it to QEMU. There is no two codepaths @amshinde correct me. |
@krsna1729 understood the reasoning for why it's not really CNI plugin job to bind (you need /dev/vfio/NN be registered with the container, which is known to DP only.) But then maybe this binding logic should live with device plugin. I understand the appeal of having the DP running unprivileged, but it doesn't get rid of complexity, it merely pushes it to a separate on-deployment component that will handle device registration. This is with additional cost of not being able to use the same pool / PF for two different workloads. |
@booxter then the device plugin will have to access the network object referenced in the podSpec, and determine if the driver to be bound is igb_uio or vfio or leave with linux. If we were to keep it entirely in device plugin, then how would two pods consume/request devices bound to different drivers? The only way today is to have different pool names This is me recollecting from my minimal knowledge of how DP works with kubelet. Do provide an example and any correction to my explanation of how we can make this happen. |
@krsna1729 I guess significant chunk of pressure towards pool reuse for different types of workloads would go away if we could split / exclude VFs that belong to the same PF: k8snetworkplumbingwg/sriov-network-device-plugin#80. It would still be not optimal since one may need a |
But then we are limiting allocation of those resources. If we split 10 VFs into two pools of five what happens when we need 6 netdev and 4 vfio or 10/0, etc etc. We have some code laying around for exclusion as we have a need to not let one of the VFs from each PF from being assigned (comma separated list of PCI ids i think) . But I think multiple pools for the same PF will be problematic or less than useful. |
@booxter I think mixed usage should be solved at cluster level. Every node/card is fully vfio or netdevice. You can start with 50/50 split of nodes/cards. Then if you use up one type, then have an operator looking out for pods waiting on resources, take an action to move a node/card to required driver to satisfy pending request, as long as any potential evicted pods can be accommodated with resource of other type somewhere else. |
@krsna1729 while this may probably work, at least in larger clusters, this seems like pushing complexity from DP / CNI level into cluster manager. (Which may well not exist.) That being said, I see no easy way out of the situation. Getting to the original point of the issue (exposing IPAM information inside a pod with userspace SR-IOV device), veth is probably one option. Has anyone considered the "VF representors": https://netdevconf.org/1.2/session.html?or-gerlitz for the same? Seems like this is the new endorsed way of getting a netlink representation of a VF that is actually registered with a userspace driver. It requires a relatively fresh kernel but perhaps it's ok. |
With new config approach, one way to achieve this is to have a new selector type of num that allows specifying the max num of devices from a certain resource exposed by device plugin. but not to specifying rootDevices in PCI address level. |
@zshi-redhat What does this mean max "num of devices"? Say, I have a PF with 8 VFs configured. If somehow I chose to add only 6 of them using max "num of devices", what happens with the other 2 VFs? How the rest of the VF's will be used? If you're not going to specify PF addr then how do you choose which VFs to be excluded and which ones to include? I am trying to understand we do we need this exclusion. |
6 VF devices will be chosen randomly or in order from the selected PF ( via selectors ).
If the rest of VFs can be discovered by different set of selectors ( for example, the rest VFs use different driver ), they will be advertised as different resource.
randomly chosen, or using the vf index num in order. |
@krsna1729 do you have some code to try? Do you plan to work on this feature? My main concern right now is not about passing IPAM information per se but about CNI plugin gracefully handling the case where the device is registered with VFIO (in which case all namespace business is skipped; we would still call to IPAM and return it in results dict). I think these two matters are separate: one is to make the existing CNI plugin to successfully handle a request for VFIO binding (by doing nothing) and another is to expose IPAM information using a fake device (which is the goal of this issue). I think the first part should be tracked separately (perhaps as a dependency for this issue). I've created #63 to track this base step. |
@booxter I have a quick and dirty bash CNI I have been using as band-aid Helper systemd script that bind VFs to vfio-pci after capturing their MAC |
@krsna1729 thanks for the links. Do you plan to work on PR for SR-IOV CNI plugin to do the same? If not, maybe I could take it over. |
@booxter go ahead! |
For use-case testing I set up 2 instances of bess-router which internally is this pipeline and did a ping test. |
@booxter @krsna1729 Is this out of SRIOV CNI plugin using the CNI Chaining mechanism right ? I went through @krsna1729 CNI version on vfioveth, it is really impressive. I think that is really good way to forward. |
@rkamudhan Not chaining at the moment in my version. This is standalone CNI. Deployed as follows Example network |
Shameless plug - I think this is a nice quickstart for anyone trying out Multus, SR-IOV |
If a pci device allocated to a container by SR-IOV device plugin is of vfio type then it doesn't have a netlink representation, so moving the interface into container namespace is useless and actually breaks the CNI plugin. We could perhaps go with a separate noop CNI plugin to do the same but there are plans to, in the future, allocate IPAM info for the device and expose it by one means or another (it could be a fake veth pair to carry the information: see issue k8snetworkplumbingwg#37; or some other mechanism). Fixes k8snetworkplumbingwg#63
If a pci device allocated to a container by SR-IOV device plugin is of vfio type then it doesn't have a netlink representation, so moving the interface into container namespace is useless and actually breaks the CNI plugin. We could perhaps go with a separate noop CNI plugin to do the same but there are plans to, in the future, allocate IPAM info for the device and expose it by one means or another (it could be a fake veth pair to carry the information: see issue k8snetworkplumbingwg#37; or some other mechanism). Fixes k8snetworkplumbingwg#63
If a pci device allocated to a container by SR-IOV device plugin is of vfio type then it doesn't have a netlink representation, so moving the interface into container namespace is useless and actually breaks the CNI plugin. We could perhaps go with a separate noop CNI plugin to do the same but there are plans to, in the future, allocate IPAM info for the device and expose it by one means or another (it could be a fake veth pair to carry the information: see issue k8snetworkplumbingwg#37; or some other mechanism). Fixes k8snetworkplumbingwg#63
If a pci device allocated to a container by SR-IOV device plugin is of vfio type then it doesn't have a netlink representation, so moving the interface into container namespace is useless and actually breaks the CNI plugin. We could perhaps go with a separate noop CNI plugin to do the same but there are plans to, in the future, allocate IPAM info for the device and expose it by one means or another (it could be a fake veth pair to carry the information: see issue k8snetworkplumbingwg#37; or some other mechanism). Fixes k8snetworkplumbingwg#63
If a pci device allocated to a container by SR-IOV device plugin is of vfio type then it doesn't have a netlink representation, so moving the interface into container namespace is useless and actually breaks the CNI plugin. We could perhaps go with a separate noop CNI plugin to do the same but there are plans to, in the future, allocate IPAM info for the device and expose it by one means or another (it could be a fake veth pair to carry the information: see issue #37; or some other mechanism). Fixes #63
Sorry folks, I know I said I will take it over but I couldn't find a slot to work on it. So if anyone can pick it up instead, please do. Thanks. |
@krsna1729 do you have an example of how dpdk app communicate with one end of veth pair inside pod? does the dpdk app require a separate IP allocated from IPAM? |
@zshi-redhat https://github.com/krsna1729/bess-router/blob/master/router.bess#L53 Basically both ends of veth will stay within network namespace. One end is used by dpdk as shown above. Other end is the one applied with ipam result. No extra ip. Just whatever was returned as part of cni call just like in the case of Linux mode of sriov. Pipeline of the dpdk app would be like this |
We really find this idea the way to go, so I actually went ahead and implemented this in DANM. I used the "dummy" interface though instead of veth. I think it better fits the scenario. |
@zshi-redhat @ahalim-intel does this still look like a good solution for the cni? |
This issue was discussed in NPWG network & resource management meeting on 07.12.2020 and was decided that this issue should not be handled in SR-IOV CNI. If there are any objections please re-open the issue and join us at the bi-weekly meeting (see CONTRIBUTING doc for more info) Thanks! |
@ahalim-intel @rkamudhan
When using device plugin in userspace mode, it would be nice to have sriov-cni create dummy interfaces in the network namespace with the mac details matching the VF, and the ipam results applied. This way dpdk apps can look for the information in a generic non hostpath/files-sharing way and increase the usability of userspace mode greatly.
https://intel-corp-team.slack.com/archives/C4C5RSEER/p1547232544029500
The text was updated successfully, but these errors were encountered: