Create veth interfaces to mirror VF details for userspace mode #37

krsna1729 · 2019-01-11T18:58:45Z

When using device plugin in userspace mode, it would be nice to have sriov-cni create dummy interfaces in the network namespace with the mac details matching the VF, and the ipam results applied. This way dpdk apps can look for the information in a generic non hostpath/files-sharing way and increase the usability of userspace mode greatly.

https://intel-corp-team.slack.com/archives/C4C5RSEER/p1547232544029500

krsna1729 · 2019-01-11T18:59:42Z

/cc @amshinde @mcastelino @egernst

krsna1729 · 2019-01-11T19:20:42Z

One of the use cases we have is to send all control packets into the kernel via veth and use netlink to listen for route updates. Having the feature where veth pairs are created automatically for each VF, with one end named with what was passed as interface: in podspec annotation /CNI_IFNAME would be ideal. The other end of the veth will be used by dpdk app to enqueue the control packets.

rkamudhan · 2019-01-11T19:22:35Z

@krsna1729 Can you give us the link of the dpdk sample app ?

krsna1729 · 2019-01-11T20:04:57Z

This actual app is for now internal. Will see if I can create a sample.

krsna1729 · 2019-01-12T03:47:35Z

Here is a sample -

setup.sh
router.bess
pipeline.txt

Haven't passed traffic to test but that's the idea

ahalimx86 · 2019-01-14T09:57:58Z

I think there will be implications creating a dummy interface with same IP/mac config of the VF itself. Most likely this will mess up the forwarding table and ARP. We need to think of different approach of capturing IPAM information then pass it on to the userspace driver.

krsna1729 · 2019-01-14T19:38:09Z

@ahalim-intel if the VF is bound to uio/vfio drivers then they would not appear in the linux stack and not cause an issue right?

krsna1729 · 2019-01-14T19:39:53Z

We would not do this if VFs are in netdev mode. Since the control messages are already being answered by kernel.

krsna1729 · 2019-01-15T02:59:09Z

Internally we have tested this model and it works when passing traffic. The sample linked above demonstrates the concept. Let me know if you see any other issues. It can be gated by a conf option for backward compatibility reasons if need be.

mcastelino · 2019-01-22T19:43:34Z

@krsna1729 so to clarify.

Both ends of the veth will be in the container namespace itself
One end of the veth is programmed with the IP and MAC of the VF
The VF is bound to VFIO/DPDK
The other end of the veth is connected to the DPDK app

So the Linux kernel within the namespace can respond to ARP/Ping etc on the programmed end.
So the DPDK app can delegate some functionality back to the Linux kernel stack

krsna1729 · 2019-01-22T19:44:57Z

@mcastelino That is correct!

rkamudhan · 2019-01-22T20:01:26Z

+1 for this. I like to test the sample app. I will give a try on the sample app. One more question, Is this features or plugin can work like chain model instead of being in SRIOV CNI ? The plugin should be getting previous results and work on the container namespace itself. I like to decouple the features and complexity as much as possible. In CNI community, we usually use chaining mechanism.

krsna1729 · 2019-01-22T20:29:14Z

@rkamudhan yup it can be done as a separate plugin instead of complicating sriov-cni. I guess if you are using device plugin in usermode then one should not be calling sriov-cni today.

Since the devices are already bound to vfio-pci and it seems non-trivial to retrieve mac address of the VF in such case I was thinking -

let the sysadmin set mac address of each VF using some logic in this script
cni will figure out vf index using deviceID passed by multus (read from a checkpoint file shared with device plugin)
use netlink equivalent of below command to retrieve the mac

$ ip link show dev ens785f0 | grep 'vf 0'
  vf 0 MAC de:ad:be:ef:00:00, spoof checking off, link-state enable, trust off

setup veth pair with mac (could be separate plugin)
call ipam

mcastelino · 2019-01-22T20:30:39Z

@krsna1729 CNI results field now supports reporting mac address but it is optional!

https://github.com/containernetworking/cni/blob/master/SPEC.md#result

However that means that the SRIOV plugin should report the macaddress.

@rkamudhan does the SRIOV plugin populate the mac address field of the result

rkamudhan · 2019-01-22T21:01:31Z

@mcastelino you are right exactly. CNI Spec support mac address in the result. There are two ways to do it, either through current SRIOV CNI and chaining mechanism or through device plugin as @krsna1729 explained. But it is tricky because MAC address is not properly populated in few NICs.

krsna1729 · 2019-01-22T21:05:35Z

@rkamudhan i meant in CNI itself. This does not touch device plugin. Sorry if i muddied the flow/explanation

krsna1729 · 2019-01-22T23:06:11Z

@rkamudhan Updated setup.sh to show the different stages

$ ./setup.sh
bess
38e6c194bf0377081922b8b84062157760810f1c5221e942018c2afaac5ab462
3: foo@foo-vdev: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000
    link/ether de:ad:be:ef:00:00 brd ff:ff:ff:ff:ff:ff
default via 172.17.0.1 dev eth0
172.17.0.0/16 dev eth0 proto kernel scope link src 172.17.0.2
198.18.0.0/30 dev foo proto kernel scope link src 198.18.0.1
198.19.0.0/30 dev foo proto kernel scope link src 198.19.0.1

krsna1729 · 2019-01-23T06:24:39Z

bash plugin :P
https://gist.github.com/krsna1729/c2ed195c26202831ef8f79e319196913

mcastelino · 2019-01-23T17:16:33Z

@krsna1729 in your bash plugin, you need to pass the result through. i.e. what you get as input, you need to passthrough. I assume you plan to chain this?

krsna1729 · 2019-01-23T17:31:03Z

@mcastelino chaining may not be needed. In case of userspace sriov we just need the above plugin.

booxter · 2019-02-07T20:38:49Z

Since the devices are already bound to vfio-pci and it seems non-trivial to retrieve mac address of the VF

Well, shouldn't we perhaps revisit it and make CNI plugin bind to userspace driver as needed? This can be controlled by additional args passed to ADD command (so you'll have a vfio arg specified in your multus network CRD). This should help with other things, like reusing the same VFs for different binding types (both netdevice and vfio). This also reduces the number of preparation steps for VFIO enablement since now CNI plugin would take care of it (as well as of registering the device back on DEL). This is the same thing that Kata does for VFIO.

So the flow would look the same way as for netdevice except the last step where we would, inside the container, unbind the netlink device and register it with userspace driver. Just before doing so, we would also create a fake veth that would carry the IPAM and L2 settings that originally belonged to the VF that is now gone after plugin exits.

krsna1729 · 2019-02-08T01:14:01Z

@booxter i think the reason for moving the binding logic out of the scope of CNI is primarily for proper implementation of device concept, where the device plugin sets the permissions and also mounts the appropriate files into the container, without leaking mount details into podSpec or requiring privileged: true. I am really happy about that.
@ahalim-intel @rkamudhan correct me.

Kata today would not work out of the box if you want to run a DPDK app in userspace mode (inside the kataVM). The device would end up as a netdevice inside the VM. For kata they bind every physical netdevice in the NetNS to vfio-pci because they need to pass it to QEMU. There is no two codepaths @amshinde correct me.

booxter · 2019-02-11T19:18:43Z

@krsna1729 understood the reasoning for why it's not really CNI plugin job to bind (you need /dev/vfio/NN be registered with the container, which is known to DP only.) But then maybe this binding logic should live with device plugin. I understand the appeal of having the DP running unprivileged, but it doesn't get rid of complexity, it merely pushes it to a separate on-deployment component that will handle device registration. This is with additional cost of not being able to use the same pool / PF for two different workloads.

krsna1729 · 2019-02-11T19:42:54Z

@booxter then the device plugin will have to access the network object referenced in the podSpec, and determine if the driver to be bound is igb_uio or vfio or leave with linux.

If we were to keep it entirely in device plugin, then how would two pods consume/request devices bound to different drivers? The only way today is to have different pool names intel.com/sriov_netdevice: X, intel.com/sriov_vfio: X, and to use back them both with the same PF, the device plugin needs to advertise 2 different device types for the same pool of resources. So everytime someone asks for _netdevice: X, we should also decrement X from _vfio. Where it gets tricky is the actual allocation is done in the kubelet. So unless we do something twisted where we mark X from vfio as unhealthy whenever kubelet asks to provision X of _netdevice, I dont know how we can correctly handle it.

This is me recollecting from my minimal knowledge of how DP works with kubelet. Do provide an example and any correction to my explanation of how we can make this happen.

booxter · 2019-02-11T22:34:04Z

@krsna1729 I guess significant chunk of pressure towards pool reuse for different types of workloads would go away if we could split / exclude VFs that belong to the same PF: k8snetworkplumbingwg/sriov-network-device-plugin#80. It would still be not optimal since one may need a vfio device but have netdevice only but at least having a single SR-IOV card on a node wouldn't be a blocker for mixed use cases.

greg-bock · 2019-02-11T23:53:25Z

I guess significant chunk of pressure towards pool reuse for different types of workloads would go away if we could split / exclude VFs that belong to the same PF

But then we are limiting allocation of those resources. If we split 10 VFs into two pools of five what happens when we need 6 netdev and 4 vfio or 10/0, etc etc.

We have some code laying around for exclusion as we have a need to not let one of the VFs from each PF from being assigned (comma separated list of PCI ids i think) . But I think multiple pools for the same PF will be problematic or less than useful.

krsna1729 · 2019-02-12T02:44:17Z

@booxter I think mixed usage should be solved at cluster level. Every node/card is fully vfio or netdevice. You can start with 50/50 split of nodes/cards. Then if you use up one type, then have an operator looking out for pods waiting on resources, take an action to move a node/card to required driver to satisfy pending request, as long as any potential evicted pods can be accommodated with resource of other type somewhere else.

booxter · 2019-02-12T21:27:23Z

@krsna1729 while this may probably work, at least in larger clusters, this seems like pushing complexity from DP / CNI level into cluster manager. (Which may well not exist.) That being said, I see no easy way out of the situation.

Getting to the original point of the issue (exposing IPAM information inside a pod with userspace SR-IOV device), veth is probably one option. Has anyone considered the "VF representors": https://netdevconf.org/1.2/session.html?or-gerlitz for the same? Seems like this is the new endorsed way of getting a netlink representation of a VF that is actually registered with a userspace driver. It requires a relatively fresh kernel but perhaps it's ok.

zshi-redhat · 2019-02-12T22:00:03Z

@krsna1729 I guess significant chunk of pressure towards pool reuse for different types of workloads would go away if we could split / exclude VFs that belong to the same PF: intel/sriov-network-device-plugin#80.

With new config approach, one way to achieve this is to have a new selector type of num that allows specifying the max num of devices from a certain resource exposed by device plugin. but not to specifying rootDevices in PCI address level.

ahalimx86 · 2019-02-13T17:40:37Z

@zshi-redhat What does this mean max "num of devices"? Say, I have a PF with 8 VFs configured. If somehow I chose to add only 6 of them using max "num of devices", what happens with the other 2 VFs? How the rest of the VF's will be used? If you're not going to specify PF addr then how do you choose which VFs to be excluded and which ones to include? I am trying to understand we do we need this exclusion.

zshi-redhat · 2019-02-13T18:44:54Z

@zshi-redhat What does this mean max "num of devices"? Say, I have a PF with 8 VFs configured. If somehow I chose to add only 6 of them using max "num of devices", what happens with the other 2 VFs?

6 VF devices will be chosen randomly or in order from the selected PF ( via selectors ).

How the rest of the VF's will be used?

If the rest of VFs can be discovered by different set of selectors ( for example, the rest VFs use different driver ), they will be advertised as different resource.
If the rest of VFs be discovered by the same set of selectors, they will be left unused. meaning they can be used for other purposes.

If you're not going to specify PF addr then how do you choose which VFs to be excluded and which ones to include? I am trying to understand we do we need this exclusion.

randomly chosen, or using the vf index num in order.

booxter · 2019-02-14T22:05:04Z

@krsna1729 do you have some code to try? Do you plan to work on this feature?

My main concern right now is not about passing IPAM information per se but about CNI plugin gracefully handling the case where the device is registered with VFIO (in which case all namespace business is skipped; we would still call to IPAM and return it in results dict). I think these two matters are separate: one is to make the existing CNI plugin to successfully handle a request for VFIO binding (by doing nothing) and another is to expose IPAM information using a fake device (which is the goal of this issue). I think the first part should be tracked separately (perhaps as a dependency for this issue). I've created #63 to track this base step.

krsna1729 · 2019-02-14T23:15:51Z

@booxter I have a quick and dirty bash CNI I have been using as band-aid
https://github.com/clearlinux/cloud-native-setup/blob/master/clr-k8s-examples/9-multi-network/cni/vfioveth

Helper systemd script that bind VFs to vfio-pci after capturing their MAC
https://github.com/clearlinux/cloud-native-setup/blob/master/clr-k8s-examples/9-multi-network/systemd/sriov.sh

booxter · 2019-02-15T20:15:25Z

@krsna1729 thanks for the links. Do you plan to work on PR for SR-IOV CNI plugin to do the same? If not, maybe I could take it over.

krsna1729 · 2019-02-15T22:39:27Z

@booxter go ahead!

krsna1729 · 2019-02-15T22:47:30Z

For use-case testing I set up 2 instances of bess-router which internally is this pipeline and did a ping test.

krsna1729/bess-router@1e5d7af

rkamudhan · 2019-02-18T10:34:00Z

@booxter @krsna1729 Is this out of SRIOV CNI plugin using the CNI Chaining mechanism right ? I went through @krsna1729 CNI version on vfioveth, it is really impressive. I think that is really good way to forward.

krsna1729 · 2019-02-18T20:06:26Z

@rkamudhan Not chaining at the moment in my version. This is standalone CNI. Deployed as follows
https://github.com/clearlinux/cloud-native-setup/blob/master/clr-k8s-examples/9-multi-network/multus-sriov-ds.yaml#L140

Example network
https://github.com/clearlinux/cloud-native-setup/blob/master/clr-k8s-examples/9-multi-network/test/sriov/0-sriov-net.yaml#L20-L39

krsna1729 · 2019-02-18T20:13:44Z

Shameless plug - I think this is a nice quickstart for anyone trying out Multus, SR-IOV
https://github.com/clearlinux/cloud-native-setup/tree/master/clr-k8s-examples/9-multi-network

If a pci device allocated to a container by SR-IOV device plugin is of vfio type then it doesn't have a netlink representation, so moving the interface into container namespace is useless and actually breaks the CNI plugin. We could perhaps go with a separate noop CNI plugin to do the same but there are plans to, in the future, allocate IPAM info for the device and expose it by one means or another (it could be a fake veth pair to carry the information: see issue k8snetworkplumbingwg#37; or some other mechanism). Fixes k8snetworkplumbingwg#63

If a pci device allocated to a container by SR-IOV device plugin is of vfio type then it doesn't have a netlink representation, so moving the interface into container namespace is useless and actually breaks the CNI plugin. We could perhaps go with a separate noop CNI plugin to do the same but there are plans to, in the future, allocate IPAM info for the device and expose it by one means or another (it could be a fake veth pair to carry the information: see issue #37; or some other mechanism). Fixes #63

booxter · 2019-03-21T20:33:24Z

Sorry folks, I know I said I will take it over but I couldn't find a slot to work on it. So if anyone can pick it up instead, please do. Thanks.

zshi-redhat · 2019-03-29T06:14:32Z

@krsna1729 do you have an example of how dpdk app communicate with one end of veth pair inside pod? does the dpdk app require a separate IP allocated from IPAM?

krsna1729 · 2019-03-29T12:27:03Z

@zshi-redhat https://github.com/krsna1729/bess-router/blob/master/router.bess#L53

Basically both ends of veth will stay within network namespace. One end is used by dpdk as shown above. Other end is the one applied with ipam result. No extra ip. Just whatever was returned as part of cni call just like in the case of Linux mode of sriov.

Pipeline of the dpdk app would be like this
https://github.com/krsna1729/bess-router/blob/master/pipeline.txt

Levovar · 2019-09-27T16:05:45Z

We really find this idea the way to go, so I actually went ahead and implemented this in DANM. I used the "dummy" interface though instead of veth. I think it better fits the scenario.
Details can be found in https://github.com/nokia/danm/pull/155/files if anyone is interested.

killianmuldoon · 2020-05-27T17:41:11Z

@zshi-redhat @ahalim-intel does this still look like a good solution for the cni?

adrianchiris · 2020-12-07T16:09:58Z

This issue was discussed in NPWG network & resource management meeting on 07.12.2020 and was decided that this issue should not be handled in SR-IOV CNI.

If there are any objections please re-open the issue and join us at the bi-weekly meeting (see CONTRIBUTING doc for more info)

Thanks!

krsna1729 changed the title ~~Create dummy interfaces to mirror VF details for userspace mode~~ Create veth interfaces to mirror VF details for userspace mode Jan 11, 2019

krsna1729 mentioned this issue Jan 26, 2019

Enable sriov-network-device-plugin vfio mode clearlinux/cloud-native-setup#37

Merged

booxter mentioned this issue Feb 11, 2019

Degrade sriov virt-launcher to unprivileged kubevirt/kubevirt#1941

Merged

booxter mentioned this issue Feb 14, 2019

Support VFIO backed VFs #63

Closed

krsna1729 mentioned this issue Feb 20, 2019

Simplify DP config file omec-project/ngic-rtc#22

Closed

zshi-redhat mentioned this issue May 22, 2019

VF passthrough not working inside a VM k8snetworkplumbingwg/sriov-network-device-plugin#127

Closed

This was referenced Mar 13, 2020

Option to use veth instead of dummy for DPDK support nokia/danm#189

Open

Enhance Fortville + DPDK support by adding extra information on the dummy kernel interface nokia/danm#175

Closed

adrianchiris added the stale This issue did not have any activity nor a conclusion in the past 90 days label Nov 24, 2020

adrianchiris closed this as completed Dec 7, 2020

Create veth interfaces to mirror VF details for userspace mode #37

Create veth interfaces to mirror VF details for userspace mode #37

Comments

krsna1729 commented Jan 11, 2019

krsna1729 commented Jan 11, 2019

krsna1729 commented Jan 11, 2019

rkamudhan commented Jan 11, 2019

krsna1729 commented Jan 11, 2019

krsna1729 commented Jan 12, 2019 • edited Loading

ahalimx86 commented Jan 14, 2019

krsna1729 commented Jan 14, 2019

krsna1729 commented Jan 14, 2019

krsna1729 commented Jan 15, 2019 • edited Loading

mcastelino commented Jan 22, 2019

krsna1729 commented Jan 22, 2019

rkamudhan commented Jan 22, 2019 • edited Loading

krsna1729 commented Jan 22, 2019 • edited Loading

mcastelino commented Jan 22, 2019 • edited Loading

rkamudhan commented Jan 22, 2019

krsna1729 commented Jan 22, 2019

krsna1729 commented Jan 22, 2019

krsna1729 commented Jan 23, 2019

mcastelino commented Jan 23, 2019

krsna1729 commented Jan 23, 2019

booxter commented Feb 7, 2019

krsna1729 commented Feb 8, 2019

booxter commented Feb 11, 2019

krsna1729 commented Feb 11, 2019

booxter commented Feb 11, 2019

greg-bock commented Feb 11, 2019

krsna1729 commented Feb 12, 2019

booxter commented Feb 12, 2019

zshi-redhat commented Feb 12, 2019

ahalimx86 commented Feb 13, 2019

zshi-redhat commented Feb 13, 2019

booxter commented Feb 14, 2019

krsna1729 commented Feb 14, 2019 • edited Loading

booxter commented Feb 15, 2019

krsna1729 commented Feb 15, 2019

krsna1729 commented Feb 15, 2019

rkamudhan commented Feb 18, 2019

krsna1729 commented Feb 18, 2019

krsna1729 commented Feb 18, 2019

booxter commented Mar 21, 2019

zshi-redhat commented Mar 29, 2019

krsna1729 commented Mar 29, 2019

Levovar commented Sep 27, 2019 • edited Loading

killianmuldoon commented May 27, 2020

adrianchiris commented Dec 7, 2020

krsna1729 commented Jan 12, 2019 •

edited

Loading

krsna1729 commented Jan 15, 2019 •

edited

Loading

rkamudhan commented Jan 22, 2019 •

edited

Loading

krsna1729 commented Jan 22, 2019 •

edited

Loading

mcastelino commented Jan 22, 2019 •

edited

Loading

krsna1729 commented Feb 14, 2019 •

edited

Loading

Levovar commented Sep 27, 2019 •

edited

Loading