Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create veth interfaces to mirror VF details for userspace mode #37

Closed
krsna1729 opened this issue Jan 11, 2019 · 45 comments
Closed

Create veth interfaces to mirror VF details for userspace mode #37

krsna1729 opened this issue Jan 11, 2019 · 45 comments
Labels
stale This issue did not have any activity nor a conclusion in the past 90 days

Comments

@krsna1729
Copy link

@ahalim-intel @rkamudhan

When using device plugin in userspace mode, it would be nice to have sriov-cni create dummy interfaces in the network namespace with the mac details matching the VF, and the ipam results applied. This way dpdk apps can look for the information in a generic non hostpath/files-sharing way and increase the usability of userspace mode greatly.

https://intel-corp-team.slack.com/archives/C4C5RSEER/p1547232544029500

@krsna1729
Copy link
Author

/cc @amshinde @mcastelino @egernst

@krsna1729
Copy link
Author

One of the use cases we have is to send all control packets into the kernel via veth and use netlink to listen for route updates. Having the feature where veth pairs are created automatically for each VF, with one end named with what was passed as interface: in podspec annotation /CNI_IFNAME would be ideal. The other end of the veth will be used by dpdk app to enqueue the control packets.

@rkamudhan
Copy link
Member

@krsna1729 Can you give us the link of the dpdk sample app ?

@krsna1729
Copy link
Author

This actual app is for now internal. Will see if I can create a sample.

@krsna1729 krsna1729 changed the title Create dummy interfaces to mirror VF details for userspace mode Create veth interfaces to mirror VF details for userspace mode Jan 11, 2019
@krsna1729
Copy link
Author

krsna1729 commented Jan 12, 2019

Here is a sample -

setup.sh
router.bess
pipeline.txt

Haven't passed traffic to test but that's the idea

@ahalimx86
Copy link
Collaborator

I think there will be implications creating a dummy interface with same IP/mac config of the VF itself. Most likely this will mess up the forwarding table and ARP. We need to think of different approach of capturing IPAM information then pass it on to the userspace driver.

@krsna1729
Copy link
Author

@ahalim-intel if the VF is bound to uio/vfio drivers then they would not appear in the linux stack and not cause an issue right?

@krsna1729
Copy link
Author

We would not do this if VFs are in netdev mode. Since the control messages are already being answered by kernel.

@krsna1729
Copy link
Author

krsna1729 commented Jan 15, 2019

Internally we have tested this model and it works when passing traffic. The sample linked above demonstrates the concept. Let me know if you see any other issues. It can be gated by a conf option for backward compatibility reasons if need be.

@mcastelino
Copy link

@krsna1729 so to clarify.

  1. Both ends of the veth will be in the container namespace itself
  2. One end of the veth is programmed with the IP and MAC of the VF
  3. The VF is bound to VFIO/DPDK
  4. The other end of the veth is connected to the DPDK app

So the Linux kernel within the namespace can respond to ARP/Ping etc on the programmed end.
So the DPDK app can delegate some functionality back to the Linux kernel stack

@krsna1729
Copy link
Author

@mcastelino That is correct!

@rkamudhan
Copy link
Member

rkamudhan commented Jan 22, 2019

+1 for this. I like to test the sample app. I will give a try on the sample app. One more question, Is this features or plugin can work like chain model instead of being in SRIOV CNI ? The plugin should be getting previous results and work on the container namespace itself. I like to decouple the features and complexity as much as possible. In CNI community, we usually use chaining mechanism.

@krsna1729
Copy link
Author

krsna1729 commented Jan 22, 2019

@rkamudhan yup it can be done as a separate plugin instead of complicating sriov-cni. I guess if you are using device plugin in usermode then one should not be calling sriov-cni today.

Since the devices are already bound to vfio-pci and it seems non-trivial to retrieve mac address of the VF in such case I was thinking -

  1. let the sysadmin set mac address of each VF using some logic in this script
  2. cni will figure out vf index using deviceID passed by multus (read from a checkpoint file shared with device plugin)
  3. use netlink equivalent of below command to retrieve the mac
$ ip link show dev ens785f0 | grep 'vf 0'
  vf 0 MAC de:ad:be:ef:00:00, spoof checking off, link-state enable, trust off
  1. setup veth pair with mac (could be separate plugin)
  2. call ipam

@mcastelino
Copy link

mcastelino commented Jan 22, 2019

@krsna1729 CNI results field now supports reporting mac address but it is optional!

https://github.com/containernetworking/cni/blob/master/SPEC.md#result

However that means that the SRIOV plugin should report the macaddress.

@rkamudhan does the SRIOV plugin populate the mac address field of the result

@rkamudhan
Copy link
Member

@mcastelino you are right exactly. CNI Spec support mac address in the result. There are two ways to do it, either through current SRIOV CNI and chaining mechanism or through device plugin as @krsna1729 explained. But it is tricky because MAC address is not properly populated in few NICs.

@krsna1729
Copy link
Author

@rkamudhan i meant in CNI itself. This does not touch device plugin. Sorry if i muddied the flow/explanation

@krsna1729
Copy link
Author

@rkamudhan Updated setup.sh to show the different stages

$ ./setup.sh
bess
38e6c194bf0377081922b8b84062157760810f1c5221e942018c2afaac5ab462
3: foo@foo-vdev: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000
    link/ether de:ad:be:ef:00:00 brd ff:ff:ff:ff:ff:ff
default via 172.17.0.1 dev eth0
172.17.0.0/16 dev eth0 proto kernel scope link src 172.17.0.2
198.18.0.0/30 dev foo proto kernel scope link src 198.18.0.1
198.19.0.0/30 dev foo proto kernel scope link src 198.19.0.1

@krsna1729
Copy link
Author

@mcastelino
Copy link

@krsna1729 in your bash plugin, you need to pass the result through. i.e. what you get as input, you need to passthrough. I assume you plan to chain this?

@krsna1729
Copy link
Author

@mcastelino chaining may not be needed. In case of userspace sriov we just need the above plugin.

@booxter
Copy link
Contributor

booxter commented Feb 7, 2019

Since the devices are already bound to vfio-pci and it seems non-trivial to retrieve mac address of the VF

Well, shouldn't we perhaps revisit it and make CNI plugin bind to userspace driver as needed? This can be controlled by additional args passed to ADD command (so you'll have a vfio arg specified in your multus network CRD). This should help with other things, like reusing the same VFs for different binding types (both netdevice and vfio). This also reduces the number of preparation steps for VFIO enablement since now CNI plugin would take care of it (as well as of registering the device back on DEL). This is the same thing that Kata does for VFIO.

So the flow would look the same way as for netdevice except the last step where we would, inside the container, unbind the netlink device and register it with userspace driver. Just before doing so, we would also create a fake veth that would carry the IPAM and L2 settings that originally belonged to the VF that is now gone after plugin exits.

@krsna1729
Copy link
Author

@booxter i think the reason for moving the binding logic out of the scope of CNI is primarily for proper implementation of device concept, where the device plugin sets the permissions and also mounts the appropriate files into the container, without leaking mount details into podSpec or requiring privileged: true. I am really happy about that.
@ahalim-intel @rkamudhan correct me.

Kata today would not work out of the box if you want to run a DPDK app in userspace mode (inside the kataVM). The device would end up as a netdevice inside the VM. For kata they bind every physical netdevice in the NetNS to vfio-pci because they need to pass it to QEMU. There is no two codepaths @amshinde correct me.

@booxter
Copy link
Contributor

booxter commented Feb 11, 2019

@krsna1729 understood the reasoning for why it's not really CNI plugin job to bind (you need /dev/vfio/NN be registered with the container, which is known to DP only.) But then maybe this binding logic should live with device plugin. I understand the appeal of having the DP running unprivileged, but it doesn't get rid of complexity, it merely pushes it to a separate on-deployment component that will handle device registration. This is with additional cost of not being able to use the same pool / PF for two different workloads.

@krsna1729
Copy link
Author

@booxter then the device plugin will have to access the network object referenced in the podSpec, and determine if the driver to be bound is igb_uio or vfio or leave with linux.

If we were to keep it entirely in device plugin, then how would two pods consume/request devices bound to different drivers? The only way today is to have different pool names intel.com/sriov_netdevice: X, intel.com/sriov_vfio: X, and to use back them both with the same PF, the device plugin needs to advertise 2 different device types for the same pool of resources. So everytime someone asks for _netdevice: X, we should also decrement X from _vfio. Where it gets tricky is the actual allocation is done in the kubelet. So unless we do something twisted where we mark X from vfio as unhealthy whenever kubelet asks to provision X of _netdevice, I dont know how we can correctly handle it.

This is me recollecting from my minimal knowledge of how DP works with kubelet. Do provide an example and any correction to my explanation of how we can make this happen.

@booxter
Copy link
Contributor

booxter commented Feb 11, 2019

@krsna1729 I guess significant chunk of pressure towards pool reuse for different types of workloads would go away if we could split / exclude VFs that belong to the same PF: k8snetworkplumbingwg/sriov-network-device-plugin#80. It would still be not optimal since one may need a vfio device but have netdevice only but at least having a single SR-IOV card on a node wouldn't be a blocker for mixed use cases.

@greg-bock
Copy link

I guess significant chunk of pressure towards pool reuse for different types of workloads would go away if we could split / exclude VFs that belong to the same PF

But then we are limiting allocation of those resources. If we split 10 VFs into two pools of five what happens when we need 6 netdev and 4 vfio or 10/0, etc etc.

We have some code laying around for exclusion as we have a need to not let one of the VFs from each PF from being assigned (comma separated list of PCI ids i think) . But I think multiple pools for the same PF will be problematic or less than useful.

@krsna1729
Copy link
Author

@booxter I think mixed usage should be solved at cluster level. Every node/card is fully vfio or netdevice. You can start with 50/50 split of nodes/cards. Then if you use up one type, then have an operator looking out for pods waiting on resources, take an action to move a node/card to required driver to satisfy pending request, as long as any potential evicted pods can be accommodated with resource of other type somewhere else.

@booxter
Copy link
Contributor

booxter commented Feb 12, 2019

@krsna1729 while this may probably work, at least in larger clusters, this seems like pushing complexity from DP / CNI level into cluster manager. (Which may well not exist.) That being said, I see no easy way out of the situation.

Getting to the original point of the issue (exposing IPAM information inside a pod with userspace SR-IOV device), veth is probably one option. Has anyone considered the "VF representors": https://netdevconf.org/1.2/session.html?or-gerlitz for the same? Seems like this is the new endorsed way of getting a netlink representation of a VF that is actually registered with a userspace driver. It requires a relatively fresh kernel but perhaps it's ok.

@zshi-redhat
Copy link
Collaborator

@krsna1729 I guess significant chunk of pressure towards pool reuse for different types of workloads would go away if we could split / exclude VFs that belong to the same PF: intel/sriov-network-device-plugin#80.

With new config approach, one way to achieve this is to have a new selector type of num that allows specifying the max num of devices from a certain resource exposed by device plugin. but not to specifying rootDevices in PCI address level.

@ahalimx86
Copy link
Collaborator

@zshi-redhat What does this mean max "num of devices"? Say, I have a PF with 8 VFs configured. If somehow I chose to add only 6 of them using max "num of devices", what happens with the other 2 VFs? How the rest of the VF's will be used? If you're not going to specify PF addr then how do you choose which VFs to be excluded and which ones to include? I am trying to understand we do we need this exclusion.

@zshi-redhat
Copy link
Collaborator

@zshi-redhat What does this mean max "num of devices"? Say, I have a PF with 8 VFs configured. If somehow I chose to add only 6 of them using max "num of devices", what happens with the other 2 VFs?

6 VF devices will be chosen randomly or in order from the selected PF ( via selectors ).

How the rest of the VF's will be used?

If the rest of VFs can be discovered by different set of selectors ( for example, the rest VFs use different driver ), they will be advertised as different resource.
If the rest of VFs be discovered by the same set of selectors, they will be left unused. meaning they can be used for other purposes.

If you're not going to specify PF addr then how do you choose which VFs to be excluded and which ones to include? I am trying to understand we do we need this exclusion.

randomly chosen, or using the vf index num in order.

@booxter
Copy link
Contributor

booxter commented Feb 14, 2019

@krsna1729 do you have some code to try? Do you plan to work on this feature?

My main concern right now is not about passing IPAM information per se but about CNI plugin gracefully handling the case where the device is registered with VFIO (in which case all namespace business is skipped; we would still call to IPAM and return it in results dict). I think these two matters are separate: one is to make the existing CNI plugin to successfully handle a request for VFIO binding (by doing nothing) and another is to expose IPAM information using a fake device (which is the goal of this issue). I think the first part should be tracked separately (perhaps as a dependency for this issue). I've created #63 to track this base step.

@krsna1729
Copy link
Author

krsna1729 commented Feb 14, 2019

@booxter
Copy link
Contributor

booxter commented Feb 15, 2019

@krsna1729 thanks for the links. Do you plan to work on PR for SR-IOV CNI plugin to do the same? If not, maybe I could take it over.

@krsna1729
Copy link
Author

@booxter go ahead!

@krsna1729
Copy link
Author

For use-case testing I set up 2 instances of bess-router which internally is this pipeline and did a ping test.

krsna1729/bess-router@1e5d7af

@rkamudhan
Copy link
Member

@booxter @krsna1729 Is this out of SRIOV CNI plugin using the CNI Chaining mechanism right ? I went through @krsna1729 CNI version on vfioveth, it is really impressive. I think that is really good way to forward.

@krsna1729
Copy link
Author

Shameless plug - I think this is a nice quickstart for anyone trying out Multus, SR-IOV
https://github.com/clearlinux/cloud-native-setup/tree/master/clr-k8s-examples/9-multi-network

booxter added a commit to booxter/sriov-cni that referenced this issue Feb 19, 2019
If a pci device allocated to a container by SR-IOV device plugin is of
vfio type then it doesn't have a netlink representation, so moving the
interface into container namespace is useless and actually breaks the
CNI plugin.

We could perhaps go with a separate noop CNI plugin to do the same but
there are plans to, in the future, allocate IPAM info for the device
and expose it by one means or another (it could be a fake veth pair to
carry the information: see issue k8snetworkplumbingwg#37; or some other mechanism).

Fixes k8snetworkplumbingwg#63
booxter added a commit to booxter/sriov-cni that referenced this issue Feb 19, 2019
If a pci device allocated to a container by SR-IOV device plugin is of
vfio type then it doesn't have a netlink representation, so moving the
interface into container namespace is useless and actually breaks the
CNI plugin.

We could perhaps go with a separate noop CNI plugin to do the same but
there are plans to, in the future, allocate IPAM info for the device
and expose it by one means or another (it could be a fake veth pair to
carry the information: see issue k8snetworkplumbingwg#37; or some other mechanism).

Fixes k8snetworkplumbingwg#63
booxter added a commit to booxter/sriov-cni that referenced this issue Feb 19, 2019
If a pci device allocated to a container by SR-IOV device plugin is of
vfio type then it doesn't have a netlink representation, so moving the
interface into container namespace is useless and actually breaks the
CNI plugin.

We could perhaps go with a separate noop CNI plugin to do the same but
there are plans to, in the future, allocate IPAM info for the device
and expose it by one means or another (it could be a fake veth pair to
carry the information: see issue k8snetworkplumbingwg#37; or some other mechanism).

Fixes k8snetworkplumbingwg#63
booxter added a commit to booxter/sriov-cni that referenced this issue Feb 19, 2019
If a pci device allocated to a container by SR-IOV device plugin is of
vfio type then it doesn't have a netlink representation, so moving the
interface into container namespace is useless and actually breaks the
CNI plugin.

We could perhaps go with a separate noop CNI plugin to do the same but
there are plans to, in the future, allocate IPAM info for the device
and expose it by one means or another (it could be a fake veth pair to
carry the information: see issue k8snetworkplumbingwg#37; or some other mechanism).

Fixes k8snetworkplumbingwg#63
zshi-redhat pushed a commit that referenced this issue Feb 24, 2019
If a pci device allocated to a container by SR-IOV device plugin is of
vfio type then it doesn't have a netlink representation, so moving the
interface into container namespace is useless and actually breaks the
CNI plugin.

We could perhaps go with a separate noop CNI plugin to do the same but
there are plans to, in the future, allocate IPAM info for the device
and expose it by one means or another (it could be a fake veth pair to
carry the information: see issue #37; or some other mechanism).

Fixes #63
@booxter
Copy link
Contributor

booxter commented Mar 21, 2019

Sorry folks, I know I said I will take it over but I couldn't find a slot to work on it. So if anyone can pick it up instead, please do. Thanks.

@zshi-redhat
Copy link
Collaborator

@krsna1729 do you have an example of how dpdk app communicate with one end of veth pair inside pod? does the dpdk app require a separate IP allocated from IPAM?

@krsna1729
Copy link
Author

@zshi-redhat https://github.com/krsna1729/bess-router/blob/master/router.bess#L53

Basically both ends of veth will stay within network namespace. One end is used by dpdk as shown above. Other end is the one applied with ipam result. No extra ip. Just whatever was returned as part of cni call just like in the case of Linux mode of sriov.

Pipeline of the dpdk app would be like this
https://github.com/krsna1729/bess-router/blob/master/pipeline.txt

@Levovar
Copy link

Levovar commented Sep 27, 2019

We really find this idea the way to go, so I actually went ahead and implemented this in DANM. I used the "dummy" interface though instead of veth. I think it better fits the scenario.
Details can be found in https://github.com/nokia/danm/pull/155/files if anyone is interested.

@killianmuldoon
Copy link
Collaborator

@zshi-redhat @ahalim-intel does this still look like a good solution for the cni?

@adrianchiris adrianchiris added the stale This issue did not have any activity nor a conclusion in the past 90 days label Nov 24, 2020
@adrianchiris
Copy link
Contributor

This issue was discussed in NPWG network & resource management meeting on 07.12.2020 and was decided that this issue should not be handled in SR-IOV CNI.

If there are any objections please re-open the issue and join us at the bi-weekly meeting (see CONTRIBUTING doc for more info)

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stale This issue did not have any activity nor a conclusion in the past 90 days
Projects
None yet
Development

No branches or pull requests

10 participants