-
Notifications
You must be signed in to change notification settings - Fork 177
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
VF passthrough not working inside a VM #127
Comments
@pperiyasamy Can you provide net-attache-def CRD for sriov-net-a? |
looks like the sriov-network-device-plugin is not advertising correct num of devices to kubelet, could you also paste the logs of kube-sriov-device-plugin-amd64-xxrjp? |
@ahalim-intel @zshi-redhat Thanks for the prompt response. Here is the requested info:
|
Your VFs are marked as unhealthy. That's the reason no VF's available to be allocated. Possible cause - associated PF is down? |
Hi @ahalim-intel You are correct. Those VFs were in admin down state and brought it to Healthy state again.
I removed physfn from the path @ https://github.com/intel/sriov-cni/blob/master/pkg/utils/utils.go#L77, but now it's looking for sriov_numvfs file like below:
I just had "sriovMode": false option in config.json. shouldn't that sufficient for binding VF to PODs ? |
@pperiyasamy As @zshi-redhat mentioned sriov-cni won't be able to run inside a VM. It needs access to all PF resources in the host to be able to set the VF in the Pod. |
@ahalim-intel @zshi-redhat Does it mean rootDevices in config.json should always contain only PF devices and not VFs ? |
It won't work like that. To set up a VF in Pod, PF information(and some other info) are needed. You cannot just simply remove a line and expect everything to work. For any reason sriov-cni fails to get any of those required info will result in error. We do not support this mode of set up yet. The SR-IOV networking solution is for baremetal cluster only. |
@ahalim-intel I've seen the following comment @ https://github.com/intel/sriov-network-device-plugin#assumptions If "sriovMode": true is given for a resource config then plugin will look for virtual functions(VFs) for all the devices listed in "rootDevices" and export the discovered VFs as allocatable extended resource list. Otherwise, plugin will export the root devices themsleves as the allocatable extended resources. So I just configured the VFs in rootDevices and trying to attach it with Pod. |
I'd say this statement is still true as the sr-iov network device plugin can report VFs as resource to kubelet, it's just SR-IOV CNI which cannot configure the VF properly without accessing PF;
|
But it's looking for sriov_numvfs file in VF's device directory (as per the error below) and not in PF device directory. Even on BM, there is no sriov_numfs file for VFs. Shouldn't it be a bug in SRIOV CNI plugin ? Failed create pod sandbox: rpc error: code = Unknown desc = [failed to set up sandbox container "277056a4567f034d104bb1e1f5f07371da03f532cb99f06622e3cf5e9d8a7d58" network for pod "testpod1": NetworkPlugin cni failed to set up pod "testpod1_default" network: Multus: Err adding pod to network "sriov-net-a": Multus: error in invoke Delegate add - "sriov": SRIOV-CNI failed to load netconf: LoadConf(): failed to get VF information: The enp5s2 interface belongs to VF device 0000:05:02.0 which is configured in the rootDevices in config.json.
The same issue exists even after applying the below patches. host-device: containernetworking/plugins#300 |
When using
Then
|
Thanks @zshi-redhat, it works now. so both host-device and sriov-network device plugin can be used to manage pool of VFs as net device passthrough inside k8s pod. Hope k8s.v1.cni.cncf.io/resourceName is local to k8s worker node, attached to NetworkAttachmentDefinition which is kind of global configuration for entire k8s cluster. |
Glad to know it works!
Yes, it is. Multus queries each net-attach-def CR associated with pod and identifies if there is a May I ask what's the network topology of your test environment? How you run the VMs? can you describe a little bit your use case if possible? |
The topology is to run k8s PODs inside the (tenant) VMs of OpenStack's image based deployments. So Kubernetes uses Openstack's dataplane (example: OpenDaylight with OpenvSwitch) as the underlay for pod-to-pod communication. This is the overall use case. |
For dpdk passthrough of VF devices (e.g VF bind to vfio-pci), we already support it with sriov-network-device-plugin, all you need to do is to bind VF to vfio-pci and use sriov-network-device-plugin to advertise the vfio-pci devices. then a pod can request such device in its spec. In this case, sriov-cni or multus are not necessary because you don't need an ip address for the dpdk interface, and there is no need to configure VF property in VM on openstack (openstack already takes care of the VF configuration). |
That's great. I just tried this option with the below configuration and dpdk passthrough works fine.
Now the question is how to bind/ubind vfio-pci/igb_uio drivers on the fly during pod's lifecycle. shouldn't it be part of CNI plugin ? should we enhance host-device plugin for it ? |
Bind/unbind should happen before sriov device plugin is launched so that it can discover vfio/uio devices first and then pass necessary container runtime configuration(/dev/vfio, etc) to pod via device plugin API; It's too late for CNI to bind the device and change the container runtime configuration when it gets invoked during pod creation. sriov device plugin doesn't support bind/unbind, it shall be done by other system config tool (for example, in openshift, we'll use operator to bind/unbind vfio or uio device).
In theory, device plugin can expose different types(dpdk, net-device) of devices as same resourceName, but user cannot tell which type of device will be allocated to pod; to request both net-device and dpdk device in one pod, you will need to expose them as different resourceNames and request both resourceNames in pod spec. |
I saw sriov-cni plugin supports bind/unbind of dpdk drivers in the run time with below configuration in the net-attach definition like this.
are you saying this is not possible when we use both cni and device plugin ?
This is just to avoid creating unnecessary resourceNames to enable dpdk driver and also it might go into the same existing network (example: sriov-net-a). As per no. of k8s.v1.cni.cncf.io/networks in pod definition, create that many net devices inside the pod and other devices should go as dpdk devices. This provides transparency and better manageability of all the VFs is under one resource name and administrator don't need to distinguish between net and dpdk drivers on the VFs and everything is abstracted inside the plugins. |
Yes, going forward we will only support sriov with CNI + Device Plugin mode.
We can discuss this. |
Yes @zshi-redhat , Exactly. I'm asking something similar to selectors configuration for driver selection based on network and resourceName configuration in the pod. But looks like we can go upto only two drivers on a resource pool per pod with this design. |
@zshi-redhat: Trying to clarify the intended use case a bit further. As Peri has explained, we run K8s in VMs on OpenStack and need to support applications that require high-performance access to secondary (private) networks in their Pods. These applications use DPDK or kernel drivers on those secondary interfaces. The envisaged solution is to set up VLAN provider networks in OpenStack and for each VLAN add a number of VLAN-tagged SR-IOV VFs to the K8s worker VMs. Inside the the K8s workers, all VFs of a given VLAN are equivalent from a connectivity perspective and should be pooled in the sriov device plugin, so there would be one pool per VLAN (spanning across all worker nodes). An application should be able to request a VF from that pool (VLAN) and at the same time specify whether it should passed as DPDK interface or normal netdev with IPAM. Hence the binding of the driver should happen at the time the VF is passed into the Pod. In essence we would like to decouple the orthogonal aspects of underlying network connectivity and the driver/interface configuration inside the Pod. Without this we would have to further partition the set of VFs of a PF for each VLAN into two pools preconfigured for DPDK or kernel driver, respectively. As the number of VFs per PF is quite limited, this would further reduce the flexibility of a K8s deployment. The ideal solution would of course even avoid the pre-configuration of VLANs on the SR-IOV VFs and hot-plug a VLAN tagged VF into the K8s worker VM when requested by a Pod. But that would require a Kuryr-like integration of the CNI with the underlying OpenStack. We also want to provide an equivalent solution for K8s deployed on bare metal. Ideally the K8s orchestration interface for the applications should not see any difference (i.e. request a VF bound to a certain VLAN and specify whether to pass as DPDK or kernel interface). |
I understand the request that you don't want pre-define the type of vf devices inside VM because it may be used either as dpdk or kernel interface depends on application running in Pod.
Yes, for vlan configuration in openstack environment, Kuryr CNI might be solution; sriov-cni cannot talk to Neutron and it only supports configuring of vlan/mac/spoof check/trusted vf etc on baremetal.
May I ask how the application is going to consume the devices inside pod? for example, how the application discover the pod interface? |
Shouldn't it (i.e. returning required volume mounts in case of dpdk interface) be done as part of device plugin itself in the AllocateResponse for Allocate rpc invocation ? It looks right now uioPool.go returns empty list for volume mounts. |
Yes, it should be. That's why bind/unbind is done before SR-IOV Device Plugin gets launched.
It mounts host device(uio, vfio etc) via devSpec config instead of using mount which is for general host volume. |
Can't we just make bind/unbind logic also to be also part of Allocate rpc implementation and mount the host device via devSpec ?
Ok, thanks for the info. |
That's indeed an option, there is even a bind hook in device plugin, but not implemented.
|
Thanks @zshi-redhat for the details. I think Pod would request either kernel network device or dpdk device, not both (or multiple devices). This is because there would be only one application running in the pod which is of the type either dpdk or kernel. Hence based on pod definition in annotations section, device plugin can decide on which driver to bind on the PCI device like below:
Ok, thanks for the info. so should we make use of ListAndWatch for this? |
One application may require multiple interfaces inside pod, for example, should the application be a network router which uses one interface as uplink and the other as radio network; here is a live demo talk that shows how this kind of application works from Open Infrastructure Summit Denver (at 32:10 timeline , in the final thoughts section, there is a mention that it's running with a kubernetes cluster in openstack VM, and uses Multus + SR-IOV components).
There was a proposal called Resource Class API which I thought might be able to solve the problem, but it requires a lot more upstream works. With resource class API, user can request a device with detailed properties(key-value pair that supported by device plugin) just like requesting a device or cpu or memory and device plugin could configure the device to a specific driver based on the allocation call which contains a device ID that maps to the key-value pair such as driver:dpdk. |
Okay, I understand. It has to be supported then.
In case of dpdk binding for device pools, sriov cni is not used. so let's say if pod definition is like below, then attach one interface as kernel device and another one as dpdk device (based on number of networks, resources requests/limits parameters) considering whole pod def data is available to device plugin.
|
Yes, I just use this command to get to know the available dpdk devices in the system. what is the command you use to find out which device(s) from the pool is attached inside the pool ?
Cool. It works with vfio-pci. is there any known issue with igb_uio driver ? |
The device ID information can be found via container environment variables, please refer to here for naming convention of the environment var. Also please let us know if there is other information the dpdk application would like to get within container.
I didn't try igb_uio, but looking at the code, there seems an issue for igb_uio.
|
Good, the device id is set in env variable PCIDEVICE_INTEL_COM_<RESOURCE_NAME>.
Yes, It's with latest master branch and selector based configuration.
Yes, the device is bound to igb_uio driver.
|
Ok, I think this might be a bug. The issue is that we don't recognize @ahalim-intel ^^, looks like an issue of using igb_uio type interfaces, did you hit this error before? |
@zshi-redhat Does selectors logic still have support for host-device plugin too ?
This is with master multus and host-device plugin (with your fixes) and config.json contains intel_sriov_hostdevice resource which is being used by host device network.
|
From the log message, it seems not using latest host-device plugin.
|
Let me bring up this topic (i.e. dynamic driver binding, global resource pool) again.
Case 1 need same net-attach-def crd object sharing the same resource pool. The device driver type for each interface would come from pod definition, i guess. |
The decision of which device to allocate is made by kubelet, and plugins like device plugin or CNI cannot make suggestion to kubelet on which device to allocate. I guess the question is how the device driver type be configured in pod definition and be interpreted by kubelet to request a device?
To have multiple net-attach-def objects sharing the same resource pool is supported, user can define multiple net-attach-def objects and choose to use either one of them in pod spec for a device.
|
Yes, I think kubelet should pass pod name (need changes in AllocateRequest API ?) into device plugin so that device plugin can read pod definition to bind the appropriate driver at run time. For example, the annotation section in pod definition would like below for pod having one interface per device type from the same network.
Okay, that's good to hear. I've tried to associate two networks (net and dpdk) with the same resource, but when pod is created with dpdk network, i could see dpdk devices inside the pod, but not able to run the dpdk application on it. I'm seeing the following error while running testpmd dpdk application.
|
Update: testpmd application just works if it runs on cores 1-2 instead of 0-1. hence it's not an issue with device plugin. |
May I know how the testpmd gets to know how many hugepages can be used and which cpu to pin within container? |
I'm not sure what are the physical cores are pinned inside the pod. Though we provide cpu: '8' in requests/limits section, i can see all physical cores inside the pod with /proc/cpuinfo command. But testpmd doesn't work with cores 0-1 whereas it worked with 1-2 cores. I just did it with trial and error method. is there a way to figure out the pinned cores inside the pod ?
Also at times, i can see net device attached inside pod (without ip configuration) for dpdk network. shouldn't kubelet always use the device id which are bound with dpdk driver ? |
Would you please check and update
Yes, that's correct, because kubelet will randomly choose a device from the resource pool, it's not ware of which device is in kernel mode or dpdk mode; but sriov-cni has the ability to detect the given device and config it no matter which mode is used. |
These files contain 0-55 and 0 respectively which means CPUs are not allocated exclusively though we provide CPU requests/limits are specified in pod definition. isn't it ? Is there any reason behind this ? Do you want to update these files manually so that pod can be run on the dedicated cores ?
But we might need to choose dpdk device on a particular pod to run dpdk application on it. This is why we need dynamic binding of appropriate driver inside the device plugin at pod bringup time by reading pod and net-attach-def definition. |
Looks like I don't have |
I feel this feature might be similar to what your want. To have the ability of passing a flag (could be annotation) indicating the device usage to device plugin so that device plugin can do the bind dynamically. |
@pperiyasamy there was a fix for igb_uio driver by @ahalim-intel , can you help to verify if it works for you? Thanks! |
@zshi-redhat @ahalim-intel Infact, I too have tried with same change few days back, but it didn't work for me, looks something more needs to be done. or did it work for you ? |
The issue with "igb_uio" driver selector that didn't export 'uio*' device file should be resolved. Not able to run testpmd is another story as running any dpdk application requires many other dependencies to be satisfied. Here's how I've tested the the above fix: Device plugin config map:
Net attach CRD:
Sample PodSpec:
We can see that the associated |
Yes @ahalim-intel , I can also see the igb_uio device inside the pod, but testpmd is not working on it (error shown below).
Here is the pod definition:
Whereas testpmd works fine vfio-pci device as shown below.
|
@pperiyasamy
Note that for igb_uio devices higher privileges are required as well as host /sys volume needs to be mounted with write permission in the container. |
Yes @ahalim-intel , The above pod spec for igb_uio driver works for the POD running on the VM, but still saw the same issue with POD running on the BM. will check it again and let you know. |
Hi, in my host I bind vfio-pci to the vf: 2019-09-25T01:50:37.556993Z qemu-system-x86_64: -drive file=/dev/vfio/vfio,format=raw,if=none,id=drive-scsi0-0-0-1: Could not refresh total sector count: Illegal seek') |
Is this a correct device ID of VF?
|
Closing this issue as supporting VF in VM has been addressed in PR |
Hi,
I've created 3 VFs on the PF 40/10G ethernet interface (i40e) and did the passthrough of those VFs into VM (i.e. Kubernetes worker node) and then trying to use sriov-network-device-plugin for managing those VFs for the Kubernetes PODs.
But when I'm trying to create pod using deployments/kernel-net-demo examples, the following error "0/1 nodes are available: 1 Insufficient intel.com/sriov_net_A" is seen k8s pod event logs.
Please have a look at it and let me know what's going wrong.
The text was updated successfully, but these errors were encountered: