VF passthrough not working inside a VM #127

pperiyasamy · 2019-05-14T08:59:46Z

Hi,

I've created 3 VFs on the PF 40/10G ethernet interface (i40e) and did the passthrough of those VFs into VM (i.e. Kubernetes worker node) and then trying to use sriov-network-device-plugin for managing those VFs for the Kubernetes PODs.
But when I'm trying to create pod using deployments/kernel-net-demo examples, the following error "0/1 nodes are available: 1 Insufficient intel.com/sriov_net_A" is seen k8s pod event logs.
Please have a look at it and let me know what's going wrong.

sriov-network-device-plugin = master
sriov-cni = master
multus-cni  = master

[root@my-centos kernel-net-demo]# kubectl get pods
NAME                               READY   STATUS    RESTARTS   AGE
local-volume-provisioner-gl8jp     1/1     Running   27         166d
nfs-provisioner-5f8b9959b6-npbtd   1/1     Running   33         166d
testpod1                           0/1     Pending   0          85s
[root@my-centos kernel-net-demo]# kubectl get events --namespace=default
LAST SEEN   TYPE      REASON             KIND   MESSAGE
27m         Warning   FailedScheduling   Pod    0/1 nodes are available: 1 Insufficient intel.com/sriov_net_A.
12m         Warning   FailedScheduling   Pod    0/1 nodes are available: 1 Insufficient intel.com/sriov_net_A.
6s          Warning   FailedScheduling   Pod    0/1 nodes are available: 1 Insufficient intel.com/sriov_net_A.
[root@my-centos kernel-net-demo]# kubectl get nodes
NAME        STATUS   ROLES    AGE    VERSION
my-centos   Ready    master   166d   v1.12.2-1+619f4e6f7f010f
[root@my-centos kernel-net-demo]# kubectl get node my-centos -o json | jq '.status.allocatable'
{
  "cpu": "8",
  "ephemeral-storage": "33802581141",
  "hugepages-1Gi": "0",
  "hugepages-2Mi": "4000Mi",
  "intel.com/sriov_net_A": "0",
  "memory": "7939216Ki",
  "pods": "110"
}
[root@my-centos kernel-net-demo]# pwd
/root/cnis/sriov-network-device-plugin/deployments/kernel-net-demo
[root@my-centos kernel-net-demo]# ls
config.json  config-node1.yaml  crdnetwork.yaml  pod-tc1.yaml  pod-tc2.yaml  pod-tc3.yaml  pod-tc4.yaml  pod.yaml  sriov-net-a.yaml  sriov-net-b.yaml
[root@my-centos kernel-net-demo]# cat /etc/pcidp/config.json
{
    "resourceList":
    [
        {
            "resourceName": "sriov_net_A",
            "rootDevices": ["0000:05:01.0", "0000:05:02.0", "0000:05:03.0"],
            "sriovMode": false,
            "deviceType": "netdevice"
        }
    ]
}
[root@my-centos kernel-net-demo]# dpdk-devbind.py -s

Network devices using DPDK-compatible driver
============================================
<none>

Network devices using kernel driver
===================================
0000:00:10.0 'Virtio network device 1000' if=eth0 drv=virtio-pci unused= *Active*
0000:00:11.0 'Virtio network device 1000' if=eth1 drv=virtio-pci unused=
0000:05:01.0 'Ethernet Virtual Function 700 Series 154c' if=enp5s1 drv=i40evf unused=
0000:05:02.0 'Ethernet Virtual Function 700 Series 154c' if=enp5s2 drv=i40evf unused=
0000:05:03.0 'Ethernet Virtual Function 700 Series 154c' if=enp5s3 drv=i40evf unused=

Other Network devices
=====================
<none>

Crypto devices using DPDK-compatible driver
===========================================
<none>

Crypto devices using kernel driver
==================================
<none>

Other Crypto devices
====================
<none>

Eventdev devices using DPDK-compatible driver
=============================================
<none>

Eventdev devices using kernel driver
====================================
<none>

Other Eventdev devices
======================
<none>

Mempool devices using DPDK-compatible driver
============================================
<none>

Mempool devices using kernel driver
===================================
<none>

Other Mempool devices
=====================
<none>
[root@my-centos kernel-net-demo]# kubectl get pods
NAME                               READY   STATUS    RESTARTS   AGE
local-volume-provisioner-gl8jp     1/1     Running   27         166d
nfs-provisioner-5f8b9959b6-npbtd   1/1     Running   33         166d
testpod1                           0/1     Pending   0          18m
[root@my-centos kernel-net-demo]# kubectl describe pod testpod1
Name:               testpod1
Namespace:          default
Priority:           0
PriorityClassName:  <none>
Node:               <none>
Labels:             env=test
Annotations:        k8s.v1.cni.cncf.io/networks: sriov-net-a, sriov-net-a
Status:             Pending
IP:
Containers:
  appcntr1:
    Image:      repo-pmd:v2
    Port:       <none>
    Host Port:  <none>
    Command:
      /bin/bash
      -c
      --
    Args:
      while true; do sleep 300000; done;
    Limits:
      intel.com/sriov_net_A:  2
    Requests:
      intel.com/sriov_net_A:  2
    Environment:              <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-slj7q (ro)
Conditions:
  Type           Status
  PodScheduled   False
Volumes:
  default-token-slj7q:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-slj7q
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason            Age                   From               Message
  ----     ------            ----                  ----               -------
  Warning  FailedScheduling  3m20s (x93 over 18m)  default-scheduler  0/1 nodes are available: 1 Insufficient intel.com/sriov_net_A.
[root@my-centos kernel-net-demo]# kubectl get events --namespace=kube-system
LAST SEEN   TYPE      REASON             KIND        MESSAGE
33m         Normal    Pulled             Pod         Container image "armdocker.rnd.ericsson.se/proj_kds/erikube/coredns:1.2.2" already present on machine
3m44s       Warning   BackOff            Pod         Back-off restarting failed container
18m         Normal    Pulled             Pod         Container image "armdocker.rnd.ericsson.se/proj_kds/erikube/coredns:1.2.2" already present on machine
3m54s       Warning   BackOff            Pod         Back-off restarting failed container
50m         Warning   BackOff            Pod         Back-off restarting failed container
42m         Normal    Scheduled          Pod         Successfully assigned kube-system/kube-sriov-device-plugin-amd64-xxrjp to my-centos
42m         Normal    Pulling            Pod         pulling image "nfvpe/sriov-device-plugin:latest"
42m         Normal    Pulled             Pod         Successfully pulled image "nfvpe/sriov-device-plugin:latest"
42m         Normal    Created            Pod         Created container
42m         Normal    Started            Pod         Started container
42m         Normal    SuccessfulCreate   DaemonSet   Created pod: kube-sriov-device-plugin-amd64-xxrjp
[root@my-centos kernel-net-demo]# kubectl get pods --namespace=kube-system
NAME                                       READY   STATUS             RESTARTS   AGE
calico-kube-controllers-69dbbb4457-nlnr8   1/1     Running            34         166d
calico-node-q26w4                          2/2     Running            69         166d
coredns-6cbf6d9bfb-5bgqf                   0/1     CrashLoopBackOff   3620       166d
coredns-6cbf6d9bfb-z98tj                   0/1     CrashLoopBackOff   3680       166d
dex-686486d96-pjt8x                        1/1     Running            36         166d
dex-686486d96-ts5f5                        1/1     Running            33         166d
kube-apiserver-my-centos                   1/1     Running            34         166d
kube-controller-manager-my-centos          1/1     Running            39         166d
kube-multus-ds-amd64-jc6qx                 1/1     Running            34         166d
kube-proxy-gq675                           1/1     Running            34         166d
kube-scheduler-my-centos                   1/1     Running            38         166d
kube-sriov-device-plugin-amd64-xxrjp       1/1     Running            0          42m
kubernetes-dashboard-c6c4dc898-pld2p       2/2     Running            68         166d
metrics-server-d4cd445b-hl679              1/1     Running            36         166d
node-feature-discovery-jtmxg               1/1     Running            36         166d
tiller-deploy-55bc9b8f75-9wwbc             1/1     Running            36         166d
[root@my-centos kernel-net-demo]#

The text was updated successfully, but these errors were encountered:

ahalimx86 · 2019-05-14T09:03:54Z

@pperiyasamy Can you provide net-attache-def CRD for sriov-net-a?

zshi-redhat · 2019-05-14T09:07:02Z

looks like the sriov-network-device-plugin is not advertising correct num of devices to kubelet, could you also paste the logs of kube-sriov-device-plugin-amd64-xxrjp?
btw, running sriov cni inside VM may fail due to no access to PF inside VM.

pperiyasamy · 2019-05-14T09:29:52Z

@ahalim-intel @zshi-redhat Thanks for the prompt response.

Here is the requested info:

[root@my-centos kernel-net-demo]# kubectl -n kube-system logs kube-sriov-device-plugin-amd64-xxrjp
I0514 08:09:51.960557      13 main.go:44] resource manager reading configs
I0514 08:09:51.961579      13 main.go:59] Initializing resource servers
I0514 08:09:51.963078      13 factory.go:53] Resource pool type: netdevice
I0514 08:09:51.963133      13 server.go:155] initializing sriov_net_A device pool
I0514 08:09:51.963157      13 pool_stub.go:56] Discovering devices with config: &{ResourceName:sriov_net_A RootDevices:[0000:05:01.0 0000:05:02.0 0000:05:03.0] DeviceType:netdevice SriovMode:false}
I0514 08:09:51.963183      13 pool_stub.go:96] Discovered Devices: map[0000:05:01.0:&Device{ID:0000:05:01.0,Health:Healthy,} 0000:05:02.0:&Device{ID:0000:05:02.0,Health:Healthy,} 0000:05:03.0:&Device{ID:0000:05:03.0,Health:Healthy,}]
I0514 08:09:51.963256      13 main.go:65] Starting all servers...
I0514 08:09:51.963294      13 server.go:166] starting sriov_net_A device plugin endpoint at: sriov_net_A.sock
number of config: 1
Resource name: &{ResourceName:sriov_net_A RootDevices:[0000:05:01.0 0000:05:02.0 0000:05:03.0] DeviceType:netdevice SriovMode:false}
I0514 08:09:51.964540      13 server.go:190] sriov_net_A device plugin endpoint started serving
I0514 08:09:51.966573      13 server.go:80] sriov_net_A device plugin registered with Kubelet
I0514 08:09:51.966619      13 main.go:70] All servers started.
I0514 08:09:51.966635      13 main.go:71] Listening for term signals
I0514 08:09:51.969823      13 server.go:101] ListAndWatch(sriov_net_A) invoked
I0514 08:09:51.969840      13 server.go:109] ListAndWatch(sriov_net_A): send devices &ListAndWatchResponse{Devices:[&Device{ID:0000:05:02.0,Health:Unhealthy,} &Device{ID:0000:05:03.0,Health:Unhealthy,} &Device{ID:0000:05:01.0,Health:Unhealthy,}],}
I0514 08:09:51.969961      13 server.go:126] ListAndWatch(sriov_net_A): device health changed!
I0514 08:09:51.969979      13 server.go:132] ListAndWatch(sriov_net_A): send updated devices &ListAndWatchResponse{Devices:[&Device{ID:0000:05:02.0,Health:Unhealthy,} &Device{ID:0000:05:03.0,Health:Unhealthy,} &Device{ID:0000:05:01.0,Health:Unhealthy,}],}

[root@my-centos kernel-net-demo]# kubectl get net-attach-def sriov-net-a -o yaml
apiVersion: k8s.cni.cncf.io/v1
kind: NetworkAttachmentDefinition
metadata:
  annotations:
    k8s.v1.cni.cncf.io/resourceName: intel.com/sriov_net_A
  creationTimestamp: 2019-05-14T08:31:25Z
  generation: 1
  name: sriov-net-a
  namespace: default
  resourceVersion: "1037653"
  selfLink: /apis/k8s.cni.cncf.io/v1/namespaces/default/network-attachment-definitions/sriov-net-a
  uid: a9a34d94-7622-11e9-884e-525400dab310
spec:
  config: '{ "type": "sriov", "vlan": 1000, "ipam": { "type": "host-local", "subnet":
    "10.56.217.0/24", "rangeStart": "10.56.217.171", "rangeEnd": "10.56.217.181",
    "routes": [{ "dst": "0.0.0.0/0" }], "gateway": "10.56.217.1" } }'

ahalimx86 · 2019-05-14T09:35:09Z

I0514 08:09:51.969840 13 server.go:109] ListAndWatch(sriov_net_A): send devices &ListAndWatchResponse{Devices:[&Device{ID:0000:05:02.0,Health:Unhealthy,} &Device{ID:0000:05:03.0,Health:Unhealthy,} &Device{ID:0000:05:01.0,Health:Unhealthy,}],}
I0514 08:09:51.969961 13 server.go:126] ListAndWatch(sriov_net_A): device health changed!
I0514 08:09:51.969979 13 server.go:132] ListAndWatch(sriov_net_A): send updated devices &ListAndWatchResponse{Devices:[&Device{ID:0000:05:02.0,Health:Unhealthy,} &Device{ID:0000:05:03.0,Health:Unhealthy,} &Device{ID:0000:05:01.0,Health:Unhealthy,}],}

Your VFs are marked as unhealthy. That's the reason no VF's available to be allocated. Possible cause - associated PF is down?

pperiyasamy · 2019-05-14T10:06:33Z

Hi @ahalim-intel

You are correct. Those VFs were in admin down state and brought it to Healthy state again.
Now when POD is created, the following error is seen:

Warning FailedCreatePodSandBox 46s kubelet, my-centos Failed create pod sandbox: rpc error: code = Unknown desc = [failed to set up sandbox container "cad5daa186d34352ace48c49ea87f30e09ce65499063c9602206ef93e7bb2f31" network for pod "testpod1": NetworkPlugin cni failed to set up pod "testpod1_default" network: Multus: Err adding pod to network "sriov-net-a": Multus: error in invoke Delegate add - "sriov": SRIOV-CNI failed to load netconf: LoadConf(): failed to get VF information: "lstat /sys/bus/pci/devices/0000:05:01.0/physfn/net: no such file or directory", failed to clean up sandbox container "cad5daa186d34352ace48c49ea87f30e09ce65499063c9602206ef93e7bb2f31" network for pod "testpod1": NetworkPlugin cni failed to teardown pod "testpod1_default" network: Multus: error in invoke Delegate del - "sriov": error reading cached NetConf in /var/lib/cni/sriov with name cad5daa186d34352ace48c49ea87f30e09ce65499063c9602206ef93e7bb2f31-net2 / Multus: error in invoke Delegate del - "sriov": error reading cached NetConf in /var/lib/cni/sriov with name cad5daa186d34352ace48c49ea87f30e09ce65499063c9602206ef93e7bb2f31-net1]

I removed physfn from the path @ https://github.com/intel/sriov-cni/blob/master/pkg/utils/utils.go#L77, but now it's looking for sriov_numvfs file like below:

Events:
  Type     Reason                  Age               From                Message
  ----     ------                  ----              ----                -------
  Normal   Scheduled               23s               default-scheduler   Successfully assigned default/testpod1 to my-centos
  Warning  FailedCreatePodSandBox  16s               kubelet, my-centos  Failed create pod sandbox: rpc error: code = Unknown desc = [failed to set up sandbox container "d67316011b3e3c61f7171c2e0907bf10d6c32730aab9b21df0ea114d3e331e4b" network for pod "testpod1": NetworkPlugin cni failed to set up pod "testpod1_default" network: Multus: Err adding pod to network "sriov-net-a": Multus: error in invoke Delegate add - "sriov": SRIOV-CNI failed to load netconf: LoadConf(): failed to get VF information: "failed to open the sriov_numfs of device \"enp5s1\": lstat /sys/class/net/enp5s1/device/sriov_numvfs: no such file or directory", failed to clean up sandbox container "d67316011b3e3c61f7171c2e0907bf10d6c32730aab9b21df0ea114d3e331e4b" network for pod "testpod1": NetworkPlugin cni failed to teardown pod "testpod1_default" network: Multus: error in invoke Delegate del - "sriov": error reading cached NetConf in /var/lib/cni/sriov with name d67316011b3e3c61f7171c2e0907bf10d6c32730aab9b21df0ea114d3e331e4b-net2 / Multus: error in invoke Delegate del - "sriov": error reading cached NetConf in /var/lib/cni/sriov with name d67316011b3e3c61f7171c2e0907bf10d6c32730aab9b21df0ea114d3e331e4b-net1]

I just had "sriovMode": false option in config.json. shouldn't that sufficient for binding VF to PODs ?

ahalimx86 · 2019-05-14T10:16:30Z

@pperiyasamy As @zshi-redhat mentioned sriov-cni won't be able to run inside a VM. It needs access to all PF resources in the host to be able to set the VF in the Pod.

pperiyasamy · 2019-05-14T12:29:34Z

@ahalim-intel @zshi-redhat Does it mean rootDevices in config.json should always contain only PF devices and not VFs ?
Because VFs doesn't have sriov_numvfs file in its device directory (which causes above error /sys/class/net/enp5s1/device/sriov_numvfs: no such file or directory)

ahalimx86 · 2019-05-14T12:57:12Z

I removed physfn from the path @ https://github.com/intel/sriov-cni/blob/master/pkg/utils/utils.go#L77, but now it's looking for sriov_numvfs file like below:

It won't work like that. To set up a VF in Pod, PF information(and some other info) are needed. You cannot just simply remove a line and expect everything to work. For any reason sriov-cni fails to get any of those required info will result in error. We do not support this mode of set up yet. The SR-IOV networking solution is for baremetal cluster only.

pperiyasamy · 2019-05-14T13:03:11Z

@ahalim-intel

I've seen the following comment @ https://github.com/intel/sriov-network-device-plugin#assumptions

If "sriovMode": true is given for a resource config then plugin will look for virtual functions(VFs) for all the devices listed in "rootDevices" and export the discovered VFs as allocatable extended resource list. Otherwise, plugin will export the root devices themsleves as the allocatable extended resources.

So I just configured the VFs in rootDevices and trying to attach it with Pod.

zshi-redhat · 2019-05-14T13:51:56Z

@ahalim-intel

I've seen the following comment @ https://github.com/intel/sriov-network-device-plugin#assumptions

If "sriovMode": true is given for a resource config then plugin will look for virtual functions(VFs) for all the devices listed in "rootDevices" and export the discovered VFs as allocatable extended resource list. Otherwise, plugin will export the root devices themsleves as the allocatable extended resources.

I'd say this statement is still true as the sr-iov network device plugin can report VFs as resource to kubelet, it's just SR-IOV CNI which cannot configure the VF properly without accessing PF;
For VM case, one possible solution is to use host-device CNI if you don't expect to configure VF attributes(vlan, mac address etc), it allows you to configure basic ipam on VF and move the VF to pod network namespace.
For using of host-device, two patches need to be applied in multus and host-device respectively as they are not merged:
host-device: containernetworking/plugins#300
Multus: k8snetworkplumbingwg/multus-cni#307

So I just configured the VFs in rootDevices and trying to attach it with Pod.

pperiyasamy · 2019-05-15T10:14:38Z

@zshi-redhat

But it's looking for sriov_numvfs file in VF's device directory (as per the error below) and not in PF device directory. Even on BM, there is no sriov_numfs file for VFs. Shouldn't it be a bug in SRIOV CNI plugin ?

Failed create pod sandbox: rpc error: code = Unknown desc = [failed to set up sandbox container "277056a4567f034d104bb1e1f5f07371da03f532cb99f06622e3cf5e9d8a7d58" network for pod "testpod1": NetworkPlugin cni failed to set up pod "testpod1_default" network: Multus: Err adding pod to network "sriov-net-a": Multus: error in invoke Delegate add - "sriov": SRIOV-CNI failed to load netconf: LoadConf(): failed to get VF information: "failed to open the sriov_numfs of device \"enp5s2\": lstat /sys/class/net/enp5s2/device/sriov_numvfs: no such file or directory"

The enp5s2 interface belongs to VF device 0000:05:02.0 which is configured in the rootDevices in config.json.

[root@my-centos kernel-net-demo]# cat /etc/pcidp/config.json
{
    "resourceList":
    [
        {
            "resourceName": "sriov_net_A",
            "rootDevices": ["0000:05:01.0", "0000:05:02.0", "0000:05:03.0"],
            "sriovMode": false,
            "deviceType": "netdevice"
        }
    ]
}

The same issue exists even after applying the below patches.

host-device: containernetworking/plugins#300
Multus: k8snetworkplumbingwg/multus-cni#307

zshi-redhat · 2019-05-15T10:33:35Z

@zshi-redhat

But it's looking for sriov_numvfs file in VF's device directory (as per the error below) and not in PF device directory. Even on BM, there is no sriov_numfs file for VFs. Shouldn't it be a bug in SRIOV CNI plugin ?

When using host-device cni plugin, we don't need sriov-cni any more; meaning when creating the net-attach-def, replace the cni type of sriov with host-device, for example:

apiVersion: "k8s.cni.cncf.io/v1"
kind: NetworkAttachmentDefinition
metadata:
  name: sriov-net1
  annotations:
    k8s.v1.cni.cncf.io/resourceName: intel.com/sriov
spec:
  config: '{
  "type": "host-device",
  "name": "host-device-network",
  "ipam": {
    "type": "host-local",
    "subnet": "10.56.217.0/24",
    "routes": [{
      "dst": "0.0.0.0/0"
    }],
    "gateway": "10.56.217.1"
  }
}'

Then host-device cni will be called during pod creation, and it will not read sriov_numfs.
also make sure host-device cni plugin is copied to /opt/cni/bin/

Failed create pod sandbox: rpc error: code = Unknown desc = [failed to set up sandbox container "277056a4567f034d104bb1e1f5f07371da03f532cb99f06622e3cf5e9d8a7d58" network for pod "testpod1": NetworkPlugin cni failed to set up pod "testpod1_default" network: Multus: Err adding pod to network "sriov-net-a": Multus: error in invoke Delegate add - "sriov": SRIOV-CNI failed to load netconf: LoadConf(): failed to get VF information: "failed to open the sriov_numfs of device \"enp5s2\": lstat /sys/class/net/enp5s2/device/sriov_numvfs: no such file or directory"

The enp5s2 interface belongs to VF device 0000:05:02.0 which is configured in the rootDevices in config.json.
[root@my-centos kernel-net-demo]# cat /etc/pcidp/config.json
{
    "resourceList":
    [
        {
            "resourceName": "sriov_net_A",
            "rootDevices": ["0000:05:01.0", "0000:05:02.0", "0000:05:03.0"],
            "sriovMode": false,
            "deviceType": "netdevice"
        }
    ]
}
The same issue exists even after applying the below patches.

host-device: containernetworking/plugins#300
Multus: intel/multus-cni#307

pperiyasamy · 2019-05-15T13:37:13Z

Thanks @zshi-redhat, it works now. so both host-device and sriov-network device plugin can be used to manage pool of VFs as net device passthrough inside k8s pod. Hope k8s.v1.cni.cncf.io/resourceName is local to k8s worker node, attached to NetworkAttachmentDefinition which is kind of global configuration for entire k8s cluster.

zshi-redhat · 2019-05-16T03:17:51Z

Thanks @zshi-redhat, it works now. so both host-device and sriov-network device plugin can be used to manage pool of VFs as net device passthrough inside k8s pod.

Glad to know it works!

Hope k8s.v1.cni.cncf.io/resourceName is local to k8s worker node, attached to NetworkAttachmentDefinition which is kind of global configuration for entire k8s cluster.

Yes, it is. Multus queries each net-attach-def CR associated with pod and identifies if there is a k8s.v1.cni.cncf.io/resourceName. This is how sriov works today.

May I ask what's the network topology of your test environment? How you run the VMs? can you describe a little bit your use case if possible?

pperiyasamy · 2019-05-16T08:30:37Z

May I ask what's the network topology of your test environment? How you run the VMs? can you describe a little bit your use case if possible?

The topology is to run k8s PODs inside the (tenant) VMs of OpenStack's image based deployments. So Kubernetes uses Openstack's dataplane (example: OpenDaylight with OpenvSwitch) as the underlay for pod-to-pod communication. This is the overall use case.
we run Telco CFs inside the PODs which are having certain performance characteristics. so it needs SR-IOV VF passthrough techniques as net device or dpdk device to k8s PODs.
As per the above results, we should achieve net device passthrough with Multus, host device CNI plugins and sriov-network-device-plugin.
But seems like dpdk passthrough is not feasible with Multus, SR-IOV CNI plugins and sriov-network-device-plugin. we can do it with SR-IOV CNI plugins with some tweaks in the code, but it can attach one device per net-attach-def CR. But we wanted to use sriov-network-device-plugin for managing pools like above net device passthrough. we have also discussed about this issue @ k8snetworkplumbingwg/sriov-cni#23. Do you have plans to support this use case too ?

zshi-redhat · 2019-05-16T09:36:08Z

May I ask what's the network topology of your test environment? How you run the VMs? can you describe a little bit your use case if possible?

The topology is to run k8s PODs inside the (tenant) VMs of OpenStack's image based deployments. So Kubernetes uses Openstack's dataplane (example: OpenDaylight with OpenvSwitch) as the underlay for pod-to-pod communication. This is the overall use case.
we run Telco CFs inside the PODs which are having certain performance characteristics. so it needs SR-IOV VF passthrough techniques as net device or dpdk device to k8s PODs.
As per the above results, we should achieve net device passthrough with Multus, host device CNI plugins and sriov-network-device-plugin.
But seems like dpdk passthrough is not feasible with Multus, SR-IOV CNI plugins and sriov-network-device-plugin. we can do it with SR-IOV CNI plugins with some tweaks in the code, but it can attach one device per net-attach-def CR. But we wanted to use sriov-network-device-plugin for managing pools like above net device passthrough. we have also discussed about this issue @ intel/sriov-cni#23. Do you have plans to support this use case too ?

For dpdk passthrough of VF devices (e.g VF bind to vfio-pci), we already support it with sriov-network-device-plugin, all you need to do is to bind VF to vfio-pci and use sriov-network-device-plugin to advertise the vfio-pci devices. then a pod can request such device in its spec. In this case, sriov-cni or multus are not necessary because you don't need an ip address for the dpdk interface, and there is no need to configure VF property in VM on openstack (openstack already takes care of the VF configuration).

pperiyasamy · 2019-05-16T11:05:42Z

For dpdk passthrough of VF devices (e.g VF bind to vfio-pci), we already support it with sriov-network-device-plugin, all you need to do is to bind VF to vfio-pci and use sriov-network-device-plugin to advertise the vfio-pci devices. then a pod can request such device in its spec. In this case, sriov-cni or multus are not necessary because you don't need an ip address for the dpdk interface, and there is no need to configure VF property in VM on openstack (openstack already takes care of the VF configuration).

That's great. I just tried this option with the below configuration and dpdk passthrough works fine.

[root@my-centos kernel-net-demo]# cat /etc/pcidp/config.json
{
    "resourceList":
    [
        {
            "resourceName": "sriov_net_A",
            "rootDevices": ["0000:05:01.0", "0000:05:02.0", "0000:05:03.0"],
            "sriovMode": false,
            "deviceType": "uio"
        }
    ]
}
[root@my-centos kernel-net-demo]# cat pod1.yml
apiVersion: v1
kind: Pod
metadata:
  name: testpmd-1
  annotations:
    k8s.v1.cni.cncf.io/resourceName: intel.com/sriov_net_A
spec:
  containers:
  - name: testpmd-1
    image: repo-pmd:v2
    imagePullPolicy: Never
    command: [ "/bin/bash", "-c", "--" ]
    args: [ "while true; do sleep 30; done;" ]
    securityContext:
      allowPrivilegeEscalation: true
      privileged: true
      runAsUser: 0
    resources:
      limits:
        hugepages-2Mi: 1200Mi
        memory: 1200Mi
        intel.com/sriov_net_A: '1'
      requests:
        memory: 1200Mi
        intel.com/sriov_net_A: '1'
    volumeMounts:
    - mountPath: /usr/src/dpdk
      name: dpdk
    - mountPath: /lib/modules
      name: modules
    - mountPath: /sys/kernel/mm/hugepages
      name: kernelhp
    - mountPath: /run:shared
      name: run
    - mountPath: /sys
      name: node
    - mountPath: /usr/src/kernels
      name: linuxheaders
    - mountPath: /usr/lib64
      name: libnuma
    - mountPath: /dev/hugepages
      name: hpmount
  volumes:
  - name: dpdk
    hostPath:
     path: /usr/src/dpdk
     type: Directory
  - name: modules
    hostPath:
     path: /lib/modules
     type: Directory
  - name: hugepage
    emptyDir:
      medium: HugePages
  - name: kernelhp
    hostPath:
     path: /sys/kernel/mm/hugepages
     type: Directory
  - name: run
    hostPath:
     path: /run
     type: Directory
  - name: node
    hostPath:
     path: /sys
     type: Directory
  - name: linuxheaders
    hostPath:
     path: /usr/src/kernels
     type: Directory
  - name: libnuma
    hostPath:
     path: /usr/lib64
     type: Directory
  - name: hpmount
    hostPath:
     path: /dev/hugepages
     type: Directory

Now the question is how to bind/ubind vfio-pci/igb_uio drivers on the fly during pod's lifecycle. shouldn't it be part of CNI plugin ? should we enhance host-device plugin for it ?
I also want to know how can we achieve the same device pool being used as net-device as well as dpdk device across PODs.

zshi-redhat · 2019-05-16T12:26:33Z

For dpdk passthrough of VF devices (e.g VF bind to vfio-pci), we already support it with sriov-network-device-plugin, all you need to do is to bind VF to vfio-pci and use sriov-network-device-plugin to advertise the vfio-pci devices. then a pod can request such device in its spec. In this case, sriov-cni or multus are not necessary because you don't need an ip address for the dpdk interface, and there is no need to configure VF property in VM on openstack (openstack already takes care of the VF configuration).

That's great. I just tried this option with the below configuration and dpdk passthrough works fine.

[root@my-centos kernel-net-demo]# cat /etc/pcidp/config.json
{
    "resourceList":
    [
        {
            "resourceName": "sriov_net_A",
            "rootDevices": ["0000:05:01.0", "0000:05:02.0", "0000:05:03.0"],
            "sriovMode": false,
            "deviceType": "uio"
        }
    ]
}
[root@my-centos kernel-net-demo]# cat pod1.yml
apiVersion: v1
kind: Pod
metadata:
  name: testpmd-1
  annotations:
    k8s.v1.cni.cncf.io/resourceName: intel.com/sriov_net_A
spec:
  containers:
  - name: testpmd-1
    image: repo-pmd:v2
    imagePullPolicy: Never
    command: [ "/bin/bash", "-c", "--" ]
    args: [ "while true; do sleep 30; done;" ]
    securityContext:
      allowPrivilegeEscalation: true
      privileged: true
      runAsUser: 0
    resources:
      limits:
        hugepages-2Mi: 1200Mi
        memory: 1200Mi
        intel.com/sriov_net_A: '1'
      requests:
        memory: 1200Mi
        intel.com/sriov_net_A: '1'
    volumeMounts:
    - mountPath: /usr/src/dpdk
      name: dpdk
    - mountPath: /lib/modules
      name: modules
    - mountPath: /sys/kernel/mm/hugepages
      name: kernelhp
    - mountPath: /run:shared
      name: run
    - mountPath: /sys
      name: node
    - mountPath: /usr/src/kernels
      name: linuxheaders
    - mountPath: /usr/lib64
      name: libnuma
    - mountPath: /dev/hugepages
      name: hpmount
  volumes:
  - name: dpdk
    hostPath:
     path: /usr/src/dpdk
     type: Directory
  - name: modules
    hostPath:
     path: /lib/modules
     type: Directory
  - name: hugepage
    emptyDir:
      medium: HugePages
  - name: kernelhp
    hostPath:
     path: /sys/kernel/mm/hugepages
     type: Directory
  - name: run
    hostPath:
     path: /run
     type: Directory
  - name: node
    hostPath:
     path: /sys
     type: Directory
  - name: linuxheaders
    hostPath:
     path: /usr/src/kernels
     type: Directory
  - name: libnuma
    hostPath:
     path: /usr/lib64
     type: Directory
  - name: hpmount
    hostPath:
     path: /dev/hugepages
     type: Directory

Now the question is how to bind/ubind vfio-pci/igb_uio drivers on the fly during pod's lifecycle. shouldn't it be part of CNI plugin ? should we enhance host-device plugin for it ?

Bind/unbind should happen before sriov device plugin is launched so that it can discover vfio/uio devices first and then pass necessary container runtime configuration(/dev/vfio, etc) to pod via device plugin API; It's too late for CNI to bind the device and change the container runtime configuration when it gets invoked during pod creation. sriov device plugin doesn't support bind/unbind, it shall be done by other system config tool (for example, in openshift, we'll use operator to bind/unbind vfio or uio device).

I also want to know how can we achieve the same device pool being used as net-device as well as dpdk device across PODs.

In theory, device plugin can expose different types(dpdk, net-device) of devices as same resourceName, but user cannot tell which type of device will be allocated to pod; to request both net-device and dpdk device in one pod, you will need to expose them as different resourceNames and request both resourceNames in pod spec.

pperiyasamy · 2019-05-16T13:36:02Z

Bind/unbind should happen before sriov device plugin is launched so that it can discover vfio/uio devices first and then pass necessary container runtime configuration(/dev/vfio, etc) to pod via device plugin API; It's too late for CNI to bind the device and change the container runtime configuration when it gets invoked during pod creation. sriov device plugin doesn't support bind/unbind, it shall be done by other system config tool (for example, in openshift, we'll use operator to bind/unbind vfio or uio device).

I saw sriov-cni plugin supports bind/unbind of dpdk drivers in the run time with below configuration in the net-attach definition like this.

    "dpdk": {
        "pci_addr": "0000:00:12.0",
        "ifname": "eth2",
        "vfid": 6,
        "kernel_driver":"virtio-pci",
        "dpdk_driver":"igb_uio",
        "dpdk_tool":"/usr/src/dpdk/install/share/dpdk/usertools/dpdk-devbind.py"
    }

are you saying this is not possible when we use both cni and device plugin ?

In theory, device plugin can expose different types(dpdk, net-device) of devices as same resourceName, but user cannot tell which type of device will be allocated to pod; to request both net-device and dpdk device in one pod, you will need to expose them as different resourceNames and request both resourceNames in pod spec.

This is just to avoid creating unnecessary resourceNames to enable dpdk driver and also it might go into the same existing network (example: sriov-net-a). As per no. of k8s.v1.cni.cncf.io/networks in pod definition, create that many net devices inside the pod and other devices should go as dpdk devices. This provides transparency and better manageability of all the VFs is under one resource name and administrator don't need to distinguish between net and dpdk drivers on the VFs and everything is abstracted inside the plugins.

zshi-redhat · 2019-05-16T13:57:22Z

Bind/unbind should happen before sriov device plugin is launched so that it can discover vfio/uio devices first and then pass necessary container runtime configuration(/dev/vfio, etc) to pod via device plugin API; It's too late for CNI to bind the device and change the container runtime configuration when it gets invoked during pod creation. sriov device plugin doesn't support bind/unbind, it shall be done by other system config tool (for example, in openshift, we'll use operator to bind/unbind vfio or uio device).

I saw sriov-cni plugin supports bind/unbind of dpdk drivers in the run time with below configuration in the net-attach definition like this.
    "dpdk": {
        "pci_addr": "0000:00:12.0",
        "ifname": "eth2",
        "vfid": 6,
        "kernel_driver":"virtio-pci",
        "dpdk_driver":"igb_uio",
        "dpdk_tool":"/usr/src/dpdk/install/share/dpdk/usertools/dpdk-devbind.py"
    }
are you saying this is not possible when we use both cni and device plugin ?

Yes, going forward we will only support sriov with CNI + Device Plugin mode.

In theory, device plugin can expose different types(dpdk, net-device) of devices as same resourceName, but user cannot tell which type of device will be allocated to pod; to request both net-device and dpdk device in one pod, you will need to expose them as different resourceNames and request both resourceNames in pod spec.

This is just to avoid creating unnecessary resourceNames to enable dpdk driver and also it might go into the same existing network (example: sriov-net-a). As per no. of k8s.v1.cni.cncf.io/networks in pod definition, create that many net devices inside the pod and other devices should go as dpdk devices. This provides transparency and better manageability of all the VFs is under one resource name and administrator don't need to distinguish between net and dpdk drivers on the VFs and everything is abstracted inside the plugins.

We can discuss this.
Currently resourceName is the only way to identify device with different properties, such as whether it's a netdevice or dpdk interface etc. this is why we are working on a selectors based configuration to allow device plugin expose various type of device with different attributes(vendorID, deviceID, driver, PF name etc) as different resourceNames. please take a look at the device_selectors branch for reference code. I think you're asking for exposing devices with same network connection as same resourceName, no matter it's dpdk or netdevice interface, then PF name selector might work, wdyt?

pperiyasamy · 2019-05-16T14:58:43Z

you're asking for exposing devices with same network connection as same resourceName, no matter it's dpdk or netdevice interface, then PF name selector might work, wdyt?

Yes @zshi-redhat , Exactly. I'm asking something similar to selectors configuration for driver selection based on network and resourceName configuration in the pod. But looks like we can go upto only two drivers on a resource pool per pod with this design.

JScheurich · 2019-05-17T08:28:56Z

@zshi-redhat: Trying to clarify the intended use case a bit further. As Peri has explained, we run K8s in VMs on OpenStack and need to support applications that require high-performance access to secondary (private) networks in their Pods. These applications use DPDK or kernel drivers on those secondary interfaces.

The envisaged solution is to set up VLAN provider networks in OpenStack and for each VLAN add a number of VLAN-tagged SR-IOV VFs to the K8s worker VMs. Inside the the K8s workers, all VFs of a given VLAN are equivalent from a connectivity perspective and should be pooled in the sriov device plugin, so there would be one pool per VLAN (spanning across all worker nodes).

An application should be able to request a VF from that pool (VLAN) and at the same time specify whether it should passed as DPDK interface or normal netdev with IPAM. Hence the binding of the driver should happen at the time the VF is passed into the Pod. In essence we would like to decouple the orthogonal aspects of underlying network connectivity and the driver/interface configuration inside the Pod.

Without this we would have to further partition the set of VFs of a PF for each VLAN into two pools preconfigured for DPDK or kernel driver, respectively. As the number of VFs per PF is quite limited, this would further reduce the flexibility of a K8s deployment.

The ideal solution would of course even avoid the pre-configuration of VLANs on the SR-IOV VFs and hot-plug a VLAN tagged VF into the K8s worker VM when requested by a Pod. But that would require a Kuryr-like integration of the CNI with the underlying OpenStack.

We also want to provide an equivalent solution for K8s deployed on bare metal. Ideally the K8s orchestration interface for the applications should not see any difference (i.e. request a VF bound to a certain VLAN and specify whether to pass as DPDK or kernel interface).

zshi-redhat · 2019-05-17T11:43:21Z

@zshi-redhat: Trying to clarify the intended use case a bit further. As Peri has explained, we run K8s in VMs on OpenStack and need to support applications that require high-performance access to secondary (private) networks in their Pods. These applications use DPDK or kernel drivers on those secondary interfaces.

The envisaged solution is to set up VLAN provider networks in OpenStack and for each VLAN add a number of VLAN-tagged SR-IOV VFs to the K8s worker VMs. Inside the the K8s workers, all VFs of a given VLAN are equivalent from a connectivity perspective and should be pooled in the sriov device plugin, so there would be one pool per VLAN (spanning across all worker nodes).

An application should be able to request a VF from that pool (VLAN) and at the same time specify whether it should passed as DPDK interface or normal netdev with IPAM. Hence the binding of the driver should happen at the time the VF is passed into the Pod. In essence we would like to decouple the orthogonal aspects of underlying network connectivity and the driver/interface configuration inside the Pod.

I understand the request that you don't want pre-define the type of vf devices inside VM because it may be used either as dpdk or kernel interface depends on application running in Pod.
The main reason we didn't do bind in sriov-cni is that for VF working in dpdk mode, it requires a few container runtime configurations be passed to pod including devSpec (e.g /dev/vfio, /dev/) and any other configs listed in device plugin AllocateResponse, these options cannot be properly configured within sriov-cni because it's too late for cni to attach the config to pod which is already launched when cni is invoked.

Without this we would have to further partition the set of VFs of a PF for each VLAN into two pools preconfigured for DPDK or kernel driver, respectively. As the number of VFs per PF is quite limited, this would further reduce the flexibility of a K8s deployment.

The ideal solution would of course even avoid the pre-configuration of VLANs on the SR-IOV VFs and hot-plug a VLAN tagged VF into the K8s worker VM when requested by a Pod. But that would require a Kuryr-like integration of the CNI with the underlying OpenStack.

Yes, for vlan configuration in openstack environment, Kuryr CNI might be solution; sriov-cni cannot talk to Neutron and it only supports configuring of vlan/mac/spoof check/trusted vf etc on baremetal.

We also want to provide an equivalent solution for K8s deployed on bare metal. Ideally the K8s orchestration interface for the applications should not see any difference (i.e. request a VF bound to a certain VLAN and specify whether to pass as DPDK or kernel interface).

May I ask how the application is going to consume the devices inside pod? for example, how the application discover the pod interface?

pperiyasamy · 2019-05-21T08:36:44Z

The main reason we didn't do bind in sriov-cni is that for VF working in dpdk mode, it requires a few container runtime configurations be passed to pod including devSpec (e.g /dev/vfio, /dev/) and any other configs listed in device plugin AllocateResponse, these options cannot be properly configured within sriov-cni because it's too late for cni to attach the config to pod which is already launched when cni is invoked

Shouldn't it (i.e. returning required volume mounts in case of dpdk interface) be done as part of device plugin itself in the AllocateResponse for Allocate rpc invocation ? It looks right now uioPool.go returns empty list for volume mounts.

zshi-redhat · 2019-05-21T12:32:11Z

The main reason we didn't do bind in sriov-cni is that for VF working in dpdk mode, it requires a few container runtime configurations be passed to pod including devSpec (e.g /dev/vfio, /dev/) and any other configs listed in device plugin AllocateResponse, these options cannot be properly configured within sriov-cni because it's too late for cni to attach the config to pod which is already launched when cni is invoked

Shouldn't it (i.e. returning required volume mounts in case of dpdk interface) be done as part of device plugin itself in the AllocateResponse for Allocate rpc invocation ?

Yes, it should be. That's why bind/unbind is done before SR-IOV Device Plugin gets launched.

It looks right now uioPool.go returns empty list for volume mounts.

It mounts host device(uio, vfio etc) via devSpec config instead of using mount which is for general host volume.

pperiyasamy · 2019-05-21T13:30:20Z

Yes, it should be. That's why bind/unbind is done before SR-IOV Device Plugin gets launched.

Can't we just make bind/unbind logic also to be also part of Allocate rpc implementation and mount the host device via devSpec ?

It mounts host device(uio, vfio etc) via devSpec config instead of using mount which is for general host volume.

Ok, thanks for the info.

zshi-redhat · 2019-05-22T06:50:19Z

Yes, it should be. That's why bind/unbind is done before SR-IOV Device Plugin gets launched.

Can't we just make bind/unbind logic also to be also part of Allocate rpc implementation and mount the host device via devSpec ?

That's indeed an option, there is even a bind hook in device plugin, but not implemented.
Several thoughts on doing bind inside device plugin:

If we expose all the device as equal kernel VF device and rely on device plugin to do the binding during allocation call, then there needs to be a way for device plugin to know whether a pod is requesting a dpdk device or kernel network device. this may imply that device plugin needs to first know which pod is asking for devices, then query kubernetes api server to get pod spec or its net-attach-def custom resource for VF dpdk/kernel config. Previously there was an proposal upstream for adding podID information in native device plugin API, but that was not accepted. You can also see effort here to add pod spec awareness in device plugin Allocate call which is also closed.
However, it's possible to get pod information without native device plugin API support, for example, via kubelet checkpoint file or kubelet pod-resource gRPC service. but the question is how we can make the request in pod spec, shall we use net-attach-def customer resource to indicate that the network associated with a pod is requesting a dpdk interface? or shall we add another filed in pod annotation to pass the same driver info? what if there are multiple devices requested by a single pod, how to make sure the devices are binded with correct order.
kubelet doesn't inform device plugin when a pod gets deleted, this means device plugin needs to monitor pod deletion and unbind the dpdk interface.

It mounts host device(uio, vfio etc) via devSpec config instead of using mount which is for general host volume.

Ok, thanks for the info.

pperiyasamy · 2019-05-22T08:59:33Z

Thanks @zshi-redhat for the details. I think Pod would request either kernel network device or dpdk device, not both (or multiple devices). This is because there would be only one application running in the pod which is of the type either dpdk or kernel.

Hence based on pod definition in annotations section, device plugin can decide on which driver to bind on the PCI device like below:

k8s.v1.cni.cncf.io/networks is specified, then choose kernel net device.
k8s.v1.cni.cncf.io/resourceName is specified, then choose dpdk device.

3. kubelet doesn't inform device plugin when a pod gets deleted, this means device plugin needs to monitor pod deletion and unbind the dpdk interface.

Ok, thanks for the info. so should we make use of ListAndWatch for this?
Looks there is a proposal for Deallocate, but which is closed due to inactivity.

zshi-redhat · 2019-05-22T10:20:40Z

Thanks @zshi-redhat for the details. I think Pod would request either kernel network device or dpdk device, not both (or multiple devices). This is because there would be only one application running in the pod which is of the type either dpdk or kernel.

One application may require multiple interfaces inside pod, for example, should the application be a network router which uses one interface as uplink and the other as radio network; here is a live demo talk that shows how this kind of application works from Open Infrastructure Summit Denver (at 32:10 timeline , in the final thoughts section, there is a mention that it's running with a kubernetes cluster in openstack VM, and uses Multus + SR-IOV components).

Hence based on pod definition in annotations section, device plugin can decide on which driver to bind on the PCI device like below:

k8s.v1.cni.cncf.io/networks is specified, then choose kernel net device.
k8s.v1.cni.cncf.io/resourceName is specified, then choose dpdk device.

k8s.v1.cni.cncf.io/resourceName is currently used in net-attach-def CR annotation field to indicate the network shall be configured on a resourceName device. It's now used for both kernel and dpdk interfaces when the network is defined to be configured on VF devices. Multus now inspects the resourceName to decide whether or not to pass device information to sriov-cni.

k8s.v1.cni.cncf.io/networks annotation is from network plumbing working group definition, it's mainly used to associate a network to pod interface, the pod dpdk interface may also require such config, for example: see sriov-cni issue. so it cannot be used to distinguish whether a kernel or dpdk device.

kubelet doesn't inform device plugin when a pod gets deleted, this means device plugin needs to monitor pod deletion and unbind the dpdk interface.

Ok, thanks for the info. so should we make use of ListAndWatch for this?
Looks there is a proposal for Deallocate, but which is closed due to inactivity.

There was a proposal called Resource Class API which I thought might be able to solve the problem, but it requires a lot more upstream works. With resource class API, user can request a device with detailed properties(key-value pair that supported by device plugin) just like requesting a device or cpu or memory and device plugin could configure the device to a specific driver based on the allocation call which contains a device ID that maps to the key-value pair such as driver:dpdk.

pperiyasamy · 2019-05-22T15:58:01Z

One application may require multiple interfaces inside pod, for example, should the application be a network router which uses one interface as uplink and the other as radio network; here is a live demo talk that shows how this kind of application works from Open Infrastructure Summit Denver (at 32:10 timeline , in the final thoughts section, there is a mention that it's running with a kubernetes cluster in openstack VM, and uses Multus + SR-IOV components).

Okay, I understand. It has to be supported then.

k8s.v1.cni.cncf.io/networks annotation is from network plumbing working group definition, it's mainly used to associate a network to pod interface, the pod dpdk interface may also require such config, for example: see sriov-cni issue. so it cannot be used to distinguish whether a kernel or dpdk device.

In case of dpdk binding for device pools, sriov cni is not used. so let's say if pod definition is like below, then attach one interface as kernel device and another one as dpdk device (based on number of networks, resources requests/limits parameters) considering whole pod def data is available to device plugin.
With this, there is no need of net-attach definition for dpdk interfaces, so it can allow to use the same device pool for both. Is this possible ?

apiVersion: v1
kind: Pod
metadata:
  name: testpod1
  labels:
    env: test
  annotations:
    k8s.v1.cni.cncf.io/networks: sriov-net-a
spec:
  containers:
  - name: appcntr1
    image: repo-pmd:v2
    imagePullPolicy: IfNotPresent
    command: [ "/bin/bash", "-c", "--" ]
    args: [ "while true; do sleep 300000; done;" ]
    resources:
      requests:
        intel.com/sriov_net_A: '2'
      limits:
        intel.com/sriov_net_A: '2'
  restartPolicy: "Never"`

pperiyasamy · 2019-06-05T12:31:57Z

Ok, so you're just showing the dpdk interface, not trying to bind inside pod.

Yes, I just use this command to get to know the available dpdk devices in the system. what is the command you use to find out which device(s) from the pool is attached inside the pool ?

Would you mind replacing igb_uio driver with vfio-pci and try?

Cool. It works with vfio-pci. is there any known issue with igb_uio driver ?

zshi-redhat · 2019-06-06T01:23:44Z

Ok, so you're just showing the dpdk interface, not trying to bind inside pod.

Yes, I just use this command to get to know the available dpdk devices in the system. what is the command you use to find out which device(s) from the pool is attached inside the pool ?

The device ID information can be found via container environment variables, please refer to here for naming convention of the environment var. Also please let us know if there is other information the dpdk application would like to get within container.

Would you mind replacing igb_uio driver with vfio-pci and try?

Cool. It works with vfio-pci. is there any known issue with igb_uio driver ?

I didn't try igb_uio, but looking at the code, there seems an issue for igb_uio.
Could you please help to check:

which version of sriov-network-device-plugin is used? are you using the latest selector based configuration? can you paste the config here?
what is the value of /sys/bus/pci/devices/<vf-pci-address>/driver? is it uio or igb_uio?

pperiyasamy · 2019-06-06T08:52:49Z

The device ID information can be found via container environment variables, please refer to here for naming convention of the environment var. Also please let us know if there is other information the dpdk application would like to get within container

Good, the device id is set in env variable PCIDEVICE_INTEL_COM_<RESOURCE_NAME>.
Currently this info is sufficient to run a dpdk application on particular interface.

which version of sriov-network-device-plugin is used? are you using the latest selector based configuration? can you paste the config here?

Yes, It's with latest master branch and selector based configuration.
Here is the config.json:

{
    "resourceList": [{
            "resourceName": "intel_sriov_netdevice",
            "selectors": {
                "vendors": ["8086"],
                "devices": ["154c"],
                "drivers": ["i40evf"]
            }
        },
        {
            "resourceName": "intel_sriov_dpdk_device",
            "selectors": {
                "vendors": ["8086"],
                "devices": ["154c"],
                "drivers": ["igb_uio"],
                "pfNames": ["ens3f2"]
            }
        }
    ]
}

what is the value of /sys/bus/pci/devices/<vf-pci-address>/driver? is it uio or igb_uio?

Yes, the device is bound to igb_uio driver.

root@dl380-006-ECCD-SUT:/sys/bus/pci/devices/0000:08:06.0# ls -lrth /sys/bus/pci/devices/0000:08:06.0/driver
lrwxrwxrwx 1 root root 0 Jun  6 08:41 /sys/bus/pci/devices/0000:08:06.0/driver -> ../../../../bus/pci/drivers/igb_uio

zshi-redhat · 2019-06-06T09:29:23Z

The device ID information can be found via container environment variables, please refer to here for naming convention of the environment var. Also please let us know if there is other information the dpdk application would like to get within container

Good, the device id is set in env variable PCIDEVICE_INTEL_COM_<RESOURCE_NAME>.
Currently this info is sufficient to run a dpdk application on particular interface.

which version of sriov-network-device-plugin is used? are you using the latest selector based configuration? can you paste the config here?

Yes, It's with latest master branch and selector based configuration.
Here is the config.json:
{
    "resourceList": [{
            "resourceName": "intel_sriov_netdevice",
            "selectors": {
                "vendors": ["8086"],
                "devices": ["154c"],
                "drivers": ["i40evf"]
            }
        },
        {
            "resourceName": "intel_sriov_dpdk_device",
            "selectors": {
                "vendors": ["8086"],
                "devices": ["154c"],
                "drivers": ["igb_uio"],
                "pfNames": ["ens3f2"]
            }
        }
    ]
}
what is the value of /sys/bus/pci/devices/<vf-pci-address>/driver? is it uio or igb_uio?

Yes, the device is bound to igb_uio driver.
root@dl380-006-ECCD-SUT:/sys/bus/pci/devices/0000:08:06.0# ls -lrth /sys/bus/pci/devices/0000:08:06.0/driver
lrwxrwxrwx 1 root root 0 Jun  6 08:41 /sys/bus/pci/devices/0000:08:06.0/driver -> ../../../../bus/pci/drivers/igb_uio

Ok, I think this might be a bug. The issue is that we don't recognize igb_uio as a provider type, only uio and vfio drivers are matched here, so when a pciNetDevice with igb_uio driver gets initialized and allocated, it doesn't attach corresponding deviceSpec to the container.

@ahalim-intel ^^, looks like an issue of using igb_uio type interfaces, did you hit this error before?

pperiyasamy · 2019-06-06T14:38:13Z

@zshi-redhat Does selectors logic still have support for host-device plugin too ?
Because I've tried to create a device pool to be used by a host-device network (on BM server), but pod bringup fails with the following error.

Warning FailedCreatePodSandBox 35m kubelet, dl380-006-eccd-sut Failed create pod sandbox: rpc error: code = Unknown desc = [failed to set up sandbox container "04c2489dde5c11667d0d093a0be70f3386d685bb7699c2363376af117946f408" network for pod "testpod1": NetworkPlugin cni failed to set up pod "testpod1_default" network: Multus: Err adding pod to network "host-device-network": Multus: error in invoke Delegate add - "host-device": specify either "device", "hwaddr" or "kernelpath", failed to clean up sandbox container "04c2489dde5c11667d0d093a0be70f3386d685bb7699c2363376af117946f408" network for pod "testpod1": NetworkPlugin cni failed to teardown pod "testpod1_default" network: Multus: error in invoke Delegate del - "host-device": specify either "device", "hwaddr" or "kernelpath"]

This is with master multus and host-device plugin (with your fixes) and config.json contains intel_sriov_hostdevice resource which is being used by host device network.

{
    "resourceList": [{
            "resourceName": "intel_sriov_netdevice",
            "selectors": {
                "vendors": ["8086"],
                "devices": ["154c"],
                "drivers": ["i40evf"]
            }
        },
        {
            "resourceName": "intel_sriov_dpdk_device",
            "selectors": {
                "vendors": ["8086"],
                "devices": ["154c"],
                "drivers": ["vfio-pci"],
                "pfNames": ["ens3f2"]
            }
        },
        {
            "resourceName": "intel_sriov_hostdevice",
            "selectors": {
                "vendors": ["8086"],
                "devices": ["154c"],
                "pfNames": ["ens3f1"]
            }
        }
    ]
}

zshi-redhat · 2019-06-06T23:13:40Z

@zshi-redhat Does selectors logic still have support for host-device plugin too ?
Because I've tried to create a device pool to be used by a host-device network (on BM server), but pod bringup fails with the following error.

Warning FailedCreatePodSandBox 35m kubelet, dl380-006-eccd-sut Failed create pod sandbox: rpc error: code = Unknown desc = [failed to set up sandbox container "04c2489dde5c11667d0d093a0be70f3386d685bb7699c2363376af117946f408" network for pod "testpod1": NetworkPlugin cni failed to set up pod "testpod1_default" network: Multus: Err adding pod to network "host-device-network": Multus: error in invoke Delegate add - "host-device": specify either "device", "hwaddr" or "kernelpath", failed to clean up sandbox container "04c2489dde5c11667d0d093a0be70f3386d685bb7699c2363376af117946f408" network for pod "testpod1": NetworkPlugin cni failed to teardown pod "testpod1_default" network: Multus: error in invoke Delegate del - "host-device": specify either "device", "hwaddr" or "kernelpath"]

From the log message, it seems not using latest host-device plugin.
With latest host-device, it should prompt this error (pciBusID is the config that host-device looks for) if device ID is not passed to host-device CNI from Multus.

This is with master multus and host-device plugin (with your fixes) and config.json contains intel_sriov_hostdevice resource which is being used by host device network.

pperiyasamy · 2019-06-07T08:23:11Z

From the log message, it seems not using latest host-device plugin.

Oops. I was using host-device plugin from multus which doesn't have your fix yet. But this one works fine.

pperiyasamy · 2019-06-07T13:44:29Z

Yes, it is possible with assumptions that device plugin can query the pod definition & network-attachment-def custom resource and there is no need to cover the sriov-cni issue here.

Let me bring up this topic (i.e. dynamic driver binding, global resource pool) again.
Can we start exploring how net-attach-def crd and pod def to be enhanced for having same resource pool to accommodate net, dpdk and other devices.

Case-1: PODs running both DPDK application(s) and kernel type application(s) on same network.
Case-2: POD running both DPDK application(s) and kernel type application(s) on different network.

Case 1 need same net-attach-def crd object sharing the same resource pool. The device driver type for each interface would come from pod definition, i guess.
Case 2 need multiple net-attach-def crd objects sharing the same resource pool. Here the decision for device driver type can be derived from some custom attribute from net-attach-def crd.
Did you see any attribute can be used from net-attach-def for this purpose and also in pod definition for Case 1 ?

zshi-redhat · 2019-06-10T01:47:06Z

Yes, it is possible with assumptions that device plugin can query the pod definition & network-attachment-def custom resource and there is no need to cover the sriov-cni issue here.

Let me bring up this topic (i.e. dynamic driver binding, global resource pool) again.
Can we start exploring how net-attach-def crd and pod def to be enhanced for having same resource pool to accommodate net, dpdk and other devices.

Case-1: PODs running both DPDK application(s) and kernel type application(s) on same network.

Case-2: POD running both DPDK application(s) and kernel type application(s) on different network.

Case 1 need same net-attach-def crd object sharing the same resource pool. The device driver type for each interface would come from pod definition, i guess.

The decision of which device to allocate is made by kubelet, and plugins like device plugin or CNI cannot make suggestion to kubelet on which device to allocate. I guess the question is how the device driver type be configured in pod definition and be interpreted by kubelet to request a device?

Case 2 need multiple net-attach-def crd objects sharing the same resource pool. Here the decision for device driver type can be derived from some custom attribute from net-attach-def crd.

To have multiple net-attach-def objects sharing the same resource pool is supported, user can define multiple net-attach-def objects and choose to use either one of them in pod spec for a device.
In current sr-iov cni plugin, it can detect whether the interface is in kernel mode or userspace mode, only apply ipam config to kernel interface. We used to have a field called dpdk in SR-IOV CNI config options to indicated this is a dpdk userspace interface, but it was later removed as it's detectable.

Did you see any attribute can be used from net-attach-def for this purpose and also in pod definition for Case 1 ?

pperiyasamy · 2019-06-11T08:24:36Z

The decision of which device to allocate is made by kubelet, and plugins like device plugin or CNI cannot make suggestion to kubelet on which device to allocate. I guess the question is how the device driver type be configured in pod definition and be interpreted by kubelet to request a device?

Yes, I think kubelet should pass pod name (need changes in AllocateRequest API ?) into device plugin so that device plugin can read pod definition to bind the appropriate driver at run time. For example, the annotation section in pod definition would like below for pod having one interface per device type from the same network.

  annotations:
    k8s.v1.cni.cncf.io/net: 1
    k8s.v1.cni.cncf.io/dpdk: 1
    k8s.v1.cni.cncf.io/networks: sriov-net1, sriov-net1

To have multiple net-attach-def objects sharing the same resource pool is supported, user can define multiple net-attach-def objects and choose to use either one of them in pod spec for a device.
In current sr-iov cni plugin, it can detect whether the interface is in kernel mode or userspace mode, only apply ipam config to kernel interface. We used to have a field called dpdk in SR-IOV CNI config options to indicated this is a dpdk userspace interface, but it was later removed as it's detectable.

Okay, that's good to hear. I've tried to associate two networks (net and dpdk) with the same resource, but when pod is created with dpdk network, i could see dpdk devices inside the pod, but not able to run the dpdk application on it. I'm seeing the following error while running testpmd dpdk application.

root@pod-dpdk:/usr/src/dpdk-stable-17.11.3# testpmd -l 0-1 -w 0000:08:0a.1 --file-prefix i --socket-mem 256,256 -- -i --nb-cores=1 --coremask=0x2 --rxq=1 --txq=1
EAL: Detected 56 lcore(s)
EAL: Probing VFIO support...
EAL: VFIO support initialized
EAL: PCI device 0000:08:0a.1 on NUMA socket 0
EAL:   probe driver: 8086:154c net_i40e_vf
EAL:   using IOMMU type 1 (Type 1)
i40evf_dev_init(): Init vf failed
EAL: Releasing pci mapped resource for 0000:08:0a.1
EAL: Calling pci_unmap_resource for 0000:08:0a.1 at 0x7f5ac0000000
EAL: Calling pci_unmap_resource for 0000:08:0a.1 at 0x7f5ac0010000
EAL: Requested device 0000:08:0a.1 cannot be used
EAL: No probed ethernet devices
Interactive-mode selected
USER1: create a new mbuf pool <mbuf_pool_socket_0>: n=155456, size=2176, socket=0
USER1: create a new mbuf pool <mbuf_pool_socket_1>: n=155456, size=2176, socket=1
Done
testpmd> show port info 0
Invalid port 0
The valid ports array is [ ]

root@dl380-006-ECCD-SUT:~/cnis/sriov-network-device-plugin/deployments# kubectl describe net-attach-def sriov-net-b
Name:         sriov-net-b
Namespace:    default
Labels:       <none>
Annotations:  k8s.v1.cni.cncf.io/resourceName: intel.com/intel_sriov_netdevice
API Version:  k8s.cni.cncf.io/v1
Kind:         NetworkAttachmentDefinition
Metadata:
  Creation Timestamp:  2019-06-11T07:25:58Z
  Generation:          1
  Resource Version:    1132974
  Self Link:           /apis/k8s.cni.cncf.io/v1/namespaces/default/network-attachment-definitions/sriov-net-b
  UID:                 28477af0-8c1a-11e9-b174-3cfdfe9eac40
Spec:
  Config:  { "type": "sriov" }
Events:    <none>

root@dl380-006-ECCD-SUT:~/cnis/sriov-network-device-plugin/deployments# kubectl describe pod pod-dpdk
Name:               pod-dpdk
Namespace:          default
Priority:           0
PriorityClassName:  <none>
Node:               dl380-006-eccd-sut/10.85.4.61
Start Time:         Tue, 11 Jun 2019 07:50:28 +0000
Labels:             <none>
Annotations:        k8s.v1.cni.cncf.io/networks: sriov-net-b, sriov-net-b
                    k8s.v1.cni.cncf.io/networks-status:
                      [{
                          "name": "k8s-pod-network",
                          "ips": [
                              "192.168.162.246"
                          ],
                          "default": true,
                          "dns": {}
                      },{
                          "name": "sriov-net-b",
                          "dns": {}
                      },{
                          "name": "sriov-net-b",
                          "dns": {}
                      }]
Status:             Running
IP:                 192.168.162.246
Containers:
  appcntr3:
    Container ID:  docker://5f55654684949b3d2005c9bffaf56406f21c242542ec3120e8068f080c53c204
    Image:         repo-pmd:v3
    Image ID:      docker://sha256:f7618a406bfc22e65692b4bb4369698b14c0ec12d8dc533420087161d68c3728
    Port:          <none>
    Host Port:     <none>
    Command:
      /bin/bash
      -c
      --
    Args:
      while true; do sleep 300000; done;
    State:          Running
      Started:      Tue, 11 Jun 2019 07:50:35 +0000
    Ready:          True
    Restart Count:  0
    Limits:
      cpu:                              8
      hugepages-1Gi:                    10Gi
      intel.com/intel_sriov_netdevice:  2
      memory:                           100Mi
    Requests:
      cpu:                              8
      hugepages-1Gi:                    10Gi
      intel.com/intel_sriov_netdevice:  2
      memory:                           100Mi
    Environment:                        <none>
    Mounts:
      /dev/hugepages from hugepage (rw)
      /usr/src/dpdk from dpdk (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-bf5h4 (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             True
  ContainersReady   True
  PodScheduled      True
Volumes:
  hugepage:
    Type:    EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:  HugePages
  dpdk:
    Type:          HostPath (bare host directory volume)
    Path:          /usr/src/dpdk
    HostPathType:  Directory
  default-token-bf5h4:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-bf5h4
    Optional:    false
QoS Class:       Guaranteed
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type    Reason     Age   From                         Message
  ----    ------     ----  ----                         -------
  Normal  Scheduled  30m   default-scheduler            Successfully assigned default/pod-dpdk to dl380-006-eccd-sut
  Normal  Pulled     30m   kubelet, dl380-006-eccd-sut  Container image "repo-pmd:v3" already present on machine
  Normal  Created    30m   kubelet, dl380-006-eccd-sut  Created container
  Normal  Started    30m   kubelet, dl380-006-eccd-sut  Started container

pperiyasamy · 2019-06-11T10:02:48Z

I've tried to associate two networks (net and dpdk) with the same resource, but when pod is created with dpdk network, i could see dpdk devices inside the pod, but not able to run the dpdk application on it. I'm seeing the following error while running testpmd dpdk application.

Update: testpmd application just works if it runs on cores 1-2 instead of 0-1. hence it's not an issue with device plugin.

zshi-redhat · 2019-06-11T11:56:03Z

I've tried to associate two networks (net and dpdk) with the same resource, but when pod is created with dpdk network, i could see dpdk devices inside the pod, but not able to run the dpdk application on it. I'm seeing the following error while running testpmd dpdk application.

Update: testpmd application just works if it runs on cores 1-2 instead of 0-1. hence it's not an issue with device plugin.

May I know how the testpmd gets to know how many hugepages can be used and which cpu to pin within container?

pperiyasamy · 2019-06-11T12:51:09Z

May I know how the testpmd gets to know how many hugepages can be used and which cpu to pin within container?

I'm not sure what are the physical cores are pinned inside the pod. Though we provide cpu: '8' in requests/limits section, i can see all physical cores inside the pod with /proc/cpuinfo command. But testpmd doesn't work with cores 0-1 whereas it worked with 1-2 cores. I just did it with trial and error method. is there a way to figure out the pinned cores inside the pod ?
There are also two 1 GB files created for the testpmd application under pod's /dev/hugepages directory.

root@pod-dpdk:/usr/src/dpdk-stable-17.11.3# ls -lrth /dev/hugepages
total 2.0G
-rw------- 1 root root 1.0G Jun 11 12:40 imap_0
-rw------- 1 root root 1.0G Jun 11 12:40 imap_1

Also at times, i can see net device attached inside pod (without ip configuration) for dpdk network. shouldn't kubelet always use the device id which are bound with dpdk driver ?

zshi-redhat · 2019-06-12T03:17:23Z

May I know how the testpmd gets to know how many hugepages can be used and which cpu to pin within container?

I'm not sure what are the physical cores are pinned inside the pod. Though we provide cpu: '8' in requests/limits section, i can see all physical cores inside the pod with /proc/cpuinfo command. But testpmd doesn't work with cores 0-1 whereas it worked with 1-2 cores. I just did it with trial and error method. is there a way to figure out the pinned cores inside the pod ?

Would you please check and update /sys/fs/cgroup/cpuset/cpuset.cpus and /sys/fs/cgroup/cpuset/cpuset.cpu_exclusive inside container? does that give any hint on which cpu core to use? I don't know if the exclusive cpus allocated to container appear as the same cpu number as on the host. for example, if you get exclusive cpu num 0,1,2, do they map to the host cpu 0,1,2?

There are also two 1 GB files created for the testpmd application under pod's /dev/hugepages directory.
root@pod-dpdk:/usr/src/dpdk-stable-17.11.3# ls -lrth /dev/hugepages
total 2.0G
-rw------- 1 root root 1.0G Jun 11 12:40 imap_0
-rw------- 1 root root 1.0G Jun 11 12:40 imap_1
Also at times, i can see net device attached inside pod (without ip configuration) for dpdk network. shouldn't kubelet always use the device id which are bound with dpdk driver ?

Yes, that's correct, because kubelet will randomly choose a device from the resource pool, it's not ware of which device is in kernel mode or dpdk mode; but sriov-cni has the ability to detect the given device and config it no matter which mode is used.

pperiyasamy · 2019-06-12T07:44:34Z

Would you please check and update /sys/fs/cgroup/cpuset/cpuset.cpus and /sys/fs/cgroup/cpuset/cpuset.cpu_exclusive inside container? does that give any hint on which cpu core to use?

These files contain 0-55 and 0 respectively which means CPUs are not allocated exclusively though we provide CPU requests/limits are specified in pod definition. isn't it ? Is there any reason behind this ? Do you want to update these files manually so that pod can be run on the dedicated cores ?

because kubelet will randomly choose a device from the resource pool, it's not ware of which device is in kernel mode or dpdk mode; but sriov-cni has the ability to detect the given device and config it no matter which mode is used.

But we might need to choose dpdk device on a particular pod to run dpdk application on it. This is why we need dynamic binding of appropriate driver inside the device plugin at pod bringup time by reading pod and net-attach-def definition.

pperiyasamy · 2019-06-13T07:36:43Z

These files contain 0-55 and 0 respectively which means CPUs are not allocated exclusively though we provide CPU requests/limits are specified in pod definition. isn't it ? Is there any reason behind this ? Do you want to update these files manually so that pod can be run on the dedicated cores ?

Looks like I don't have CPUManager feature-gate and cpu-manager-policy set to static in my test environment. Let me enable it and get back to you.

zshi-redhat · 2019-06-14T03:32:47Z

Would you please check and update /sys/fs/cgroup/cpuset/cpuset.cpus and /sys/fs/cgroup/cpuset/cpuset.cpu_exclusive inside container? does that give any hint on which cpu core to use?

These files contain 0-55 and 0 respectively which means CPUs are not allocated exclusively though we provide CPU requests/limits are specified in pod definition. isn't it ? Is there any reason behind this ? Do you want to update these files manually so that pod can be run on the dedicated cores ?

because kubelet will randomly choose a device from the resource pool, it's not ware of which device is in kernel mode or dpdk mode; but sriov-cni has the ability to detect the given device and config it no matter which mode is used.

But we might need to choose dpdk device on a particular pod to run dpdk application on it. This is why we need dynamic binding of appropriate driver inside the device plugin at pod bringup time by reading pod and net-attach-def definition.

I feel this feature might be similar to what your want. To have the ability of passing a flag (could be annotation) indicating the device usage to device plugin so that device plugin can do the bind dynamically.

zshi-redhat · 2019-06-27T03:07:14Z

@pperiyasamy there was a fix for igb_uio driver by @ahalim-intel , can you help to verify if it works for you? Thanks!

pperiyasamy · 2019-06-27T08:59:20Z

@zshi-redhat @ahalim-intel Infact, I too have tried with same change few days back, but it didn't work for me, looks something more needs to be done. or did it work for you ?

ahalimx86 · 2019-06-27T13:31:11Z

The issue with "igb_uio" driver selector that didn't export 'uio*' device file should be resolved. Not able to run testpmd is another story as running any dpdk application requires many other dependencies to be satisfied.

Here's how I've tested the the above fix:

Device plugin config map:

{
    "resourceList": [{
            "resourceName": "intel_sriov_netdevice",
            "selectors": {
                "vendors": ["8086"],
                "devices": ["154c", "10ed"],
                "drivers": ["i40evf", "ixgbevf"]
            }
        },
        {
            "resourceName": "intel_sriov_dpdk",
            "selectors": {
                "vendors": ["8086"],
                "devices": ["154c", "10ed"],
                "drivers": ["vfio-pci"]
            }
        },
        {
            "resourceName": "intel_sriov_igbuio",
            "selectors": {
                "vendors": ["8086"],
                "devices": ["154c", "10ed"],
                "drivers": ["igb_uio"]
            }
        }
    ]
}

Net attach CRD:

apiVersion: "k8s.cni.cncf.io/v1"
kind: NetworkAttachmentDefinition
metadata:
  name: sriov-igbuio
  annotations:
    k8s.v1.cni.cncf.io/resourceName: intel.com/intel_sriov_igbuio
spec:
  config: '{
    "type": "sriov",
    "name": "sriov-igbuio"
}'

Sample PodSpec:

apiVersion: v1
kind: Pod
metadata:
  name: testpod-igbuio
  labels:
    env: test
  annotations:
    k8s.v1.cni.cncf.io/networks: sriov-igbuio
spec:
  containers:
  - name: appcntr1
    image: centos/tools
    imagePullPolicy: IfNotPresent
    command: [ "/bin/bash", "-c", "--" ]
    args: [ "while true; do sleep 300000; done;" ]
    resources:
      requests:
        intel.com/intel_sriov_igbuio: '1'
      limits:
        intel.com/intel_sriov_igbuio: '1'

master# kubectl exec -it testpod-igbuio -- bash -c "ls -la /dev/uio*; env"

crw------- 1 root root 237, 2 Jun 25 12:33 /dev/uio2

HOSTNAME=testpod-igbuio
TERM=xterm
KUBERNETES_SERVICE_PORT=443
PCIDEVICE_INTEL_COM_INTEL_SRIOV_IGBUIO=0000:18:0a.2
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
PWD=/
SHLVL=1
HOME=/root
KUBERNETES_PORT_443_TCP_PROTO=tcp
KUBERNETES_SERVICE_PORT_HTTPS=443
container=docker
_=/usr/bin/env

We can see that the associated /dev/uio2 device file for the allocated VF device is mounted in the Pod. So if this VF is whitelisted by DPDK app and all other constraints are met(enough hugepages, socket-mem and numa) this should work same as you've mentioned you able to run using vfio-pci. One thing to check, running DPDK app with igb_uio may require privileged Pod. Where, vfio-pci does not require a privilege Pod.

pperiyasamy · 2019-07-03T09:30:43Z

Yes @ahalim-intel , I can also see the igb_uio device inside the pod, but testpmd is not working on it (error shown below).

root@pod-dpdk-uio:/usr/src/dpdk-stable-17.11.3# env | grep PCIDEVICE_INTEL
PCIDEVICE_INTEL_COM_INTEL_SRIOV_DPDK_UIO_DEVICE=0000:08:0e.0
txq=1-prefix i --socket-mem 256,256 -- -i --nb-cores=1 --coremask=0x4 --rxq=1 --t
EAL: Detected 56 lcore(s)
EAL: Some devices want iova as va but pa will be used because.. EAL: few device bound to UIO
EAL: Probing VFIO support...
EAL:   cannot open VFIO container, error 2 (No such file or directory)
EAL: VFIO support could not be initialized
EAL: Cannot obtain physical addresses: No such file or directory. Only vfio will function.
EAL: PCI device 0000:08:0e.0 on NUMA socket 0
EAL:   probe driver: 8086:154c net_i40e_vf
EAL: Requested device 0000:08:0e.0 cannot be used
EAL: No probed ethernet devices
Interactive-mode selected
USER1: create a new mbuf pool <mbuf_pool_socket_0>: n=155456, size=2176, socket=0
USER1: create a new mbuf pool <mbuf_pool_socket_1>: n=155456, size=2176, socket=1
Done
testpmd> show port info 0
Invalid port 0
The valid ports array is [ ]

Here is the pod definition:

apiVersion: v1
kind: Pod
metadata:
  name: pod-dpdk-uio
  annotations:
    k8s.v1.cni.cncf.io/networks: sriov-net-uio
spec:
  containers:
  - name: appcntruio3
    image: repo-pmd:v3
    imagePullPolicy: Never
    securityContext:
     capabilities:
       add: ["IPC_LOCK"]
    command: [ "/bin/bash", "-c", "--" ]
    args: [ "while true; do sleep 300000; done;" ]
    resources:
      requests:
        memory: 100Mi
        hugepages-1Gi: 10Gi
        cpu: '8'
      limits:
        hugepages-1Gi: 10Gi
        cpu: '8'
        memory: 100Mi
    volumeMounts:
    - mountPath: /dev/hugepages
      name: hugepage
      readOnly: False
    - mountPath: /usr/src/dpdk
      name: dpdk
  volumes:
  - name: hugepage
    emptyDir:
      medium: HugePages
  - name: dpdk
    hostPath:
     path: /usr/src/dpdk
     type: Directory

Whereas testpmd works fine vfio-pci device as shown below.

root@pod-dpdk:/usr/src/dpdk-stable-17.11.3# env | grep PCIDEVICE_INTEL
PCIDEVICE_INTEL_COM_INTEL_SRIOV_DPDK_VFIO_DEVICE=0000:08:06.0
root@pod-dpdk:/usr/src/dpdk-stable-17.11.3# testpmd -l 1-2 -w 0000:08:06.0 --file-prefix i --socket-mem 256,256 -- -i --nb-cores=1 --coremask=0x4 --rxq=1 --txq=1
EAL: Detected 56 lcore(s)
EAL: Probing VFIO support...
EAL: VFIO support initialized
EAL: PCI device 0000:08:06.0 on NUMA socket 0
EAL:   probe driver: 8086:154c net_i40e_vf
EAL:   using IOMMU type 1 (Type 1)
Interactive-mode selected
USER1: create a new mbuf pool <mbuf_pool_socket_0>: n=155456, size=2176, socket=0
USER1: create a new mbuf pool <mbuf_pool_socket_1>: n=155456, size=2176, socket=1

Warning! Cannot handle an odd number of ports with the current port topology. Configuration must be changed to have an even number of ports, or relaunch application with --port-topology=chained

Configuring Port 0 (socket 0)
Port 0: 4E:6F:F2:06:4C:E0
Checking link statuses...
Done
testpmd> show port info 0

********************* Infos for port 0  *********************
MAC address: 4E:6F:F2:06:4C:E0
Driver name: net_i40e_vf
Connect to socket: 0
memory allocation on the socket: 0
Link status: up
Link speed: 10000 Mbps
Link duplex: full-duplex
MTU: 1500
Promiscuous mode: enabled
Allmulticast mode: disabled
Maximum number of MAC addresses: 64
Maximum number of MAC addresses of hash filtering: 0
VLAN offload:
  strip on
  filter on
  qinq(extend) off
Hash key size in bytes: 52
Redirection table size: 64
Supported flow types:
  ipv4-frag
  ipv4-tcp
  ipv4-udp
  ipv4-sctp
  ipv4-other
  ipv6-frag
  ipv6-tcp
  ipv6-udp
  ipv6-sctp
  ipv6-other
  l2_payload
Max possible RX queues: 4
Max possible number of RXDs per queue: 4096
Min possible number of RXDs per queue: 64
RXDs number alignment: 32
Max possible TX queues: 4
Max possible number of TXDs per queue: 4096
Min possible number of TXDs per queue: 64
TXDs number alignment: 32
testpmd>

ahalimx86 · 2019-07-04T12:48:20Z

@pperiyasamy
Please see the Pod specs and log below:

master# cat pod-igbuio.yaml
apiVersion: v1
kind: Pod
metadata:
  name: testpod-igbuio
  labels:
    env: test
  annotations:
    k8s.v1.cni.cncf.io/networks: sriov-igbuio
spec:
  containers:
  - name: appcntr1
    image: ubuntu-dpdk
    imagePullPolicy: Never
    securityContext:
      capabilities:
        add: ["SYS_ADMIN", "IPC_LOCK"]
    command: [ "/bin/bash", "-c", "--" ]
    args: [ "while true; do sleep 300000; done;" ]
    volumeMounts:
    - mountPath: /dev/hugepages
      name: hugepage
    - mountPath: /sys
      name: sysfs
    resources:
      requests:
        memory: 1Gi
        hugepages-2Mi: 4Gi
        intel.com/intel_sriov_igbuio: '1'
      limits:
        memory: 1Gi
        hugepages-2Mi: 4Gi
        intel.com/intel_sriov_igbuio: '1'
  volumes:
  - name: hugepage
    emptyDir:
      medium: HugePages
  - name: sysfs
    hostPath:
      path: /sys


root@testpod-igbuio:/# env
HOSTNAME=testpod-igbuio                                                                                                                                                                    
PCIDEVICE_INTEL_COM_INTEL_SRIOV_IGBUIO=0000:18:0a.2                                                                                                                                        
KUBERNETES_PORT_443_TCP_PROTO=tcp                                                                                                                                                          
KUBERNETES_PORT_443_TCP_ADDR=10.254.0.1                                                                                                                                                    
KUBERNETES_PORT=tcp://10.254.0.1:443                                                                                                                                                       
PWD=/                                                                                                                                                                                      
HOME=/root                                                                                                                                                                                                                                                                                                                               
KUBERNETES_SERVICE_PORT_HTTPS=443                                                                                                                                                          
KUBERNETES_PORT_443_TCP_PORT=443                                                                                                                                                                                                                                                                                                        
KUBERNETES_PORT_443_TCP=tcp://10.254.0.1:443                                                                                                                                               
TERM=xterm                                                                                                                                                                                 
SHLVL=1                                                                                                                                                                                    
KUBERNETES_SERVICE_PORT=443                                                                                                                                                                
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin                                                                                                                          
KUBERNETES_SERVICE_HOST=10.254.0.1                                                                                                                                                         
_=/usr/bin/env 

root@testpod-igbuio:/# testpmd -w 18:0a.2 --socket-mem=2048,2048 -- -i
EAL: Detected 56 lcore(s)                                             
EAL: Some devices want iova as va but pa will be used because.. EAL: few device bound to UIO
EAL: No free hugepages reported in hugepages-1048576kB                                      
EAL: Probing VFIO support...                                                                
EAL:   cannot open VFIO container, error 2 (No such file or directory)                      
EAL: VFIO support could not be initialized 
EAL: PCI device 0000:18:0a.2 on NUMA socket 0                                               
EAL:   probe driver: 8086:154c net_i40e_vf                                                  
Interactive-mode selected                                                                   
USER1: create a new mbuf pool <mbuf_pool_socket_0>: n=587456, size=2176, socket=0           
USER1: create a new mbuf pool <mbuf_pool_socket_1>: n=587456, size=2176, socket=1           

Warning! Cannot handle an odd number of ports with the current port topology. Configuration must be changed to have an even number of ports, or relaunch application with --port-topology=chained                                                                                                                                                                                     

Configuring Port 0 (socket 0)
Port 0: 92:81:19:3A:20:21    
Checking link statuses...    
Done                         
testpmd> quit

Shutting down port 0...
Stopping ports...      
Done                   
Closing ports...       
Done                   

Bye...

Shutting down port 0...
Stopping ports...
Done
Closing ports...
Port 0 is already closed
Done

Bye...

Note that for igb_uio devices higher privileges are required as well as host /sys volume needs to be mounted with write permission in the container.

pperiyasamy · 2019-07-11T07:51:01Z

Yes @ahalim-intel , The above pod spec for igb_uio driver works for the POD running on the VM, but still saw the same issue with POD running on the BM. will check it again and let you know.

leyao-daily · 2019-09-25T01:57:02Z

Hi, in my host I bind vfio-pci to the vf:
{
"resourceList": [
{
"resourceName": "intel_sriov",
"selectors": {
"vendors": ["8086"],
"devices": ["37cd"],
"drivers": ["vfio-pci"],
"pfNames": ["eno2"]
}
}
]
}
And the node get the sriov resources correctly.
{ "type": "sriov", "cniVersion": "0.3.1", "ipam": { "type": "host-local",
"subnet": "10.56.206.0/24", "routes": [ { "dst": "0.0.0.0/0" } ], "gateway": "10.56.206.1"
} }'
And when i create a vm pod. i find that my vf device was considered as a disk and it report:

2019-09-25T01:50:37.556993Z qemu-system-x86_64: -drive file=/dev/vfio/vfio,format=raw,if=none,id=drive-scsi0-0-0-1: Could not refresh total sector count: Illegal seek')

zshi-redhat · 2019-09-27T09:10:35Z

Hi, in my host I bind vfio-pci to the vf:
{
"resourceList": [
{
"resourceName": "intel_sriov",
"selectors": {
"vendors": ["8086"],
"devices": ["37cd"],

Is this a correct device ID of VF?
btw, I'm going to close this issue as the described problem of enabling VF inside VM has been addressed, would you mind open an new issue?

"drivers": ["vfio-pci"],
"pfNames": ["eno2"]
}
}
]
}
And the node get the sriov resources correctly.
{ "type": "sriov", "cniVersion": "0.3.1", "ipam": { "type": "host-local",
"subnet": "10.56.206.0/24", "routes": [ { "dst": "0.0.0.0/0" } ], "gateway": "10.56.206.1"
} }'
And when i create a vm pod. i find that my vf device was considered as a disk and it report:

2019-09-25T01:50:37.556993Z qemu-system-x86_64: -drive file=/dev/vfio/vfio,format=raw,if=none,id=drive-scsi0-0-0-1: Could not refresh total sector count: Illegal seek')

zshi-redhat · 2019-09-27T09:12:15Z

Closing this issue as supporting VF in VM has been addressed in PR

zshi-redhat mentioned this issue Jun 12, 2019

enable use of SR-IOV plugins in VM environment #133

Closed

zshi-redhat mentioned this issue Jun 13, 2019

Supporting SR-IOV non-capable devices k8snetworkplumbingwg/sriov-cni#83

Closed

pperiyasamy mentioned this issue Jul 3, 2019

fix: igb_uio export device specs #140

Merged

zshi-redhat closed this as completed Sep 27, 2019

VF passthrough not working inside a VM #127

VF passthrough not working inside a VM #127

Comments

pperiyasamy commented May 14, 2019

ahalimx86 commented May 14, 2019

zshi-redhat commented May 14, 2019

pperiyasamy commented May 14, 2019

ahalimx86 commented May 14, 2019

pperiyasamy commented May 14, 2019

ahalimx86 commented May 14, 2019

pperiyasamy commented May 14, 2019

ahalimx86 commented May 14, 2019

pperiyasamy commented May 14, 2019

zshi-redhat commented May 14, 2019

pperiyasamy commented May 15, 2019

zshi-redhat commented May 15, 2019

pperiyasamy commented May 15, 2019

zshi-redhat commented May 16, 2019 • edited Loading

pperiyasamy commented May 16, 2019

zshi-redhat commented May 16, 2019

pperiyasamy commented May 16, 2019 • edited Loading

zshi-redhat commented May 16, 2019

pperiyasamy commented May 16, 2019 • edited Loading

zshi-redhat commented May 16, 2019 • edited Loading

pperiyasamy commented May 16, 2019

JScheurich commented May 17, 2019

zshi-redhat commented May 17, 2019

pperiyasamy commented May 21, 2019

zshi-redhat commented May 21, 2019

pperiyasamy commented May 21, 2019

zshi-redhat commented May 22, 2019 • edited Loading

pperiyasamy commented May 22, 2019

zshi-redhat commented May 22, 2019 • edited Loading

pperiyasamy commented May 22, 2019

pperiyasamy commented Jun 5, 2019

zshi-redhat commented Jun 6, 2019

pperiyasamy commented Jun 6, 2019 • edited Loading

zshi-redhat commented Jun 6, 2019

pperiyasamy commented Jun 6, 2019

zshi-redhat commented Jun 6, 2019 • edited Loading

pperiyasamy commented Jun 7, 2019

pperiyasamy commented Jun 7, 2019 • edited Loading

zshi-redhat commented Jun 10, 2019

pperiyasamy commented Jun 11, 2019

pperiyasamy commented Jun 11, 2019

zshi-redhat commented Jun 11, 2019

pperiyasamy commented Jun 11, 2019

zshi-redhat commented Jun 12, 2019

pperiyasamy commented Jun 12, 2019 • edited Loading

pperiyasamy commented Jun 13, 2019

zshi-redhat commented Jun 14, 2019

zshi-redhat commented Jun 27, 2019

pperiyasamy commented Jun 27, 2019

ahalimx86 commented Jun 27, 2019

pperiyasamy commented Jul 3, 2019

ahalimx86 commented Jul 4, 2019

pperiyasamy commented Jul 11, 2019

leyao-daily commented Sep 25, 2019

zshi-redhat commented Sep 27, 2019

zshi-redhat commented Sep 27, 2019

zshi-redhat commented May 16, 2019 •

edited

Loading

pperiyasamy commented May 16, 2019 •

edited

Loading

pperiyasamy commented May 16, 2019 •

edited

Loading

zshi-redhat commented May 16, 2019 •

edited

Loading

zshi-redhat commented May 22, 2019 •

edited

Loading

zshi-redhat commented May 22, 2019 •

edited

Loading

pperiyasamy commented Jun 6, 2019 •

edited

Loading

zshi-redhat commented Jun 6, 2019 •

edited

Loading

pperiyasamy commented Jun 7, 2019 •

edited

Loading

pperiyasamy commented Jun 12, 2019 •

edited

Loading