Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FPGA SRIO-V is not supportted by the FPGA device plugin #372

Closed
xxinran opened this issue Apr 27, 2020 · 12 comments · Fixed by #478
Closed

FPGA SRIO-V is not supportted by the FPGA device plugin #372

xxinran opened this issue Apr 27, 2020 · 12 comments · Fixed by #478
Assignees
Labels
fpga FPGA device plugin related issue

Comments

@xxinran
Copy link

xxinran commented Apr 27, 2020

I am trying FPGA device plugin in my local env with SRIOV-enabled A10 FPGA. It has 1 VF and 1 PF.
The FPGA device plugin is always in CrashLoopBackOff status.
The output of k logs intel-fpga-plugin-kgnz7 --namespace kube-system is the following:

FPGA device plugin (OPAE) started in af mode
Device scan failed: intel-fpga-dev.1: AFU without corresponding FME found
main.(*devicePlugin).scanFPGAs
/intel-device-plugins-for-kubernetes/cmd/fpga_plugin/fpga_plugin.go:296
main.(*devicePlugin).Scan
/intel-device-plugins-for-kubernetes/cmd/fpga_plugin/fpga_plugin.go:204
github.com/intel/intel-device-plugins-for-kubernetes/pkg/deviceplugin.(*Manager).Run.func1
/intel-device-plugins-for-kubernetes/pkg/deviceplugin/manager.go:96
runtime.goexit
/usr/lib/golang/src/runtime/asm_amd64.s:1357

Thanks in advance.

@bart0sh
Copy link
Member

bart0sh commented Apr 27, 2020

Which model of the Arria10 card do you use? Which kernel driver(DFL or OPAE) do you use? Please, show the output of ls -la /sys/class/fpga command here.

@xxinran
Copy link
Author

xxinran commented Apr 27, 2020

Which model of the Arria10 card do you use? Which kernel driver(DFL or OPAE) do you use? Please, show the output of ls -la /sys/class/fpga command here.

Here are my outputs:
$ lspci | grep 9c
04:00.0 Processing accelerators: Intel Corporation Device 09c4
04:00.1 Processing accelerators: Intel Corporation Device 09c5
$ ls -al /sys/class/fpga
total 0
drwxr-xr-x 2 root root 0 Apr 27 14:20 .
drwxr-xr-x 79 root root 0 Apr 23 22:13 ..
lrwxrwxrwx 1 root root 0 Apr 24 03:44 intel-fpga-dev.0 -> ../../devices/pci0000:00/0000:00:03.0/0000:04:00.0/fpga/intel-fpga-dev.0
lrwxrwxrwx 1 root root 0 Apr 27 14:15 intel-fpga-dev.1 -> ../../devices/pci0000:00/0000:00:03.0/0000:04:00.1/fpga/intel-fpga-dev.1

$ ls | grep opae
opae-intel-fpga-driver-1.3.0-2
opae-intel-fpga-driver-1.3.0-2.tar.gz
opae-sdk-1.3.0-2

@bart0sh
Copy link
Member

bart0sh commented Apr 27, 2020

@xxinran looks good so far. Next two things I'd like to see are the output of these 2 commands:
ls -la /sys/class/fpga/intel-fpga-dev.0/ and lsmod |grep fpga. I suspect that intel-fpga-fme module is not loaded according to plugin error, but let's see what it is in reality.

@xxinran
Copy link
Author

xxinran commented Apr 27, 2020

@bart0sh Thanks for your quick reply, here is my outputs:

$ ls -la /sys/class/fpga/intel-fpga-dev.0/
total 0
drwxr-xr-x 4 root root 0 Apr 27 15:15 .
drwxr-xr-x 3 root root 0 Apr 24 03:43 ..
lrwxrwxrwx 1 root root 0 Apr 24 03:49 device -> ../../../0000:04:00.0
drwxr-xr-x 11 root root 0 Apr 24 03:43 intel-fpga-fme.0
drwxr-xr-x 2 root root 0 Apr 24 03:49 power
lrwxrwxrwx 1 root root 0 Apr 24 03:43 subsystem -> ../../../../../../class/fpga
-rw-r--r-- 1 root root 4096 Apr 24 03:43 uevent

$ ls -la /sys/class/fpga/intel-fpga-dev.1/
total 0
drwxr-xr-x 4 root root 0 Apr 27 15:16 .
drwxr-xr-x 3 root root 0 Apr 27 14:15 ..
lrwxrwxrwx 1 root root 0 Apr 27 14:20 device -> ../../../0000:04:00.1
drwxr-xr-x 4 root root 0 Apr 27 14:15 intel-fpga-port.1
drwxr-xr-x 2 root root 0 Apr 27 15:16 power
lrwxrwxrwx 1 root root 0 Apr 27 14:15 subsystem -> ../../../../../../class/fpga
-rw-r--r-- 1 root root 4096 Apr 27 14:15 uevent

$ lsmod |grep fpga
intel_fpga_afu 36864 0
intel_fpga_fme 61440 0
intel_fpga_pci 32768 2 intel_fpga_fme,intel_fpga_afu
fpga_mgr_mod 16384 1 intel_fpga_fme

@bart0sh
Copy link
Member

bart0sh commented Apr 27, 2020

@xxinran thanks for the info. It looks incorrect from plugin point of view. It should be intel-fpga-fme.1 directory under /sys/class/fpga/intel-fpga-dev.1/. Plugin expects it to be present. Here is an example from my system:

> ls -la /sys/class/fpga/intel-fpga-dev.0/
total 0
drwxr-xr-x  5 root root    0 Apr  7 12:29 .
drwxr-xr-x  3 root root    0 Apr  7 12:29 ..
lrwxrwxrwx  1 root root    0 Apr  8 13:49 device -> ../../../0000:06:00.0
drwxr-xr-x 12 root root    0 Apr  7 12:29 intel-fpga-fme.0
drwxr-xr-x  4 root root    0 Apr  7 12:29 intel-fpga-port.0
drwxr-xr-x  2 root root    0 Apr  8 04:28 power
lrwxrwxrwx  1 root root    0 Apr  7 12:29 subsystem -> ../../../../../../class/fpga
-rw-r--r--  1 root root 4096 Apr  7 12:29 uevent
ed@r720-1:~> ls -la /sys/class/fpga/intel-fpga-dev.1/
total 0
drwxr-xr-x  5 root root    0 Apr  7 12:29 .
drwxr-xr-x  3 root root    0 Apr  7 12:29 ..
lrwxrwxrwx  1 root root    0 Apr  8 13:49 device -> ../../../0000:42:00.0
drwxr-xr-x 12 root root    0 Apr  7 12:29 intel-fpga-fme.1
drwxr-xr-x  4 root root    0 Apr  7 12:29 intel-fpga-port.1
drwxr-xr-x  2 root root    0 Apr  8 04:28 power
lrwxrwxrwx  1 root root    0 Apr  7 12:29 subsystem -> ../../../../../../class/fpga
-rw-r--r--  1 root root 4096 Apr  7 12:29 uevent

I'd suggest to investigate why does it happen by looking at the dmesg output or asking OPAE devs. You can also try to re-insert cards one by one to see if they both show intel-fpga-fme entry in the /sys/class/fpga/intel-fpga-dev.0/ directory.

@xxinran
Copy link
Author

xxinran commented Apr 28, 2020

@bart0sh It seems that you can 2 physical FPGA card, not one vf and one pf. Did you enable SRIOV and virtualize the FPGA into VFs. If I disabled SRIOV, it will be only one fme and one port in intel-fpga-dev-0, and there is no intel-fpga-dev-1 repo, and the FPGA device plugin will report correctly.
But if I enable SRIOV, the /sys/class/fpga will have 2 intel-fpga-dev-*, as I copyed above, in this case, the plugin will show this error: AFU without corresponding FME found.

@kad
Copy link
Member

kad commented Apr 28, 2020

@xxinran can you please show output of /opt/intel/fpga-sw/fpgatool list and /opt/intel/fpga-sw/fpgatool -d intel-fpga-port.1 fpgainfo on your system?

@bart0sh bart0sh changed the title Is FPGA SRIOV supportted in FPGA device plugin? FPGA SRIO-V is not supportted by the FPGA device plugin Apr 28, 2020
@bart0sh
Copy link
Member

bart0sh commented Apr 28, 2020

@xxinran The plugin doesn't support SRIO-V setup at the moment. However, this is just because we didn't test this configuration. We'll work on this issue and report here when it's fixed.

@xxinran
Copy link
Author

xxinran commented Apr 29, 2020

@bart0sh Ah, got it. Thanks for your explaination :)

@kad
Copy link
Member

kad commented May 5, 2020

@xxinran can you still reply with output of fpgatool in your system, so we can validate that at least part of functionality working in such setups?

@xxinran
Copy link
Author

xxinran commented May 6, 2020

Sure.

$ sudo /opt/intel/fpga-sw/fpgatool -d intel-fpga-port.1 fpgainfo

//****** PORT ******//
Name                             : intel-fpga-port.1
Device Node                      : /dev/intel-fpga-port.1
SysFS Path                       : /sys/devices/pci0000:00/0000:00:03.0/0000:04:00.1/fpga/intel-fpga-dev.1/intel-fpga-port.1
PCIe s:b:d:f                     : 0000:04:00.1
Physical Function PCIe s:b:d:f   : 0000:04:00.0
Device Id                        : 0x8086:0x09c5
Device Class                     : 0x120000
Local CPUs                       : 0-7,16-23
NUMA                             : 0
FME Name                         : intel-fpga-fme.0
Port Id                          : 0
Interface UUID                   : 9926ab6d6c925a68aabca7d84c545738
Accelerator UUID                 : f7df405cbd7acf7222f144b0b93acd18
Kernet API Version               : 0
Port Regions                     : 2
Port Region (Index/Size/Offset)  : 0 / 262144 / 0
Port Region (Index/Size/Offset)  : 1 / 4096 / 262144
$ /opt/intel/fpga-sw/fpgatool list
Detected FPGA FMEs: 1
intel-fpga-fme.0
Detected FPGA Ports: 1
intel-fpga-port.1
$ ls  /sys/class/fpga
intel-fpga-dev.0  intel-fpga-dev.1
$ ls /sys/class/fpga/intel-fpga-dev.0
device  intel-fpga-fme.0  power  subsystem  uevent

$ ls  /sys/class/fpga/intel-fpga-dev.1
device  intel-fpga-port.1  power  subsystem  uevent

@kad
Copy link
Member

kad commented May 6, 2020

Thanks @xxinran, so it is only discovery phase of the plugin will require fix.

@msivosuo msivosuo added the fpga FPGA device plugin related issue label Sep 15, 2020
bart0sh added a commit to bart0sh/intel-device-plugins-for-kubernetes that referenced this issue Oct 26, 2020
Reimplemented discovering of the FPGA devices using
APIs from pkg/fpga/intel_fpga_linux. The APis are also
used in the fpga_tool utility.

The API is more advanced and supports SR-IOV among other
things.

Fixes: intel#372

Signed-off-by: Ed Bartosh <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fpga FPGA device plugin related issue
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants