Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU: Add support for the new xe KMD #1670

Merged
merged 6 commits into from
Mar 13, 2024
Merged

Conversation

tkatila
Copy link
Contributor

@tkatila tkatila commented Feb 20, 2024

Separates i915 and xe KMDs via resource name: gpu.intel.com/i915 vs gpu.intel.com/xe. Supports both i915 and xe devices at the same time, but with RM only one type is allowed. Also made some changes to the documentation and NFD rules.

Fixes: #1688

Copy link
Contributor

@eero-t eero-t left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few documentation comments/suggestions.

cmd/gpu_plugin/README.md Outdated Show resolved Hide resolved
cmd/gpu_plugin/README.md Show resolved Hide resolved
cmd/gpu_plugin/README.md Outdated Show resolved Hide resolved
cmd/gpu_plugin/gpu_plugin.go Outdated Show resolved Hide resolved
cmd/gpu_plugin/monitoring.md Outdated Show resolved Hide resolved
@tkatila tkatila changed the title GPU: Add support for the new xe drm driver GPU: Add support for the new xe KMD Feb 20, 2024
Copy link
Contributor

@eero-t eero-t left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some doc suggestions

cmd/gpu_plugin/README.md Outdated Show resolved Hide resolved
cmd/gpu_plugin/README.md Outdated Show resolved Hide resolved
cmd/gpu_plugin/README.md Outdated Show resolved Hide resolved
cmd/gpu_plugin/README.md Outdated Show resolved Hide resolved
@eero-t
Copy link
Contributor

eero-t commented Feb 22, 2024

Docs look fine to me now. Somebody else can review the code part...

@eero-t
Copy link
Contributor

eero-t commented Feb 22, 2024

I wonder whether it would make sense also to list the driver package names required for L0 (compute+monitoring)?

  • Intel driver repos: https://dgpu-docs.intel.com/
    • For Ubuntu 22.04:
      • level-zero-dev + intel-level-zero-gpu
    • For RHEL/SLES:
      • level-zero-devel + intel-level-zero-gpu
  • ⁠Debian (Testing+) / Ubuntu (24.04+):
    • libze-dev + libze-intel-gpu1
  • Fedora (38+):
    • oneapi-level-zero-devel + intel-level-zero
  • Arch Linux:
    • level-zero-headers + intel-compute-runtime

(As those packages are missing from older distros and their package names are therefore less familiar.)

@tkatila
Copy link
Contributor Author

tkatila commented Feb 23, 2024

I wonder whether it would make sense also to list the driver package names required for L0 (compute+monitoring)?

In general yes, I just don't know if the GPU plugin is the most sensible place for it. At least I wouldn't put those to the main readme.

cmd/gpu_plugin/README.md Outdated Show resolved Hide resolved
@codecov-commenter
Copy link

codecov-commenter commented Feb 26, 2024

Codecov Report

Attention: Patch coverage is 88.72180% with 15 lines in your changes are missing coverage. Please review.

Project coverage is 40.25%. Comparing base (7d00cf0) to head (e0c4ef1).

❗ Current head e0c4ef1 differs from pull request most recent head 4946b26. Consider uploading reports for the commit 4946b26 to get more accurate results

Files Patch % Lines
cmd/gpu_plugin/rm/gpu_plugin_resource_manager.go 33.33% 11 Missing and 1 partial ⚠️
cmd/gpu_plugin/device_props.go 92.59% 1 Missing and 1 partial ⚠️
cmd/gpu_plugin/gpu_plugin.go 98.48% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1670      +/-   ##
==========================================
+ Coverage   39.67%   40.25%   +0.57%     
==========================================
  Files          73       75       +2     
  Lines        6520     6573      +53     
==========================================
+ Hits         2587     2646      +59     
+ Misses       3782     3778       -4     
+ Partials      151      149       -2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@tkatila tkatila force-pushed the xe-driver-support branch 2 times, most recently from c86860f to 20c99a4 Compare February 27, 2024 08:43
uniemimu
uniemimu previously approved these changes Mar 8, 2024
Copy link
Contributor

@mythi mythi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it make sense to update e2e/gpu tests to explicitly check xe or i915 but keep the former disabled (SKIP) for now?

cmd/gpu_plugin/README.md Show resolved Hide resolved
cmd/gpu_plugin/README.md Outdated Show resolved Hide resolved
@tkatila
Copy link
Contributor Author

tkatila commented Mar 12, 2024

would it make sense to update e2e/gpu tests to explicitly check xe or i915 but keep the former disabled (SKIP) for now?

I think so. We had an internal discussion that we could add a Flex card or similar to the e2e node and tie one GPU to i915 and another to xe.

I'll add the test.

Plugin can support both i915 and xe drivers dynamically. But
having both drivers on same node with RM is not possible.

Signed-off-by: Tuomas Katila <[email protected]>
cmd/gpu_plugin/README.md Show resolved Hide resolved
cmd/gpu_plugin/README.md Outdated Show resolved Hide resolved
Co-authored-by: Eero Tamminen <[email protected]>
Signed-off-by: Tuomas Katila <[email protected]>
mythi
mythi previously approved these changes Mar 13, 2024
Copy link
Contributor

@mythi mythi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Also align xelink-sidecar deployment with the new files in
the xpu manager project.

Signed-off-by: Tuomas Katila <[email protected]>
Copy link
Contributor

@eero-t eero-t left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took a quick look also at the code and that looks OK too.

PS. At some point memory amount checks need to move from using out-of-tree i915 sysfs file, to supporting upstream i915 / Xe KMD ioctl interface, but that does not need to be done in this PR: https://docs.kernel.org/gpu/driver-uapi.html

@mythi mythi merged commit ca301c0 into intel:main Mar 13, 2024
78 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

GPU label not added to nodes on Talos because i915 is built into the kernel
5 participants