Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gpu levelzero sidecar #1803

Merged
merged 6 commits into from
Sep 19, 2024
Merged

Gpu levelzero sidecar #1803

merged 6 commits into from
Sep 19, 2024

Conversation

tkatila
Copy link
Contributor

@tkatila tkatila commented Aug 9, 2024

Add GPU Levelzero sidecar to allow fetching health data for the GPU devices from the Levelzero API.

As a bonus, this also adds support for detecting Intel GPUs within Windows Subsystem for Linux (WSL).

@tkatila tkatila force-pushed the gpu-levelzero-sidecar branch 4 times, most recently from fc65eff to d3afded Compare August 13, 2024 11:59
Copy link
Contributor

@eero-t eero-t left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few doc comments.

PS. I would also suggest running following (untested) command on the code to fix PCI abbreviation use in error messages:
sed -i 's/ pci / PCI /' $(git grep -l ' pci ')

README.md Outdated Show resolved Hide resolved
cmd/gpu_levelzero/README.md Outdated Show resolved Hide resolved
cmd/internal/levelzero/README.md Show resolved Hide resolved
@tkatila
Copy link
Contributor Author

tkatila commented Sep 6, 2024

@eero-t & @uniemimu can you please review?

This probably requires some restructuring after the GPU CDI support has been merged (assuming it will happen first).

(Failed check is due to internal certificate issue, I do not suspect it to be a real problem.)

@eero-t
Copy link
Contributor

eero-t commented Sep 6, 2024

@eero-t & @uniemimu can you please review?

What are the differences to earlier version? (you did force push, so github does not show those.)

@tkatila
Copy link
Contributor Author

tkatila commented Sep 6, 2024

@eero-t & @uniemimu can you please review?

What are the differences to earlier version? (you did force push, so github does not show those.)

https://github.com/intel/intel-device-plugins-for-kubernetes/compare/82a387760e4c2a9a12cc366405e19153f67e7efd..76b1e5a04abc600ec1992a1adf0474d42ffad1b9

~mid third is my changes. Top third and bottom third are baseline changes. But to summarize:

  1. Some logging to C code
  2. Added temperature read from device and its handling in the plugin

cmd/gpu_plugin/gpu_plugin.go Outdated Show resolved Hide resolved
cmd/gpu_levelzero/zes.c Outdated Show resolved Hide resolved
cmd/gpu_levelzero/zes.c Outdated Show resolved Hide resolved
cmd/gpu_levelzero/zes.c Outdated Show resolved Hide resolved
cmd/gpu_levelzero/zes.c Outdated Show resolved Hide resolved
@tkatila tkatila force-pushed the gpu-levelzero-sidecar branch 2 times, most recently from 6a923f8 to 0e6ee98 Compare September 11, 2024 08:57
@tkatila
Copy link
Contributor Author

tkatila commented Sep 11, 2024

Rebased on top of the CDI changes. Also added an e2e test for the levelzero deployment.

Copy link
Contributor

@eero-t eero-t left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved. There were few minor C-code items which could be fixed before merge.

cmd/gpu_levelzero/zes.c Outdated Show resolved Hide resolved
cmd/gpu_levelzero/zes.c Outdated Show resolved Hide resolved
cmd/gpu_levelzero/zes.c Outdated Show resolved Hide resolved
@mythi
Copy link
Contributor

mythi commented Sep 12, 2024

should it be mentioned somewhere that these new features are not available for the operator users?

@tkatila
Copy link
Contributor Author

tkatila commented Sep 12, 2024

should it be mentioned somewhere that these new features are not available for the operator users?

Sure, I'll add a note. This PR is large enough as it is so I didn't want to touch the operator use case. Once this is merged, I'll tackle the operator support.

@tkatila
Copy link
Contributor Author

tkatila commented Sep 13, 2024

Rebased the content. Added C fixes, fixed compile flags for the build and decided to change the levelzero enabling to a boolean (instead of a unis socket path).

cmd/gpu_plugin/gpu_plugin.go Outdated Show resolved Hide resolved
cmd/gpu_plugin/gpu_plugin.go Outdated Show resolved Hide resolved
cmd/gpu_plugin/gpu_plugin.go Outdated Show resolved Hide resolved
Copy link
Contributor

@eero-t eero-t left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved, looks OK on quick review.

cmd/gpu_plugin/README.md Outdated Show resolved Hide resolved
@tkatila
Copy link
Contributor Author

tkatila commented Sep 17, 2024

Since I broke wsl in the previous commits, I verified it to work on the current HEAD.

Copy link
Contributor

@mythi mythi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

tkatila and others added 6 commits September 19, 2024 14:51
Signed-off-by: Tuomas Katila <[email protected]>
In addition to the levelzero's health data use, this adds support to
scan devices in WSL. Scanning happens by retrieving Intel device
indices from the Level-Zero API.

Signed-off-by: Tuomas Katila <[email protected]>
Co-authored-by: Eero Tamminen <[email protected]>
Signed-off-by: Tuomas Katila <[email protected]>
@tkatila
Copy link
Contributor Author

tkatila commented Sep 19, 2024

Squashed commits, no code changes.

@uniemimu uniemimu merged commit 9bd3975 into intel:main Sep 19, 2024
75 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants