-
Notifications
You must be signed in to change notification settings - Fork 168
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor: unify the GPU device selection ExaTrkX; local variables declaration in FRNN lib #2925
Conversation
📊: Physics performance monitoring for d380548physmon summary
|
…ix a typo in READEM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good !
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #2925 +/- ##
==========================================
+ Coverage 47.24% 48.86% +1.62%
==========================================
Files 508 493 -15
Lines 30041 29058 -983
Branches 14586 13798 -788
==========================================
+ Hits 14192 14200 +8
+ Misses 5375 4962 -413
+ Partials 10474 9896 -578 ☔ View full report in Codecov by Sentry. |
2 problems here:
@paulgessinger can you add @hrzhao76 to the bridge user list? |
@andiwand Thought I did already, but done now anyway. |
|
Invalidated by push of 618923b
@andiwand thanks for the message. I fix it in this commit 49a628a, then manually do the unit test and it works. However, now the ci fails when downloading the Geant4 pkg. Seems URL changed. it also fails with the |
@hrzhao76 The CI job runs on a machine which does have a GPU available. The WARNING seems to indicate to me that it's failing to correctly select the GPU. I would caution against downgrading the WARNING, because it means that we would stop running GPU tests. |
https://github.com/acts-project/acts/pull/2925/checks?check_run_id=22552545038
If we want to run on a GPU there seems to be a problem with the selection. If CPU is wanted the |
In principle I like these changes, so we should try to merge them. Could the CI issue with the GPU be related to the fact, that the previous default device type was |
Sorry for the late reply. I'm looking into this ci bridge failure. |
…to dev/exatrkx-gpu-device
Hello @benjaminhuth @paulgessinger @andiwand Thank you for pointing out the CI bridge failure. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for the update! I have only a some nitpick-comments on the logging.
One idea: Would it be worth to factor out and centralize the device selection logic to a seperate function, maybe in .../include/Acts/Plugins/ExaTrkX/detail/deviceSelection.hpp
or so?
Thanks @hrzhao76! There's also still conflict in |
I've corrected the logging. Regarding the deviceSelection.hpp, perhaps we can consider this for the future? Currently, only TorchScript follows this method, while ONNX uses a different approach. |
Thanks for pointing out. I've resolved this conflict after pulling the new commits. Also fixed a bug to avoid calling CUDAGuard when the device is kCPU, which failed in last round of ci bridge... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your work! I approve and hope the CI goes through...
…laration in FRNN lib (acts-project#2925) This PR moves `deviceHint` in previous implementations to the Config constructor, and create a `torch::Device` type with some protections to ensure both model and input tensors are loaded to a specific GPU. The base of FRNN repo is changed to avoid declaration of global variables in CUDA codes which causes segmentation fault in run time when running with Triton Inference Server. Tagging ExaTrkX aaS people here @xju2 @ytchoutw @asnaylor @yongbinfeng @y19y19
…laration in FRNN lib (acts-project#2925) This PR moves `deviceHint` in previous implementations to the Config constructor, and create a `torch::Device` type with some protections to ensure both model and input tensors are loaded to a specific GPU. The base of FRNN repo is changed to avoid declaration of global variables in CUDA codes which causes segmentation fault in run time when running with Triton Inference Server. Tagging ExaTrkX aaS people here @xju2 @ytchoutw @asnaylor @yongbinfeng @y19y19
…it (acts-project#3353) acts-project#2925 broke build for the Exa.TrkX plugin CPU only, this PR fixes this. It changes this from an implicit choice (depending on wether or not we find cuda) to an explicit choice (cmake flag). Also adds CI job to build CPU only.
This PR moves
deviceHint
in previous implementations to the Config constructor, and create atorch::Device
type with some protections to ensure both model and input tensors are loaded to a specific GPU. The base of FRNN repo is changed to avoid declaration of global variables in CUDA codes which causes segmentation fault in run time when running with Triton Inference Server.Tagging ExaTrkX aaS people here
@xju2 @ytchoutw @asnaylor @yongbinfeng @y19y19