Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor: unify the GPU device selection ExaTrkX; local variables declaration in FRNN lib #2925

Merged
merged 13 commits into from
Jun 21, 2024

Conversation

hrzhao76
Copy link
Contributor

@hrzhao76 hrzhao76 commented Feb 5, 2024

This PR moves deviceHint in previous implementations to the Config constructor, and create a torch::Device type with some protections to ensure both model and input tensors are loaded to a specific GPU. The base of FRNN repo is changed to avoid declaration of global variables in CUDA codes which causes segmentation fault in run time when running with Triton Inference Server.

Tagging ExaTrkX aaS people here
@xju2 @ytchoutw @asnaylor @yongbinfeng @y19y19

@github-actions github-actions bot added Component - Examples Affects the Examples module Component - Plugins Affects one or more Plugins Track Finding labels Feb 5, 2024
@AJPfleger AJPfleger changed the title refactor: unify the GPU device seletion ExaTrkX; local variables declaration in FRNN lib refactor: unify the GPU device selection ExaTrkX; local variables declaration in FRNN lib Feb 6, 2024
Copy link
Contributor

@Corentin-Allaire Corentin-Allaire left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good !

@andiwand andiwand added this to the next milestone Mar 1, 2024
Copy link

codecov bot commented Mar 1, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 48.86%. Comparing base (92aca1c) to head (3ca9f70).

Current head 3ca9f70 differs from pull request most recent head d380548

Please upload reports for the commit d380548 to get more accurate results.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2925      +/-   ##
==========================================
+ Coverage   47.24%   48.86%   +1.62%     
==========================================
  Files         508      493      -15     
  Lines       30041    29058     -983     
  Branches    14586    13798     -788     
==========================================
+ Hits        14192    14200       +8     
+ Misses       5375     4962     -413     
+ Partials    10474     9896     -578     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@andiwand andiwand removed the automerge label Mar 1, 2024
@andiwand
Copy link
Contributor

andiwand commented Mar 1, 2024

2 problems here:

  • codecov is failing
  • ci bridge wont run

@paulgessinger can you add @hrzhao76 to the bridge user list?

@paulgessinger
Copy link
Member

@andiwand Thought I did already, but done now anyway.

@andiwand
Copy link
Contributor

andiwand commented Mar 1, 2024

@hrzhao76
Copy link
Contributor Author

hrzhao76 commented Mar 4, 2024

CI Bridge / build_exatrkx is failing https://github.com/acts-project/acts/pull/2925/checks?check_run_id=22177563102 @hrzhao76

@andiwand thanks for the message. I fix it in this commit 49a628a, then manually do the unit test and it works.

However, now the ci fails when downloading the Geant4 pkg. Seems URL changed.
https://gitlab.cern.ch/acts/ci-bridge/-/jobs/36678358#L340

it also fails with the ACTS_LOG_FAILURE_THRESHOLD=WARNING. in this case would you suggest to change the Warning to Info if GPU is not available?

https://gitlab.cern.ch/acts/ci-bridge/-/jobs/36678356#L362

@paulgessinger
Copy link
Member

paulgessinger commented Mar 12, 2024

@hrzhao76 The CI job runs on a machine which does have a GPU available. The WARNING seems to indicate to me that it's failing to correctly select the GPU. I would caution against downgrading the WARNING, because it means that we would stop running GPU tests.

@andiwand
Copy link
Contributor

https://github.com/acts-project/acts/pull/2925/checks?check_run_id=22552545038

09:27:40    MetricLearni   WARNING   GPU device 0 not available. Using CPU instead.
FPE masks:
- Fatras/include/ActsFatras/Kernel/detail/SimulationActor.hpp:197: FLTUND: 1
- Fatras/include/ActsFatras/Physics/ElectroMagnetic/BetheHeitler.hpp:65: FLTUND: 1
- Examples/Io/Root/src/RootTrackStatesWriter.cpp:590: FLTINV: 1
- Examples/Io/Root/src/RootTrackSummaryWriter.cpp:419: FLTINV: 1
- Core/src/Vertexing/AdaptiveMultiVertexFinder.cpp:480: FLTUND: 1
- Core/include/Acts/TrackFitting/detail/GsfComponentMerging.hpp:88: FLTUND: 1
- Core/include/Acts/TrackFitting/detail/GsfComponentMerging.hpp:198: FLTUND: 1
Traceback (most recent call last):
  File "/builds/acts/ci-bridge/src/Examples/Scripts/Python/exatrkx.py", line 75, in <module>
    addExaTrkX(
  File "/builds/acts/ci-bridge/build/python/acts/examples/reconstruction.py", line 1530, in addExaTrkX
    graphConstructor = acts.examples.TorchMetricLearning(**metricLearningConfig)
  File "/builds/acts/ci-bridge/build/python/acts/_adapter.py", line 74, in wrapped
    fn(self, *args, **kwargs)
  File "/builds/acts/ci-bridge/build/python/acts/_adapter.py", line 40, in wrapped
    fn(self, cfg, *args, **_kwargs)
acts.ActsPythonBindings.logging.ThresholdFailure: Previous debug message exceeds the ACTS_LOG_FAILURE_THRESHOLD=WARNING configuration, bailing out.

If we want to run on a GPU there seems to be a problem with the selection. If CPU is wanted the ACTS_LOG_FAILURE_THRESHOLD seems to kill the process.

@hrzhao76

@github-actions github-actions bot added Stale and removed Stale labels Apr 11, 2024
@benjaminhuth
Copy link
Member

In principle I like these changes, so we should try to merge them. Could the CI issue with the GPU be related to the fact, that the previous default device type was -1, and no it is defaulted to 0?
This wouldn't make much sense to me, but is the only clear difference I see...

@github-actions github-actions bot removed the Stale label Jun 21, 2024
@hrzhao76
Copy link
Contributor Author

@hrzhao76 are we moving forward with this? Otherwise I'll close this PR.

Sorry for the late reply. I'm looking into this ci bridge failure.

@hrzhao76
Copy link
Contributor Author

Hello @benjaminhuth @paulgessinger @andiwand Thank you for pointing out the CI bridge failure.
The issue arises because one of the examples is tested with only the CPU, and CUDA_VISIBLE_DEVICES is empty. You can see this at line 1302 in test_examples.py.
The original device selection logic reports a ACTS_WARNING anyway and fails to run in such a CPU-only testing even though on a GPU equipped machine. I have improved the logic, and it should work correctly now. Let's wait for the ci bridge results.

Copy link
Member

@benjaminhuth benjaminhuth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for the update! I have only a some nitpick-comments on the logging.

One idea: Would it be worth to factor out and centralize the device selection logic to a seperate function, maybe in .../include/Acts/Plugins/ExaTrkX/detail/deviceSelection.hpp or so?

Plugins/ExaTrkX/src/TorchEdgeClassifier.cpp Outdated Show resolved Hide resolved
@paulgessinger
Copy link
Member

Thanks @hrzhao76! There's also still conflict in thirdparty/FRNN/CMakeLists.txt, could you resolve that?

@github-actions github-actions bot added the Infrastructure Changes to build tools, continous integration, ... label Jun 21, 2024
@hrzhao76
Copy link
Contributor Author

Thanks a lot for the update! I have only a some nitpick-comments on the logging.

One idea: Would it be worth to factor out and centralize the device selection logic to a seperate function, maybe in .../include/Acts/Plugins/ExaTrkX/detail/deviceSelection.hpp or so?

I've corrected the logging. Regarding the deviceSelection.hpp, perhaps we can consider this for the future? Currently, only TorchScript follows this method, while ONNX uses a different approach.

@hrzhao76
Copy link
Contributor Author

Thanks @hrzhao76! There's also still conflict in thirdparty/FRNN/CMakeLists.txt, could you resolve that?

Thanks for pointing out. I've resolved this conflict after pulling the new commits. Also fixed a bug to avoid calling CUDAGuard when the device is kCPU, which failed in last round of ci bridge...

Copy link
Member

@benjaminhuth benjaminhuth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your work! I approve and hope the CI goes through...

@kodiakhq kodiakhq bot merged commit cf9d872 into acts-project:main Jun 21, 2024
52 checks passed
@acts-project-service acts-project-service added the Fails Athena tests This PR causes a failure in the Athena tests label Jun 21, 2024
timadye pushed a commit to andiwand/acts that referenced this pull request Jun 27, 2024
…laration in FRNN lib (acts-project#2925)

This PR moves `deviceHint` in previous implementations to the Config constructor, and create a `torch::Device` type with some protections to ensure both model and input tensors are loaded to a specific GPU. The base of FRNN repo is changed to avoid declaration of global variables in CUDA codes which causes segmentation fault in run time when running with Triton Inference Server. 

Tagging ExaTrkX aaS people here
@xju2 @ytchoutw @asnaylor @yongbinfeng @y19y19
kodiakhq bot pushed a commit that referenced this pull request Jul 8, 2024
…it (#3353)

#2925 broke build for the Exa.TrkX plugin CPU only, this PR fixes this. It changes this from an implicit choice (depending on wether or not we find cuda) to an explicit choice (cmake flag).
Also adds CI job to build CPU only.
benjaminhuth pushed a commit to benjaminhuth/acts that referenced this pull request Jul 11, 2024
…laration in FRNN lib (acts-project#2925)

This PR moves `deviceHint` in previous implementations to the Config constructor, and create a `torch::Device` type with some protections to ensure both model and input tensors are loaded to a specific GPU. The base of FRNN repo is changed to avoid declaration of global variables in CUDA codes which causes segmentation fault in run time when running with Triton Inference Server. 

Tagging ExaTrkX aaS people here
@xju2 @ytchoutw @asnaylor @yongbinfeng @y19y19
benjaminhuth added a commit to benjaminhuth/acts that referenced this pull request Jul 11, 2024
…it (acts-project#3353)

acts-project#2925 broke build for the Exa.TrkX plugin CPU only, this PR fixes this. It changes this from an implicit choice (depending on wether or not we find cuda) to an explicit choice (cmake flag).
Also adds CI job to build CPU only.
@paulgessinger paulgessinger modified the milestones: next, v36.0.0 Jul 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component - Examples Affects the Examples module Component - Plugins Affects one or more Plugins Fails Athena tests This PR causes a failure in the Athena tests Infrastructure Changes to build tools, continous integration, ... Track Finding
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants