refactor: unify the GPU device selection ExaTrkX; local variables declaration in FRNN lib #2925

hrzhao76 · 2024-02-05T12:52:31Z

This PR moves deviceHint in previous implementations to the Config constructor, and create a torch::Device type with some protections to ensure both model and input tensors are loaded to a specific GPU. The base of FRNN repo is changed to avoid declaration of global variables in CUDA codes which causes segmentation fault in run time when running with Triton Inference Server.

Tagging ExaTrkX aaS people here
@xju2 @ytchoutw @asnaylor @yongbinfeng @y19y19

github-actions · 2024-02-05T15:51:35Z

📊: Physics performance monitoring for `d380548`

Full contents

physmon summary

Plugins/ExaTrkX/src/TorchMetricLearning.cpp

Plugins/ExaTrkX/src/OnnxMetricLearning.cpp

Plugins/ExaTrkX/src/TorchEdgeClassifier.cpp

…ix a typo in READEM

Corentin-Allaire

Looks good !

codecov · 2024-03-01T09:11:20Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 48.86%. Comparing base (92aca1c) to head (3ca9f70).

❗ Current head 3ca9f70 differs from pull request most recent head d380548

Please upload reports for the commit d380548 to get more accurate results.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2925      +/-   ##
==========================================
+ Coverage   47.24%   48.86%   +1.62%     
==========================================
  Files         508      493      -15     
  Lines       30041    29058     -983     
  Branches    14586    13798     -788     
==========================================
+ Hits        14192    14200       +8     
+ Misses       5375     4962     -413     
+ Partials    10474     9896     -578

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

andiwand · 2024-03-01T14:39:14Z

2 problems here:

codecov is failing
ci bridge wont run

@paulgessinger can you add @hrzhao76 to the bridge user list?

paulgessinger · 2024-03-01T15:21:25Z

@andiwand Thought I did already, but done now anyway.

andiwand · 2024-03-01T15:54:33Z

CI Bridge / build_exatrkx is failing https://github.com/acts-project/acts/pull/2925/checks?check_run_id=22177563102 @hrzhao76

…o dev/exatrkx-gpu-device

Invalidated by push of 618923b

hrzhao76 · 2024-03-04T08:15:28Z

CI Bridge / build_exatrkx is failing https://github.com/acts-project/acts/pull/2925/checks?check_run_id=22177563102 @hrzhao76

@andiwand thanks for the message. I fix it in this commit 49a628a, then manually do the unit test and it works.

However, now the ci fails when downloading the Geant4 pkg. Seems URL changed.
https://gitlab.cern.ch/acts/ci-bridge/-/jobs/36678358#L340

it also fails with the ACTS_LOG_FAILURE_THRESHOLD=WARNING. in this case would you suggest to change the Warning to Info if GPU is not available?

https://gitlab.cern.ch/acts/ci-bridge/-/jobs/36678356#L362

paulgessinger · 2024-03-12T15:33:45Z

@hrzhao76 The CI job runs on a machine which does have a GPU available. The WARNING seems to indicate to me that it's failing to correctly select the GPU. I would caution against downgrading the WARNING, because it means that we would stop running GPU tests.

andiwand · 2024-03-12T15:34:27Z

https://github.com/acts-project/acts/pull/2925/checks?check_run_id=22552545038

09:27:40    MetricLearni   WARNING   GPU device 0 not available. Using CPU instead.
FPE masks:
- Fatras/include/ActsFatras/Kernel/detail/SimulationActor.hpp:197: FLTUND: 1
- Fatras/include/ActsFatras/Physics/ElectroMagnetic/BetheHeitler.hpp:65: FLTUND: 1
- Examples/Io/Root/src/RootTrackStatesWriter.cpp:590: FLTINV: 1
- Examples/Io/Root/src/RootTrackSummaryWriter.cpp:419: FLTINV: 1
- Core/src/Vertexing/AdaptiveMultiVertexFinder.cpp:480: FLTUND: 1
- Core/include/Acts/TrackFitting/detail/GsfComponentMerging.hpp:88: FLTUND: 1
- Core/include/Acts/TrackFitting/detail/GsfComponentMerging.hpp:198: FLTUND: 1
Traceback (most recent call last):
  File "/builds/acts/ci-bridge/src/Examples/Scripts/Python/exatrkx.py", line 75, in <module>
    addExaTrkX(
  File "/builds/acts/ci-bridge/build/python/acts/examples/reconstruction.py", line 1530, in addExaTrkX
    graphConstructor = acts.examples.TorchMetricLearning(**metricLearningConfig)
  File "/builds/acts/ci-bridge/build/python/acts/_adapter.py", line 74, in wrapped
    fn(self, *args, **kwargs)
  File "/builds/acts/ci-bridge/build/python/acts/_adapter.py", line 40, in wrapped
    fn(self, cfg, *args, **_kwargs)
acts.ActsPythonBindings.logging.ThresholdFailure: Previous debug message exceeds the ACTS_LOG_FAILURE_THRESHOLD=WARNING configuration, bailing out.

If we want to run on a GPU there seems to be a problem with the selection. If CPU is wanted the ACTS_LOG_FAILURE_THRESHOLD seems to kill the process.

@hrzhao76

benjaminhuth · 2024-06-21T08:07:59Z

In principle I like these changes, so we should try to merge them. Could the CI issue with the GPU be related to the fact, that the previous default device type was -1, and no it is defaulted to 0?
This wouldn't make much sense to me, but is the only clear difference I see...

hrzhao76 · 2024-06-21T10:03:49Z

@hrzhao76 are we moving forward with this? Otherwise I'll close this PR.

Sorry for the late reply. I'm looking into this ci bridge failure.

…to dev/exatrkx-gpu-device

hrzhao76 · 2024-06-21T11:13:59Z

Hello @benjaminhuth @paulgessinger @andiwand Thank you for pointing out the CI bridge failure.
The issue arises because one of the examples is tested with only the CPU, and CUDA_VISIBLE_DEVICES is empty. You can see this at line 1302 in test_examples.py.
The original device selection logic reports a ACTS_WARNING anyway and fails to run in such a CPU-only testing even though on a GPU equipped machine. I have improved the logic, and it should work correctly now. Let's wait for the ci bridge results.

benjaminhuth

Thanks a lot for the update! I have only a some nitpick-comments on the logging.

One idea: Would it be worth to factor out and centralize the device selection logic to a seperate function, maybe in .../include/Acts/Plugins/ExaTrkX/detail/deviceSelection.hpp or so?

Plugins/ExaTrkX/src/TorchEdgeClassifier.cpp

paulgessinger · 2024-06-21T11:53:57Z

Thanks @hrzhao76! There's also still conflict in thirdparty/FRNN/CMakeLists.txt, could you resolve that?

…exatrkx-gpu-device

hrzhao76 · 2024-06-21T13:43:20Z

Thanks a lot for the update! I have only a some nitpick-comments on the logging.

One idea: Would it be worth to factor out and centralize the device selection logic to a seperate function, maybe in .../include/Acts/Plugins/ExaTrkX/detail/deviceSelection.hpp or so?

I've corrected the logging. Regarding the deviceSelection.hpp, perhaps we can consider this for the future? Currently, only TorchScript follows this method, while ONNX uses a different approach.

hrzhao76 · 2024-06-21T13:46:19Z

Thanks @hrzhao76! There's also still conflict in thirdparty/FRNN/CMakeLists.txt, could you resolve that?

Thanks for pointing out. I've resolved this conflict after pulling the new commits. Also fixed a bug to avoid calling CUDAGuard when the device is kCPU, which failed in last round of ci bridge...

benjaminhuth

Thanks for your work! I approve and hope the CI goes through...

cmake/ActsExternSources.cmake

acts-project-service · 2024-06-21T17:39:52Z

🔴 Athena integration test results [`cf9d872`]

🔴 Some tests have failed!

Please investigate the pipeline!

status	job	report
🟢	run_unit_tests
🟢	test_ActsDumpGeometryIdentifiers
🟢	test_ActsCheckObjectCountsWorkflow
🟢	test_ActsEFTrackFit
🟢	test_ActsPersistifySeeds
🟢	test_ActsBenchmarkWithSpot
🟢	test_ActsAnalogueClustering
🟢	test_ActsWorkflowHeavyIons
🟢	test_ActsWorkflowFastTracking
🟢	test_ActsWorkflowCached
🟢	test_ActsWorkflow
🟢	test_ActsValidateAmbiguityResolution
🟢	test_ActsValidateResolvedTracks
🟢	test_ActsValidateTracks
🟢	test_ActsValidateActsCoreSpacePoints
🟢	test_ActsValidateActsSpacePoints
🟢	test_ActsValidateSeeds
🟢	test_ActsValidateOrthogonalSeeds
🟢	test_ActsValidateClusters
🟢	test_ActsPersistifyEDM
🟢	test_ActsGSFRefitting
🟢	test_ActsKfRefitting
🟢	test_ActsExtrapolationAlgTest
🟢	test_ActsITkTest
🟢	run_workflow_tests_run4_mc
🟢	run_workflow_tests_run2_mc
🟢	run_workflow_tests_run2_data
🟢	run_workflow_tests_run3_mc
🟢	run_workflow_tests_run3_data
🔴	run_art_test: test_data18_13TeV_1000evt
🟢	run_art_test: test_ttbarPU40_reco

@xju2

…laration in FRNN lib (acts-project#2925) This PR moves `deviceHint` in previous implementations to the Config constructor, and create a `torch::Device` type with some protections to ensure both model and input tensors are loaded to a specific GPU. The base of FRNN repo is changed to avoid declaration of global variables in CUDA codes which causes segmentation fault in run time when running with Triton Inference Server. Tagging ExaTrkX aaS people here @xju2 @ytchoutw @asnaylor @yongbinfeng @y19y19

…it (#3353) #2925 broke build for the Exa.TrkX plugin CPU only, this PR fixes this. It changes this from an implicit choice (depending on wether or not we find cuda) to an explicit choice (cmake flag). Also adds CI job to build CPU only.

@xju2

…laration in FRNN lib (acts-project#2925) This PR moves `deviceHint` in previous implementations to the Config constructor, and create a `torch::Device` type with some protections to ensure both model and input tensors are loaded to a specific GPU. The base of FRNN repo is changed to avoid declaration of global variables in CUDA codes which causes segmentation fault in run time when running with Triton Inference Server. Tagging ExaTrkX aaS people here @xju2 @ytchoutw @asnaylor @yongbinfeng @y19y19

…it (acts-project#3353) acts-project#2925 broke build for the Exa.TrkX plugin CPU only, this PR fixes this. It changes this from an implicit choice (depending on wether or not we find cuda) to an explicit choice (cmake flag). Also adds CI job to build CPU only.

hrzhao76 added 3 commits January 23, 2024 16:22

feat: ExaTrkX ensure the model are loaded to the same GPU as the input

93a8454

add device guard; ci format

55c01e0

typo fixed; change frnn global declaration

39ee6e5

github-actions bot added Component - Examples Affects the Examples module Component - Plugins Affects one or more Plugins Track Finding labels Feb 5, 2024

AJPfleger requested a review from Corentin-Allaire February 6, 2024 16:24

AJPfleger changed the title ~~refactor: unify the GPU device seletion ExaTrkX; local variables declaration in FRNN lib~~ refactor: unify the GPU device selection ExaTrkX; local variables declaration in FRNN lib Feb 6, 2024

Corentin-Allaire suggested changes Feb 13, 2024

View reviewed changes

Plugins/ExaTrkX/src/TorchMetricLearning.cpp Outdated Show resolved Hide resolved

Plugins/ExaTrkX/src/OnnxMetricLearning.cpp Show resolved Hide resolved

Corentin-Allaire reviewed Feb 13, 2024

View reviewed changes

Plugins/ExaTrkX/src/TorchEdgeClassifier.cpp Outdated Show resolved Hide resolved

choose proper ACTS log for Torch; adapt ONNX to the new base class; f…

b6e52ef

…ix a typo in READEM

Corentin-Allaire previously approved these changes Feb 28, 2024

View reviewed changes

andiwand added the automerge label Mar 1, 2024

andiwand added this to the next milestone Mar 1, 2024

andiwand removed the automerge label Mar 1, 2024

Merge branch 'main' into dev/exatrkx-gpu-device

12a2284

hrzhao76 added 2 commits March 3, 2024 23:58

set default params to fix ci bridge

49a628a

Merge branch 'dev/exatrkx-gpu-device' of github.com:hrzhao76/acts int…

618923b

…o dev/exatrkx-gpu-device

Merge branch 'main' into dev/exatrkx-gpu-device

3ca9f70

github-actions bot added Stale and removed Stale labels Apr 11, 2024

github-actions bot removed the Stale label Jun 21, 2024

hrzhao76 added 2 commits June 21, 2024 03:49

improve the device selection logic

235ca35

Merge branch 'dev/exatrkx-gpu-device' of github.com:hrzhao76/acts in…

08d5841

…to dev/exatrkx-gpu-device

benjaminhuth reviewed Jun 21, 2024

View reviewed changes

Plugins/ExaTrkX/src/TorchEdgeClassifier.cpp Outdated Show resolved Hide resolved

benjaminhuth reviewed Jun 21, 2024

View reviewed changes

Plugins/ExaTrkX/src/TorchEdgeClassifier.cpp Outdated Show resolved Hide resolved

paulgessinger assigned benjaminhuth Jun 21, 2024

hrzhao76 added 3 commits June 21, 2024 06:27

no CUDAGuard if device is kCPU

021dea9

Merge branch 'main' of https://github.com/acts-project/acts into dev/…

17e305a

…exatrkx-gpu-device

update the FRNN lib

d380548

github-actions bot added the Infrastructure Changes to build tools, continous integration, ... label Jun 21, 2024

benjaminhuth approved these changes Jun 21, 2024

View reviewed changes

paulgessinger reviewed Jun 21, 2024

View reviewed changes

cmake/ActsExternSources.cmake Show resolved Hide resolved

benjaminhuth added the automerge label Jun 21, 2024

kodiakhq bot merged commit cf9d872 into acts-project:main Jun 21, 2024
52 checks passed

github-actions bot removed the automerge label Jun 21, 2024

acts-project-service added the Fails Athena tests This PR causes a failure in the Athena tests label Jun 21, 2024

benjaminhuth mentioned this pull request Jul 8, 2024

fix: Re-enable Exa.TrkX build without CUDA, make configuration explicit #3353

Merged

paulgessinger modified the milestones: next, v36.0.0 Jul 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: unify the GPU device selection ExaTrkX; local variables declaration in FRNN lib #2925

refactor: unify the GPU device selection ExaTrkX; local variables declaration in FRNN lib #2925

hrzhao76 commented Feb 5, 2024

github-actions bot commented Feb 5, 2024 •

edited

Loading

Corentin-Allaire left a comment

codecov bot commented Mar 1, 2024 •

edited

Loading

andiwand commented Mar 1, 2024

paulgessinger commented Mar 1, 2024

andiwand commented Mar 1, 2024

hrzhao76 commented Mar 4, 2024 •

edited

Loading

paulgessinger commented Mar 12, 2024 •

edited

Loading

andiwand commented Mar 12, 2024

benjaminhuth commented Jun 21, 2024

hrzhao76 commented Jun 21, 2024

hrzhao76 commented Jun 21, 2024

benjaminhuth left a comment

paulgessinger commented Jun 21, 2024

hrzhao76 commented Jun 21, 2024

hrzhao76 commented Jun 21, 2024

benjaminhuth left a comment

acts-project-service commented Jun 21, 2024 •

edited

Loading

refactor: unify the GPU device selection ExaTrkX; local variables declaration in FRNN lib #2925

refactor: unify the GPU device selection ExaTrkX; local variables declaration in FRNN lib #2925

Conversation

hrzhao76 commented Feb 5, 2024

github-actions bot commented Feb 5, 2024 • edited Loading

📊: Physics performance monitoring for d380548

physmon summary

Corentin-Allaire left a comment

Choose a reason for hiding this comment

codecov bot commented Mar 1, 2024 • edited Loading

Codecov Report

andiwand commented Mar 1, 2024

paulgessinger commented Mar 1, 2024

andiwand commented Mar 1, 2024

hrzhao76 commented Mar 4, 2024 • edited Loading

paulgessinger commented Mar 12, 2024 • edited Loading

andiwand commented Mar 12, 2024

benjaminhuth commented Jun 21, 2024

hrzhao76 commented Jun 21, 2024

hrzhao76 commented Jun 21, 2024

benjaminhuth left a comment

Choose a reason for hiding this comment

paulgessinger commented Jun 21, 2024

hrzhao76 commented Jun 21, 2024

hrzhao76 commented Jun 21, 2024

benjaminhuth left a comment

Choose a reason for hiding this comment

acts-project-service commented Jun 21, 2024 • edited Loading

🔴 Athena integration test results [cf9d872]

🔴 Some tests have failed!

github-actions bot commented Feb 5, 2024 •

edited

Loading

📊: Physics performance monitoring for `d380548`

codecov bot commented Mar 1, 2024 •

edited

Loading

hrzhao76 commented Mar 4, 2024 •

edited

Loading

paulgessinger commented Mar 12, 2024 •

edited

Loading

acts-project-service commented Jun 21, 2024 •

edited

Loading

🔴 Athena integration test results [`cf9d872`]