-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve buffer allocation and copies for SiPixelDigisCUDA #36176
Improve buffer allocation and copies for SiPixelDigisCUDA #36176
Conversation
+code-checks Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-36176/26742
|
A new Pull Request was created by @czangela for master. It involves the following packages:
@jpata, @cmsbuild, @fwyzard, @makortel, @slava77 can you please review it and eventually sign? Thanks. cms-bot commands are listed here |
// we don't copy local x and y coordinates and module index | ||
enum class StorageLocationHost { ADC = 0, CLUS = 1, PDIGI = 3, RAWIDARR = 5, MAX = 7 }; | ||
/* | ||
======================================================================================== |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what happens if N is odd? (the 32bit word would not be alligned...)
I would feel more comfortable reversing the whole storage
so we keep
clus,adc,id,yy,xx close to each other
and one transfer to host the first part (that is also much better aligned)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the 128-byte alignment not important for each column then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it should be
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like the simpler fix would be to round maxFedWords
up the next multiple of 64
in the constructor of SiPixelDigisCUDA
:
- : m_store(cms::cuda::make_device_unique<uint16_t[]>(
- maxFedWords * int(SiPixelDigisCUDASOAView::StorageLocation::kMAX), stream)) {
- auto get16 = [&](SiPixelDigisCUDASOAView::StorageLocation s) { return m_store.get() + int(s) * maxFedWords; };
+ : m_store(cms::cuda::make_device_unique<uint16_t[]>(
+ ((maxFedWords + 63) / 64) * 64 * int(SiPixelDigisCUDASOAView::StorageLocation::kMAX), stream)) {
+ auto get16 = [&](SiPixelDigisCUDASOAView::StorageLocation s) { return m_store.get() + int(s) * ((maxFedWords + 63) / 64) * 64; };
This would guarantee that all "columns" are aligned to 128 bytes (assuming the whole buffer is) and that any 32-bit integer does not fall on a 16-bit address.
cms::cuda::device::unique_ptr<uint16_t[]> moduleInd_d; // module id of each pixel | ||
cms::cuda::device::unique_ptr<int32_t[]> clus_d; // cluster id of each pixel | ||
cms::cuda::device::unique_ptr<DeviceConstView> view_d; // "me" pointer | ||
cms::cuda::device::unique_ptr<uint16_t[]> m_store; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to be on the safe side I would allocate uint32_t (and then use a uint16_t pointer anyhow).
In reality as long as we use the CashedAllocator is absolutely irrelevant (alignment guaranteed).
so OK keep it as such (just reverse the components as suggested below)
@cmsbuild , please test. |
enable gpu |
+code-checks Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-36176/26746
|
@cmsbuild , please test |
btw: |
// separate product? | ||
cms::cuda::device::unique_ptr<uint32_t[]> pdigi_d; // packed digi (row, col, adc) of each pixel | ||
cms::cuda::device::unique_ptr<uint32_t[]> rawIdArr_d; // DetId of each pixel | ||
cms::cuda::host::unique_ptr<SiPixelDigisCUDASOAView> m_view; // "me" pointer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The m_view
was in device memory before. Was it deemed better to access the pinned host memory directly instead of transferring it? Was passing the View by value considered?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed there is no need to allocate it on heap (even less on pinned memory)
suggested change
SiPixelDigisCUDASOAView m_view;
the single elements are still passed to the kernel one by one. In a further cleanup one can consider to pass the view to the kernel by value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In PixelRecHitGPUKernel::makeHitsAsync()
it is still passed by pointer
cmssw/RecoLocalTracker/SiPixelRecHits/plugins/PixelRecHitGPUKernel.cu
Lines 51 to 52 in f60a7ff
gpuPixelRecHits::getHits<<<blocks, threadsPerBlock, 0, stream>>>( | |
cpeParams, bs_d.data(), digis_d.view(), digis_d.nDigis(), clusters_d.view(), hits_d.view()); |
uint16_t *xx_; // local coordinates of each pixel | ||
uint16_t *yy_; | ||
uint16_t *adc_; // ADC of each pixel | ||
uint16_t *moduleInd_; // module id of each pixel | ||
int32_t *clus_; // cluster id of each pixel | ||
uint32_t *pdigi_; | ||
uint32_t *rawIdArr_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Public access to non-const
pointers combined with
SiPixelDigisCUDASOAView const *view() const { return m_view.get(); } |
opens door for mutable access via
const
functions because it is easy to make a non-const
copy of SiPixelDigisCUDASOAView
(now also in host code because ofcms::cuda::host::unique_ptr<SiPixelDigisCUDASOAView> m_view; // "me" pointer |
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that we already decided that the best way to format and code SOAs and View is deferred to a further discussion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Data products are required to be const
-thread-safe, and we shouldn't deviate from that even temporarily.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This view is used to modify buffers through it like TrackingRecHit2DSOAView, so members should stay non-const
. Is making them private
and accessing them through member functions resolve the conflict?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm starting to see @VinInn's point now. The TrackingRecHit2DSOAView
appears to suffer from the same problem (easy to "leak" mutable access from const
methods by just making a copy of the View
; I had forgotten we had such cases already). Adding member functions (either const
or non-const
) that return pointers-to-non-const
don't help.
Overhauling "everything" is clearly out of scope of this PR. Still, the earlier SiPixelDigisCUDA
did not leak mutable access. Maybe, to be practical and given the precedent of TrackingRecHit2DSOAView
, the "gaining mutable access via copying View
" could be considered infrequent-enough to accept temporarily, and work towards getting e.g. #35951 as a next step for better SoA. @fwyzard, what do you think?
@cmsbuild please test |
+1 Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-21d468/21070/summary.html GPU Comparison SummarySummary:
Comparison SummarySummary:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me.
I left some trivial suggestions, but I don't mind if they are not implemented.
+heterogeneous |
+reconstruction
Just in case it's useful, I picked one case, which can be selected with
Ability to more easily sign off GPU PRs based on the comparisons is becoming a dream. |
This pull request is fully signed and it will be integrated in one of the next master IBs (tests are also fine). This pull request will now be reviewed by the release team before it's merged. @perrotta, @dpiparo, @qliphy (and backports should be raised in the release meeting by the corresponding L2) |
+1
|
PR description:
This is a technical PR that merges memory buffers of
SiPixelDigisCUDA
, to have a singleSoA
buffer.Also, output collections are now copied to the
CPU
in a single call.This decreases memory allocation for each thread, the number of certain API calls and improves throughput (measurements carried out in
pixeltrack-standalone framework
).Other details:
CMSSW_12_1_0_pre5
cmg-gpu1080
export CUDA_VISIBLE_DEVICES=4
API calls diff: (for 9 events)
(measurement done with
nvprof
, output formatted withawk
,sort
,diff
)diff view
Memory allocation per event thread:
(measurement done with
nvidia-smi
while runningcmsRun
)These measurements show that the decrease in per event memory allocations scales with the number of threads and is approximately 39%.
Throughput
(in
pixeltrack-standalone
framework applying the same changes)Before average: 1148,5825 events/s
After average: 1171,05 events/s
1171,05 / 1148,5825 - 1 = 1,956% ~ 2% speedup
PR validation:
The reconstruction remains the same from the physics point of view.
Validation in pixeltrack-standalone framework:
if this PR is a backport please specify the original PR and why you need to backport that PR:
kindly ping @tonydp03 @felicepantaleo @VinInn