Improve buffer allocation and copies for SiPixelDigisCUDA #36176

czangela · 2021-11-19T09:32:14Z

PR description:

This is a technical PR that merges memory buffers of SiPixelDigisCUDA, to have a single SoA buffer.
Also, output collections are now copied to the CPU in a single call.

This decreases memory allocation for each thread, the number of certain API calls and improves throughput (measurements carried out in pixeltrack-standalone framework).

Other details:

release: CMSSW_12_1_0_pre5
machine: cmg-gpu1080
export CUDA_VISIBLE_DEVICES=4

API calls diff: (for 9 events)
(measurement done with nvprof, output formatted with awk, sort, diff)

API name	baseline	digiscudasoa_patch_2021_11_17	diff	per event diff	per reco diff
cudaEventCreateWithFlags	218	206	-12
cudaEventDestroy	218	206	-12
cudaEventQuery	1112	1032	-80
cudaEventRecord	1080	990	-90
cudaFree	164	160	-4		-4
cudaFreeHost	299	291	-8		-8
cudaGetDevice	2654	2470	-184
cudaHostAlloc	298	290	-8		-8
cudaMalloc	148	144	-4		-4
cudaMemcpyAsync	490	454	-36	-4
cudaSetDevice	1382	1315	-67
cudaStreamWaitEvent	3	6	+3

diff view

-218 cudaEventCreateWithFlags
+206 cudaEventCreateWithFlags
-218 cudaEventDestroy
+206 cudaEventDestroy
-1112 cudaEventQuery
+1032 cudaEventQuery
-1080 cudaEventRecord
+990 cudaEventRecord
-164 cudaFree
+160 cudaFree
-299 cudaFreeHost
+291 cudaFreeHost
-2654 cudaGetDevice
+2470 cudaGetDevice
-298 cudaHostAlloc
+290 cudaHostAlloc
-148 cudaMalloc
+144 cudaMalloc
-490 cudaMemcpyAsync
+454 cudaMemcpyAsync
-1382 cudaSetDevice
+1315 cudaSetDevice
-3 cudaStreamWaitEvent
+6 cudaStreamWaitEvent

Memory allocation per event thread:
(measurement done with nvidia-smi while running cmsRun)

threads/streams	baseline	digiscudasoa_patch_2021_11_17	difference
1 thread/stream	611 MiB	597 MiB	-14 MiB
8 threads/streams	867 MiB	751 MiB	-116 MiB
diff - 7 threads	256 MiB	154 MiB	-
per thread	36 MiB	22 MiB	-14MiB

These measurements show that the decrease in per event memory allocations scales with the number of threads and is approximately 39%.

Throughput
(in pixeltrack-standalone framework applying the same changes)

# before
$  for N in 1 2 3 4; do taskset -c 0-15,32-47 ./cudadev  --numberOfThreads 16 --maxEvents 10000; done
Found 1 devices
Processing 10000 events, of which 16 concurrently, with 16 threads.
Processed 10000 events in 8.657476e+00 seconds, throughput 1155.07 events/s.
Found 1 devices
Processing 10000 events, of which 16 concurrently, with 16 threads.
Processed 10000 events in 8.695556e+00 seconds, throughput 1150.01 events/s.
Found 1 devices
Processing 10000 events, of which 16 concurrently, with 16 threads.
Processed 10000 events in 8.730811e+00 seconds, throughput 1145.37 events/s.
Found 1 devices
Processing 10000 events, of which 16 concurrently, with 16 threads.
Processed 10000 events in 8.742207e+00 seconds, throughput 1143.88 events/s.

# after
$  for N in 1 2 3 4; do taskset -c 0-15,32-47 ./cudadev  --numberOfThreads 16 --maxEvents 10000; done
Found 1 devices
Processing 10000 events, of which 16 concurrently, with 16 threads.
Processed 10000 events in 8.489208e+00 seconds, throughput 1177.97 events/s.
Found 1 devices
Processing 10000 events, of which 16 concurrently, with 16 threads.
Processed 10000 events in 8.525742e+00 seconds, throughput 1172.92 events/s.
Found 1 devices
Processing 10000 events, of which 16 concurrently, with 16 threads.
Processed 10000 events in 8.563654e+00 seconds, throughput 1167.73 events/s.
Found 1 devices
Processing 10000 events, of which 16 concurrently, with 16 threads.
Processed 10000 events in 8.579456e+00 seconds, throughput 1165.58 events/s.

Before average: 1148,5825 events/s
After average: 1171,05 events/s
1171,05 / 1148,5825 - 1 = 1,956% ~ 2% speedup

PR validation:

The reconstruction remains the same from the physics point of view.

Validation in pixeltrack-standalone framework:

$ ./cudadev --histogram --validation --numberOfThreads 16 --maxEvents 10000
Found 1 devices
Processing 10000 events, of which 16 concurrently, with 16 threads.
CountValidator: all 10000 events passed validation
 Average relative track difference 0.000857969 (all within tolerance)
 Average absolute vertex difference 0.0011 (all within tolerance)
Processed 10000 events in 7.489605e+01 seconds, throughput 133.518 events/s.

if this PR is a backport please specify the original PR and why you need to backport that PR:

kindly ping @tonydp03 @felicepantaleo @VinInn

cmsbuild · 2021-11-19T09:38:57Z

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-36176/26742

This PR adds an extra 36KB to repository
There are other open Pull requests which might conflict with changes you have proposed:
- File RecoLocalTracker/SiPixelRecHits/plugins/SiPixelRecHitSoAFromLegacy.cc modified in PR(s): Use cooperative groups to populate Associations (Histograms) in Pixel Patatrack #35713

cmsbuild · 2021-11-19T09:39:17Z

A new Pull Request was created by @czangela for master.

It involves the following packages:

CUDADataFormats/SiPixelCluster (heterogeneous, reconstruction)
CUDADataFormats/SiPixelDigi (heterogeneous, reconstruction)
EventFilter/SiPixelRawToDigi (reconstruction)
RecoLocalTracker/SiPixelClusterizer (reconstruction)
RecoLocalTracker/SiPixelRecHits (reconstruction)

@jpata, @cmsbuild, @fwyzard, @makortel, @slava77 can you please review it and eventually sign? Thanks.
@mtosi, @OzAmram, @felicepantaleo, @GiacomoSguazzoni, @JanFSchulte, @rovere, @VinInn, @Martin-Grunewald, @missirol, @dkotlins, @ferencek, @gpetruc, @mmusich, @threus, @tvami this is something you requested to watch as well.
@perrotta, @dpiparo, @qliphy you are the release manager for this.

cms-bot commands are listed here

VinInn · 2021-11-19T09:56:41Z

CUDADataFormats/SiPixelDigi/interface/SiPixelDigisCUDASOAView.h

+  // we don't copy local x and y coordinates and module index
+  enum class StorageLocationHost { ADC = 0, CLUS = 1, PDIGI = 3, RAWIDARR = 5, MAX = 7 };
+  /*
+  ========================================================================================


what happens if N is odd? (the 32bit word would not be alligned...)
I would feel more comfortable reversing the whole storage
so we keep
clus,adc,id,yy,xx close to each other
and one transfer to host the first part (that is also much better aligned)

Is the 128-byte alignment not important for each column then?

it should be

Looks like the simpler fix would be to round maxFedWords up the next multiple of 64 in the constructor of SiPixelDigisCUDA:

- : m_store(cms::cuda::make_device_unique<uint16_t[]>( - maxFedWords * int(SiPixelDigisCUDASOAView::StorageLocation::kMAX), stream)) { - auto get16 = [&](SiPixelDigisCUDASOAView::StorageLocation s) { return m_store.get() + int(s) * maxFedWords; }; + : m_store(cms::cuda::make_device_unique<uint16_t[]>( + ((maxFedWords + 63) / 64) * 64 * int(SiPixelDigisCUDASOAView::StorageLocation::kMAX), stream)) { + auto get16 = [&](SiPixelDigisCUDASOAView::StorageLocation s) { return m_store.get() + int(s) * ((maxFedWords + 63) / 64) * 64; };

This would guarantee that all "columns" are aligned to 128 bytes (assuming the whole buffer is) and that any 32-bit integer does not fall on a 16-bit address.

VinInn · 2021-11-19T10:02:45Z

CUDADataFormats/SiPixelDigi/interface/SiPixelDigisCUDA.h

-  cms::cuda::device::unique_ptr<uint16_t[]> moduleInd_d;  // module id of each pixel
-  cms::cuda::device::unique_ptr<int32_t[]> clus_d;        // cluster id of each pixel
-  cms::cuda::device::unique_ptr<DeviceConstView> view_d;  // "me" pointer
+  cms::cuda::device::unique_ptr<uint16_t[]> m_store;


Just to be on the safe side I would allocate uint32_t (and then use a uint16_t pointer anyhow).
In reality as long as we use the CashedAllocator is absolutely irrelevant (alignment guaranteed).
so OK keep it as such (just reverse the components as suggested below)

VinInn · 2021-11-19T10:08:52Z

@cmsbuild , please test.

VinInn · 2021-11-19T10:09:02Z

enable gpu

cmsbuild · 2021-11-19T10:32:25Z

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-36176/26746

This PR adds an extra 36KB to repository
There are other open Pull requests which might conflict with changes you have proposed:
- File RecoLocalTracker/SiPixelRecHits/plugins/SiPixelRecHitSoAFromLegacy.cc modified in PR(s): Use cooperative groups to populate Associations (Histograms) in Pixel Patatrack #35713

cmsbuild · 2021-11-19T10:32:49Z

Pull request #36176 was updated. @jpata, @cmsbuild, @fwyzard, @makortel, @slava77 can you please check and sign again.

VinInn · 2021-11-19T10:34:17Z

@cmsbuild , please test

VinInn · 2021-11-19T16:32:38Z

btw:
are we protected against ndigi==0?

makortel · 2021-11-19T21:54:48Z

CUDADataFormats/SiPixelDigi/interface/SiPixelDigisCUDA.h

-  // separate product?
-  cms::cuda::device::unique_ptr<uint32_t[]> pdigi_d;     // packed digi (row, col, adc) of each pixel
-  cms::cuda::device::unique_ptr<uint32_t[]> rawIdArr_d;  // DetId of each pixel
+  cms::cuda::host::unique_ptr<SiPixelDigisCUDASOAView> m_view;  // "me" pointer


The m_view was in device memory before. Was it deemed better to access the pinned host memory directly instead of transferring it? Was passing the View by value considered?

Indeed there is no need to allocate it on heap (even less on pinned memory)
suggested change
SiPixelDigisCUDASOAView m_view;

the single elements are still passed to the kernel one by one. In a further cleanup one can consider to pass the view to the kernel by value.

In PixelRecHitGPUKernel::makeHitsAsync() it is still passed by pointer

cmssw/RecoLocalTracker/SiPixelRecHits/plugins/PixelRecHitGPUKernel.cu

Lines 51 to 52 in f60a7ff

gpuPixelRecHits::getHits<<<blocks, threadsPerBlock, 0, stream>>>(

cpeParams, bs_d.data(), digis_d.view(), digis_d.nDigis(), clusters_d.view(), hits_d.view());

CUDADataFormats/SiPixelCluster/interface/SiPixelClustersCUDA.h

makortel · 2021-11-19T21:58:05Z

CUDADataFormats/SiPixelDigi/interface/SiPixelDigisCUDASOAView.h

+  uint16_t *xx_;  // local coordinates of each pixel
+  uint16_t *yy_;
+  uint16_t *adc_;        // ADC of each pixel
+  uint16_t *moduleInd_;  // module id of each pixel
+  int32_t *clus_;        // cluster id of each pixel
+  uint32_t *pdigi_;
+  uint32_t *rawIdArr_;


Public access to non-const pointers combined with

cmssw/CUDADataFormats/SiPixelDigi/interface/SiPixelDigisCUDA.h

Line 33 in f60a7ff

SiPixelDigisCUDASOAView const *view() const { return m_view.get(); }

opens door for mutable access via const functions because it is easy to make a non-const copy of SiPixelDigisCUDASOAView (now also in host code because of

cmssw/CUDADataFormats/SiPixelDigi/interface/SiPixelDigisCUDA.h

Line 39 in f60a7ff

cms::cuda::host::unique_ptr<SiPixelDigisCUDASOAView> m_view; // "me" pointer

).

I think that we already decided that the best way to format and code SOAs and View is deferred to a further discussion

Data products are required to be const-thread-safe, and we shouldn't deviate from that even temporarily.

This view is used to modify buffers through it like TrackingRecHit2DSOAView, so members should stay non-const. Is making them private and accessing them through member functions resolve the conflict?

I'm starting to see @VinInn's point now. The TrackingRecHit2DSOAView appears to suffer from the same problem (easy to "leak" mutable access from const methods by just making a copy of the View; I had forgotten we had such cases already). Adding member functions (either const or non-const) that return pointers-to-non-const don't help.

Overhauling "everything" is clearly out of scope of this PR. Still, the earlier SiPixelDigisCUDA did not leak mutable access. Maybe, to be practical and given the precedent of TrackingRecHit2DSOAView, the "gaining mutable access via copying View" could be considered infrequent-enough to accept temporarily, and work towards getting e.g. #35951 as a next step for better SoA. @fwyzard, what do you think?

cmsbuild · 2021-12-07T09:48:51Z

Pull request #36176 was updated. @jpata, @cmsbuild, @fwyzard, @makortel, @slava77 can you please check and sign again.

slava77 · 2021-12-07T19:28:34Z

@cmsbuild please test

cmsbuild · 2021-12-07T22:26:33Z

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-21d468/21070/summary.html
COMMIT: e832c49
CMSSW: CMSSW_12_3_X_2021-12-07-1100/slc7_amd64_gcc900
Additional Tests: GPU
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/36176/21070/install.sh to create a dev area with all the needed externals and cmssw changes.

GPU Comparison Summary

Summary:

No significant changes to the logs found
Reco comparison results: 5 differences found in the comparisons
DQMHistoTests: Total files compared: 4
DQMHistoTests: Total histograms compared: 19802
DQMHistoTests: Total failures: 1438
DQMHistoTests: Total nulls: 0
DQMHistoTests: Total successes: 18364
DQMHistoTests: Total skipped: 0
DQMHistoTests: Total Missing objects: 0
DQMHistoSizes: Histogram memory added: 0.0 KiB( 3 files compared)
Checked 12 log files, 9 edm output root files, 4 DQM output files
TriggerResults: no differences found

Comparison Summary

Summary:

No significant changes to the logs found
Reco comparison results: 2 differences found in the comparisons
DQMHistoTests: Total files compared: 42
DQMHistoTests: Total histograms compared: 3250608
DQMHistoTests: Total failures: 6
DQMHistoTests: Total nulls: 0
DQMHistoTests: Total successes: 3250580
DQMHistoTests: Total skipped: 22
DQMHistoTests: Total Missing objects: 0
DQMHistoSizes: Histogram memory added: 0.0 KiB( 41 files compared)
Checked 177 log files, 37 edm output root files, 42 DQM output files
TriggerResults: no differences found

fwyzard

Looks good to me.

I left some trivial suggestions, but I don't mind if they are not implemented.

CUDADataFormats/SiPixelDigi/src/SiPixelDigisCUDA.cc

CUDADataFormats/SiPixelDigi/interface/SiPixelDigisCUDASOAView.h

fwyzard · 2021-12-08T09:10:07Z

+heterogeneous

slava77 · 2021-12-09T01:02:59Z

+reconstruction

for #36176 e832c49

code changes are in line with the PR description and the follow up review
jenkins tests pass and comparisons with the baseline show small differences in the GPU tracking wf 11634.506, with the frequency somewhat similar to what was seen recently in Test regression in GPU WF #36390

Just in case it's useful, I picked one case, which can be selected with Scan("p.pt():p.eta():p.phi():p.normalizedChi2()", "abs(p.pt()-0.9960846)<0.1&&abs(p.phi()-0.2702972)<0.1", "", 1, 3), after SetAlias("p", "recoTracks_pixelTracks__RECO.obj")

*  Row   * Instance *    p.pt() *   p.eta() *   p.phi() * p.normalizedChi2()
baseline
*        3 *       76 * 0.9960846 * 2.1197478 * 0.2702972 * 0.7512380
this PR:
*        3 *       74 * 0.9966789 * 2.1197481 * 0.2703375 *  0.6810361

Ability to more easily sign off GPU PRs based on the comparisons is becoming a dream.

cmsbuild · 2021-12-09T01:03:25Z

This pull request is fully signed and it will be integrated in one of the next master IBs (tests are also fine). This pull request will now be reviewed by the release team before it's merged. @perrotta, @dpiparo, @qliphy (and backports should be raised in the release meeting by the corresponding L2)

fwyzard · 2021-12-09T10:32:23Z

As a cross-check, the speed-up is visible in the usual CMSSW workflow (pixel quadruplets on 2018 data) as well:

perrotta · 2021-12-09T10:47:04Z

+1

Improved performance
Extensively reviewed and finally approved by the relevant gurus

czangela added 3 commits November 17, 2021 14:10

use single buffer for digis, change digis cuda soa view

a03bfba

make single copy for digis device to host

4c5c2af

use enum classes, make art

4e13266

cmsbuild added this to the CMSSW_12_2_X milestone Nov 19, 2021

cmsbuild added code-checks-pending heterogeneous-pending orp-pending pending-signatures reconstruction-pending tests-pending labels Nov 19, 2021

cmsbuild added code-checks-approved and removed code-checks-pending labels Nov 19, 2021

VinInn reviewed Nov 19, 2021

View reviewed changes

reverse memory storage order

f60a7ff

cmsbuild added code-checks-pending and removed code-checks-approved labels Nov 19, 2021

cmsbuild added code-checks-approved and removed code-checks-pending labels Nov 19, 2021

cmsbuild added tests-started and removed tests-pending labels Nov 19, 2021

makortel reviewed Nov 20, 2021

View reviewed changes

cmsbuild added code-checks-approved and removed code-checks-pending labels Dec 7, 2021

cmsbuild added tests-started and removed tests-pending labels Dec 7, 2021

cmsbuild added tests-approved and removed tests-started labels Dec 7, 2021

fwyzard approved these changes Dec 8, 2021

View reviewed changes

CUDADataFormats/SiPixelDigi/src/SiPixelDigisCUDA.cc Show resolved Hide resolved

CUDADataFormats/SiPixelDigi/interface/SiPixelDigisCUDASOAView.h Show resolved Hide resolved

CUDADataFormats/SiPixelDigi/interface/SiPixelDigisCUDASOAView.h Show resolved Hide resolved

cmsbuild added heterogeneous-approved and removed heterogeneous-pending labels Dec 8, 2021

cmsbuild added fully-signed reconstruction-approved and removed reconstruction-pending pending-signatures labels Dec 9, 2021

cmsbuild added orp-approved and removed orp-pending labels Dec 9, 2021

cmsbuild merged commit d573dd2 into cms-sw:master Dec 9, 2021

czangela mentioned this pull request Dec 9, 2021

Functional changes as follow-up to 36176, related to SiPixelDigisCUDA SOA #36430

Merged

This was referenced Dec 9, 2021

Temporarily restore sector and endcap accessors to RegionalMuonShower #36435

Merged

[Do not merge] testing bot #36437

Closed

[12.3.X] Move to root 6.24 cms-sw/cmsdist#7500

Merged

[ROOT624] Testing root 62.4 for gcc10 cms-sw/cmsdist#7505

Closed

missirol mentioned this pull request Dec 19, 2021

make customizeHLTforPatatrack even more robust #36536

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve buffer allocation and copies for SiPixelDigisCUDA #36176

Improve buffer allocation and copies for SiPixelDigisCUDA #36176

czangela commented Nov 19, 2021 •

edited

Loading

cmsbuild commented Nov 19, 2021

cmsbuild commented Nov 19, 2021

VinInn Nov 19, 2021

makortel Nov 19, 2021

fwyzard Nov 26, 2021

fwyzard Nov 26, 2021 •

edited

Loading

VinInn Nov 19, 2021

VinInn commented Nov 19, 2021

VinInn commented Nov 19, 2021

cmsbuild commented Nov 19, 2021

cmsbuild commented Nov 19, 2021

VinInn commented Nov 19, 2021

VinInn commented Nov 19, 2021

makortel Nov 19, 2021

VinInn Nov 20, 2021 •

edited

Loading

makortel Nov 22, 2021

makortel Nov 19, 2021

VinInn Nov 20, 2021

makortel Nov 22, 2021

czangela Nov 23, 2021

makortel Nov 23, 2021

cmsbuild commented Dec 7, 2021

slava77 commented Dec 7, 2021

cmsbuild commented Dec 7, 2021

fwyzard left a comment

fwyzard commented Dec 8, 2021

slava77 commented Dec 9, 2021

cmsbuild commented Dec 9, 2021

fwyzard commented Dec 9, 2021

perrotta commented Dec 9, 2021

	gpuPixelRecHits::getHits<<<blocks, threadsPerBlock, 0, stream>>>(
	cpeParams, bs_d.data(), digis_d.view(), digis_d.nDigis(), clusters_d.view(), hits_d.view());

Improve buffer allocation and copies for SiPixelDigisCUDA #36176

Improve buffer allocation and copies for SiPixelDigisCUDA #36176

Conversation

czangela commented Nov 19, 2021 • edited Loading

PR description:

PR validation:

if this PR is a backport please specify the original PR and why you need to backport that PR:

cmsbuild commented Nov 19, 2021

cmsbuild commented Nov 19, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fwyzard Nov 26, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

VinInn commented Nov 19, 2021

VinInn commented Nov 19, 2021

cmsbuild commented Nov 19, 2021

cmsbuild commented Nov 19, 2021

VinInn commented Nov 19, 2021

VinInn commented Nov 19, 2021

Choose a reason for hiding this comment

VinInn Nov 20, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cmsbuild commented Dec 7, 2021

slava77 commented Dec 7, 2021

cmsbuild commented Dec 7, 2021

GPU Comparison Summary

Comparison Summary

fwyzard left a comment

Choose a reason for hiding this comment

fwyzard commented Dec 8, 2021

slava77 commented Dec 9, 2021

cmsbuild commented Dec 9, 2021

fwyzard commented Dec 9, 2021

perrotta commented Dec 9, 2021

czangela commented Nov 19, 2021 •

edited

Loading

fwyzard Nov 26, 2021 •

edited

Loading

VinInn Nov 20, 2021 •

edited

Loading