Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve buffer allocation and copies for SiPixelDigisCUDA #36176

Merged
merged 13 commits into from
Dec 9, 2021

Conversation

czangela
Copy link
Contributor

@czangela czangela commented Nov 19, 2021

PR description:

This is a technical PR that merges memory buffers of SiPixelDigisCUDA, to have a single SoA buffer.
Also, output collections are now copied to the CPU in a single call.

This decreases memory allocation for each thread, the number of certain API calls and improves throughput (measurements carried out in pixeltrack-standalone framework).

Other details:

  • release: CMSSW_12_1_0_pre5
  • machine: cmg-gpu1080
  • export CUDA_VISIBLE_DEVICES=4

API calls diff: (for 9 events)
(measurement done with nvprof, output formatted with awk, sort, diff)

API name baseline digiscudasoa_patch_2021_11_17 diff per event diff per reco diff
cudaEventCreateWithFlags 218 206 -12
cudaEventDestroy 218 206 -12
cudaEventQuery 1112 1032 -80
cudaEventRecord 1080 990 -90
cudaFree 164 160 -4 -4
cudaFreeHost 299 291 -8 -8
cudaGetDevice 2654 2470 -184
cudaHostAlloc 298 290 -8 -8
cudaMalloc 148 144 -4 -4
cudaMemcpyAsync 490 454 -36 -4
cudaSetDevice 1382 1315 -67
cudaStreamWaitEvent 3 6 +3

diff view

-218 cudaEventCreateWithFlags
+206 cudaEventCreateWithFlags
-218 cudaEventDestroy
+206 cudaEventDestroy
-1112 cudaEventQuery
+1032 cudaEventQuery
-1080 cudaEventRecord
+990 cudaEventRecord
-164 cudaFree
+160 cudaFree
-299 cudaFreeHost
+291 cudaFreeHost
-2654 cudaGetDevice
+2470 cudaGetDevice
-298 cudaHostAlloc
+290 cudaHostAlloc
-148 cudaMalloc
+144 cudaMalloc
-490 cudaMemcpyAsync
+454 cudaMemcpyAsync
-1382 cudaSetDevice
+1315 cudaSetDevice
-3 cudaStreamWaitEvent
+6 cudaStreamWaitEvent

Memory allocation per event thread:
(measurement done with nvidia-smi while running cmsRun)

threads/streams baseline digiscudasoa_patch_2021_11_17 difference
1 thread/stream 611 MiB 597 MiB -14 MiB
8 threads/streams 867 MiB 751 MiB -116 MiB
diff - 7 threads 256 MiB 154 MiB -
per thread 36 MiB 22 MiB -14MiB

These measurements show that the decrease in per event memory allocations scales with the number of threads and is approximately 39%.

Throughput
(in pixeltrack-standalone framework applying the same changes)

# before
$  for N in 1 2 3 4; do taskset -c 0-15,32-47 ./cudadev  --numberOfThreads 16 --maxEvents 10000; done
Found 1 devices
Processing 10000 events, of which 16 concurrently, with 16 threads.
Processed 10000 events in 8.657476e+00 seconds, throughput 1155.07 events/s.
Found 1 devices
Processing 10000 events, of which 16 concurrently, with 16 threads.
Processed 10000 events in 8.695556e+00 seconds, throughput 1150.01 events/s.
Found 1 devices
Processing 10000 events, of which 16 concurrently, with 16 threads.
Processed 10000 events in 8.730811e+00 seconds, throughput 1145.37 events/s.
Found 1 devices
Processing 10000 events, of which 16 concurrently, with 16 threads.
Processed 10000 events in 8.742207e+00 seconds, throughput 1143.88 events/s.

# after
$  for N in 1 2 3 4; do taskset -c 0-15,32-47 ./cudadev  --numberOfThreads 16 --maxEvents 10000; done
Found 1 devices
Processing 10000 events, of which 16 concurrently, with 16 threads.
Processed 10000 events in 8.489208e+00 seconds, throughput 1177.97 events/s.
Found 1 devices
Processing 10000 events, of which 16 concurrently, with 16 threads.
Processed 10000 events in 8.525742e+00 seconds, throughput 1172.92 events/s.
Found 1 devices
Processing 10000 events, of which 16 concurrently, with 16 threads.
Processed 10000 events in 8.563654e+00 seconds, throughput 1167.73 events/s.
Found 1 devices
Processing 10000 events, of which 16 concurrently, with 16 threads.
Processed 10000 events in 8.579456e+00 seconds, throughput 1165.58 events/s.

Before average: 1148,5825 events/s
After average: 1171,05 events/s
1171,05 / 1148,5825 - 1 = 1,956% ~ 2% speedup

PR validation:

The reconstruction remains the same from the physics point of view.

Validation in pixeltrack-standalone framework:

$ ./cudadev --histogram --validation --numberOfThreads 16 --maxEvents 10000
Found 1 devices
Processing 10000 events, of which 16 concurrently, with 16 threads.
CountValidator: all 10000 events passed validation
 Average relative track difference 0.000857969 (all within tolerance)
 Average absolute vertex difference 0.0011 (all within tolerance)
Processed 10000 events in 7.489605e+01 seconds, throughput 133.518 events/s.

if this PR is a backport please specify the original PR and why you need to backport that PR:

kindly ping @tonydp03 @felicepantaleo @VinInn

@cmsbuild
Copy link
Contributor

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-36176/26742

@cmsbuild
Copy link
Contributor

A new Pull Request was created by @czangela for master.

It involves the following packages:

  • CUDADataFormats/SiPixelCluster (heterogeneous, reconstruction)
  • CUDADataFormats/SiPixelDigi (heterogeneous, reconstruction)
  • EventFilter/SiPixelRawToDigi (reconstruction)
  • RecoLocalTracker/SiPixelClusterizer (reconstruction)
  • RecoLocalTracker/SiPixelRecHits (reconstruction)

@jpata, @cmsbuild, @fwyzard, @makortel, @slava77 can you please review it and eventually sign? Thanks.
@mtosi, @OzAmram, @felicepantaleo, @GiacomoSguazzoni, @JanFSchulte, @rovere, @VinInn, @Martin-Grunewald, @missirol, @dkotlins, @ferencek, @gpetruc, @mmusich, @threus, @tvami this is something you requested to watch as well.
@perrotta, @dpiparo, @qliphy you are the release manager for this.

cms-bot commands are listed here

// we don't copy local x and y coordinates and module index
enum class StorageLocationHost { ADC = 0, CLUS = 1, PDIGI = 3, RAWIDARR = 5, MAX = 7 };
/*
========================================================================================
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what happens if N is odd? (the 32bit word would not be alligned...)
I would feel more comfortable reversing the whole storage
so we keep
clus,adc,id,yy,xx close to each other
and one transfer to host the first part (that is also much better aligned)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the 128-byte alignment not important for each column then?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it should be

Copy link
Contributor

@fwyzard fwyzard Nov 26, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like the simpler fix would be to round maxFedWords up the next multiple of 64 in the constructor of SiPixelDigisCUDA:

-    : m_store(cms::cuda::make_device_unique<uint16_t[]>(
-          maxFedWords * int(SiPixelDigisCUDASOAView::StorageLocation::kMAX), stream)) {
-  auto get16 = [&](SiPixelDigisCUDASOAView::StorageLocation s) { return m_store.get() + int(s) * maxFedWords; };
+    : m_store(cms::cuda::make_device_unique<uint16_t[]>(
+          ((maxFedWords + 63) / 64) * 64 * int(SiPixelDigisCUDASOAView::StorageLocation::kMAX), stream)) {
+  auto get16 = [&](SiPixelDigisCUDASOAView::StorageLocation s) { return m_store.get() + int(s) * ((maxFedWords + 63) / 64) * 64; };

This would guarantee that all "columns" are aligned to 128 bytes (assuming the whole buffer is) and that any 32-bit integer does not fall on a 16-bit address.

cms::cuda::device::unique_ptr<uint16_t[]> moduleInd_d; // module id of each pixel
cms::cuda::device::unique_ptr<int32_t[]> clus_d; // cluster id of each pixel
cms::cuda::device::unique_ptr<DeviceConstView> view_d; // "me" pointer
cms::cuda::device::unique_ptr<uint16_t[]> m_store;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to be on the safe side I would allocate uint32_t (and then use a uint16_t pointer anyhow).
In reality as long as we use the CashedAllocator is absolutely irrelevant (alignment guaranteed).
so OK keep it as such (just reverse the components as suggested below)

@VinInn
Copy link
Contributor

VinInn commented Nov 19, 2021

@cmsbuild , please test.

@VinInn
Copy link
Contributor

VinInn commented Nov 19, 2021

enable gpu

@cmsbuild
Copy link
Contributor

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-36176/26746

@cmsbuild
Copy link
Contributor

Pull request #36176 was updated. @jpata, @cmsbuild, @fwyzard, @makortel, @slava77 can you please check and sign again.

@VinInn
Copy link
Contributor

VinInn commented Nov 19, 2021

@cmsbuild , please test

@VinInn
Copy link
Contributor

VinInn commented Nov 19, 2021

btw:
are we protected against ndigi==0?

// separate product?
cms::cuda::device::unique_ptr<uint32_t[]> pdigi_d; // packed digi (row, col, adc) of each pixel
cms::cuda::device::unique_ptr<uint32_t[]> rawIdArr_d; // DetId of each pixel
cms::cuda::host::unique_ptr<SiPixelDigisCUDASOAView> m_view; // "me" pointer
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The m_view was in device memory before. Was it deemed better to access the pinned host memory directly instead of transferring it? Was passing the View by value considered?

Copy link
Contributor

@VinInn VinInn Nov 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed there is no need to allocate it on heap (even less on pinned memory)
suggested change
SiPixelDigisCUDASOAView m_view;

the single elements are still passed to the kernel one by one. In a further cleanup one can consider to pass the view to the kernel by value.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In PixelRecHitGPUKernel::makeHitsAsync() it is still passed by pointer

gpuPixelRecHits::getHits<<<blocks, threadsPerBlock, 0, stream>>>(
cpeParams, bs_d.data(), digis_d.view(), digis_d.nDigis(), clusters_d.view(), hits_d.view());

Comment on lines 39 to 45
uint16_t *xx_; // local coordinates of each pixel
uint16_t *yy_;
uint16_t *adc_; // ADC of each pixel
uint16_t *moduleInd_; // module id of each pixel
int32_t *clus_; // cluster id of each pixel
uint32_t *pdigi_;
uint32_t *rawIdArr_;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Public access to non-const pointers combined with

SiPixelDigisCUDASOAView const *view() const { return m_view.get(); }

opens door for mutable access via const functions because it is easy to make a non-const copy of SiPixelDigisCUDASOAView (now also in host code because of
cms::cuda::host::unique_ptr<SiPixelDigisCUDASOAView> m_view; // "me" pointer

).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that we already decided that the best way to format and code SOAs and View is deferred to a further discussion

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Data products are required to be const-thread-safe, and we shouldn't deviate from that even temporarily.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This view is used to modify buffers through it like TrackingRecHit2DSOAView, so members should stay non-const. Is making them private and accessing them through member functions resolve the conflict?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm starting to see @VinInn's point now. The TrackingRecHit2DSOAView appears to suffer from the same problem (easy to "leak" mutable access from const methods by just making a copy of the View; I had forgotten we had such cases already). Adding member functions (either const or non-const) that return pointers-to-non-const don't help.

Overhauling "everything" is clearly out of scope of this PR. Still, the earlier SiPixelDigisCUDA did not leak mutable access. Maybe, to be practical and given the precedent of TrackingRecHit2DSOAView, the "gaining mutable access via copying View" could be considered infrequent-enough to accept temporarily, and work towards getting e.g. #35951 as a next step for better SoA. @fwyzard, what do you think?

@cmsbuild
Copy link
Contributor

cmsbuild commented Dec 7, 2021

Pull request #36176 was updated. @jpata, @cmsbuild, @fwyzard, @makortel, @slava77 can you please check and sign again.

@slava77
Copy link
Contributor

slava77 commented Dec 7, 2021

@cmsbuild please test

@cmsbuild
Copy link
Contributor

cmsbuild commented Dec 7, 2021

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-21d468/21070/summary.html
COMMIT: e832c49
CMSSW: CMSSW_12_3_X_2021-12-07-1100/slc7_amd64_gcc900
Additional Tests: GPU
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/36176/21070/install.sh to create a dev area with all the needed externals and cmssw changes.

GPU Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 5 differences found in the comparisons
  • DQMHistoTests: Total files compared: 4
  • DQMHistoTests: Total histograms compared: 19802
  • DQMHistoTests: Total failures: 1438
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 18364
  • DQMHistoTests: Total skipped: 0
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 3 files compared)
  • Checked 12 log files, 9 edm output root files, 4 DQM output files
  • TriggerResults: no differences found

Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 2 differences found in the comparisons
  • DQMHistoTests: Total files compared: 42
  • DQMHistoTests: Total histograms compared: 3250608
  • DQMHistoTests: Total failures: 6
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 3250580
  • DQMHistoTests: Total skipped: 22
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 41 files compared)
  • Checked 177 log files, 37 edm output root files, 42 DQM output files
  • TriggerResults: no differences found

Copy link
Contributor

@fwyzard fwyzard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

I left some trivial suggestions, but I don't mind if they are not implemented.

@fwyzard
Copy link
Contributor

fwyzard commented Dec 8, 2021

+heterogeneous

@slava77
Copy link
Contributor

slava77 commented Dec 9, 2021

+reconstruction

for #36176 e832c49

  • code changes are in line with the PR description and the follow up review
  • jenkins tests pass and comparisons with the baseline show small differences in the GPU tracking wf 11634.506, with the frequency somewhat similar to what was seen recently in Test regression in GPU WF #36390

Just in case it's useful, I picked one case, which can be selected with Scan("p.pt():p.eta():p.phi():p.normalizedChi2()", "abs(p.pt()-0.9960846)<0.1&&abs(p.phi()-0.2702972)<0.1", "", 1, 3), after SetAlias("p", "recoTracks_pixelTracks__RECO.obj")

*  Row   * Instance *    p.pt() *   p.eta() *   p.phi() * p.normalizedChi2()
baseline
*        3 *       76 * 0.9960846 * 2.1197478 * 0.2702972 * 0.7512380
this PR:
*        3 *       74 * 0.9966789 * 2.1197481 * 0.2703375 *  0.6810361

Ability to more easily sign off GPU PRs based on the comparisons is becoming a dream.

@cmsbuild
Copy link
Contributor

cmsbuild commented Dec 9, 2021

This pull request is fully signed and it will be integrated in one of the next master IBs (tests are also fine). This pull request will now be reviewed by the release team before it's merged. @perrotta, @dpiparo, @qliphy (and backports should be raised in the release meeting by the corresponding L2)

@fwyzard
Copy link
Contributor

fwyzard commented Dec 9, 2021

As a cross-check, the speed-up is visible in the usual CMSSW workflow (pixel quadruplets on 2018 data) as well:

plot

zoom

@perrotta
Copy link
Contributor

perrotta commented Dec 9, 2021

+1

  • Improved performance
  • Extensively reviewed and finally approved by the relevant gurus

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants