Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix collect time metric in CoalesceBatches #729

Merged
merged 4 commits into from
Sep 11, 2020

Conversation

tgravescs
Copy link
Collaborator

fixes #658

GpuCoalesceBatches has a collectTime metric that measures the time a task spent waiting to collect all the input batches. This metric can be incorrect when a task becomes blocked by the GPU semaphore trying to access the GPU. GpuCoalesceBatches only starts the collectTime metric in the next() call, but the GPU semphore may try to be acquired by an upstream node when GpuCoalesceBatches's input iterator's hasNext() call is invoked. Any time spent in the first invocation of the input iterator hasNext() method is not tracked, leading to large discrepancies in time as reported in #622.

Moved the start of this into hasNext which was most straight forward.
Note also needed to move the totalTime metric.

attached is screenshot of the fix where GpuCoalesceBatches properly shows collect batch time and total time very close to the GpuShuffledHashJoin build time. Without the fix the times are very different.

After Fix:
Screen Shot 2020-09-10 at 5 23 25 PM

Before Fix:
Screen Shot 2020-09-10 at 5 26 04 PM

@tgravescs
Copy link
Collaborator Author

build

@tgravescs tgravescs added the bug Something isn't working label Sep 10, 2020
@tgravescs tgravescs added this to the Aug 31 - Sep 11 milestone Sep 10, 2020
@tgravescs tgravescs self-assigned this Sep 10, 2020
@tgravescs tgravescs merged commit be6bb90 into NVIDIA:branch-0.2 Sep 11, 2020
@tgravescs tgravescs deleted the coalesceCollectTime branch September 11, 2020 13:28
nartal1 pushed a commit to nartal1/spark-rapids that referenced this pull request Jun 9, 2021
* Fix collect time in CoalesceBatches

* Also move totalTime

* switch to use if !isdefined

Signed-off-by: Thomas Graves <[email protected]>

* remove extra newline

Signed-off-by: Thomas Graves <[email protected]>
nartal1 pushed a commit to nartal1/spark-rapids that referenced this pull request Jun 9, 2021
* Fix collect time in CoalesceBatches

* Also move totalTime

* switch to use if !isdefined

Signed-off-by: Thomas Graves <[email protected]>

* remove extra newline

Signed-off-by: Thomas Graves <[email protected]>
tgravescs pushed a commit to tgravescs/spark-rapids that referenced this pull request Nov 30, 2023
…IDIA#729)

Signed-off-by: spark-rapids automation <[email protected]>

Signed-off-by: spark-rapids automation <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] GpuCoalesceBatches collectTime metric can be underreported
2 participants