Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve distinct by using cuco::static_map::retrieve_all #10916

Merged

Conversation

PointKernel
Copy link
Member

@PointKernel PointKernel commented May 20, 2022

Closes #10909

This PR was intended to fix a bug in the distinct implementation where the stream parameter was not passed when invoking static_map::contains. During the work, @ttnghia Pointed out that the contains + thrust::copy_if logic can be simplified by using static_map::retrieve_all. Finally, the PR fetches a newer version of cuco to utilize retrieve_all and fixes a bug in unit tests where results should be sorted before comparison.

@PointKernel PointKernel added bug Something isn't working 3 - Ready for Review Ready for review by team libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change labels May 20, 2022
@PointKernel PointKernel requested a review from a team as a code owner May 20, 2022 20:18
@PointKernel PointKernel self-assigned this May 20, 2022
@PointKernel PointKernel requested a review from cwharris May 20, 2022 20:18
@PointKernel PointKernel requested a review from vuule May 20, 2022 20:18
@github-actions github-actions bot added the CMake CMake build issue label May 22, 2022
@jrhemstad
Copy link
Contributor

rerun tests

@PointKernel PointKernel requested a review from a team as a code owner May 24, 2022 02:19
@PointKernel PointKernel requested a review from ttnghia May 24, 2022 02:19
@PointKernel PointKernel changed the title Fix a bug in distinct: pass stream to map::contains Improve distinct by using cuco::static_map::retrieve_all May 24, 2022
@PointKernel
Copy link
Member Author

Seeing performance regression when using retrieve_all:

  • static_map::contains + thrust::copy_if:
Type NumRows Samples CPU Time Noise GPU Time Noise
I32 10000 4640x 113.619 us 6.63% 108.028 us 4.10%
I32 100000 3552x 146.936 us 5.29% 141.284 us 3.37%
I32 1000000 1168x 437.271 us 1.85% 431.588 us 1.27%
I32 10000000 142x 3.534 ms 0.44% 3.529 ms 0.41%
  • static_map::retrieve_all:
Type NumRows Samples CPU Time Noise GPU Time Noise
I32 10000 1424x 463.748 us 6.15% 457.890 us 6.01%
I32 100000 1408x 500.350 us 6.76% 494.447 us 6.64%
I32 1000000 3152x 760.985 us 26.55% 755.105 us 26.54%
I32 10000000 624x 3.321 ms 0.96% 3.315 ms 0.94%

Will look into profiling tomorrow.

@ttnghia
Copy link
Contributor

ttnghia commented May 24, 2022

Seeing performance regression when using retrieve_all:

Did you get these numbers by pulling the same GIT_TAG for cuco?

@codecov
Copy link

codecov bot commented May 24, 2022

Codecov Report

❗ No coverage uploaded for pull request base (branch-22.08@9e4fc74). Click here to learn what that means.
The diff coverage is n/a.

@@               Coverage Diff               @@
##             branch-22.08   #10916   +/-   ##
===============================================
  Coverage                ?   86.32%           
===============================================
  Files                   ?      144           
  Lines                   ?    22668           
  Branches                ?        0           
===============================================
  Hits                    ?    19569           
  Misses                  ?     3099           
  Partials                ?        0           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9e4fc74...ef695ee. Read the comment docs.

@PointKernel PointKernel removed the bug Something isn't working label May 25, 2022
@jrhemstad
Copy link
Contributor

The perf issue sounds similar to what @cheinger did in #10511

He found explicitly materializing the gather mask improved perf because putting the device-side contains call into the gather increased register pressure too much.

How is the retrieve_all implemented?

@PointKernel
Copy link
Member Author

PointKernel commented May 26, 2022

The perf issue sounds similar to what @cheinger did in #10511

Agreed.

How is the retrieve_all implemented?

retrieve_all is using thrust::copy_if + transform iterator to generate the output since it requires a device functor to transform the content of a hash map slot to a thrust::tuple. Is it worth trying to write our own kernel in cuco instead of using thrust::copy_if? I don't see another way to get rid of this register usage issue otherwise.

@jrhemstad
Copy link
Contributor

Is it worth trying to write our own kernel in cuco instead of using thrust::copy_if

Doubtful. I don't see why there would be an increased register usage problem in this situation. The problem from #10511 was that we were using a contains function as the predicate for a copy_if which is a complex function that uses a lot of registers.

For retrieve_all it should be a incredibly simple predicate function of just checking if the slot is empty. So I think something else is going on.

@PointKernel
Copy link
Member Author

Updates: the performance regression comes from a bug in the current static_map::retrieve_all. Created NVIDIA/cuCollections#169 to fix it.

@PointKernel PointKernel added the improvement Improvement / enhancement to an existing function label May 27, 2022
@PointKernel
Copy link
Member Author

Merging this PR since the performance regression issue will be resolved after merging #10983

@PointKernel
Copy link
Member Author

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 6eca42f into rapidsai:branch-22.08 May 27, 2022
@PointKernel PointKernel deleted the distinct-missing-parameter branch November 16, 2022 20:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team CMake CMake build issue improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] cudf::distinct implementation missed stream parameter when calling to key_map.contains
5 participants