-
Notifications
You must be signed in to change notification settings - Fork 914
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] Address performance regression in semi/anti joins from switching to cuco #9973
Comments
I am not entirely sure this is the same regression, but I am seeing one after (#9666) was merged. I see less invocations of this kernel, but they are ~10x slower than before the PR was merged. In one of the TPCDS @ 3TB query q87 (and possibly others) I see a loss of 4 seconds (before: 14263 ms, after: 18194 ms)
Does this seem to be the same issue as in this ticket? I also tried with an early version of the mixed condition PR and this kernel seems to be back to normal. That said, I don't see us invoking the mixed condition kernels. Perhaps there are general performance improvements here: #9917 that help. |
I realize now that this also means the change (#9666) wasn't merged into that branch. So I don't think the mixed join change will address. I would still like to confirm that the place I saw the regression is expected of this issue. |
The slowdowns that I observed were more in the 1-3.5x range than 10x, so I'm surprised that you're seeing something so significant, but a regression is expected because of this issue. I'm especially surprised because I observed larger slowdowns for smaller data sizes, whereas for larger data sizes the relative slowdown was much smaller (although in absolute terms a 4 second slowdown is of course possible). I wonder (but have no specific evidence to suggest this) whether #9912 may also have led to some slowdown in the specific algorithms used here that is exacerbating the issue. The mixed condition PR is independent and will not change the perf here. |
This issue has been labeled |
Is your feature request related to a problem? Please describe.
cuco::static_map
is currently slower than cudf's internal unordered map. As part of a general move towards relying on cuCollections for data structures, however, we have made the replacement in #9666.Describe the solution you'd like
cuco::static_multimap
was optimized as part of its incorporation into cudf in #8934, but because of the current structure of cuCollections these optimizations have not been ported tocuco::static_map
. While addressing NVIDIA/cuCollections#110 we should profile and make sure that the resulting optimizations are sufficient to close the performance gap.Additional context
The performance loss was deemed acceptable in order to progress faster towards #9695, which will rely on features that
cuco::static_map
offers thatcudf::concurrent_unordered_map
does not.The text was updated successfully, but these errors were encountered: