-
Notifications
You must be signed in to change notification settings - Fork 902
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
support CUDA async memory resource in JNI #9201
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to have at least a smoke test of the new allocator type in RmmTest that sets up the allocator, allocates and frees memory to exercise it. Bonus points if it also sets up the allocator with a small limit and verifies it gets an OOM if it tries to allocate just beyond that size.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added smoke test, which will be skipped if cuda < 11.2.
Note that we now support Java in the CI, so Java PRs should not skip ci. |
rerun tests |
1 similar comment
rerun tests |
rerun tests |
1 similar comment
rerun tests |
fyi, you could also experiment with using |
Codecov Report
@@ Coverage Diff @@
## branch-21.10 #9201 +/- ##
===============================================
Coverage ? 10.82%
===============================================
Files ? 115
Lines ? 19166
Branches ? 0
===============================================
Hits ? 2074
Misses ? 17092
Partials ? 0 Continue to review full report at Codecov.
|
@gpucibot merge |
@jrhemstad Yeah that's something we can try if it turns out small allocations are too expensive with async. |
@jrhemstad filed this: rapidsai/rmm#868, we need to fix this before we start using the async allocator. He thought it was a quick fix, and that it could be included in 21.10. FYI @sameerz |
It seems that circumvents the fragmentation-solving feature we want from the async allocator. If arena only allocates large chunks from the async allocator, won't we still have fragmentation within the arena blocks that the async allocator cannot solve since the async allocator will be unaware of the sub-utilization of the allocations it sees? |
The per-thread arenas are just caches for small allocations. If cuda async proves to be slow for small allocations, we can use the arena allocator to speed up these, as in a typical job there are tons of small allocations. The number of free blocks are now capped in each per-thread arena, so in theory it shouldn't cause too much additional fragmentation. If/when we decide to try this, we can probably further tweak the algorithm. |
CUDA 11.2 introduced stream ordered memory allocator that can potentially resolve memory fragmentation issues. See https://developer.nvidia.com/blog/using-cuda-stream-ordered-memory-allocator-part-1/