-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Alias allocations with same size elements despite different dtypes #665
Conversation
Assuming it is safe to do this, I am a bit surprised NVRTC doesn't find this opportunity by itself. |
!build |
Are you saying my claim in #221 about "reusing registers are useless" is wrong? This is 🤯. |
I think so! I was planning to try this for smem tensors then noticed it has an effect for registers. Maybe I am missing some reason this is unsafe so it is not done automatically. |
See #2934 (comment) PR #665 allowed us to re-use allocations that have different dtypes. We already check that our aliased tensors do not have vectorized accesses larger than those of the original tensors. However, when we have different dtypes we `reinterpret_cast` it to a different `Array` type. Previously we did not specify any alignment in that type's template args, meaning it assumed an alignment of size 1. Since the actual addresses will all still be aligned this does not caused misaligned accesses at runtime. This PR sets the template arg for alignment to be that of the vectorized access width for the alias tensor, so that the compiler could hypothetically do some optimizations knowing the address is aligned.
Currently the
reuseMemoryAllocations
lowering pass reusesTensorView
s if they have compatible shapes and vectorization, and the same dtype. This PR replaces the dtype condition with a check that the size of each element is the same.It is somewhat rare in our test suite that we find opportunities to re-use one buffer with another of a different type, but it does happen. For example, one case in
FusionWelfordShmoo_CUDA
corresponding totestWelford(DataType::ComplexFloat, 1, 320, 256);
currently shows:With the current PR, this becomes
since we re-use two register
TensorView
s that change fromint64_t
tostd::complex<float>
.Re-using shared memory in these cases is likely always beneficial, but the effect of explicitly re-using registers is hard to analyze since the compiler pipeline is so complicated. It is likely rarely harmful, so this behavior is on by default. However, it can be disabled using
DisableOption::ReuseMismatchedTypeRegisters
orNVFUSER_DISABLE=reuse_mismatched_type_registers
.