-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Specify different-dtype alias TV alignment #3084
Conversation
!build |
@@ -2964,10 +2964,13 @@ class CudaKernelGenerator : private kir::ConstIrVisitor { | |||
} else { | |||
indent() << "// Alias Allocation (changing dtype) - " | |||
<< alloc->memoryType() << "\n"; | |||
auto va = kernel_->summary().vectorized_accesses; | |||
auto it = va.find(tv); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could instead use the original TV, whose vectorized accesses are guaranteed to be at least as large as this 🤷♂️
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That may not work since the actual alignment size is sizeof(buffer_dtype) * alias_alignment
. If we used the alignment of the original tensor, the alignment size would become larger if the size of the data type of the original tensor was smaller.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is currently only used by "inner aliasing" in which the size of the dtypes must match, as well as the parallelization and shapes. I'll leave it though for the reason you stated since in the future we might possibly allow mismatched shapes and dtype sizes subject to some other constraint.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I see. I think what's more important here would be we should make CudaCodeGen
as trivial as possible and try to minimize the coupling with the lowering passes. It's difficult to keep track of all the assumptions made in the prior phases of code translations. For this case particular, since the lowering passes the alignment size, that should be just used as is. If some other valid value should be used instead, that decision should be made by the lowering and be recorded in the kernel summary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
See #2934 (comment)
PR #665 allowed us to re-use allocations that have different dtypes. We already check that our aliased tensors do not have vectorized accesses larger than those of the original tensors. However, when we have different dtypes we
reinterpret_cast
it to a differentArray
type. Previously we did not specify any alignment in that type's template args, meaning it assumed an alignment of size 1. Since the actual addresses will all still be aligned this does not caused misaligned accesses at runtime. This PR sets the template arg for alignment to be that of the vectorized access width for the alias tensor, so that the compiler could hypothetically do some optimizations knowing the address is aligned.