You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
SYCL has a 2k parameter size limit. CUDA typically has 4k, although it may depend on device. Complex RHS expressions can exceed this limit pretty easily, particularly on SYCL, and this is in fact the case for one of the GENE kernels. The workaround I added was to copy the RHS expression gtensor_span object to device memory and pass only the pointer as a parameter. This works, but it can slow things down, particularly for small kernels. While we want to avoid small kernels in general, it's still not great.
One way to improve this, suggested by @uphoffc, is to optimize gview and gtensor_storage. At least for the case of a gview around an gtensor, all that needs to be stored is a pointer to the underlying data and strides. The offset can be added in to the pointer, and multiple layers of strides are not needed (the underlying strides just amount to col major and can be assumed). I am not sure how this will work for more complex cases, however, things like composing swapaxes and a slice view.
We can also use the copy to device-mem pass pointer hack only when necessary, using constexpr if (for SYCL backend which already requires C++17) or with more fancy TMP hackery. Note that nvcc didn't support C++17 until CUDA 11, so we could consider bumping the gtensor requirement at the cost of loosing support for CUDA < 11.
There also seems to be redundancy in gtensor_span object, which extend from gstrided and contain shape, strides, plus the storage object's size and capacity. Not clear to me whether this is worth optimizing though.
I am also unsure of the effect of these optimizations - the low hanging fruit here is simply constexpr if to only apply the workaround when the parameters are really too big. Most small kernels will have a small RHS and won't require the workaround; kernels with a RHS that exceeds the limit will likely be long running and the extra memcpy is likely to have less impact. I think the first step here is to apply the constexpr if for the SYCL backend, which has the smallest param limit and already requires C++17. Then we can explore further more complicated changes over time. Second step is to analyze the GENE kernels to see what view combos they are using, to get a sense for how much of an effect gview size optimization would have, e.g. would special casing gview of gtensor to avoid double storing strides make a significant difference here.
The text was updated successfully, but these errors were encountered:
I think for now we still want to support CUDA 10, so global use of C++17 is not an option yet. Using it in SYCL ifdef'ed block is fine though. constexpr if can be emulated in C++14 but it's ugly; we haven't been hitting the param limit for CUDA, so I think we don't worry about it there for now.
SYCL has a 2k parameter size limit. CUDA typically has 4k, although it may depend on device. Complex RHS expressions can exceed this limit pretty easily, particularly on SYCL, and this is in fact the case for one of the GENE kernels. The workaround I added was to copy the RHS expression gtensor_span object to device memory and pass only the pointer as a parameter. This works, but it can slow things down, particularly for small kernels. While we want to avoid small kernels in general, it's still not great.
One way to improve this, suggested by @uphoffc, is to optimize gview and gtensor_storage. At least for the case of a gview around an gtensor, all that needs to be stored is a pointer to the underlying data and strides. The offset can be added in to the pointer, and multiple layers of strides are not needed (the underlying strides just amount to col major and can be assumed). I am not sure how this will work for more complex cases, however, things like composing swapaxes and a slice view.
We can also use the copy to device-mem pass pointer hack only when necessary, using constexpr if (for SYCL backend which already requires C++17) or with more fancy TMP hackery. Note that nvcc didn't support C++17 until CUDA 11, so we could consider bumping the gtensor requirement at the cost of loosing support for CUDA < 11.
There also seems to be redundancy in gtensor_span object, which extend from gstrided and contain shape, strides, plus the storage object's size and capacity. Not clear to me whether this is worth optimizing though.
I am also unsure of the effect of these optimizations - the low hanging fruit here is simply constexpr if to only apply the workaround when the parameters are really too big. Most small kernels will have a small RHS and won't require the workaround; kernels with a RHS that exceeds the limit will likely be long running and the extra memcpy is likely to have less impact. I think the first step here is to apply the constexpr if for the SYCL backend, which has the smallest param limit and already requires C++17. Then we can explore further more complicated changes over time. Second step is to analyze the GENE kernels to see what view combos they are using, to get a sense for how much of an effect gview size optimization would have, e.g. would special casing gview of gtensor to avoid double storing strides make a significant difference here.
The text was updated successfully, but these errors were encountered: