Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cost of param limit size workaround, and optimizing view size #164

Open
bd4 opened this issue Mar 3, 2022 · 1 comment
Open

cost of param limit size workaround, and optimizing view size #164

bd4 opened this issue Mar 3, 2022 · 1 comment
Assignees

Comments

@bd4
Copy link
Contributor

bd4 commented Mar 3, 2022

SYCL has a 2k parameter size limit. CUDA typically has 4k, although it may depend on device. Complex RHS expressions can exceed this limit pretty easily, particularly on SYCL, and this is in fact the case for one of the GENE kernels. The workaround I added was to copy the RHS expression gtensor_span object to device memory and pass only the pointer as a parameter. This works, but it can slow things down, particularly for small kernels. While we want to avoid small kernels in general, it's still not great.

One way to improve this, suggested by @uphoffc, is to optimize gview and gtensor_storage. At least for the case of a gview around an gtensor, all that needs to be stored is a pointer to the underlying data and strides. The offset can be added in to the pointer, and multiple layers of strides are not needed (the underlying strides just amount to col major and can be assumed). I am not sure how this will work for more complex cases, however, things like composing swapaxes and a slice view.

We can also use the copy to device-mem pass pointer hack only when necessary, using constexpr if (for SYCL backend which already requires C++17) or with more fancy TMP hackery. Note that nvcc didn't support C++17 until CUDA 11, so we could consider bumping the gtensor requirement at the cost of loosing support for CUDA < 11.

There also seems to be redundancy in gtensor_span object, which extend from gstrided and contain shape, strides, plus the storage object's size and capacity. Not clear to me whether this is worth optimizing though.

I am also unsure of the effect of these optimizations - the low hanging fruit here is simply constexpr if to only apply the workaround when the parameters are really too big. Most small kernels will have a small RHS and won't require the workaround; kernels with a RHS that exceeds the limit will likely be long running and the extra memcpy is likely to have less impact. I think the first step here is to apply the constexpr if for the SYCL backend, which has the smallest param limit and already requires C++17. Then we can explore further more complicated changes over time. Second step is to analyze the GENE kernels to see what view combos they are using, to get a sense for how much of an effect gview size optimization would have, e.g. would special casing gview of gtensor to avoid double storing strides make a significant difference here.

@bd4 bd4 self-assigned this Mar 3, 2022
@bd4
Copy link
Contributor Author

bd4 commented Mar 3, 2022

I think for now we still want to support CUDA 10, so global use of C++17 is not an option yet. Using it in SYCL ifdef'ed block is fine though. constexpr if can be emulated in C++14 but it's ugly; we haven't been hitting the param limit for CUDA, so I think we don't worry about it there for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant