-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Adreno GPU target and topi supporting textures with dynamically allocated textures #11161
Conversation
ddfa320
to
fb29643
Compare
@csullivan Could you please take a look? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great @elvin-n! I've reviewed everything except the schedules which I will do in a follow up pass.
Note as this is a squash I would suggest use of Co-authored-by
in the commit to reflect the co-authorship.
src/runtime/opencl/opencl_common.h
Outdated
@@ -345,6 +345,7 @@ struct BufferDescriptor { | |||
* e.g. image2d[height=O, width=IHW] | |||
*/ | |||
kImage2DWeight, | |||
kTexture2DNHWC, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: We can now support arbitrary layouts with transform_layout
which I will suggest we move to. It will require some rework on the TIR lowering. I don't suggest this block these schedules from being upstreamed now, but we should circle back on this soon.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add any AR/TODO into the code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like that idea. Something like,
TODO(tvm-team): Uncouple use of storage scope and data layout by using the transform_layout schedule primitive to express the desired texture layout. This will require supporting Nd indices in BufferLoad and BufferStore in CodegenOpenCL, and ensuring Nd allocations for texture are correctly routed to the AllocateTexture packed function in the OpenCL DeviceAPI.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
elif data_layout == "NHWC4c": | ||
ic = data.shape[3] * data.shape[4] | ||
else: | ||
# TODO(amalyshe) add proper error raising |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Address the TODOs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
# specific language governing permissions and limitations | ||
# under the License. | ||
# pylint: disable=invalid-name,unused-variable,unused-argument,no-member | ||
"""Conv2D alter op and legalize functions for x86""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not x86
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
python/tvm/topi/adreno/utils.py
Outdated
from ..utils import get_const_tuple | ||
|
||
|
||
def getDiv(value, start): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
snake_case to match the rest of the file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
python/tvm/topi/adreno/utils.py
Outdated
---------- | ||
out: tuple of the (chunks, block, tail) | ||
""" | ||
tail = trip_count % 4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use block
throughout
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
python/tvm/topi/adreno/utils.py
Outdated
in_channel_tail: int | ||
Tail in the latest chunk diffing original number of channels vs blocked one | ||
If in_channel_tail != in_channel_block: | ||
original_channels = in_channel_chunks * in_channel_block - in_channel_tail |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: consider referring to this as padding_tail
so that it's clear this isn't the remainder of a floordiv. anything to make this a little more clear upfront, took me a bit to understand given the current naming convention. Same comment for filter api below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tried to do my best
python/tvm/topi/adreno/utils.py
Outdated
def _reorder_data_nchw(*indices): | ||
condition = [] | ||
condition.append(indices[1] == in_channel_chunks - 1) | ||
condition.append(indices[4] >= in_channel_tail) | ||
condition = tvm.tir.all(*condition) | ||
return tvm.tir.if_then_else( | ||
condition, | ||
pad_value, | ||
Input[indices[0], indices[1] * in_channel_block + indices[4], indices[2], indices[3]], | ||
) | ||
|
||
def _reorder_data_nhwc(*indices): | ||
condition = [] | ||
condition.append(indices[3] == in_channel_chunks - 1) | ||
condition.append(indices[4] >= in_channel_tail) | ||
condition = tvm.tir.all(*condition) | ||
return tvm.tir.if_then_else( | ||
condition, | ||
pad_value, | ||
Input[indices[0], indices[1], indices[2], indices[3] * in_channel_block + indices[4]], | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: Explicit buffer layout padding as part of transform_layout
is on the roadmap and will appear in RFC soon. Putting a note here to note that explicit layout transformations like this should be unnecessary in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added comment and reference to rfc
in_height, in_width, kernel_h, kernel_w, dilation_h, dilation_w, padding, stride_h, stride_w | ||
): | ||
""" | ||
Expands spatial dimensions to be dividable by factor 4. This will allow us to do extrimely |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typos
Expands spatial dimensions to be dividable by factor 4. This will allow us to do extrimely | |
Expands spatial dimensions to be dividable by factor 4. This will allow us |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you please point where typos are?
python/tvm/topi/adreno/utils.py
Outdated
Height of the feature map | ||
|
||
in_width: int | ||
Width of the featrue map |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Width of the featrue map | |
Width of the feature map |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
python/tvm/topi/adreno/utils.py
Outdated
# certain limitation of the Qualcomm devices. Subject to be determined for certain device | ||
# individually, but until we have access to remote device during compilation, we have to | ||
# define it uniformly for all target devices | ||
limit = 16384 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let us use the Target attributes for this, and specifically use the attribute preprocessor as is done for cuda here. Add image extent to the attribute list for the device api and use it when calling DetectDeviceFlag to query the size limits of the opencl image on the remote device.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added new texture_spatial_limit
attribute to opencl target, added to the DeviceAttrKind
and runtime_ctypes in python, but not sure if it was required since I don;t know how and when to use DetectDeviceFlag as well I have an access to the texture_spatial_limit
in the python part through tvm.target.Target.current().attrs["texture_spatial_limit"]
I would consider this as "addressed" but need to understand if my solution is applicable and if we need parts related to DeviceAttrKind
- There are 5 compute/schedules: conv2d for NCHW/NHWC, depthwise_conv2d for NCHW/NHWC, average pooling - Fix of dynamically allocated textures caching - Add texture-nhwc scope - Fix issue with codegen of vars having non acceptable symbols Co-authored-by: Chris Sullivan <[email protected]> Co-authored-by: Egor Churaev <[email protected]>
Done |
Co-authored-by: Li <[email protected]>
|
||
pad_data, kernel = s[conv].op.input_tensors | ||
|
||
s[pad_data].compute_inline() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you meaning to inline padding here? Your comment above implies that you intend to do otherwise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is inlined into next stage - cache read for textures
AT = s.cache_read(pad_data, "global.texture", [conv])
bind_data_copy(s[AT])
If I do not add s[pad_data].compute_inline()
the schedule would not be complete and would claim about missing of some bindings
from tvm.contrib import graph_runtime | ||
|
||
|
||
def get_reference(mod, params1, input_shape, inputs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Common utility shared in other test files, consider adding to the utils subdir.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
moved shared functions into utils/adreno_utils.py
|
||
|
||
# build module run with opencl and cpu, compare results | ||
def build_run_compare( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Common utility
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
moved shared functions into utils/adreno_utils.py
|
||
|
||
@tvm.testing.requires_opencl | ||
def test_conv2d_yolov3_v2_nchw_3c(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do these tests pass on a local opencl device (e.g. with an nvgpu?). If not, it would be good to skip the tests that depend on the RPC tracker env vars if they are not set if they require a remote device.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have not verified with nvidia gpu, but they pass successfully on intel integrated graphics and enabled opencl in the platform and tvm. I need to verify if tests run in the CI, but cannot do this due to issues with GPU build in CI
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@csullivan Looked into CI test results and got an impression that all opencl tests are disabled. It seems we need to enable them in CI but in separate PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's accurate, and I agree we can consider enabling them in CI in a separate PR. If you see that these tests pass when running locally and without and RPC tracker that is sufficient.
@@ -324,6 +324,7 @@ TVM_REGISTER_TARGET_KIND("opencl", kDLOpenCL) | |||
.add_attr_option<Bool>("system-lib") | |||
.add_attr_option<Integer>("max_num_threads", Integer(256)) | |||
.add_attr_option<Integer>("thread_warp_size", Integer(1)) | |||
.add_attr_option<Integer>("texture_spatial_limit", Integer(16384)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding this. An improvement would be to query the remote device using a call to the device api GetAttr using the target attr preprocessor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still do not fully understand the usage model. I left for a while only definition of texture_spatial_limit
in opencl target and access in python because adding of kTextureSpatialLimit
in DeviceAttrKind caused a fail during compilation of cuda and as I do not fully understand usage model, don't know how to fix this properly. If I need to extend cuda as well for this constant or just ignore and if ignore in which place kTextureSpatialLimit
should be used
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM with a few final nits
…llocated textures (apache#11161) * Add Adreno GPU target and topi supporting textures - There are 5 compute/schedules: conv2d for NCHW/NHWC, depthwise_conv2d for NCHW/NHWC, average pooling - Fix of dynamically allocated textures caching - Add texture-nhwc scope - Fix issue with codegen of vars having non acceptable symbols Co-authored-by: Chris Sullivan <[email protected]> Co-authored-by: Egor Churaev <[email protected]> * Address comments * Add vectorization into some adreno pool flow Co-authored-by: Li <[email protected]> * Fix adreno tests for running on the opencl host platform * remove unnecessary kDriverVersion in DeviceAttrKind * Move utils adreno functinos to separate shared file * fix black hits Co-authored-by: Chris Sullivan <[email protected]> Co-authored-by: Egor Churaev <[email protected]> Co-authored-by: Li <[email protected]>
…llocated textures (apache#11161) * Add Adreno GPU target and topi supporting textures - There are 5 compute/schedules: conv2d for NCHW/NHWC, depthwise_conv2d for NCHW/NHWC, average pooling - Fix of dynamically allocated textures caching - Add texture-nhwc scope - Fix issue with codegen of vars having non acceptable symbols Co-authored-by: Chris Sullivan <[email protected]> Co-authored-by: Egor Churaev <[email protected]> * Address comments * Add vectorization into some adreno pool flow Co-authored-by: Li <[email protected]> * Fix adreno tests for running on the opencl host platform * remove unnecessary kDriverVersion in DeviceAttrKind * Move utils adreno functinos to separate shared file * fix black hits Co-authored-by: Chris Sullivan <[email protected]> Co-authored-by: Egor Churaev <[email protected]> Co-authored-by: Li <[email protected]>
…llocated textures (apache#11161) * Add Adreno GPU target and topi supporting textures - There are 5 compute/schedules: conv2d for NCHW/NHWC, depthwise_conv2d for NCHW/NHWC, average pooling - Fix of dynamically allocated textures caching - Add texture-nhwc scope - Fix issue with codegen of vars having non acceptable symbols Co-authored-by: Chris Sullivan <[email protected]> Co-authored-by: Egor Churaev <[email protected]> * Address comments * Add vectorization into some adreno pool flow Co-authored-by: Li <[email protected]> * Fix adreno tests for running on the opencl host platform * remove unnecessary kDriverVersion in DeviceAttrKind * Move utils adreno functinos to separate shared file * fix black hits Co-authored-by: Chris Sullivan <[email protected]> Co-authored-by: Egor Churaev <[email protected]> Co-authored-by: Li <[email protected]>
for NCHW/NHWC, average/max pooling