Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Adreno GPU target and topi supporting textures with dynamically allocated textures #11161

Merged
merged 9 commits into from
May 13, 2022

Conversation

elvin-n
Copy link
Contributor

@elvin-n elvin-n commented Apr 28, 2022

  • There are 5 compute/schedules: conv2d for NCHW/NHWC, depthwise_conv2d
    for NCHW/NHWC, average/max pooling
  • Fix of dynamically allocated textures caching
  • Add texture-nhwc scope
  • Fix issue with codegen of vars having non acceptable symbols

@elvin-n elvin-n force-pushed the scout/adreno branch 2 times, most recently from ddfa320 to fb29643 Compare May 2, 2022 16:24
@elvin-n
Copy link
Contributor Author

elvin-n commented May 2, 2022

@csullivan Could you please take a look?

@csullivan csullivan self-requested a review May 2, 2022 20:36
Copy link
Contributor

@csullivan csullivan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great @elvin-n! I've reviewed everything except the schedules which I will do in a follow up pass.

Note as this is a squash I would suggest use of Co-authored-by in the commit to reflect the co-authorship.

@@ -345,6 +345,7 @@ struct BufferDescriptor {
* e.g. image2d[height=O, width=IHW]
*/
kImage2DWeight,
kTexture2DNHWC,
Copy link
Contributor

@csullivan csullivan May 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: We can now support arbitrary layouts with transform_layout which I will suggest we move to. It will require some rework on the TIR lowering. I don't suggest this block these schedules from being upstreamed now, but we should circle back on this soon.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add any AR/TODO into the code?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that idea. Something like,

TODO(tvm-team): Uncouple use of storage scope and data layout by using the transform_layout schedule primitive to express the desired texture layout. This will require supporting Nd indices in BufferLoad and BufferStore in CodegenOpenCL, and ensuring Nd allocations for texture are correctly routed to the AllocateTexture packed function in the OpenCL DeviceAPI.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

elif data_layout == "NHWC4c":
ic = data.shape[3] * data.shape[4]
else:
# TODO(amalyshe) add proper error raising
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Address the TODOs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

# specific language governing permissions and limitations
# under the License.
# pylint: disable=invalid-name,unused-variable,unused-argument,no-member
"""Conv2D alter op and legalize functions for x86"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not x86

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

from ..utils import get_const_tuple


def getDiv(value, start):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

snake_case to match the rest of the file

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

----------
out: tuple of the (chunks, block, tail)
"""
tail = trip_count % 4
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use block throughout

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

in_channel_tail: int
Tail in the latest chunk diffing original number of channels vs blocked one
If in_channel_tail != in_channel_block:
original_channels = in_channel_chunks * in_channel_block - in_channel_tail
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: consider referring to this as padding_tail so that it's clear this isn't the remainder of a floordiv. anything to make this a little more clear upfront, took me a bit to understand given the current naming convention. Same comment for filter api below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tried to do my best

Comment on lines 110 to 133
def _reorder_data_nchw(*indices):
condition = []
condition.append(indices[1] == in_channel_chunks - 1)
condition.append(indices[4] >= in_channel_tail)
condition = tvm.tir.all(*condition)
return tvm.tir.if_then_else(
condition,
pad_value,
Input[indices[0], indices[1] * in_channel_block + indices[4], indices[2], indices[3]],
)

def _reorder_data_nhwc(*indices):
condition = []
condition.append(indices[3] == in_channel_chunks - 1)
condition.append(indices[4] >= in_channel_tail)
condition = tvm.tir.all(*condition)
return tvm.tir.if_then_else(
condition,
pad_value,
Input[indices[0], indices[1], indices[2], indices[3] * in_channel_block + indices[4]],
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: Explicit buffer layout padding as part of transform_layout is on the roadmap and will appear in RFC soon. Putting a note here to note that explicit layout transformations like this should be unnecessary in the future.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added comment and reference to rfc

in_height, in_width, kernel_h, kernel_w, dilation_h, dilation_w, padding, stride_h, stride_w
):
"""
Expands spatial dimensions to be dividable by factor 4. This will allow us to do extrimely
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typos

Suggested change
Expands spatial dimensions to be dividable by factor 4. This will allow us to do extrimely
Expands spatial dimensions to be dividable by factor 4. This will allow us

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you please point where typos are?

Height of the feature map

in_width: int
Width of the featrue map
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Width of the featrue map
Width of the feature map

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

# certain limitation of the Qualcomm devices. Subject to be determined for certain device
# individually, but until we have access to remote device during compilation, we have to
# define it uniformly for all target devices
limit = 16384
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let us use the Target attributes for this, and specifically use the attribute preprocessor as is done for cuda here. Add image extent to the attribute list for the device api and use it when calling DetectDeviceFlag to query the size limits of the opencl image on the remote device.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added new texture_spatial_limit attribute to opencl target, added to the DeviceAttrKind and runtime_ctypes in python, but not sure if it was required since I don;t know how and when to use DetectDeviceFlag as well I have an access to the texture_spatial_limit in the python part through tvm.target.Target.current().attrs["texture_spatial_limit"]
I would consider this as "addressed" but need to understand if my solution is applicable and if we need parts related to DeviceAttrKind

- There are 5 compute/schedules: conv2d for NCHW/NHWC, depthwise_conv2d
  for NCHW/NHWC, average pooling
- Fix of dynamically allocated textures caching
- Add texture-nhwc scope
- Fix issue with codegen of vars having non acceptable symbols

Co-authored-by: Chris Sullivan <[email protected]>
Co-authored-by: Egor Churaev <[email protected]>
@elvin-n
Copy link
Contributor Author

elvin-n commented May 4, 2022

Note as this is a squash I would suggest use of Co-authored-by in the commit to reflect the co-authorship.

Done


pad_data, kernel = s[conv].op.input_tensors

s[pad_data].compute_inline()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you meaning to inline padding here? Your comment above implies that you intend to do otherwise.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is inlined into next stage - cache read for textures

    AT = s.cache_read(pad_data, "global.texture", [conv])
    bind_data_copy(s[AT])

If I do not add s[pad_data].compute_inline() the schedule would not be complete and would claim about missing of some bindings

from tvm.contrib import graph_runtime


def get_reference(mod, params1, input_shape, inputs):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Common utility shared in other test files, consider adding to the utils subdir.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved shared functions into utils/adreno_utils.py



# build module run with opencl and cpu, compare results
def build_run_compare(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Common utility

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved shared functions into utils/adreno_utils.py



@tvm.testing.requires_opencl
def test_conv2d_yolov3_v2_nchw_3c():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do these tests pass on a local opencl device (e.g. with an nvgpu?). If not, it would be good to skip the tests that depend on the RPC tracker env vars if they are not set if they require a remote device.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have not verified with nvidia gpu, but they pass successfully on intel integrated graphics and enabled opencl in the platform and tvm. I need to verify if tests run in the CI, but cannot do this due to issues with GPU build in CI

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@csullivan Looked into CI test results and got an impression that all opencl tests are disabled. It seems we need to enable them in CI but in separate PR

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's accurate, and I agree we can consider enabling them in CI in a separate PR. If you see that these tests pass when running locally and without and RPC tracker that is sufficient.

@@ -324,6 +324,7 @@ TVM_REGISTER_TARGET_KIND("opencl", kDLOpenCL)
.add_attr_option<Bool>("system-lib")
.add_attr_option<Integer>("max_num_threads", Integer(256))
.add_attr_option<Integer>("thread_warp_size", Integer(1))
.add_attr_option<Integer>("texture_spatial_limit", Integer(16384))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this. An improvement would be to query the remote device using a call to the device api GetAttr using the target attr preprocessor.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still do not fully understand the usage model. I left for a while only definition of texture_spatial_limit in opencl target and access in python because adding of kTextureSpatialLimit in DeviceAttrKind caused a fail during compilation of cuda and as I do not fully understand usage model, don't know how to fix this properly. If I need to extend cuda as well for this constant or just ignore and if ignore in which place kTextureSpatialLimit should be used

Copy link
Contributor

@csullivan csullivan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with a few final nits

@csullivan csullivan merged commit c2d1905 into apache:main May 13, 2022
@csullivan
Copy link
Contributor

Many thanks for the great work @elvin-n, @echuraev, @lhez. This is merged.

mehrdadh pushed a commit to mehrdadh/tvm that referenced this pull request May 16, 2022
…llocated textures (apache#11161)

* Add Adreno GPU target and topi supporting textures

- There are 5 compute/schedules: conv2d for NCHW/NHWC, depthwise_conv2d
  for NCHW/NHWC, average pooling
- Fix of dynamically allocated textures caching
- Add texture-nhwc scope
- Fix issue with codegen of vars having non acceptable symbols

Co-authored-by: Chris Sullivan <[email protected]>
Co-authored-by: Egor Churaev <[email protected]>

* Address comments

* Add vectorization into some adreno pool flow

Co-authored-by: Li <[email protected]>

* Fix adreno tests for running on the opencl host platform

* remove unnecessary kDriverVersion in DeviceAttrKind

* Move utils adreno functinos to separate shared file

* fix black hits

Co-authored-by: Chris Sullivan <[email protected]>
Co-authored-by: Egor Churaev <[email protected]>
Co-authored-by: Li <[email protected]>
shtinsa pushed a commit to Deelvin/tvm that referenced this pull request May 17, 2022
…llocated textures (apache#11161)

* Add Adreno GPU target and topi supporting textures

- There are 5 compute/schedules: conv2d for NCHW/NHWC, depthwise_conv2d
  for NCHW/NHWC, average pooling
- Fix of dynamically allocated textures caching
- Add texture-nhwc scope
- Fix issue with codegen of vars having non acceptable symbols

Co-authored-by: Chris Sullivan <[email protected]>
Co-authored-by: Egor Churaev <[email protected]>

* Address comments

* Add vectorization into some adreno pool flow

Co-authored-by: Li <[email protected]>

* Fix adreno tests for running on the opencl host platform

* remove unnecessary kDriverVersion in DeviceAttrKind

* Move utils adreno functinos to separate shared file

* fix black hits

Co-authored-by: Chris Sullivan <[email protected]>
Co-authored-by: Egor Churaev <[email protected]>
Co-authored-by: Li <[email protected]>
shingjan pushed a commit to shingjan/tvm that referenced this pull request May 17, 2022
…llocated textures (apache#11161)

* Add Adreno GPU target and topi supporting textures

- There are 5 compute/schedules: conv2d for NCHW/NHWC, depthwise_conv2d
  for NCHW/NHWC, average pooling
- Fix of dynamically allocated textures caching
- Add texture-nhwc scope
- Fix issue with codegen of vars having non acceptable symbols

Co-authored-by: Chris Sullivan <[email protected]>
Co-authored-by: Egor Churaev <[email protected]>

* Address comments

* Add vectorization into some adreno pool flow

Co-authored-by: Li <[email protected]>

* Fix adreno tests for running on the opencl host platform

* remove unnecessary kDriverVersion in DeviceAttrKind

* Move utils adreno functinos to separate shared file

* fix black hits

Co-authored-by: Chris Sullivan <[email protected]>
Co-authored-by: Egor Churaev <[email protected]>
Co-authored-by: Li <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants