-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[AutoScheduler] Remove max_registers_per_block
in HardwareParams
#7040
Conversation
cc @jcf94 |
@@ -64,8 +64,7 @@ HardwareParams HardwareParamsNode::GetDefaultHardwareParams(const Target& target | |||
device_api->GetAttr(ctx, tvm::runtime::DeviceAttrKind::kMaxSharedMemoryPerBlock, &ret); | |||
int max_shared_memory_per_block = ret; | |||
|
|||
device_api->GetAttr(ctx, tvm::runtime::DeviceAttrKind::kMaxRegistersPerBlock, &ret); | |||
int max_registers_per_block = ret; | |||
int max_local_memory_per_block = INT32_MAX; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a comment as the PR description.
src/auto_scheduler/search_task.cc
Outdated
max_threads_per_block, max_vthread_extent, warp_size); | ||
} else if (target->kind->device_type == kDLMetal) { | ||
// Reference: https://developer.apple.com/metal/Metal-Feature-Set-Tables.pdf | ||
// This setting looks working for Metal GPUs later than A10 | ||
int max_shared_memory_per_block = 32 * 1024; | ||
int max_registers_per_block = 4 * 1024; | ||
int max_local_memory_per_block = INT32_MAX; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto.
ok I'll update my PR #7038 after we let this in first. |
f9e7def
to
26cd727
Compare
@comaniac Comments are addressed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
thanks @merrymercy @comaniac |
…pache#7040) * [AutoScheduler] Fix hardware params * address comments
…pache#7040) * [AutoScheduler] Fix hardware params * address comments
…pache#7040) * [AutoScheduler] Fix hardware params * address comments
Previously, we use
hardware_params->max_registers_per_block
got from Cuda device query as the value ofmax_local_memory_per_block
inVerifyGPUCode
. This is wrong. They are just not the same thing.Luckily, for NVIDIA GPUs, this bug does not affect the performance. Because
kMaxRegistersPerBlock
returns a very large value. The check inVerifyGPUCode
with this large value almost affects nothing.We have to rename
hardware_params->max_registers_per_block
to a correct namehardware_params->max_local_memory_per_block
, so it is more meaningful for other backends.A better way is to set it as
INT32_MAX
to simply skip this check. Because there is no hard limitation in the CUDA runtime for this value. Setting it toINT32_MAX
can enlarge the search space while keeping most of the measured schedules still valid.