Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[OpenCL] Fix OpenCL get_valid_counts errors due to intrinsic atomic_add #5857

Merged
merged 3 commits into from
Jun 30, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
31 changes: 31 additions & 0 deletions src/target/source/codegen_opencl.cc
Original file line number Diff line number Diff line change
Expand Up @@ -70,6 +70,11 @@ std::string CodeGenOpenCL::Finish() {
"#endif\n\n";
}

// Enable atomic_add used by get_valid_counts. Only needed for OpenCL < 1.1.
trevor-m marked this conversation as resolved.
Show resolved Hide resolved
if (enable_atomics_) {
decl_stream << "#pragma OPENCL EXTENSION cl_khr_global_int32_base_atomics : enable\n"
"#pragma OPENCL EXTENSION cl_khr_global_int32_extended_atomics : enable\n\n";
}
return CodeGenC::Finish();
}

Expand Down Expand Up @@ -224,6 +229,32 @@ std::string CodeGenOpenCL::CastFromTo(std::string value, DataType from, DataType
return os.str();
}

void CodeGenOpenCL::VisitExpr_(const CallNode* op, std::ostream& os) {
if (op->op.same_as(builtin::address_of())) {
// Overload tvm_address_of to add storage scope (e.g. __global).
const LoadNode* load = op->args[0].as<LoadNode>();
CHECK(op->args.size() == 1 && load);
os << "((";
auto it = alloc_storage_scope_.find(load->buffer_var.get());
if (it != alloc_storage_scope_.end()) {
PrintStorageScope(it->second, os);
}
this->PrintType(load->dtype.element_of(), os);
os << " *)" << this->GetVarID(load->buffer_var.get()) << " + ";
this->PrintExpr(load->index, os);
os << ')';
} else if (op->op.same_as(builtin_call_extern_)) {
auto func = Downcast<StringImm>(op->args[0]);
// Enable atomics extension if used.
if (func->value == "atomic_add") {
enable_atomics_ = true;
}
CodeGenC::VisitExpr_(op, os);
} else {
CodeGenC::VisitExpr_(op, os);
}
}

void CodeGenOpenCL::VisitExpr_(const BroadcastNode* op, std::ostream& os) { // NOLINT(*)
std::string v = PrintExpr(op->value);
os << "((";
Expand Down
3 changes: 3 additions & 0 deletions src/target/source/codegen_opencl.h
Original file line number Diff line number Diff line change
Expand Up @@ -54,13 +54,16 @@ class CodeGenOpenCL final : public CodeGenC {
std::string CastFromTo(std::string value, DataType from, DataType target); // NOLINT(*)

// overload visitor
void VisitExpr_(const CallNode* op, std::ostream& os) final; // NOLINT(*)
void VisitExpr_(const BroadcastNode* op, std::ostream& os) final; // NOLINT(*)
void VisitExpr_(const FloatImmNode* op, std::ostream& os) final; // NOLINT(*)

private:
// whether enable fp16 and fp64 extension
bool enable_fp16_{false};
bool enable_fp64_{false};
// Whether to enable atomics extension.
bool enable_atomics_{false};
};

} // namespace codegen
Expand Down
4 changes: 2 additions & 2 deletions tests/python/relay/test_op_level5.py
Original file line number Diff line number Diff line change
Expand Up @@ -270,8 +270,8 @@ def verify_get_valid_counts(dshape, score_threshold, id_index, score_index):
intrp = relay.create_executor("debug", ctx=ctx, target=target)
out = intrp.evaluate(func)(np_data)
tvm.testing.assert_allclose(out[0].asnumpy(), np_out1, rtol=1e-3, atol=1e-04)
# get_valid_count for cuda doesn't do data rearrangement
if target == 'cuda':
# get_valid_count for cuda, opencl doesn't do data rearrangement
if target in ['cuda', 'opencl']:
return
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Returning here looks wrong to me. The test in the below link doesn't work for OpenCL too because we don't do data rearrangement for GPU nms implementation.
https://discuss.tvm.ai/t/nms-compile-fails-for-cuda-target-but-works-fine-for-llvm-target/7045/2

Probably, we should fix non_max_suppression for GPU first?

Copy link
Contributor Author

@trevor-m trevor-m Jun 25, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OpenCL uses the same implementation as CUDA. The CUDA implementation of get_valid_counts was changed to no longer rearrange the output of get_valid_counts because it will be rearranged by NMS later anyway. This gives the correct output for NMS. See #5339

That issue with NMS looks to be a separate issue where the CUDA implementation wasn't fully updated to match changes to CPU implementation by #4312

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your explanation. Actually, I've successfully build NMS if I revert the change in #4312.

tvm.testing.assert_allclose(out[1].asnumpy(), np_out2, rtol=1e-3, atol=1e-04)
tvm.testing.assert_allclose(out[2].asnumpy(), np_out3, rtol=1e-3, atol=1e-04)
Expand Down
7 changes: 7 additions & 0 deletions topi/python/topi/cuda/nms.py
Original file line number Diff line number Diff line change
Expand Up @@ -34,10 +34,17 @@ def cuda_atomic_add_rule(op):
return tvm.tir.call_pure_extern("int32", "atomicAdd", op.args[0], op.args[1])
raise RuntimeError("only support int32, float32 and float64")

def opencl_atomic_add_rule(op):
if op.dtype == "int32":
return tvm.tir.call_pure_extern("int32", "atomic_add", op.args[0], op.args[1])
raise RuntimeError("only support int32")

tvm.target.intrin.register_intrin_rule(
"cuda", "atomic_add", cuda_atomic_add_rule, override=True)

tvm.target.intrin.register_intrin_rule(
"opencl", "atomic_add", opencl_atomic_add_rule, override=True)

tvm.ir.register_op_attr("tir.atomic_add", "TCallEffectKind", tvm.tir.CallEffectKind.Opaque)

def atomic_add(x, y):
Expand Down