Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for multiple OpenCL platforms #1345

Merged
merged 10 commits into from
Jul 9, 2018

Conversation

kazum
Copy link
Contributor

@kazum kazum commented Jun 27, 2018

This PR allows having both SDAccel and other OpenCL platforms.

When the target device is 'sdaccel', the runtime tries to use the Xilinx toolchain and find FPGA devices. If it fails to find SDAccel environment, it falls back to other OpenCL platforms so that unittests can test sdaccel backend.

When the target device is 'opencl', the runtime tries to find GPU and CPU devices. The behavior is same as before.

@kazum kazum force-pushed the opencl-multi-platforms branch 3 times, most recently from 66bee46 to e406f4c Compare June 27, 2018 19:21
@kazum kazum force-pushed the opencl-multi-platforms branch from e406f4c to a6bc7b9 Compare June 27, 2018 19:33
@kazum
Copy link
Contributor Author

kazum commented Jun 27, 2018

@tqchen @tmoreau89 @comaniac Can you review this PR? Thanks.

Copy link
Contributor

@comaniac comaniac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The mechanism of using CPU/GPU to test the HLS C kernel when SDAccel environment isn't available seems reasonable to me. Only one very miner comment listed.

On the other hand, although I also reviewed the runtime part and looked good to me as well, I have very few sense of a good way to organize those devices. Please refer to others in this part.

@@ -139,7 +139,7 @@ class MemoryAccessVerifier final : protected IRVisitor {

/// Check if a given DLDeviceType/TVMDeviceExtType value denotes GPU device.
static bool IsGPUDevice(int dev_type) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we give a better name to this function since now it checks not only GPU but FPGA devices?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your comment. I've introduced IsFPGADevice() instead of modifying IsGPUDevice(). Now, we can check that the memory access is not to host with !(IsGPUDevice() || IsFPGADevice).

});

TVM_REGISTER_GLOBAL("module.loadfile_awsxclbin")
.set_body([](TVMArgs args, TVMRetValue* rv) {
*rv = OpenCLModuleLoadFile(args[0], args[1]);
*rv = OpenCLModuleLoadFile(args[0], args[1], {"accelerator"}, "Xilinx");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we be using "Xilinx" instead of something more specific, like aws_f1 or a specific FPGA device number? I suspect that there might be more than a single Xilinx backend as we start to include more support towards FPGAs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Xilinx" is the name of the OpenCL platform here. It's common between AWS F1 and other Xilinx FPGAs which support SDAccel.

@tmoreau89
Copy link
Contributor

One question: will this flow support all three of (1) hw emulation, (2), sw emulation, (3), actual hardware execution on the FPGA.
In this case will the runtime always check for FPGA device, or only for case 3?
I can imagine launching an regular CPU instance with an FPGA development AMI, instead of an F1 instance, just to test OpenCL synthesis in software emulation/hardware emulation.

Another question, is how the different hardware FPGA backends should be handled by the runtime. Right now are we aiming to support only VUP9+ based F1 instances? Or are you also planning on supporting other development boards?

Overall, the PR looks acceptable, although I'm not an expert on the TVM runtime. Maybe @tqchen can give a final word on the changes.

@kazum
Copy link
Contributor Author

kazum commented Jun 29, 2018

One question: will this flow support all three of (1) hw emulation, (2), sw emulation, (3), actual hardware execution on the FPGA.
In this case will the runtime always check for FPGA device, or only for case 3?
I can imagine launching an regular CPU instance with an FPGA development AMI, instead of an F1 instance, just to test OpenCL synthesis in software emulation/hardware emulation.

This flow supports all of the three cases. In the cases of (1) and (2), the Xilinx OpenCL platform tries to read emconfig.json and emulates devices written in the file. We can use, e.g., non-F1 instances for development like that. If emconfig.json doesn't exist or the Xilinx toolchain is not set up, TVM will falls back to other OpenCL platforms.

Another question, is how the different hardware FPGA backends should be handled by the runtime. Right now are we aiming to support only VUP9+ based F1 instances? Or are you also planning on supporting other development boards?

The current implementation should work for all Xilinx FPGAs which supports SDAccel. I've not tested other FPGA environments than AWS F1 yet, though.

@tmoreau89
Copy link
Contributor

@kazum thank you for your answers. One more question, in the case of having a different SDAccel FPGA targets (i.e. different part numbers), what variable would we need to change to set the device number?

@kazum
Copy link
Contributor Author

kazum commented Jun 30, 2018

@tmoreau89 Currently, the runtime always uses the first device and it is hard-coded in:
https://github.com/dmlc/tvm/blob/a06f520/src/runtime/opencl/opencl_module.cc#L106
Supporting multiple FPGA devices is future work and I'd be glad to send the PR. :)

@tmoreau89
Copy link
Contributor

@kazum great, I think it's good that we keep in mind that by using the same SDAccel flow, we may be targeting multiple FPGA platforms.
I don't have anything to add, @tqchen if the changes to the run time look good, can you approve this PR?

workspace_ = cl::OpenCLWorkspace::Global();
workspace_->Init();
workspace_->Init(device_types, platform_name);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current choice of platform is mutually exclusive, so that once we load a sdaccel file, it is no longer possible to load an OpenCL GPU module due to the single global singleton being used in here.

@tqchen
Copy link
Member

tqchen commented Jul 3, 2018

Sorry for delayed review on this, see my comment.

The current way of supporting SDAccel is not desirable, because when both SDAccel and GPU are present, the global singleton can only be initialized to one platform.

If we indeed want to support sdaccel as a separate device as opencl -- which makes sense because the opencl for hardware is quite different. Then we might want to create a separate singleton for the SDAccel while reusing the current OpenCL runtime code.

This being said, we may not need the fake compile feature at the moment for the same reason that sdaccel is very different from OpenCL.

Seems that we just need a way to tell if sdaccel is available.

@tqchen tqchen added status: review in progress status: need update need update based on feedbacks labels Jul 3, 2018
@kazum
Copy link
Contributor Author

kazum commented Jul 3, 2018

@tqchen Thanks for your comments. I'm okay with separating SDAccel implementation from the OpenCL code because the current singleton architecture cannot support multiple FPGA devices in either way.

Then, I think of adding a base class for OpenCL devices, and re-implementing GPU and SDAccel supports upon it. I think we don't need so big code changes for that. Is it fine with you?

It's also okay to me that the fake compile feature is not necessary now. I'll try to find a better way to test SDAccel codes for CI without SDAccel environment.

@tqchen
Copy link
Member

tqchen commented Jul 3, 2018

@kazum It sounds good.

About the multiple FPGA support, I think it is possible to support multiple FPGAs in current runtime, as long as these devices are all added. and you can use sdaccel(0) and sdaccel(1) to indicate devices

@kazum
Copy link
Contributor Author

kazum commented Jul 3, 2018

@tqchen Ah, I see. I'll give it a try, thanks!

@kazum kazum force-pushed the opencl-multi-platforms branch from a2c1051 to 7296f8e Compare July 5, 2018 19:42
@kazum
Copy link
Contributor Author

kazum commented Jul 5, 2018

@tqchen I've pushed commits to separate SDAccel runtime and remove fake compile feature. Can you check them out? About multiple FPGA support, I think of sending another patches later after we finish this PR.

The CI system fails to create test binaries now, but it looks like it refers to the old version of libtvm.so. The same test can be passed successfully on my environment.

@kazum
Copy link
Contributor Author

kazum commented Jul 5, 2018

The CI system fails to create test binaries now, but it looks like it refers to the old version of libtvm.so.

So sorry, it was caused my bug, and I've fixed the problem.

@@ -17,5 +17,14 @@ Module OpenCLModuleCreate(
return codegen::DeviceSourceModuleCreate(data, fmt, fmap, "opencl");
}

Module SDAccelModuleCreate(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make SDAccel into a subfolder of runtime/opencl, add an optional configuration flag in cmake(USE_SDACCEL). This is mainly to minimize the runtime binary size

const std::shared_ptr<OpenCLWorkspace>& OpenCLWorkspace::Global() {
static std::shared_ptr<OpenCLWorkspace> inst = std::make_shared<OpenCLWorkspace>();
template <OpenCLPlatform T>
const std::shared_ptr<OpenCLWorkspace<T>>& OpenCLWorkspace<T>::Global() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

implementation of template function should go to the header file

}

TVM_REGISTER_GLOBAL("device_api.opencl")
.set_body([](TVMArgs args, TVMRetValue* rv) {
DeviceAPI* ptr = OpenCLWorkspace::Global().get();
DeviceAPI* ptr = OpenCLWorkspace<OpenCLPlatform::kGPU>::Global().get();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move sdaccel into a separate file

@tqchen
Copy link
Member

tqchen commented Jul 8, 2018

@kazum I have made some followup comments. One general feeling I have in the current change is that the template has been abused a bit.

Actually, we could logically divide things into

common base, no template, common logics:

OpenCLWorkspace,
OpenCLModuleNode,
OpenCLThreadEntry


SDAccelWorkspace : OpenCLWorkspace
SDAccelModuleNode: OpenCLModuleNode

The key problem we might face is that some cases we might need to get back to the Global singleton reference. This can likely be resolved by adding a virtual function(GetGlobalWorkspace) and do subclassing

The second comment is that we want to optionally build sdaccel as a module, without build both when opencl is enabled. This is to minimize the tvm runtime size which could be important for mobile devices.

@kazum
Copy link
Contributor Author

kazum commented Jul 8, 2018

@tqchen Thanks for your comments! I've addressed them.

Copy link
Member

@tqchen tqchen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some final comments

const char* s = data_.c_str();
size_t len = data_.length();
cl_int err;
cl_program program = clCreateProgramWithSource(workspace_->context, 1, &s, &len, &err);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is helpful to support both CreateProgramWithSource and CreateProgramWithBinary by checking the format, this way CreateProgram itself do not have to be a diverged function

std::string source)
: data_(data), fmt_(fmt), fmap_(fmap), source_(source) {}
// destructor
~OpenCLModuleNode() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider only keep declaration in header and move the implementations to cc file

@kazum kazum force-pushed the opencl-multi-platforms branch from e224cdc to a1e91c3 Compare July 8, 2018 22:48
@kazum
Copy link
Contributor Author

kazum commented Jul 8, 2018

@tqchen Addressed, thanks!

}
}

std::shared_ptr<cl::OpenCLWorkspace> OpenCLModuleNode::GetGlobalWorkspace() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change to

const std::shared_ptr<OpenCLWorkspace>&

So when we refer to it, we don't have to copy

@kazum
Copy link
Contributor Author

kazum commented Jul 9, 2018

@tqchen Addressed, thanks.

Copy link
Member

@tqchen tqchen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one last change and it can be merged

void Init();
void Init(const std::vector<std::string>& device_types, const std::string& platform_name = "");
virtual void Init() {
Init({"gpu", "cpu"});
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only initialize gpu, to be consistent with existing seeting

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This initializes only GPU basically, and falls back to CPU only when GPU is not available. The behavior is consistent with the original one.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main problem is that the current opencl schedules assumes GPU, so they may not work when falling back to CPUs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I'm fine with removing the fallback feature in this PR.

@tqchen tqchen merged commit fa2e428 into apache:master Jul 9, 2018
@tqchen tqchen added status: accepted and removed status: need update need update based on feedbacks status: review in progress labels Jul 9, 2018
@tqchen
Copy link
Member

tqchen commented Jul 9, 2018

thanks @kazum this is merged!

tqchen pushed a commit to tqchen/tvm that referenced this pull request Aug 4, 2018
sergei-mironov pushed a commit to sergei-mironov/tvm that referenced this pull request Aug 8, 2018
@kazum kazum deleted the opencl-multi-platforms branch August 23, 2018 02:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants