Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GPU] Graph serialization for GPU #13801

Merged
merged 26 commits into from
Nov 14, 2022
Merged

Conversation

e-ddykim
Copy link
Contributor

@e-ddykim e-ddykim commented Nov 2, 2022

Details:

  • This PR adds a model caching for GPU plugin as a preview feature.
    • It reduces the first inference latency by skipping the graph optimization passes.
    • The main components to be serialized are primitive_inst and primitive_impl.
    • primitive, program and program_node are not serialized.
  • To enable it, it is required to set an environmental variable 'OV_GPU_CACHE_MODEL' to 1.
    • If not set, current kernel caching feature is activated.

Tickets:

  • 57672

@e-ddykim e-ddykim force-pushed the gpu-serial_poc branch 8 times, most recently from 37daf88 to e72ffa4 Compare November 7, 2022 11:51
@e-ddykim e-ddykim added this to the 2022.3 milestone Nov 7, 2022
@e-ddykim e-ddykim added the category: GPU OpenVINO GPU plugin label Nov 7, 2022
@e-ddykim e-ddykim marked this pull request as ready for review November 7, 2022 14:43
@e-ddykim e-ddykim requested review from a team as code owners November 7, 2022 14:43
, _impl(nullptr)
, _outputs({memory::ptr()})
, _output_changed(false)
, _mem_allocated(false) {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this flag be recognized by can_be_optimized?

Copy link
Contributor Author

@e-ddykim e-ddykim Nov 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In most cases, can_be_optimized() is true when _mem_allocated is false. But in the case of "implicit concat" for onednn, can_be_optimized() is false while _mem_allocated is false.

@yeonbok yeonbok merged commit f488e6c into openvinotoolkit:master Nov 14, 2022
Copy link
Contributor

@vladimir-paramuzov vladimir-paramuzov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@e-ddykim @yeonbok Guys, you merged this PR too quickly, I haven't finished my review yet.
I submit comments that I have at the moment, so please address them in separate PR.

@@ -901,4 +941,218 @@ std::string primitive_inst::get_implementation_name() const {
return "undef";
}

void primitive_inst::save(cldnn::BinaryOutputBuffer& ob) const {
if (type() == cldnn::data::type_id() ||
(type() == cldnn::mutable_data::type_id() && _impl == nullptr)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not override save method for data/mutable data instead of having branch in common impl?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/openvinotoolkit/openvino/pull/13986/files#r1027065103
I overrided save and load methods for data/mutable data, and removed a branch. Thank you.

if (!_mem_allocated) {
for (size_t dep_idx = 0; dep_idx < _deps.size(); ++dep_idx) {
for (size_t m_idx = 0; m_idx < _deps[dep_idx]->_deps.size(); ++m_idx) {
if (get_network().get_engine().is_the_same_buffer(*_outputs[0], *_deps[dep_idx]->_deps[m_idx]->_outputs[0])) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That logic looks weird and unsafe. If I understand correctly, it assumes that mutable_data is used only asWA for multiple outputs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -181,6 +181,11 @@ void CompileModelCacheTestBase::run() {
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Such big patch with 0 unit tests is unacceptable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not add unit tests that does not use the IE interfaces (export and import) because serialization needs to be saved to a file and then loaded again to see if it works. And the number of newly enabled functional tests related to serialization is more than 200.

Copy link
Contributor

@vladimir-paramuzov vladimir-paramuzov Nov 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked those functional tests and can't say that we can be sure that everything works fine based on them only. E.g. if I change import of CompiledModel::Export to return in the very beginning, then the tests still pass.
Basically, the test check that 1. properties are supported 2. blob file is created; and there is no guarantee that blob file is used or contains expected content

because serialization needs to be saved to a file

Why? As I can see objects are working with ostream/istream, so you can probably you some stream type which is operating in memory

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caching functional tests runs the same case three times.

  1. run without cache at L200
  2. run to create cache at L219 with i =0
  3. run with cache at L219 with i =1

Then, it compares the results 1 vs. 2 and 1 vs. 3 at L223

compare(originalOutputs, get_plugin_outputs());

But, I agree with you in that we can't say that we can be sure that everything works fine based on them only. I'll add unit tests working on memory as you guided. Thank you.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added 38 unit tests for serialization. These tests have export_import in the test case name. Thank you.

}

if (idx == _deps.size())
std::cout << "[get_index_in_deps]: not found" << std::endl;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why? It should be either removed or changed to exception.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -762,7 +762,9 @@ void program::cleanup() {
}
}
}
_kernels_cache->reset();

if (_engine.configuration().kernels_cache_path.empty())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -425,6 +426,9 @@ void primitive_inst::set_arguments() {
}

void primitive_inst::build_deps() {
if (_node == nullptr)
return;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't exception be thrown here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/openvinotoolkit/openvino/pull/13986/files#r1024718320
Updated to throw an exception when _node is null. Thank you.

int num_data_nodes;
ib >> num_data_nodes;

_memory_pool->clear_pool_for_network(net_id);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it needed? new mem pool object is created above, so I think clear is redundant.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

, _internal(false)
, _is_primary_stream(false)
, _reset_arguments(true) {
net_id += 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe net_id is always 1 here which is unexpected.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/openvinotoolkit/openvino/pull/13986/files#r1025253714
I added a new function get_new_net_id() to emit an unique id, and applied it to network ctors. Thank you.

std::string type;
std::string _primitive_id;
ib >> type >> _primitive_id;
std::shared_ptr<cldnn::primitive_inst> new_primitive_inst = cldnn::get_type_id(type)->create_instance(*this);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to separate data nodes and other node types here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the case of deserialization, output memory is allocated whenever each primitive_inst is restored. At this time, in the case of primitive_inst that does not allocate memory on its own (_mem_allocated is false), the memory address of another primitive_inst is used. In the case of non-data types, there is no problem if they are restored in the order of _exec_order. But, since data types are not in _exec_order, memory addresses may not be known unless allocated in advance.

@e-ddykim
Copy link
Contributor Author

@e-ddykim @yeonbok Guys, you merged this PR too quickly, I haven't finished my review yet. I submit comments that I have at the moment, so please address them in separate PR.

@vladimir-paramuzov Thank you for taking your valuable time to review. I will submit a new PR to address your review.

@vladimir-paramuzov
Copy link
Contributor

@e-ddykim one more issue in cmake output:

Checking patch include/oneapi/dnnl/dnnl.hpp...
error: while searching for:
    struct desc {
        dnnl_convolution_desc_t data;

        /// Constructs a descriptor for a convolution forward propagation
        /// primitive with bias.
        ///

error: patch failed: include/oneapi/dnnl/dnnl.hpp:4686
error: include/oneapi/dnnl/dnnl.hpp: patch does not apply

@e-ddykim
Copy link
Contributor Author

@e-ddykim one more issue in cmake output:

Checking patch include/oneapi/dnnl/dnnl.hpp...
error: while searching for:
    struct desc {
        dnnl_convolution_desc_t data;

        /// Constructs a descriptor for a convolution forward propagation
        /// primitive with bias.
        ///

error: patch failed: include/oneapi/dnnl/dnnl.hpp:4686
error: include/oneapi/dnnl/dnnl.hpp: patch does not apply

https://github.com/openvinotoolkit/openvino/pull/13986/files#r1021667835
That error message occurs when trying to patch code that has already been patched again. I updated not to be displayed to prevent confusion.

@e-ddykim e-ddykim deleted the gpu-serial_poc branch February 8, 2024 05:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category: GPU OpenVINO GPU plugin
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants