Skip to content

Commit

Permalink
Initial PR for fixing issues on existing TCs for gpufunctest (openvin…
Browse files Browse the repository at this point in the history
…otoolkit#71)

* Fix typo to set begin/end mask to kernel param correctly for strided_slice

Signed-off-by: Andrew Park <[email protected]>

* Fix exception to try resue input with empty deps and _exec_deps during creating reshape_inst

Signed-off-by: Andrew Park <[email protected]>

* Fix strided_slice to support static shape

Signed-off-by: Andrew Park <[email protected]>

* Fix input layout representation for byxf and nv12 on parameter

Signed-off-by: Andrew Park <[email protected]>

* Fix broadcast size check logic

Signed-off-by: Andrew Park <[email protected]>

* Fix layout initialization to convert tensor(cldnn format) to PartialShape(IE format) for weight reorder

Signed-off-by: Andrew Park <[email protected]>

* Fix kernel data conversion to convert PartialShape to ordered dims with output fortmat

Signed-off-by: Andrew Park <[email protected]>

* Fix not to check the new layout is identical empty tensor

Signed-off-by: Andrew Park <[email protected]>

* Enable cases to reorder/reshape + gemm  where op is not FC

Signed-off-by: Andrew Park <[email protected]>

* Update AsyncInferRequest for InferRequestLegacy compatibility

Signed-off-by: Andrew Park <[email protected]>

* Apply PR#11073 update eltwise calc_output_layout_function and replace PartialShape::broadcast_merge_into with tensor::max

Signed-off-by: Andrew Park <[email protected]>

* Fix reduce calc_output_layout to apply reduce_axes correctly

Signed-off-by: Andrew Park <[email protected]>

* Fix scatter_update calc_output_layout to get the number of dims from dependencies correctly

Signed-off-by: Andrew Park <[email protected]>

* Fix gather_nd calc_output_layout to calculate the final output tensor

Signed-off-by: Andrew Park <[email protected]>

* Fix calc_body_input_layout to adjust cropped input shape correctly and update ==operator for layout comparison

Signed-off-by: Andrew Park <[email protected]>

* Fix PartialShape representation whether input rank is 2 and 3 for reshape preprocess of gemm

Signed-off-by: Andrew Park <[email protected]>

* Add condition to check whether input layout is dynamic or not in gather calc_output_layout

Signed-off-by: Andrew Park <[email protected]>

* Fix ScatterUpdate issue

* Align scatter update axis format with IE

Signed-off-by: Andrew Park <[email protected]>

* Revert "Enable cases to reorder/reshape + gemm  where op is not FC"

This reverts commit 16a60b5.

Signed-off-by: Andrew Park <[email protected]>

* Revert "Update AsyncInferRequest for InferRequestLegacy compatibility"

This reverts commit f57a7e4.

Signed-off-by: Andrew Park <[email protected]>

* Update scatter_update to propagate axis with integer type instead of scatter_update_axis

Signed-off-by: Andrew Park <[email protected]>

* Revert "Fix not to check the new layout is identical empty tensor"

This reverts commit 1215c70.

Signed-off-by: Andrew Park <[email protected]>

Co-authored-by: Ahn, Paul Y <[email protected]>

Final PR for fixing issues on existing TCs for gpufunctest (openvinotoolkit#72)

* Initial integration into InferReqeust for RemoteBlob and DynamicBatch

Signed-off-by: Andrew Park <[email protected]>

* Enable DynamicBatch related logics

Signed-off-by: Andrew Park <[email protected]>

* Fix PartialShape comparison related issues on TensorIterator/LSTMSequenceTest

Signed-off-by: Andrew Park <[email protected]>

* Fix feature representation for slope layout when shape has a dimension = 1

Signed-off-by: Andrew Park <[email protected]>

* Revert "Fix feature representation for slope layout when shape has a dimension = 1"

This reverts commit 1169fbb.

* Revert "Fix PartialShape comparison related issues on TensorIterator/LSTMSequenceTest"

This reverts commit 664175c.

Signed-off-by: Andrew Park <[email protected]>
  • Loading branch information
andrew-k-park authored and yeonbok committed Aug 8, 2022
1 parent d6088e9 commit 855d005
Show file tree
Hide file tree
Showing 14 changed files with 495 additions and 102 deletions.
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
// Copyright (C) 2018-2022 Intel Corporation
// SPDX-License-Identifier: Apache-2.0
//

#pragma once

#include <string>
#include <map>
#include <cpp_interfaces/impl/ie_infer_async_request_thread_safe_default.hpp>
#include "intel_gpu/plugin/infer_request_legacy.hpp"

namespace ov {
namespace runtime {
namespace intel_gpu {

class AsyncInferRequestLegacy : public InferenceEngine::AsyncInferRequestThreadSafeDefault {
public:
using Parent = InferenceEngine::AsyncInferRequestThreadSafeDefault;
AsyncInferRequestLegacy(const InferRequestLegacy::Ptr &inferRequest,
const InferenceEngine::ITaskExecutor::Ptr& taskExecutor,
const InferenceEngine::ITaskExecutor::Ptr& waitExecutor,
const InferenceEngine::ITaskExecutor::Ptr& callbackExecutor);

~AsyncInferRequestLegacy();

void Infer_ThreadUnsafe() override;
void StartAsync_ThreadUnsafe() override;

private:
InferRequestLegacy::Ptr _inferRequest;
InferenceEngine::ITaskExecutor::Ptr _waitExecutor;
};

} // namespace intel_gpu
} // namespace runtime
} // namespace ov
Original file line number Diff line number Diff line change
Expand Up @@ -78,7 +78,6 @@ class InferRequest : public InferenceEngine::IInferRequestInternal {
bool m_useProfiling = false;
bool m_useStreams = false;
bool m_useExternalQueue = false;
bool is_allocated = false;
std::shared_ptr<Graph> m_graph;

// dynamic batch stuff
Expand All @@ -92,19 +91,19 @@ class InferRequest : public InferenceEngine::IInferRequestInternal {

InferenceEngine::Blob::Ptr create_host_blob(const InferenceEngine::TensorDesc& desc,
std::shared_ptr<InferenceEngine::IAllocator> alloc = nullptr);
InferenceEngine::Blob::Ptr create_device_blob(const InferenceEngine::TensorDesc& desc, const cldnn::layout& layout);
InferenceEngine::Blob::Ptr create_device_blob(const InferenceEngine::TensorDesc& desc);

void copy_output_data(cldnn::memory::ptr outputMemory, InferenceEngine::Blob::Ptr bptr, buf_info* bi = nullptr);
void copy_input_data(std::shared_ptr<cldnn::network> network, const cldnn::primitive_id &inputName,
const cldnn::layout& inputLayout, const InferenceEngine::Blob &inputBlob,
buf_info* bi = nullptr);


InferenceEngine::Blob::Ptr create_shared_device_blob(const InferenceEngine::TensorDesc& desc, const cldnn::layout& layout, void* usm_host_mem);
void allocate_inputs();
void allocate_outputs();
void allocate_inputs_dynamic();
void allocate_outputs_dynamic();

void set_input(const std::string& name, const InferenceEngine::Blob::Ptr& data);
void set_output(const std::string& name, const InferenceEngine::Blob::Ptr& data);
InferenceEngine::Blob::Ptr reinterpret_device_blob(InferenceEngine::Blob::Ptr data, const InferenceEngine::TensorDesc& new_desc);

std::map<cldnn::primitive_id, cldnn::network_output> internal_outputs;
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -80,8 +80,8 @@ class InferRequestLegacy : public InferenceEngine::IInferRequestInternal {
std::shared_ptr<Graph> m_graph;

// dynamic batch stuff
std::map<std::string, std::vector<buf_info>> batchInputs;
std::map<std::string, std::vector<buf_info>> batchOutputs;
std::map<std::string, std::vector<buf_info_legacy>> batchInputs;
std::map<std::string, std::vector<buf_info_legacy>> batchOutputs;
InferenceEngine::IStreamsExecutor* streamExecutor = nullptr;

void prepare_input(const cldnn::primitive_id &inputName, InferenceEngine::Blob::Ptr &inputBlob,
Expand All @@ -92,10 +92,10 @@ class InferRequestLegacy : public InferenceEngine::IInferRequestInternal {
std::shared_ptr<InferenceEngine::IAllocator> alloc = nullptr);
InferenceEngine::Blob::Ptr create_device_blob(const InferenceEngine::TensorDesc& desc, const cldnn::layout& layout);

void copy_output_data(cldnn::memory::ptr outputMemory, InferenceEngine::Blob::Ptr bptr, buf_info* bi = nullptr);
void copy_output_data(cldnn::memory::ptr outputMemory, InferenceEngine::Blob::Ptr bptr, buf_info_legacy* bi = nullptr);
void copy_input_data(std::shared_ptr<cldnn::network> network, const cldnn::primitive_id &inputName,
const cldnn::layout& inputLayout, const InferenceEngine::Blob &inputBlob,
buf_info* bi = nullptr);
buf_info_legacy* bi = nullptr);

InferenceEngine::Blob::Ptr create_shared_device_blob(const InferenceEngine::TensorDesc& desc, const cldnn::layout& layout, void* usm_host_mem);
void allocate_inputs();
Expand Down
44 changes: 30 additions & 14 deletions src/plugins/intel_gpu/src/graph/eltwise.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -31,23 +31,39 @@ layout eltwise_inst::calc_output_layout(eltwise_node const& node, kernel_impl_pa
auto desc = impl_param.typed_desc<eltwise>();
auto output_type = desc->output_data_type ? *desc->output_data_type : input_node_layout.data_type;

ov::PartialShape out_pshape;
auto format = input_node_layout.format;
for (size_t i = 0; i < desc->input_size(); i++) {
if (i == primary_input_idx)
continue;
auto get_output_layout = [&](){
auto format = input_node_layout.format;
if (input_node_layout.is_static()) {
auto size = input_node_layout.get_tensor();
for (size_t i = 0; i < node.inputs_count(); i++) {
if (i == primary_input_idx)
continue;

auto l = impl_param.get_non_padded_input_layout(i);
if (!ov::PartialShape::broadcast_merge_into(out_pshape, l.size, ov::op::AutoBroadcastSpec(ov::op::AutoBroadcastType::NUMPY))) {
IE_THROW() << "incorrect input shapes\n";
auto l = node.input(i).get_non_padded_output_layout();
size = tensor::max(size, l.get_tensor());
if (l.format == format::b_fs_zyx_fsv16) // use optimized 5D
format = format::b_fs_zyx_fsv16;
else if (l.format == format::bs_fs_zyx_bsv16_fsv16)
format = format::bs_fs_zyx_bsv16_fsv16;
}
return layout(output_type, format, size);
} else {
ov::PartialShape out_pshape;
for (size_t i = 0; i < node.inputs_count(); i++) {
auto l = node.input(i).get_non_padded_output_layout();
if (!ov::PartialShape::broadcast_merge_into(out_pshape, l.size, ov::op::AutoBroadcastSpec(ov::op::AutoBroadcastType::NUMPY))) {
IE_THROW() << "incorrect input shapes\n";
}
if (l.format == format::b_fs_zyx_fsv16) // use optimized 5D
format = format::b_fs_zyx_fsv16;
else if (l.format == format::bs_fs_zyx_bsv16_fsv16)
format = format::bs_fs_zyx_bsv16_fsv16;
}
return layout(output_type, format, out_pshape);
}
};

if (l.format == format::b_fs_zyx_fsv16) // use optimized 5D
format = format::b_fs_zyx_fsv16;
else if (l.format == format::bs_fs_zyx_bsv16_fsv16)
format = format::bs_fs_zyx_bsv16_fsv16;
}
auto output_layout = layout(output_type, format, out_pshape);
auto output_layout = get_output_layout();

auto mode = desc->mode;
// list of operations supported for integer types
Expand Down
59 changes: 57 additions & 2 deletions src/plugins/intel_gpu/src/graph/impls/ocl/gemm.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,8 @@
#include "gemm/gemm_kernel_base.h"
#include "intel_gpu/runtime/error_handler.hpp"

#include "matmul_shape_inference.hpp"

namespace cldnn {
namespace ocl {

Expand All @@ -29,8 +31,61 @@ struct gemm_impl : typed_primitive_impl_ocl<gemm> {
auto gemm_optional_params =
get_default_optional_params<kernel_selector::gemm_optional_params>(arg.get_program());

for (size_t i = 1; i < arg.inputs_count(); i++) {
gemm_params.inputs.push_back(convert_data_tensor(impl_param->input_layouts[i]));
auto gemmSpecificPartialShape = [](ov::PartialShape& pshape) {
switch (pshape.rank().get_length()) {
case 2: { // batch, feature representation (rank == 2)
pshape.insert(pshape.begin(), 1ul);
pshape.insert(pshape.begin(), 1ul);
break;
}
case 3 : { // feature representation (rank == 3)
pshape.insert(pshape.begin(), 1, 1ul);
break;
}
}
};
auto output_layout = arg.get_output_layout();
auto output_pshape = output_layout.size;
auto output_rank = output_pshape.rank().get_length();
std::vector<ov::PartialShape> input_shapes;
for (size_t i = 0; i < arg.inputs_count(); i++) {
auto input_layout = arg.input(i).get_output_layout();
auto input_pshape = input_layout.get_partial_shape();
auto input_rank = input_pshape.rank().get_length();
if (input_rank != output_rank || input_rank < 4) {
if (input_rank == 1) {
bool transpose = false;
if (i == 0) {
transpose = arg.get_primitive()->transpose_input0;
input_pshape.insert(input_pshape.begin(), 1);
} else {
transpose = arg.get_primitive()->transpose_input1;
input_pshape.insert(input_pshape.end(), 1);
}
if (transpose) {
std::swap(input_pshape[0], input_pshape[1]);
}
}
if (input_rank < output_rank)
input_pshape.insert(input_pshape.begin(), output_rank - input_rank, 1ul);

gemmSpecificPartialShape(input_pshape);
}
input_layout.size = input_pshape;
input_shapes.push_back(input_pshape);
if (i == 0)
gemm_params.inputs[0] = convert_data_tensor(input_layout);
else
gemm_params.inputs.push_back(convert_data_tensor(input_layout));
}
if (output_rank < 4) {
ov::op::v0::MatMul op;
op.set_transpose_a(arg.get_primitive()->transpose_input0);
op.set_transpose_b(arg.get_primitive()->transpose_input1);
std::vector<ov::PartialShape> output_shapes = {ov::PartialShape()};
shape_infer(&op, input_shapes, output_shapes);
output_layout.size = output_shapes[0];
gemm_params.outputs[0] = convert_data_tensor(output_layout);
}

gemm_params.alpha = desc->alpha;
Expand Down
48 changes: 41 additions & 7 deletions src/plugins/intel_gpu/src/graph/impls/ocl/strided_slice.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -32,13 +32,47 @@ struct strided_slice_impl : typed_primitive_impl_ocl<strided_slice> {
auto op_params = get_default_optional_params<kernel_selector::strided_slice_optional_params>(arg.get_program());
const size_t dims_num = params.inputs[0].Dimentions();

// Getting data from constant inputs. There are 3 args: Begin, End, Stride
for (size_t i = 1; i < arg.get_dependencies().size(); ++i) {
auto& input = arg.get_dependency(i).as<data>();
auto mem = input.get_attached_memory_ptr();
std::vector<int32_t> sizes = read_vector<int32_t>(mem, arg.get_program().get_stream());
pad_vector_to_size(sizes, dims_num, i != 1); // for "begin" completion used 0 value, for other - 1
params.striding_params.push_back(sizes);
if (!arg.const_mem.empty()) {
// Getting data from constant inputs. There are 3 args: Begin, End, Stride
for (size_t i = 0; i < arg.const_mem.size(); ++i) {
auto mem = arg.const_mem[i];
std::vector<int32_t> sizes;
if (mem->get_layout().data_type == cldnn::data_types::i64) {
mem_lock<int64_t, mem_lock_type::read> lock{mem, arg.get_program().get_stream()};
int64_t* data = lock.data();
std::vector<int64_t> sizes_i64 = std::vector<int64_t>(data, data + mem->get_layout().count());
sizes.resize(sizes_i64.size());
for (size_t j = 0; j < sizes.size(); j++)
sizes[j] = static_cast<int32_t>(sizes_i64[j]);
} else {
mem_lock<int32_t, mem_lock_type::read> lock{mem, arg.get_program().get_stream()};
int32_t* data = lock.data();
sizes = std::vector<int32_t>(data, data + mem->get_layout().count());
}
pad_vector_to_size(sizes, dims_num, i != 1); // for "begin" completion used 0 value, for other - 1
params.striding_params.push_back(sizes);
}
} else {
// Getting data from constant inputs. There are 3 args: Begin, End, Stride
for (size_t i = 1; i < arg.get_dependencies().size(); ++i) {
auto& input = arg.get_dependency(i).as<data>();
auto mem = input.get_attached_memory_ptr();
std::vector<int32_t> sizes;
if (input.get_output_layout().data_type == cldnn::data_types::i64) {
mem_lock<int64_t> lock{mem, arg.get_program().get_stream()};
int64_t* data = lock.data();
std::vector<int64_t> sizes_i64 = std::vector<int64_t>(data, data + input.get_output_layout().count());
sizes.resize(sizes_i64.size());
for (size_t j = 0; j < sizes.size(); j++)
sizes[j] = static_cast<int32_t>(sizes_i64[j]);
} else {
mem_lock<int32_t> lock{mem, arg.get_program().get_stream()};
int32_t* data = lock.data();
sizes = std::vector<int32_t>(data, data + input.get_output_layout().count());
}
pad_vector_to_size(sizes, dims_num, i != 1); // for "begin" completion used 0 value, for other - 1
params.striding_params.push_back(sizes);
}
}

auto begin_mask_ = prim->begin_mask;
Expand Down
2 changes: 1 addition & 1 deletion src/plugins/intel_gpu/src/graph/kernel_selector_helper.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -720,7 +720,7 @@ kernel_selector::dev_type get_device_type(cldnn::device_type type) {

kernel_selector::data_tensor convert_data_tensor(const layout& l, uint32_t split, const tensor view_offset) {
const auto& pad = l.data_padding;
const auto& vals = l.get_dims();
const auto& vals = l.get_tensor().sizes(l.format);
const auto& add_offsets = view_offset.sizes(l.format);
const auto& lower_pad = pad.lower_size().sizes(l.format);
const auto& upper_pad = pad.upper_size().sizes(l.format);
Expand Down
13 changes: 9 additions & 4 deletions src/plugins/intel_gpu/src/graph/reshape.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -127,10 +127,15 @@ reshape_inst::typed_primitive_inst(network& network, reshape_node const& node) :

// if reshape operated in-place, postpone creation of the output until network run,
// then create new memory object as the reinterpreted output of the previous primitive
if (!node.can_be_optimized())
_output = allocate_output();
else
reuse_input();
if (_node.get_output_layout().is_static()) {
if (!node.can_be_optimized())
_output = allocate_output();
else
reuse_input();
} else {
if (_exec_deps.size() > 0 && input_memory_ptr())
reuse_input();
}
}

static std::vector<int64_t> read_vector(cldnn::memory::ptr mem, cldnn::stream& stream) {
Expand Down
62 changes: 62 additions & 0 deletions src/plugins/intel_gpu/src/plugin/async_infer_request_legacy.cpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
// Copyright (C) 2018-2022 Intel Corporation
// SPDX-License-Identifier: Apache-2.0
//

#include "intel_gpu/plugin/async_infer_request_legacy.hpp"
#include "intel_gpu/plugin/itt.hpp"
#include <memory>

namespace ov {
namespace runtime {
namespace intel_gpu {

AsyncInferRequestLegacy::AsyncInferRequestLegacy(const InferRequestLegacy::Ptr &inferRequest,
const InferenceEngine::ITaskExecutor::Ptr& taskExecutor,
const InferenceEngine::ITaskExecutor::Ptr& waitExecutor,
const InferenceEngine::ITaskExecutor::Ptr& callbackExecutor)
: AsyncInferRequestThreadSafeDefault(inferRequest, taskExecutor, callbackExecutor), _inferRequest(inferRequest), _waitExecutor(waitExecutor) {
_pipeline = {};

if (!_inferRequest->use_external_queue()) {
_pipeline.push_back({taskExecutor,
[this] {
OV_ITT_SCOPED_TASK(itt::domains::intel_gpu_plugin, "AsyncInferRequestLegacy::PreprocessingAndStartPipeline");
_inferRequest->setup_stream_graph();
_inferRequest->preprocess();
_inferRequest->enqueue();
_inferRequest->wait();
} });
} else {
_pipeline.push_back({ _waitExecutor,
[this] {
OV_ITT_SCOPED_TASK(itt::domains::intel_gpu_plugin, "AsyncInferRequestLegacy::WaitPipeline");
_inferRequest->wait_notify();
} });
}
}

void AsyncInferRequestLegacy::Infer_ThreadUnsafe() {
if (_inferRequest->use_external_queue()) {
_inferRequest->setup_stream_graph();
_inferRequest->preprocess_notify();
_inferRequest->enqueue_notify();
}
Parent::Infer_ThreadUnsafe();
}

void AsyncInferRequestLegacy::StartAsync_ThreadUnsafe() {
if (_inferRequest->use_external_queue()) {
_inferRequest->setup_stream_graph();
_inferRequest->preprocess_notify();
_inferRequest->enqueue_notify();
}
Parent::StartAsync_ThreadUnsafe();
}

AsyncInferRequestLegacy::~AsyncInferRequestLegacy() {
StopAndWait();
}

} // namespace intel_gpu
} // namespace runtime
} // namespace ov
9 changes: 8 additions & 1 deletion src/plugins/intel_gpu/src/plugin/compiled_model.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@
#include "intel_gpu/plugin/infer_request.hpp"
#include "intel_gpu/plugin/compiled_model.hpp"
#include "intel_gpu/plugin/async_infer_request.hpp"
#include "intel_gpu/plugin/async_infer_request_legacy.hpp"
#include "openvino/runtime/intel_gpu/properties.hpp"
#include "intel_gpu/plugin/infer_request_legacy.hpp"

Expand Down Expand Up @@ -121,8 +122,14 @@ IInferRequestInternal::Ptr CompiledModel::CreateInferRequest() {
if (this->_plugin && _plugin->IsNewAPI()) {
internalRequest = CreateInferRequestImpl(_parameters, _results);
}
if (!internalRequest)
if (!internalRequest) {
internalRequest = CreateInferRequestImpl(_networkInputs, _networkOutputs);
internalRequest->setPointerToExecutableNetworkInternal(shared_from_this());
return std::make_shared<AsyncInferRequestLegacy>(std::static_pointer_cast<InferRequestLegacy>(internalRequest),
m_taskExecutor,
m_waitExecutor,
_callbackExecutor);
}
internalRequest->setPointerToExecutableNetworkInternal(shared_from_this());
return std::make_shared<AsyncInferRequest>(std::static_pointer_cast<InferRequest>(internalRequest),
m_taskExecutor,
Expand Down
Loading

0 comments on commit 855d005

Please sign in to comment.