【NPU】Cherry-pick ascendrc ops code by 0325 to develop (#32197)

* merge 31065 * Fix typo of selected_npus (#31230) * merge 31249 * [NPU] Support npu op pow and pow grad (#31247) * [NPU] Support npu op: (1) pow (2) pow_grad * Support fp16 * Fix pow npu fp16 test (#31256) * support list of list attribute for NPU (#31299) * support list of list attribute for NPU * fix compile problem * fix reference * [NPU] Support npu op: (1) slice (2) slice_grad (#31275) * fix reading flags from env (#31329) * merge 31347 * [NPU] Support npu op layer_norm and layer_norm_grad (#31310) * init commit, add layer_norm npu kernel * fix typo * add unittest * add unittest * fix bug * fix bug * refine ut * [NPU] add npu kernel for equal op (#31393) * add npu kernel for equal op * refine code * add more ut * update year * [NPU] Support npu kernel for shape op (#31427) * add shape npu * fix * fix * fix endif (#31431) * Fix pow, use fillD instead of broadcast (#31433) * Fix pow, refine code (#31440) * fix cmake of cryptopp to avoid downloading every time (#31451) * [NPU] squeeze and unsqueeze op for ascend (#31452) Co-authored-by: root <[email protected]> * Support npu kernel for gather op (#31458) * add gather npu op * code review done * update python new line * precommit * fix review * del commit * 【NPU】add scale op for npu (#31499) * add scale npu * fix * fix * Support TensorFormVector, TensorToVector of bool type (#31518) * support TensorFormVector, TensorToVector of bool type * add ut * fix compile problem * 【NPU】support npu kernel for fill_constant op (#31521) * add fill_constant npu * add fill_constant npu * fix * cherry-pick 31422, solve conflict * 【NPU】Support npu kernel for matmul op (#31544) * add matmulv2_npu * add matmul * add matmul * [NPU] Support npu op elementwise_mul and elementwise_mul_grad (#31571) * [NPU] Support npu op elementwise_max (#31574) * 【NPU】add relu op for npu (#31515) * add relu npu * fixed * fix * 【NPU】Suppert npu kernel for reshape2 op (#31524) * add reshape2 npu * add reshpe2 * [NPU] Support npu kernel for gather op fix bug (#31541) * add gather npu op * code review done * update python new line * precommit * fix review * del commit * update gather_grad * fix bug * fix bug * [NPU] Support npu kernel for amp_check_finite_and_unscale_npu op (#31457) * Support npu kernel for amp_check_finite_and_unscale_npu op * support EnforceNotMet exception * fix exception bug * modify python unittest * precommit * update c++ unittest * fix review * fix review * [NPU] accuracy op (#31492) * accuracy op * fix license * fix * add test and fix bug * [NPU] add Assign OP (#31561) * add assign op * add test assign npu test * dele if def Co-authored-by: oyjxer <[email protected]> * [NPU] fix npu op elementwise_mul_grad (#31592) * 【NPU】Support npu op gelu and gelu_grad (#31530) * Support npu op gelu and gelu_grad * Support npu op gelu and gelu_grad * [NPU] fix assgin cmake (#31595) * fix gather_grad bug (#31607) * [NPU] add range op (#31560) * add range op * fix codestyle; call GetSize directly Co-authored-by: oyjxer <[email protected]> * 【NPU】Support npu op elementwise_div and elementwise_div_grad (#31573) * Support npu op elementwise_div and elementwise_div_grad * Support npu op elementwise_div and elementwise_div_grad * Support npu op elementwise_div and elementwise_div_grad * [NPU] Support npu op log, log_grad, sqrt, sqrt_grad, square, tanh and tanh_grad (#31600) * [NPU] Support npu op logicalnot_op (#31534) * [NPU] Support npu op elementwise_min (#31575) * [NPU] Support npu op elementwise_pow (#31576) * [NPU] Support npu op table_lookup_v2 and table_lookup_v2_grad (#31399) * [npu] support npu kernel `table_lookup_v2` * clean up * +python test * +cmake * clean up * remove int8 kernel + python unitest for fp16 * clean up * [NPU] support npu kernel for `less_than` (#31327) * [npu] support npu kernel for `less than` * remove int* kernel * cleanup * [NPU] Support npu kernel scatter op (#31624) * Support npu kernel scatter op * Add more test * [NPU] fix allocator min chunk size (#31632) * [NPU] Support NPU kernel cast op (#31635) Co-authored-by: frankwhzhang <[email protected]> * [NPU] add npu kernel for sgd (#31639) * 【NPU】Support NPU kernel for reduce_sum op v2 (#31620) * add reduce_sum * fix broadcastd * fix test * fix * add unsqueeze in reduce_sum * add template * add unittest for keep_dim * test reduce_all Co-authored-by: frankwhzhang <[email protected]> * [NPU] add npu kernel for adam (#31644) * add npu kernel for adam * refine code * disable test * modify atol * 【NPU】Support npu kernel for mul op (#31584) * add mul * add test mul * [NPU] add npu kernel for softmax_with_cross_entropy (#31656) * init * fix bugs * [NPU] add npu kernel for mean Op (#31562) * update mean op * update mean op * give a better test activation Co-authored-by: oyjxer <[email protected]> * Revert "[NPU] add npu kernel for mean Op (#31562)" (#31665) This reverts commit 468ac69. * 【NPU】Add TensorCopy to NPU kernel for reduce_sum op (#31667) * update unittest * add TensorCopy in npu grad kernel * [NPU] Support npu op `expand` (#31405) * [npu] support npu kernel for `expand` * [NPU] fix shape of dx in mul_grad (#31675) * fix shape of dx * refine code * [NPU] add Increment op (#31563) * add increment * fix * update test increment op inplace * update increment op * increment b = 2 Co-authored-by: oyjxer <[email protected]> * [NPU] add NPU add topk (#31596) * add topk op * add cmake * update topk npu op * refactor func * fix test not go npu TopKD bug * NPUPlace(4) to NPUPlace(0) * update comment Co-authored-by: oyjxer <[email protected]> * [NPU] Support NPU kernel sum op (#31671) * [NPU] npu support `transpose` (#31486) * cherry-pick 31564, solve conflict * [NPU] Fix bug: Fix calculation errors of pow grad npu kernel (#31699) * [NPU] Support testing grad of NPU ops in OpTest (#31697) * [NPU] Support NPU kernel of stack op (#31711) * [NPU] Remove redundant ctest of top_k_op_npu_test (#31718) * [NPU] fix reshape npu op kernel (#31726) * rename npu op file * fix reshape * [NPU] change transpose to transpose2 (#31734) * change transpose to transpose2 * fix bug * [NPU] Support mean npu kernel (#31729) * [NPU] fix some bugs of npu op (#31739) * fix softmax * fix mean * fix lookup_table_v2 * 【NPU】Fix npu kernel elementwise_div_grad (#31753) * [NPU] fix the grad kernel diff bug of gather op (#31757) * fix gather grad kernel diff * fix gather grad kernel diff * fix gather review bug * 【NPU】Fix reshape test & add grad test (#31776) * fix * fix * [NPU] support fp16 for npu accuracy op (#31797) * [NPU] support list of tensor input (#31801) * support list of tensor as npu input * add comment * fix typo * fix typo * [NPU] add npu kernel for concat op (#31695) * add npu kernel for concat op * add npu kernel for concat op * refine code * update * refine concat_grad * [NPU] Support npu kernel for op elementwise_floordiv (#31822) * [NPU] fix bug of lookup_table_v2_grad (#31834) * [NPU] support default stream (#31510) * [NPU] support mixed precision input for npu layer norm (#31847) * support mixed precision input for npu layer norm * fix layer_norm npu kernel Co-authored-by: zhiqiu <[email protected]> * 【NPU】Support npu kernel for update_loss_scaling op (#31830) * add update_loss_scaling_npu NPU kernel * change TensorFromVec to Memset * fix compile problem (#31850) * [NPU] support npu for conditional_block op (#31854) * 【NPU】Add int dtype kernel for reshape2 op (#31864) * fix * fix * [NPU] fix some op bugs (#31855) * fix some op bugs * fix some bugs * follow comments * fix log level * add ut * [NPU] support fp16 of input for api pow (#31871) * [NPU] add npu kernel for truncated_gaussian_random op (#31654) * init * add todo * add npu kernel for truncated_gaussian_random * add sync * fix concat_grad * fix typo * fix compile * fix compile * fix compile * fix compile * fix compile * fix compile * fix code style * fix code style * fix code * Fix op test (#32231) * fix conditional block (#32243) * fix style code Co-authored-by: xiayanming <[email protected]> Co-authored-by: Leo Chen <[email protected]> Co-authored-by: liym27 <[email protected]> Co-authored-by: Reventon_L <[email protected]> Co-authored-by: root <[email protected]> Co-authored-by: oyjxer <[email protected]> Co-authored-by: yinhaofeng <[email protected]> Co-authored-by: OleNet <[email protected]> Co-authored-by: Meiyim <[email protected]> Co-authored-by: oyxuan-11 <[email protected]> Co-authored-by: pangyoki <[email protected]>
PaddlePaddle · Apr 15, 2021 · e6bc358 · e6bc358
1 parent 69d8027
commit e6bc358
Show file tree

Hide file tree

Showing 138 changed files with 13,659 additions and 314 deletions.
diff --git a/cmake/external/gloo.cmake b/cmake/external/gloo.cmake
@@ -32,7 +32,7 @@ cache_third_party(extern_gloo
     TAG           ${GLOO_TAG}
     DIR           GLOO_SOURCE_DIR)
 
-if(WITH_ASCEND)
+  if(WITH_ASCEND OR WITH_ASCEND_CL)
   ExternalProject_Add(
       extern_gloo
       ${EXTERNAL_PROJECT_LOG_ARGS}

diff --git a/cmake/external/protobuf.cmake b/cmake/external/protobuf.cmake
@@ -242,7 +242,7 @@ endif()
     )
 ENDFUNCTION()
 
-if(WITH_ASCEND)
+if(WITH_ASCEND OR WITH_ASCEND_CL)
     SET(PROTOBUF_VERSION 3.8.0)
 else()
     SET(PROTOBUF_VERSION 3.1.0)

diff --git a/cmake/external/threadpool.cmake b/cmake/external/threadpool.cmake
@@ -16,7 +16,7 @@ INCLUDE(ExternalProject)
 
 SET(THREADPOOL_PREFIX_DIR ${THIRD_PARTY_PATH}/threadpool)
 SET(THREADPOOL_SOURCE_DIR ${THIRD_PARTY_PATH}/threadpool/src/extern_threadpool)
-if(WITH_ASCEND)
+if(WITH_ASCEND OR WITH_ASCEND_CL)
     SET(THREADPOOL_REPOSITORY https://gitee.com/tianjianhe/ThreadPool.git)
 else()
     SET(THREADPOOL_REPOSITORY ${GIT_URL}/progschj/ThreadPool.git)

diff --git a/cmake/external/warpctc.cmake b/cmake/external/warpctc.cmake
@@ -43,7 +43,7 @@ cache_third_party(extern_warpctc
     TAG          ${WARPCTC_TAG}
     DIR          WARPCTC_SOURCE_DIR)
 
-if(WITH_ASCEND)
+if(WITH_ASCEND OR WITH_ASCEND_CL)
     ExternalProject_Add(
         extern_warpctc
         ${EXTERNAL_PROJECT_LOG_ARGS}

diff --git a/paddle/fluid/framework/tensor_util.h b/paddle/fluid/framework/tensor_util.h
@@ -135,6 +135,7 @@ void TensorFromArray(const T* src, const size_t& array_size,
   }
 #endif
 }
+
 template <typename T>
 void TensorFromVector(const std::vector<T>& src,
                       const platform::DeviceContext& ctx, Tensor* dst) {
@@ -167,6 +168,49 @@ void TensorFromVector(const std::vector<T>& src,
 #endif
 }
 
+// The fully specialized function should be inline to avoid
+// multi-definition.
+template <>
+inline void TensorFromVector(const std::vector<bool>& src,
+                             const platform::DeviceContext& ctx, Tensor* dst) {
+  // vector<bool> has no data() member, use array instead.
+  // See details:
+  // https://stackoverflow.com/questions/46115669/why-does-stdvectorbool-have-no-data/46115714
+  bool* array = new bool[src.size()];
+  for (unsigned int i = 0; i < src.size(); i++) {
+    array[i] = static_cast<bool>(src[i]);
+  }
+
+  auto dst_place = ctx.GetPlace();
+  auto src_ptr = static_cast<const void*>(array);
+  platform::CPUPlace src_place;
+  dst->Resize({static_cast<int64_t>(src.size())});
+  auto dst_ptr = static_cast<void*>(dst->mutable_data<bool>(dst_place));
+  auto size = src.size() * sizeof(bool);
+
+  if (platform::is_cpu_place(dst_place)) {
+    memory::Copy(BOOST_GET_CONST(platform::CPUPlace, dst_place), dst_ptr,
+                 src_place, src_ptr, size);
+  }
+#ifdef PADDLE_WITH_CUDA
+  else if (platform::is_gpu_place(dst_place)) {  // NOLINT
+    memory::Copy(
+        BOOST_GET_CONST(platform::CUDAPlace, dst_place), dst_ptr, src_place,
+        src_ptr, size,
+        reinterpret_cast<const platform::CUDADeviceContext&>(ctx).stream());
+  }
+#endif
+#ifdef PADDLE_WITH_ASCEND_CL
+  else if (platform::is_npu_place(dst_place)) {  // NOLINT
+    memory::Copy(
+        BOOST_GET_CONST(platform::NPUPlace, dst_place), dst_ptr, src_place,
+        src_ptr, size,
+        reinterpret_cast<const platform::NPUDeviceContext&>(ctx).stream());
+  }
+#endif
+  delete[] array;
+}
+
 template <typename T>
 void TensorFromVector(const std::vector<T>& src, Tensor* dst) {
   platform::CPUPlace dst_place = platform::CPUPlace();
@@ -179,6 +223,23 @@ void TensorFromVector(const std::vector<T>& src, Tensor* dst) {
   memory::Copy(dst_place, dst_ptr, src_place, src_ptr, size);
 }
 
+template <>
+inline void TensorFromVector(const std::vector<bool>& src, Tensor* dst) {
+  bool* array = new bool[src.size()];
+  for (unsigned int i = 0; i < src.size(); i++) {
+    array[i] = static_cast<bool>(src[i]);
+  }
+  platform::CPUPlace dst_place = platform::CPUPlace();
+  auto src_ptr = static_cast<const void*>(array);
+  platform::CPUPlace src_place;
+  dst->Resize({static_cast<int64_t>(src.size())});
+  auto dst_ptr = static_cast<void*>(dst->mutable_data<bool>(dst_place));
+  auto size = src.size() * sizeof(bool);
+
+  memory::Copy(dst_place, dst_ptr, src_place, src_ptr, size);
+  delete[] array;
+}
+
 template <typename T>
 void TensorToVector(const Tensor& src, const platform::DeviceContext& ctx,
                     std::vector<T>* dst) {
@@ -212,6 +273,46 @@ void TensorToVector(const Tensor& src, const platform::DeviceContext& ctx,
 #endif
 }
 
+template <>
+inline void TensorToVector(const Tensor& src,
+                           const platform::DeviceContext& ctx,
+                           std::vector<bool>* dst) {
+  auto src_ptr = static_cast<const void*>(src.data<bool>());
+  auto size = src.numel() * sizeof(bool);
+
+  bool* array = new bool[src.numel()];
+
+  platform::CPUPlace dst_place;
+  dst->resize(src.numel());
+  auto dst_ptr = static_cast<void*>(array);
+
+  if (platform::is_cpu_place(src.place())) {
+    memory::Copy(dst_place, dst_ptr,
+                 BOOST_GET_CONST(platform::CPUPlace, src.place()), src_ptr,
+                 size);
+  }
+#ifdef PADDLE_WITH_CUDA
+  else if (platform::is_gpu_place(src.place())) {  // NOLINT
+    memory::Copy(
+        dst_place, dst_ptr, BOOST_GET_CONST(platform::CUDAPlace, src.place()),
+        src_ptr, size,
+        reinterpret_cast<const platform::CUDADeviceContext&>(ctx).stream());
+  }
+#endif
+#ifdef PADDLE_WITH_ASCEND_CL
+  else if (platform::is_npu_place(src.place())) {  // NOLINT
+    memory::Copy(
+        dst_place, dst_ptr, BOOST_GET_CONST(platform::NPUPlace, src.place()),
+        src_ptr, size,
+        reinterpret_cast<const platform::NPUDeviceContext&>(ctx).stream());
+  }
+#endif
+  for (unsigned int i = 0; i < src.numel(); i++) {
+    (*dst)[i] = static_cast<bool>(array[i]);
+  }
+  delete[] array;
+}
+
 template <typename T>
 void TensorToVector(const Tensor& src, std::vector<T>* dst) {
   auto src_ptr = static_cast<const void*>(src.data<T>());
@@ -231,6 +332,32 @@ void TensorToVector(const Tensor& src, std::vector<T>* dst) {
                BOOST_GET_CONST(platform::CPUPlace, src.place()), src_ptr, size);
 }
 
+template <>
+inline void TensorToVector(const Tensor& src, std::vector<bool>* dst) {
+  auto src_ptr = static_cast<const void*>(src.data<bool>());
+  auto size = src.numel() * sizeof(bool);
+
+  bool* array = new bool[src.numel()];
+
+  platform::CPUPlace dst_place;
+  dst->resize(src.numel());
+  auto dst_ptr = static_cast<void*>(array);
+
+  PADDLE_ENFORCE_EQ(
+      platform::is_cpu_place(src.place()), true,
+      platform::errors::InvalidArgument(
+          "The input tensor should be CPU device, but actually it is in %s.",
+          src.place()));
+
+  memory::Copy(dst_place, dst_ptr,
+               BOOST_GET_CONST(platform::CPUPlace, src.place()), src_ptr, size);
+
+  for (unsigned int i = 0; i < src.numel(); i++) {
+    (*dst)[i] = static_cast<bool>(array[i]);
+  }
+  delete[] array;
+}
+
 std::ostream& operator<<(std::ostream& os, const Tensor& t);
 }  // namespace framework
 }  // namespace paddle
diff --git a/paddle/fluid/framework/tensor_util_test.cc b/paddle/fluid/framework/tensor_util_test.cc
@@ -242,6 +242,61 @@ TEST(TensorToVector, Tensor) {
 #endif
 }
 
+TEST(TensorToVector, Tensor_bool) {
+  {
+    paddle::framework::Tensor src;
+    bool* src_ptr =
+        src.mutable_data<bool>({3, 3}, paddle::platform::CPUPlace());
+    for (int i = 0; i < 3 * 3; ++i) {
+      src_ptr[i] = static_cast<bool>(i % 2);
+    }
+
+    paddle::platform::CPUPlace place;
+    std::vector<bool> dst;
+    paddle::framework::TensorToVector<bool>(src, &dst);
+
+    for (int i = 0; i < 3 * 3; ++i) {
+      EXPECT_EQ(src_ptr[i], dst[i]);
+    }
+  }
+#ifdef PADDLE_WITH_CUDA
+  {
+    std::vector<bool> src_vec = {
+        false, true, false, true, false, true, false, true, false,
+    };
+    paddle::framework::Tensor gpu_tensor;
+    paddle::platform::CUDAPlace place;
+    paddle::platform::CUDADeviceContext gpu_ctx(place);
+    paddle::framework::TensorFromVector<bool>(src_vec, gpu_ctx, &gpu_tensor);
+
+    std::vector<bool> dst;
+    paddle::framework::TensorToVector<bool>(gpu_tensor, gpu_ctx, &dst);
+
+    for (int i = 0; i < 3 * 3; ++i) {
+      EXPECT_EQ(src_vec[i], dst[i]);
+    }
+  }
+#endif
+#ifdef PADDLE_WITH_ASCEND_CL
+  {
+    std::vector<bool> src_vec = {
+        false, true, false, true, false, true, false, true, false,
+    };
+    paddle::framework::Tensor npu_tensor;
+    paddle::platform::NPUPlace place(0);
+    paddle::platform::NPUDeviceContext npu_ctx(place);
+    paddle::framework::TensorFromVector<bool>(src_vec, npu_ctx, &npu_tensor);
+
+    std::vector<bool> dst;
+    paddle::framework::TensorToVector<bool>(npu_tensor, npu_ctx, &dst);
+
+    for (int i = 0; i < 3 * 3; ++i) {
+      EXPECT_EQ(src_vec[i], dst[i]);
+    }
+  }
+#endif
+}
+
 TEST(TensorFromDLPack, Tensor) {
   {
     std::vector<int> src_vec = {1, 2, 3, 4, 5, 6, 7, 8, 9};

diff --git a/paddle/fluid/framework/type_defs.h b/paddle/fluid/framework/type_defs.h
@@ -45,6 +45,17 @@ using Attribute = boost::variant<
 
 using AttributeMap = std::unordered_map<std::string, Attribute>;
 
+#ifdef PADDLE_WITH_ASCEND_CL
+using NPUAttribute =
+    boost::variant<boost::blank, int, float, std::string, std::vector<int>,
+                   std::vector<float>, std::vector<std::string>, bool,
+                   std::vector<bool>, BlockDesc*, int64_t,
+                   std::vector<BlockDesc*>, std::vector<int64_t>,
+                   std::vector<double>, std::vector<std::vector<int64_t>>>;
+
+using NPUAttributeMap = std::unordered_map<std::string, NPUAttribute>;
+#endif
+
 using OpCreator = std::function<OperatorBase*(
     const std::string& /*type*/, const VariableNameMap& /*inputs*/,
     const VariableNameMap& /*outputs*/, const AttributeMap& /*attrs*/)>;

diff --git a/paddle/fluid/memory/memcpy.cc b/paddle/fluid/memory/memcpy.cc
@@ -206,8 +206,16 @@ void Copy<platform::NPUPlace, platform::CPUPlace>(platform::NPUPlace dst_place,
   if (UNLIKELY(num == 0)) return;
 
   platform::SetNPUDeviceId(dst_place.device);
+
+  // NOTE(ascendrc): NPU memcpy async from host to device is a "real" async,
+  // which is different from CUDA. In Paddle, when async is called, "sync"
+  // is run actually, which means Paddle doesn't fully supported async.
+  // TODO(ascendrc): Support NPU memcpy async for better performance.
+  stream = nullptr;
+
   VLOG(4) << "memory::Copy " << num << " Bytes from " << src_place << " to "
           << dst_place << " by thream(" << stream << ")";
+
   if (stream) {
     platform::RecordEvent record_event("NpuMemcpyAsync:CPU->NPU");
     platform::NPUMemcpyAsync(dst, src, num, ACL_MEMCPY_HOST_TO_DEVICE, stream);
@@ -226,8 +234,16 @@ void Copy<platform::CPUPlace, platform::NPUPlace>(platform::CPUPlace dst_place,
   if (UNLIKELY(num == 0)) return;
 
   platform::SetNPUDeviceId(src_place.device);
+
+  // NOTE(ascendrc): NPU memcpy async from device to host is a "real" async,
+  // which is different from CUDA. In Paddle, when async is called, "sync"
+  // is run actually, which means Paddle doesn't fully supported async.
+  // TODO(ascendrc): Support NPU memcpy async for better performance.
+  stream = nullptr;
+
   VLOG(4) << "memory::Copy " << num << " Bytes from " << src_place << " to "
           << dst_place << " by thream(" << stream << ")";
+
   if (stream) {
     platform::RecordEvent record_event("NpuMemcpyAsync:NPU->CPU");
     platform::NPUMemcpyAsync(dst, src, num, ACL_MEMCPY_DEVICE_TO_HOST, stream);

diff --git a/paddle/fluid/operators/CMakeLists.txt b/paddle/fluid/operators/CMakeLists.txt
@@ -124,6 +124,7 @@ if (WITH_ASCEND)
 endif()
 
 if (WITH_ASCEND_CL)
+  cc_test(assign_op_npu_test SRCS assign_op_npu_test.cc DEPS assign_op)
   cc_library(npu_op_runner SRCS npu_op_runner.cc DEPS operator npu_info)
   set(COMMON_OP_DEPS ${COMMON_OP_DEPS} npu_op_runner)
 endif()
@@ -141,8 +142,8 @@ set(OPERATOR_DEPS ${OPERATOR_DEPS} ${COMMON_OP_DEPS})
 set(GLOB_OPERATOR_DEPS ${OPERATOR_DEPS} CACHE INTERNAL "Global Op dependencies")
 
 cc_test(test_common_infer_shape_functions SRCS test_common_infer_shape_functions.cc DEPS common_infer_shape_functions ${COMMON_OP_DEPS} activation_op elementwise_add_op softmax_op softmax)
-cc_test(assign_op_test SRCS assign_op_test.cc DEPS assign_op)
 cc_test(gather_test SRCS gather_test.cc DEPS tensor)
+cc_test(assign_op_test SRCS assign_op_test.cc DEPS assign_op)
 cc_test(scatter_test SRCS scatter_test.cc DEPS tensor math_function)
 cc_test(beam_search_decode_op_test SRCS beam_search_decode_op_test.cc DEPS lod_tensor)
 cc_test(strided_memcpy_test SRCS strided_memcpy_test.cc DEPS tensor memory)
@@ -163,10 +164,19 @@ if (WITH_PYTHON)
   cc_library(py_func_op SRCS py_func_op.cc DEPS op_registry python pybind)
 endif()
 
+if (WITH_ASCEND_CL)
+  cc_test(range_op_npu_test SRCS range_op_npu_test.cc DEPS op_registry range_op scope device_context enforce executor)
+  cc_test(lookup_table_v2_op_npu_test SRCS lookup_table_v2_op_npu_test.cc DEPS op_registry lookup_table_v2_op scope device_context enforce executor compare_op)
+endif()
+
 set(GLOB_OP_LIB ${OP_LIBRARY} CACHE INTERNAL "Global OP library")
 add_subdirectory(benchmark)
 
 cc_test(op_debug_string_test SRCS op_debug_string_test.cc DEPS elementwise_add_op)
+if (WITH_ASCEND_CL)
+    cc_test(transpose_op_npu_test SRCS transpose_op_npu_test.cc DEPS op_registry transpose_op scope device_context enforce executor)
+endif()
+
 
 if(WITH_MKLDNN)
 include(mkldnn/inplace_op_tests.cmake)
@@ -180,3 +190,7 @@ if(WITH_UNITY_BUILD)
     # The specified link dependency needs to be displayed here.
     target_link_libraries(paddle_operators_unity ${OP_HEADER_DEPS} ${COMMON_OP_DEPS})
 endif()
+
+if(WITH_ASCEND_CL)
+cc_test(gelu_op_npu_test SRCS gelu_op_npu_test.cc DEPS op_registry gelu_op scope device_context enforce executor)
+endif()