Memset/memcpy/omp profiling tester (apache#8380)

* Memory set/copy speed assertions * Memory set/copy speed assertions * .. * .. * .. * .. * bounce some cache * lint * Timing output for test_factorization_module when Verbose enabled (apache#8363) * Timing output for test_factorization_module when Verbose enabled * Trigger build * Trigger build * Trigger build * Misc fixes for sparse distributed training (apache#8345) * remove mshadow::range in init_op.h * add unit test * remove pass by ptr, add unit test for pull empty wieghts * fix range in key partition * remove wrong comment * remove change for partition * remove unused var * add int64 to arange. add checkpointing example * Fix the Readme (apache#8369) * Allow test to converge (apache#8351) * Allow test to converge * Trigger build * Trigger build * Trigger build * Update cudnn_algoreg-inl.h (apache#7988) * [Perl] emulate Python zip() for Perl (apache#8192) * [Perl] emulate Python zip() for Perl * [Perl] retool zip() uses away from the callback form * add profile option for frontend profiling to image script (apache#8171) * add profile option for frontend profiling to image script * Update image_classification.py * Update image_classification.py * Fix Typo (classification) (apache#8376) Fix a typo in the example readme. * Use omp_get_max_threads() when OMP_NUM_THREADS environment variable is set (apache#8379) * CPU optimization for ActivationOp (apache#8296) * CPU optimization for ActivationOp Significant improvement on CPU (several magnitudes of order in some cases, especially on backward pass). Very slight improvement on GPU. OLD MSHADOW APPROACH -------------------- CPU === Timing: 50 iterations of 10 calls, shape = [1,1,28,28] Activation Operator CPU: Timing [Forward] 18.948 ms, avg: 0.037896 ms X 500 passes Activation Operator CPU: Timing [Backward] 1.658 ms, avg: 0.003316 ms X 500 passes Timing: 50 iterations of 10 calls, shape = [1,3,28,28] Activation Operator CPU: Timing [Forward] 57.973 ms, avg: 0.115946 ms X 500 passes Activation Operator CPU: Timing [Backward] 4.748 ms, avg: 0.009496 ms X 500 passes Timing: 50 iterations of 10 calls, shape = [50,1,18,32] Activation Operator CPU: Timing [Forward] 703.446 ms, avg: 1.40689 ms X 500 passes Activation Operator CPU: Timing [Backward] 56.255 ms, avg: 0.11251 ms X 500 passes Timing: 50 iterations of 10 calls, shape = [50,3,18,32] Activation Operator CPU: Timing [Forward] 2107.77 ms, avg: 4.21554 ms X 500 passes Activation Operator CPU: Timing [Backward] 168.483 ms, avg: 0.336966 ms X 500 passes Timing: 50 iterations of 10 calls, shape = [20,3,128,128] Activation Operator CPU: Timing [Forward] 24122.2 ms, avg: 48.2443 ms X 500 passes Activation Operator CPU: Timing [Backward] 1908.7 ms, avg: 3.8174 ms X 500 passes GPU === Timing: 50 iterations of 10 calls, shape = [1,1,28,28] Activation Operator GPU: Timing [Forward] 1.637 ms, avg: 0.003274 ms X 500 passes Activation Operator GPU: Timing [Backward] 1.665 ms, avg: 0.00333 ms X 500 passes Timing: 50 iterations of 10 calls, shape = [1,3,28,28] Activation Operator GPU: Timing [Forward] 1.562 ms, avg: 0.003124 ms X 500 passes Activation Operator GPU: Timing [Backward] 1.661 ms, avg: 0.003322 ms X 500 passes Timing: 50 iterations of 10 calls, shape = [50,1,18,32] Activation Operator GPU: Timing [Forward] 1.635 ms, avg: 0.00327 ms X 500 passes Activation Operator GPU: Timing [Backward] 1.702 ms, avg: 0.003404 ms X 500 passes Timing: 50 iterations of 10 calls, shape = [50,3,18,32] Activation Operator GPU: Timing [Forward] 1.83 ms, avg: 0.00366 ms X 500 passes Activation Operator GPU: Timing [Backward] 2.041 ms, avg: 0.004082 ms X 500 passes Timing: 50 iterations of 10 calls, shape = [20,3,128,128] Activation Operator GPU: Timing [Forward] 2.08 ms, avg: 0.00416 ms X 500 passes Activation Operator GPU: Timing [Backward] 2.688 ms, avg: 0.005376 ms X 500 passes NEW MXNET_OP APPROACH --------------------- CPU === Timing: 50 iterations of 10 calls, shape = [1,1,28,28] Activation Operator CPU: Timing [Forward] 80.748 ms, avg: 0.161496 ms X 500 passes Activation Operator CPU: Timing [Backward] 1.176 ms, avg: 0.002352 ms X 500 passes Timing: 50 iterations of 10 calls, shape = [1,3,28,28] Activation Operator CPU: Timing [Forward] 7.881 ms, avg: 0.015762 ms X 500 passes Activation Operator CPU: Timing [Backward] 2.181 ms, avg: 0.004362 ms X 500 passes Timing: 50 iterations of 10 calls, shape = [50,1,18,32] Activation Operator CPU: Timing [Forward] 111.48 ms, avg: 0.22296 ms X 500 passes Activation Operator CPU: Timing [Backward] 5.408 ms, avg: 0.010816 ms X 500 passes Timing: 50 iterations of 10 calls, shape = [50,3,18,32] Activation Operator CPU: Timing [Forward] 333.439 ms, avg: 0.666878 ms X 500 passes Activation Operator CPU: Timing [Backward] 21.331 ms, avg: 0.042662 ms X 500 passes Timing: 50 iterations of 10 calls, shape = [20,3,128,128] Activation Operator CPU: Timing [Forward] 3429.19 ms, avg: 6.85837 ms X 500 passes Activation Operator CPU: Timing [Backward] 286.324 ms, avg: 0.572648 ms X 500 passes GPU === Timing: 50 iterations of 10 calls, shape = [1,1,28,28] Activation Operator GPU: Timing [Forward] 1.618 ms, avg: 0.003236 ms X 500 passes Activation Operator GPU: Timing [Backward] 1.671 ms, avg: 0.003342 ms X 500 passes Timing: 50 iterations of 10 calls, shape = [1,3,28,28] Activation Operator GPU: Timing [Forward] 1.629 ms, avg: 0.003258 ms X 500 passes Activation Operator GPU: Timing [Backward] 1.728 ms, avg: 0.003456 ms X 500 passes Timing: 50 iterations of 10 calls, shape = [50,1,18,32] Activation Operator GPU: Timing [Forward] 1.753 ms, avg: 0.003506 ms X 500 passes Activation Operator GPU: Timing [Backward] 1.756 ms, avg: 0.003512 ms X 500 passes Timing: 50 iterations of 10 calls, shape = [50,3,18,32] Activation Operator GPU: Timing [Forward] 1.704 ms, avg: 0.003408 ms X 500 passes Activation Operator GPU: Timing [Backward] 1.791 ms, avg: 0.003582 ms X 500 passes Timing: 50 iterations of 10 calls, shape = [20,3,128,128] Activation Operator GPU: Timing [Forward] 2.032 ms, avg: 0.004064 ms X 500 passes Activation Operator GPU: Timing [Backward] 2.143 ms, avg: 0.004286 ms X 500 passes * lint * Trigger build * Trigger build * Negative begin and end support for csr slice (apache#8241) * negative index support for sparse slice * fix lint * getitem(int) for csr ndarray, support a[-1] * remove unneccessary argument * unittest and doc update * Preparing for 0.12.0.rc0: Final changes before RC (apache#8301) * Final changes before RC * Updates to NEWS.md * Updates * Enable smoothing in softmax operator (apache#8125) * v0.12 regression: Fix registration of children for Block (apache#8277) * Fix Block not registering children If the attribute was already set to something different than Block (e.g. None), it was not being registered. * fix if / elif for block children registration * trigger test * Add fix from apache#8152 * Add tests from apache#8152 * Revert "[CMAKE] Fix windows cmake build" (apache#8311) * Revert "Added my code signing key (apache#8293)" This reverts commit 22ab185. * Revert "[CMAKE] Fix windows cmake build (apache#8227)" This reverts commit 1c1c788. * fixed broken links. https was pointing to http for mxnet.io (apache#8300) * Update rnn.md (apache#8320) * fluent methods for missed ops (apache#8329) * update ps lite (apache#8327) * Fix unused type warning (apache#8316) * Trigger build * Trigger build * Misc fixes for sparse distributed training (apache#8345) * remove mshadow::range in init_op.h * add unit test * remove pass by ptr, add unit test for pull empty wieghts * fix range in key partition * remove wrong comment * remove change for partition * remove unused var * add int64 to arange. add checkpointing example * Fix the Readme (apache#8369) * Allow test to converge (apache#8351) * Allow test to converge * Trigger build * Trigger build * Trigger build * Update cudnn_algoreg-inl.h (apache#7988) * [Perl] emulate Python zip() for Perl (apache#8192) * [Perl] emulate Python zip() for Perl * [Perl] retool zip() uses away from the callback form * add profile option for frontend profiling to image script (apache#8171) * add profile option for frontend profiling to image script * Update image_classification.py * Update image_classification.py * Fix Typo (classification) (apache#8376) Fix a typo in the example readme. * do gtest test * add assert and do higher runs as performance test only (when performance test flag set) * Trigger build * lint * Trigger build * Sparse operator performance improvement (apache#8412) * sparse rsprsp perf improvements * Clean up * dtype default to source_array.dtype for sparse ndarrays (apache#8403) * derive default dtype/ctx from input for sparse ndarrays * add gpu tests * fix lint. add doc * remove default_ctx code * bug fix when passing dtype to array() * update doc * remove extra line * also check ctx * fix using default mean pixels (apache#8352) * fix gluon.data.RecordFileDataset (apache#8353) * upgrade MKL (apache#8378) * Lint fix (apache#8402) * Trigger build
rahul003 · Oct 28, 2017 · 076703c · 076703c
1 parent afac234
commit 076703c
Show file tree

Hide file tree

Showing 2 changed files with 164 additions and 0 deletions.
diff --git a/tests/cpp/include/test_perf.h b/tests/cpp/include/test_perf.h
@@ -60,6 +60,25 @@ inline uint64_t getMicroTickCount() {
 #endif
 }
 
+/*! \brief current timestamp: millionths of a second */
+inline uint64_t getNannoTickCount() {
+#ifndef _WIN32
+  struct timeval tv;
+  gettimeofday(&tv, NULL);
+  return (uint64_t(tv.tv_sec) * 1000000 + tv.tv_usec) * 1000;
+#else
+  LARGE_INTEGER CurrentTime;
+  LARGE_INTEGER Frequency;
+
+  QueryPerformanceFrequency(&Frequency);
+  QueryPerformanceCounter(&CurrentTime);
+
+  CurrentTime.QuadPart *= 1000000000;
+  CurrentTime.QuadPart /= Frequency.QuadPart;
+  return CurrentTime.QuadPart;
+#endif
+}
+
 /*! \brief millisecond tick count */
 inline uint64_t getTickCount() {
   return getMicroTickCount() / 1000;

diff --git a/tests/cpp/misc/memory_test.cc b/tests/cpp/misc/memory_test.cc
@@ -0,0 +1,145 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+/*!
+ *  \file memory_test.cc
+ *  \brief Perf/profile run of ActivationOp
+ *  \author Chris Olivier
+ */
+
+#include <gtest/gtest.h>
+#include <dmlc/omp.h>
+#include <mxnet/tensor_blob.h>
+#include "../include/test_util.h"
+#include "../include/test_perf.h"
+
+using namespace mxnet;
+
+#ifdef _OPENMP
+template<typename Container>
+static typename Container::value_type average(const Container& cont) {
+  typename Container::value_type avg = 0;
+  const size_t sz = cont.size();
+  for (auto iter = cont.begin(), e_iter = cont.end(); iter != e_iter; ++iter) {
+    avg += *iter / sz;  // Attempt to not overflow by dividing up incrementally
+  }
+  return avg;
+}
+
+static std::string pretty_num(uint64_t val) {
+  std::string res, s = std::to_string(val);
+  size_t ctr = 0;
+  for (int i = static_cast<int>(s.size()) - 1; i >= 0; --i, ++ctr) {
+    if (ctr && (ctr % 3) == 0) {
+      res += ",";
+    }
+    res.push_back(s[i]);
+  }
+  std::reverse(res.begin(), res.end());
+  return res;
+}
+
+static int GetOMPThreadCount() {
+  return omp_get_max_threads() >> 1;
+}
+
+/*!
+ * \brief Generic bidirectional sanity test
+ */
+TEST(MEMORY_TEST, MemsetAndMemcopyPerformance) {
+  const size_t GB = 1000000000;  // memset never slower
+  uint64_t base = 100000;
+  std::list<uint64_t> memset_times, omp_set_times, memcpy_times, omp_copy_times;
+  size_t pass = 0;
+  do {
+    memset_times.resize(0);
+    omp_set_times.resize(0);
+    memcpy_times.resize(0);
+    omp_copy_times.resize(0);;
+
+    const size_t test_size = 2 * base;
+    std::cout << "====================================" << std::endl
+              << "Data size: " << pretty_num(test_size) << std::endl << std::flush;
+
+    std::unique_ptr<uint8_t> buffer_1(new uint8_t[test_size]), buffer_2(new uint8_t[test_size]);
+    uint8_t *src = buffer_1.get(), *dest = buffer_2.get();
+
+    for (size_t x = 0; x < 5; ++x) {
+      // Init memory with different values
+      memset(src, 3, test_size);
+      memset(dest, 255, test_size);  // wipe out some/all of src cache
+
+      // memset
+      uint64_t start = test::perf::getNannoTickCount();
+      memset(src, 123, test_size);
+      const uint64_t memset_time = test::perf::getNannoTickCount() - start;
+
+      start = test::perf::getNannoTickCount();
+      #pragma omp parallel for num_threads(GetOMPThreadCount())
+      for (int i = 0; i < test_size; ++i) {
+        src[i] = 42;
+      }
+      const uint64_t omp_set_time = test::perf::getNannoTickCount() - start;
+
+      start = test::perf::getNannoTickCount();
+      memcpy(dest, src, test_size);
+      const uint64_t memcpy_time = test::perf::getNannoTickCount() - start;
+
+      // bounce the cache and dirty logic
+      memset(src, 6, test_size);
+      memset(dest, 200, test_size);
+
+      start = test::perf::getNannoTickCount();
+      #pragma omp parallel for num_threads(GetOMPThreadCount())
+      for (int i = 0; i < test_size; ++i) {
+        dest[i] = src[i];
+      }
+      const uint64_t omp_copy_time = test::perf::getNannoTickCount() - start;
+
+      memset_times.push_back(memset_time);
+      omp_set_times.push_back(omp_set_time);
+      memcpy_times.push_back(memcpy_time);
+      omp_copy_times.push_back(omp_copy_time);
+
+      std::cout << "memset time:   " << pretty_num(memcpy_time) << " ns" << std::endl
+                << "omp set time:  " << pretty_num(omp_set_time) << " ns" << std::endl
+                << std::endl;
+      std::cout << "memcpy time:   " << pretty_num(memcpy_time) << " ns" << std::endl
+                << "omp copy time: " << pretty_num(omp_copy_time) << " ns" << std::endl
+                << std::endl;
+    }
+    std::cout << "------------------------------------" << std::endl;
+    if (average(memset_times) > average(omp_set_times)) {
+      std::cout << "<< MEMSET SLOWER FOR " << pretty_num(test_size) << " items >>" << std::endl;
+    }
+    if (average(memcpy_times) > average(omp_copy_times)) {
+      std::cout << "<< MEMCPY SLOWER FOR " << pretty_num(test_size) << " items >>" << std::endl;
+    }
+    if (!pass) {
+      GTEST_ASSERT_LE(average(memset_times), average(omp_set_times));
+      GTEST_ASSERT_LE(average(memcpy_times), average(omp_copy_times));
+    }
+    base *= 10;
+    ++pass;
+  } while (test::performance_run
+           && base <= GB
+           && (average(memset_times) < average(omp_set_times)
+               || average(memcpy_times), average(omp_copy_times)));
+}
+#endif  // _OPENMP