Skip to content

Commit

Permalink
Memset/memcpy/omp profiling tester (apache#8380)
Browse files Browse the repository at this point in the history
* Memory set/copy speed assertions

* Memory set/copy speed assertions

* ..

* ..

* ..

* ..

* bounce some cache

* lint

* Timing output for test_factorization_module when Verbose enabled (apache#8363)

* Timing output for test_factorization_module when Verbose enabled

* Trigger build

* Trigger build

* Trigger build

* Misc fixes for sparse distributed training (apache#8345)

* remove mshadow::range in init_op.h

* add unit test

* remove pass by ptr, add unit test for pull empty wieghts

* fix range in key partition

* remove wrong comment

* remove change for partition

* remove unused var

* add int64 to arange. add checkpointing example

* Fix the Readme (apache#8369)

* Allow test to converge (apache#8351)

* Allow test to converge

* Trigger build

* Trigger build

* Trigger build

* Update cudnn_algoreg-inl.h (apache#7988)

* [Perl] emulate Python zip() for Perl (apache#8192)

* [Perl] emulate Python zip() for Perl

* [Perl] retool zip() uses away from the callback form

* add profile option for frontend profiling to image script (apache#8171)

* add profile option for frontend profiling to image script

* Update image_classification.py

* Update image_classification.py

* Fix Typo (classification) (apache#8376)

Fix a typo in the example readme.

* Use omp_get_max_threads() when OMP_NUM_THREADS environment variable is set (apache#8379)

* CPU optimization for ActivationOp (apache#8296)

* CPU optimization for ActivationOp

Significant improvement on CPU (several magnitudes of order in some cases, especially on backward pass).
Very slight improvement on GPU.

OLD MSHADOW APPROACH
--------------------

CPU
===

Timing: 50 iterations of 10 calls, shape = [1,1,28,28]
Activation Operator CPU:  Timing [Forward] 18.948 ms, avg: 0.037896 ms X 500 passes
Activation Operator CPU:  Timing [Backward] 1.658 ms, avg: 0.003316 ms X 500 passes

Timing: 50 iterations of 10 calls, shape = [1,3,28,28]
Activation Operator CPU:  Timing [Forward] 57.973 ms, avg: 0.115946 ms X 500 passes
Activation Operator CPU:  Timing [Backward] 4.748 ms, avg: 0.009496 ms X 500 passes

Timing: 50 iterations of 10 calls, shape = [50,1,18,32]
Activation Operator CPU:  Timing [Forward] 703.446 ms, avg: 1.40689 ms X 500 passes
Activation Operator CPU:  Timing [Backward] 56.255 ms, avg: 0.11251 ms X 500 passes

Timing: 50 iterations of 10 calls, shape = [50,3,18,32]
Activation Operator CPU:  Timing [Forward] 2107.77 ms, avg: 4.21554 ms X 500 passes
Activation Operator CPU:  Timing [Backward] 168.483 ms, avg: 0.336966 ms X 500 passes

Timing: 50 iterations of 10 calls, shape = [20,3,128,128]
Activation Operator CPU:  Timing [Forward] 24122.2 ms, avg: 48.2443 ms X 500 passes
Activation Operator CPU:  Timing [Backward] 1908.7 ms, avg: 3.8174 ms X 500 passes

GPU
===

Timing: 50 iterations of 10 calls, shape = [1,1,28,28]
Activation Operator GPU:  Timing [Forward] 1.637 ms, avg: 0.003274 ms X 500 passes
Activation Operator GPU:  Timing [Backward] 1.665 ms, avg: 0.00333 ms X 500 passes

Timing: 50 iterations of 10 calls, shape = [1,3,28,28]
Activation Operator GPU:  Timing [Forward] 1.562 ms, avg: 0.003124 ms X 500 passes
Activation Operator GPU:  Timing [Backward] 1.661 ms, avg: 0.003322 ms X 500 passes

Timing: 50 iterations of 10 calls, shape = [50,1,18,32]
Activation Operator GPU:  Timing [Forward] 1.635 ms, avg: 0.00327 ms X 500 passes
Activation Operator GPU:  Timing [Backward] 1.702 ms, avg: 0.003404 ms X 500 passes

Timing: 50 iterations of 10 calls, shape = [50,3,18,32]
Activation Operator GPU:  Timing [Forward] 1.83 ms, avg: 0.00366 ms X 500 passes
Activation Operator GPU:  Timing [Backward] 2.041 ms, avg: 0.004082 ms X 500 passes

Timing: 50 iterations of 10 calls, shape = [20,3,128,128]
Activation Operator GPU:  Timing [Forward] 2.08 ms, avg: 0.00416 ms X 500 passes
Activation Operator GPU:  Timing [Backward] 2.688 ms, avg: 0.005376 ms X 500 passes

NEW MXNET_OP APPROACH
---------------------

CPU
===

Timing: 50 iterations of 10 calls, shape = [1,1,28,28]
Activation Operator CPU:  Timing [Forward] 80.748 ms, avg: 0.161496 ms X 500 passes
Activation Operator CPU:  Timing [Backward] 1.176 ms, avg: 0.002352 ms X 500 passes

Timing: 50 iterations of 10 calls, shape = [1,3,28,28]
Activation Operator CPU:  Timing [Forward] 7.881 ms, avg: 0.015762 ms X 500 passes
Activation Operator CPU:  Timing [Backward] 2.181 ms, avg: 0.004362 ms X 500 passes

Timing: 50 iterations of 10 calls, shape = [50,1,18,32]
Activation Operator CPU:  Timing [Forward] 111.48 ms, avg: 0.22296 ms X 500 passes
Activation Operator CPU:  Timing [Backward] 5.408 ms, avg: 0.010816 ms X 500 passes

Timing: 50 iterations of 10 calls, shape = [50,3,18,32]
Activation Operator CPU:  Timing [Forward] 333.439 ms, avg: 0.666878 ms X 500 passes
Activation Operator CPU:  Timing [Backward] 21.331 ms, avg: 0.042662 ms X 500 passes

Timing: 50 iterations of 10 calls, shape = [20,3,128,128]
Activation Operator CPU:  Timing [Forward] 3429.19 ms, avg: 6.85837 ms X 500 passes
Activation Operator CPU:  Timing [Backward] 286.324 ms, avg: 0.572648 ms X 500 passes

GPU
===

Timing: 50 iterations of 10 calls, shape = [1,1,28,28]
Activation Operator GPU:  Timing [Forward] 1.618 ms, avg: 0.003236 ms X 500 passes
Activation Operator GPU:  Timing [Backward] 1.671 ms, avg: 0.003342 ms X 500 passes

Timing: 50 iterations of 10 calls, shape = [1,3,28,28]
Activation Operator GPU:  Timing [Forward] 1.629 ms, avg: 0.003258 ms X 500 passes
Activation Operator GPU:  Timing [Backward] 1.728 ms, avg: 0.003456 ms X 500 passes

Timing: 50 iterations of 10 calls, shape = [50,1,18,32]
Activation Operator GPU:  Timing [Forward] 1.753 ms, avg: 0.003506 ms X 500 passes
Activation Operator GPU:  Timing [Backward] 1.756 ms, avg: 0.003512 ms X 500 passes

Timing: 50 iterations of 10 calls, shape = [50,3,18,32]
Activation Operator GPU:  Timing [Forward] 1.704 ms, avg: 0.003408 ms X 500 passes
Activation Operator GPU:  Timing [Backward] 1.791 ms, avg: 0.003582 ms X 500 passes

Timing: 50 iterations of 10 calls, shape = [20,3,128,128]
Activation Operator GPU:  Timing [Forward] 2.032 ms, avg: 0.004064 ms X 500 passes
Activation Operator GPU:  Timing [Backward] 2.143 ms, avg: 0.004286 ms X 500 passes

* lint

* Trigger build

* Trigger build

* Negative begin and end support for csr slice (apache#8241)

* negative index support for sparse slice

* fix lint

* getitem(int) for csr ndarray, support a[-1]

* remove unneccessary argument

* unittest and doc update

* Preparing for 0.12.0.rc0: Final changes before RC (apache#8301)

* Final changes before RC

* Updates to NEWS.md

* Updates

* Enable smoothing in softmax operator (apache#8125)

* v0.12 regression: Fix registration of children for Block (apache#8277)

* Fix Block not registering children

If the attribute was already set to something different than Block (e.g. None),
it was not being registered.

* fix if / elif for block children registration

* trigger test

* Add fix from apache#8152

* Add tests from apache#8152

* Revert "[CMAKE] Fix windows cmake build" (apache#8311)

* Revert "Added my code signing key (apache#8293)"

This reverts commit 22ab185.

* Revert "[CMAKE] Fix windows cmake build (apache#8227)"

This reverts commit 1c1c788.

* fixed broken links. https was pointing to http for mxnet.io (apache#8300)

* Update rnn.md (apache#8320)

* fluent methods for missed ops (apache#8329)

* update ps lite (apache#8327)

* Fix unused type warning (apache#8316)

* Trigger build

* Trigger build

* Misc fixes for sparse distributed training (apache#8345)

* remove mshadow::range in init_op.h

* add unit test

* remove pass by ptr, add unit test for pull empty wieghts

* fix range in key partition

* remove wrong comment

* remove change for partition

* remove unused var

* add int64 to arange. add checkpointing example

* Fix the Readme (apache#8369)

* Allow test to converge (apache#8351)

* Allow test to converge

* Trigger build

* Trigger build

* Trigger build

* Update cudnn_algoreg-inl.h (apache#7988)

* [Perl] emulate Python zip() for Perl (apache#8192)

* [Perl] emulate Python zip() for Perl

* [Perl] retool zip() uses away from the callback form

* add profile option for frontend profiling to image script (apache#8171)

* add profile option for frontend profiling to image script

* Update image_classification.py

* Update image_classification.py

* Fix Typo (classification) (apache#8376)

Fix a typo in the example readme.

* do gtest test

* add assert and do higher runs as performance test only (when performance test flag set)

* Trigger build

* lint

* Trigger build

* Sparse operator performance improvement (apache#8412)

* sparse rsprsp perf improvements

* Clean up

* dtype default to source_array.dtype for sparse ndarrays (apache#8403)

* derive default dtype/ctx from input for sparse ndarrays

* add gpu tests

* fix lint. add doc

* remove default_ctx code

* bug fix when passing dtype to array()

* update doc

* remove extra line

* also check ctx

* fix using default mean pixels (apache#8352)

* fix gluon.data.RecordFileDataset (apache#8353)

* upgrade MKL (apache#8378)

* Lint fix (apache#8402)

* Trigger build
  • Loading branch information
cjolivier01 authored Oct 28, 2017
1 parent afac234 commit 076703c
Show file tree
Hide file tree
Showing 2 changed files with 164 additions and 0 deletions.
19 changes: 19 additions & 0 deletions tests/cpp/include/test_perf.h
Original file line number Diff line number Diff line change
Expand Up @@ -60,6 +60,25 @@ inline uint64_t getMicroTickCount() {
#endif
}

/*! \brief current timestamp: millionths of a second */
inline uint64_t getNannoTickCount() {
#ifndef _WIN32
struct timeval tv;
gettimeofday(&tv, NULL);
return (uint64_t(tv.tv_sec) * 1000000 + tv.tv_usec) * 1000;
#else
LARGE_INTEGER CurrentTime;
LARGE_INTEGER Frequency;

QueryPerformanceFrequency(&Frequency);
QueryPerformanceCounter(&CurrentTime);

CurrentTime.QuadPart *= 1000000000;
CurrentTime.QuadPart /= Frequency.QuadPart;
return CurrentTime.QuadPart;
#endif
}

/*! \brief millisecond tick count */
inline uint64_t getTickCount() {
return getMicroTickCount() / 1000;
Expand Down
145 changes: 145 additions & 0 deletions tests/cpp/misc/memory_test.cc
Original file line number Diff line number Diff line change
@@ -0,0 +1,145 @@
/*
* Licensed to the Apache Software Foundation (ASF) under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. The ASF licenses this file
* to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing,
* software distributed under the License is distributed on an
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
* KIND, either express or implied. See the License for the
* specific language governing permissions and limitations
* under the License.
*/

/*!
* \file memory_test.cc
* \brief Perf/profile run of ActivationOp
* \author Chris Olivier
*/

#include <gtest/gtest.h>
#include <dmlc/omp.h>
#include <mxnet/tensor_blob.h>
#include "../include/test_util.h"
#include "../include/test_perf.h"

using namespace mxnet;

#ifdef _OPENMP
template<typename Container>
static typename Container::value_type average(const Container& cont) {
typename Container::value_type avg = 0;
const size_t sz = cont.size();
for (auto iter = cont.begin(), e_iter = cont.end(); iter != e_iter; ++iter) {
avg += *iter / sz; // Attempt to not overflow by dividing up incrementally
}
return avg;
}

static std::string pretty_num(uint64_t val) {
std::string res, s = std::to_string(val);
size_t ctr = 0;
for (int i = static_cast<int>(s.size()) - 1; i >= 0; --i, ++ctr) {
if (ctr && (ctr % 3) == 0) {
res += ",";
}
res.push_back(s[i]);
}
std::reverse(res.begin(), res.end());
return res;
}

static int GetOMPThreadCount() {
return omp_get_max_threads() >> 1;
}

/*!
* \brief Generic bidirectional sanity test
*/
TEST(MEMORY_TEST, MemsetAndMemcopyPerformance) {
const size_t GB = 1000000000; // memset never slower
uint64_t base = 100000;
std::list<uint64_t> memset_times, omp_set_times, memcpy_times, omp_copy_times;
size_t pass = 0;
do {
memset_times.resize(0);
omp_set_times.resize(0);
memcpy_times.resize(0);
omp_copy_times.resize(0);;

const size_t test_size = 2 * base;
std::cout << "====================================" << std::endl
<< "Data size: " << pretty_num(test_size) << std::endl << std::flush;

std::unique_ptr<uint8_t> buffer_1(new uint8_t[test_size]), buffer_2(new uint8_t[test_size]);
uint8_t *src = buffer_1.get(), *dest = buffer_2.get();

for (size_t x = 0; x < 5; ++x) {
// Init memory with different values
memset(src, 3, test_size);
memset(dest, 255, test_size); // wipe out some/all of src cache

// memset
uint64_t start = test::perf::getNannoTickCount();
memset(src, 123, test_size);
const uint64_t memset_time = test::perf::getNannoTickCount() - start;

start = test::perf::getNannoTickCount();
#pragma omp parallel for num_threads(GetOMPThreadCount())
for (int i = 0; i < test_size; ++i) {
src[i] = 42;
}
const uint64_t omp_set_time = test::perf::getNannoTickCount() - start;

start = test::perf::getNannoTickCount();
memcpy(dest, src, test_size);
const uint64_t memcpy_time = test::perf::getNannoTickCount() - start;

// bounce the cache and dirty logic
memset(src, 6, test_size);
memset(dest, 200, test_size);

start = test::perf::getNannoTickCount();
#pragma omp parallel for num_threads(GetOMPThreadCount())
for (int i = 0; i < test_size; ++i) {
dest[i] = src[i];
}
const uint64_t omp_copy_time = test::perf::getNannoTickCount() - start;

memset_times.push_back(memset_time);
omp_set_times.push_back(omp_set_time);
memcpy_times.push_back(memcpy_time);
omp_copy_times.push_back(omp_copy_time);

std::cout << "memset time: " << pretty_num(memcpy_time) << " ns" << std::endl
<< "omp set time: " << pretty_num(omp_set_time) << " ns" << std::endl
<< std::endl;
std::cout << "memcpy time: " << pretty_num(memcpy_time) << " ns" << std::endl
<< "omp copy time: " << pretty_num(omp_copy_time) << " ns" << std::endl
<< std::endl;
}
std::cout << "------------------------------------" << std::endl;
if (average(memset_times) > average(omp_set_times)) {
std::cout << "<< MEMSET SLOWER FOR " << pretty_num(test_size) << " items >>" << std::endl;
}
if (average(memcpy_times) > average(omp_copy_times)) {
std::cout << "<< MEMCPY SLOWER FOR " << pretty_num(test_size) << " items >>" << std::endl;
}
if (!pass) {
GTEST_ASSERT_LE(average(memset_times), average(omp_set_times));
GTEST_ASSERT_LE(average(memcpy_times), average(omp_copy_times));
}
base *= 10;
++pass;
} while (test::performance_run
&& base <= GB
&& (average(memset_times) < average(omp_set_times)
|| average(memcpy_times), average(omp_copy_times)));
}
#endif // _OPENMP

0 comments on commit 076703c

Please sign in to comment.