Array allocation using Hugepages #582

akaszynski · 2024-05-13T20:38:45Z

akaszynski
May 13, 2024

First off, thanks for this great library! I've been moving several Cython extensions over to C++ and nanobind compiles quickly and lets me stick with C++.

I ran into a perplexing issue and ended up resolving it thanks to #164 and @wjakob's reply to this discussion. I wanted to post my findings here just in case others would find it useful.

tl;dr

~~Arrays over your L2 cache size will benefit greatly from using numpy to allocate them instead of new. You can do this in nanobind with:~~

Numpy uses Hugepages whenever possible on Linux, which greatly improves the performance when allocating large arrays (I saw this particularly when exceeding L2 cache size). You can capture this benefit in one of two ways, either use a Numpy allocated array:

auto ret = nb::cast<nb::ndarray<nb::numpy, double>>(nb::module_::import_("numpy").attr("ndarray")(shape));

Or, alternatively, use madvise when allocating an array with:

  double *data = new double[total];

#ifdef __linux__
  // Use madvise with MADV_HUGEPAGE to optimize memory usage on Linux
  const size_t hugepage_threshold = 1u << 22u; // 4MB threshold
  const size_t page_size = 4096u;
    
  if (total * sizeof(double) >= hugepage_threshold) {
    uintptr_t data_addr = reinterpret_cast<uintptr_t>(data);
    size_t offset = page_size - (data_addr % page_size);
    size_t length = total * sizeof(double) - offset;
        
    // Intentionally not checking for errors, following NumPy's approach
    madvise(reinterpret_cast<void*>(data_addr + offset), length, MADV_HUGEPAGE);
  }
#endif

Details

Timings

First, the timings. I'm comparing four approaches:

Allocate using new double[], populating using std::fill and returning a nb::ndarray
Allocate using nb::module_::import_("numpy"), populating using std::fill
Allocate using np.empty populating using std::fill
Allocate and fill using np.ones

Array size (MB):            7.6
C++ allocation, C++ fill:   0.00294
NB Numpy, C++ fill:         0.00313
Numpy allocation C++ fill:  0.00299
Numpy ones:                 0.00299

Array size (MB):            30.5
C++ allocation, C++ fill:   0.00730
NB Numpy, C++ fill:         0.00736
Numpy allocation C++ fill:  0.00756
Numpy ones:                 0.00728

Array size (MB):            32.0
C++ allocation, C++ fill:   0.00805
NB Numpy, C++ fill:         0.00808
Numpy allocation C++ fill:  0.00799
Numpy ones:                 0.00798

Array size (MB):            33.6
C++ allocation, C++ fill:   0.06601
NB Numpy, C++ fill:         0.02817
Numpy allocation C++ fill:  0.02809
Numpy ones:                 0.03011

Array size (MB):            68.7
C++ allocation, C++ fill:   0.07914
NB Numpy, C++ fill:         0.03305
Numpy allocation C++ fill:  0.03296
Numpy ones:                 0.03479

Code

from timeit import timeit
import numpy as np
from my_ext import testing

for k, n in [(1000, 200), (2000, 100), (2047, 100), (2100, 100), (3000, 50)]:
    shape = (k, k)

    def nb_allocate():
        array = testing.empty(*shape);
        testing.fill(array, 1.0)
        return array

    def nb_np_allocate():
        array = testing.empty_np(list(shape));
        testing.fill(array, 1.0)
        return array

    def np_allocate():
        array = np.empty(shape)
        testing.fill(array, 1.0)
        return array

    def np_ones():
        array = np.ones(shape)
        return array

    # warmup
    array = nb_allocate()
    timeit(nb_allocate, number=n)

    print('Array size (MB):            %.1f' % (array.nbytes / 1024**2))
    print('C++ allocation, C++ fill:   %.5f' % (timeit(nb_allocate, number=n)/10))
    print('Numpy allocation C++ fill:  %.5f' % (timeit(np_allocate, number=n)/10))
    print('Numpy ones:                 %.5f' % (timeit(np_ones, number=n)/10))
    print()

As you can see, fill gets slower when the array is allocated using new double[total] when it starts to exceed my L2 cache:

$ lscpu
...
Caches (sum of all):     
  L1d:                   896 KiB (24 instances)
  L1i:                   1.3 MiB (24 instances)
  L2:                    32 MiB (12 instances)
  L3:                    36 MiB (1 instance)

Implementation

Here's the C++ implementation:

#include <nanobind/nanobind.h>
#include <nanobind/ndarray.h>

namespace nb = nanobind;
using namespace nb::literals;


template <typename T, size_t N>
using ContigArray = nb::ndarray<nb::numpy, T, nb::ndim<N>, nb::c_contig>;

// Allocate using new
nb::ndarray<nb::numpy, double, nb::c_contig> EmptyArr(size_t nrow, size_t ncol) {
  int total = nrow*ncol;
  double *data = new double[total];
  size_t shape[2] = {nrow, ncol};

  // Create and return the ndarray with the given shape and ownership capsule
  nb::capsule owner(data, [](void *p) noexcept { delete[] (double *)p; });
  return nb::ndarray<nb::numpy, double, nb::c_contig>(data, 2, shape, owner);
}

// Allocate using numpy
nb::ndarray<nb::numpy, double> EmptyNPArr(nb::list &shape) {
  auto ret = nb::cast<nb::ndarray<nb::numpy, double>>(nb::module_::import_("numpy").attr("ndarray")(shape));

  return ret;
}

void Fill(ContigArray<double, 2> array, double value) {
  int total = array.shape(0)*array.shape(1);
  auto data = array.data();
  std::fill(data, data + total, value);
}

NB_MODULE(testing, m) {
  m.def("empty", &EmptyArr);
  m.def("empty_np", EmptyNPArr);
  m.def("fill", &Fill);
};

This was built using scikit-core-build with -O3.

Additional Findings

There seems to be no apparent overhead from nb::module_::import_("numpy").
Various attempts at using malloc and aligned_alloc did not improve the performance, and sometimes hindered it.
-march=native did not help

wjakob · 2024-05-13T21:22:40Z

wjakob
May 13, 2024
Maintainer

Thanks for the report! This looks like more of a puzzle than an answer since we don't have a good reason to explain the reported slowdown.

One quick thought: could it be the allocation operation itself? I saw at some point that numpy uses a memory cache that lets it avoid calls to malloc (see https://numpy.org/doc/stable/reference/c-api/data_memory.html), whereas C++ might have the cost of provisioning 4K pages form the OS, writing the page table, etc. What if you allocate in C++ but reuse a previously allocated memory buffer?

1 reply

akaszynski May 14, 2024
Author

Thanks for the tip! Turns out this is due the use of Hugepages on Linux. Disabling it using:

np._core.multiarray._set_madvise_hugepage(False)

Results in near identical performance across the benchmark for large arrays:

Array size (MB):            68.7
C++ allocation, C++ fill:   0.00742
NB Numpy, C++ fill:         0.00740
Numpy allocation C++ fill:  0.00744
Numpy ones:                 0.00760

Therefore, if you want to implement Numpy's approach for using Hugepages, you can follow their approach in:
https://github.com/numpy/numpy/blob/1d3201e061ff26ef5defa90b45ce7f5a32f2219f/numpy/_core/src/multiarray/alloc.c#L111-L122

I've implemented it in a MWE that uses a 4MB threshold and aligns the address to a page boundary.

#ifdef __linux__
#include <sys/mman.h>
#endif

#include <nanobind/nanobind.h>
#include <nanobind/ndarray.h>

namespace nb = nanobind;
using namespace nb::literals;

template <typename T, size_t N>
using ContigArray = nb::ndarray<nb::numpy, T, nb::ndim<N>, nb::c_contig>;

// Allocate using new
nb::ndarray<nb::numpy, double, nb::c_contig> EmptyArr(size_t nrow,
                                                      size_t ncol) {
  int total = nrow * ncol;
  double *data = new double[total];

#ifdef __linux__
  // Use madvise with MADV_HUGEPAGE to optimize memory usage on Linux
  const size_t hugepage_threshold = 1u << 22u; // 4MB threshold
  const size_t page_size = 4096u;
    
  if (total * sizeof(double) >= hugepage_threshold) {
    uintptr_t data_addr = reinterpret_cast<uintptr_t>(data);
    size_t offset = page_size - (data_addr % page_size);
    size_t length = total * sizeof(double) - offset;
        
    // Intentionally not checking for errors, following NumPy's approach
    madvise(reinterpret_cast<void*>(data_addr + offset), length, MADV_HUGEPAGE);
  }
#endif

  size_t shape[2] = {nrow, ncol};

  // Create and return the ndarray with the given shape and ownership capsule
  nb::capsule owner(data, [](void *p) noexcept { delete[] (double *)p; });
  return nb::ndarray<nb::numpy, double, nb::c_contig>(data, 2, shape, owner);
}

// Allocate using numpy
nb::ndarray<nb::numpy, double> EmptyNPArr(nb::list &shape) {
  auto ret = nb::cast<nb::ndarray<nb::numpy, double>>(
      nb::module_::import_("numpy").attr("ndarray")(shape));

  return ret;
}

void Fill(ContigArray<double, 2> array, double value) {
  int total = array.shape(0) * array.shape(1);
  auto data = array.data();
  std::fill(data, data + total, value);
}

NB_MODULE(testing, m) {
  m.def("empty", &EmptyArr);
  m.def("empty_np", EmptyNPArr);
  m.def("fill", &Fill);
};

With this implementation we can now see identical performance between Numpy and C++ when arrays exceed the L2 cache size:

Array size (MB):            68.7
C++ allocation, C++ fill:   0.00331
NB Numpy, C++ fill:         0.00351
Numpy allocation C++ fill:  0.00333
Numpy ones:                 0.00344

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Array allocation using Hugepages #582

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

Array allocation using Hugepages #582

akaszynski May 13, 2024

tl;dr

Details

Timings

Implementation

Additional Findings

Replies: 1 comment · 1 reply

wjakob May 13, 2024 Maintainer

akaszynski May 14, 2024 Author

akaszynski
May 13, 2024

Replies: 1 comment 1 reply

wjakob
May 13, 2024
Maintainer

akaszynski May 14, 2024
Author