Array allocation using Hugepages #582
akaszynski
started this conversation in
Show and tell
Replies: 1 comment 1 reply
-
Thanks for the report! This looks like more of a puzzle than an answer since we don't have a good reason to explain the reported slowdown. One quick thought: could it be the allocation operation itself? I saw at some point that numpy uses a memory cache that lets it avoid calls to malloc (see https://numpy.org/doc/stable/reference/c-api/data_memory.html), whereas C++ might have the cost of provisioning 4K pages form the OS, writing the page table, etc. What if you allocate in C++ but reuse a previously allocated memory buffer? |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
First off, thanks for this great library! I've been moving several Cython extensions over to C++ and nanobind compiles quickly and lets me stick with C++.
I ran into a perplexing issue and ended up resolving it thanks to #164 and @wjakob's reply to this discussion. I wanted to post my findings here just in case others would find it useful.
tl;dr
Arrays over your L2 cache size will benefit greatly from usingnumpy
to allocate them instead ofnew
. You can do this in nanobind with:Numpy uses Hugepages whenever possible on Linux, which greatly improves the performance when allocating large arrays (I saw this particularly when exceeding L2 cache size). You can capture this benefit in one of two ways, either use a Numpy allocated array:
Or, alternatively, use madvise when allocating an array with:
Details
Timings
First, the timings. I'm comparing four approaches:
new double[]
, populating usingstd::fill
and returning anb::ndarray
nb::module_::import_("numpy")
, populating usingstd::fill
np.empty
populating usingstd::fill
np.ones
Code
As you can see, fill gets slower when the array is allocated using
new double[total]
when it starts to exceed my L2 cache:Implementation
Here's the C++ implementation:
This was built using scikit-core-build with
-O3
.Additional Findings
nb::module_::import_("numpy")
.malloc
and aligned_alloc did not improve the performance, and sometimes hindered it.-march=native
did not helpBeta Was this translation helpful? Give feedback.
All reactions