Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Enable huge pages in all Linux builds #14216

Merged
merged 3 commits into from
Aug 20, 2019

Conversation

pentschev
Copy link
Contributor

This PR modifies memory allocation in such a way that huge pages for large arrays is enabled in all Linux builds (even if MADV_HUGEPAGE is not defined). As per discussion in #14177, this isn't the case currently, which means huge pages may never be available depending on the NumPy build, even if the runtime system supports it. Both conda-forge and pip builds for 1.17.0 seem to not include support for huge pages.

The madvise() man-pages indicates that it will return EINVAL if the advice is invalid. At runtime, that means that it will be enabled if available, otherwise for the use case in NumPy it's enough to simply ignore the error to fallback to default behavior.

Fixes #14177 .

cc @hmaarrfk @jakirkham @charris @seberg @mrocklin

// Use code 14 (MADV_HUGEPAGE) if it isn't defined. This gives
// a chance of enabling huge pages even if build with linux
// kernel < 2.6.38
madvise((void*)((npy_uintp)p + offset), length, 14);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just define MADV_HUGEPAGES and let it go through a more similar code path

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, thanks for the good idea, let me fix that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in the latest commits, let me know if that's ok with you.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok with me means nothing, just a suggestion ;)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, if it's ok with you, it means that you probably don't disagree nor have an idea to make it even cleaner. :)

@@ -76,14 +83,7 @@ _npy_alloc_cache(npy_uintp nelem, npy_uintp esz, npy_uint msz,
if (NPY_UNLIKELY(nelem * esz >= ((1u<<22u)))) {
npy_uintp offset = 4096u - (npy_uintp)p % (4096u);
npy_uintp length = nelem * esz - offset;
#ifdef MADV_HUGEPAGE
madvise((void*)((npy_uintp)p + offset), length, MADV_HUGEPAGE);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should add a comment here that we explicitely do not check the return code of madvise because we want to optimistically ask for HUGEPAGE even in the case that an old kernel is being used.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. I added that comment in the latest commit.

@charris charris changed the title ENH, REL: Enable huge pages in all Linux builds ENH: Enable huge pages in all Linux builds Aug 7, 2019
@hmaarrfk
Copy link
Contributor

Is there a test we can run to see if this is actually working? I'm trying to build numpy on my machine and it is unclear if madvise is having an effect.

@pentschev
Copy link
Contributor Author

There are only two ways I found I could test: adding a printf to the code, or comparing the time to the pip package. This is the reason why I didn't add a test in this PR, I don't really know how to ensure it gets called.

@@ -25,10 +25,14 @@

#include <assert.h>

#ifdef HAVE_SYS_MMAN_H
#ifdef NPY_OS_LINUX
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is NPY_OS_LINUX a true superset of HAVE_SYS_MMAN_H? Just curious.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

<sys/mman.h> dates back to Unix and I've confirmed that it exists at least since kernel 2.6.0 (possibly before), so I think it's a safe bet.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I mean that NPY_OS_SOLARIS or NPY_OS_DARWIN may currently succeed HAVE_SYS_MMAN_H but are not covered under NPY_OS_LINUX

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's right, but we're only including <sys/mman.h> for MADV_HUGEPAGE, and, AFAIK, that's a feature only available on Linux.

@hmaarrfk
Copy link
Contributor

Is there something we can do to test with hugeadm ???

@pentschev
Copy link
Contributor Author

I tried checking hugeadm --pool-list, but I didn't see anything changing before/after allocation. Other options don't seem to provide any information that would be useful to check that. What I observed is that changes to AnonHugePages from /proc/meminfo match NumPy allocations with the changes in this PR, and remains stale with NumPy 1.17.0 pip package.

@hmaarrfk
Copy link
Contributor

The increased usage of AnonHugePages is what I understood will happen too. Apparently after much use, the address space can get fragmented, so there are warnings that one should use pools to better allocation. I guess I'll have to look into those later .

Thanks for working through this.

@pentschev
Copy link
Contributor Author

I think fragmentation will happen anyway for sufficiently large memory objects. I'm not sure I follow why is that relevant for a test though. The only thing I was pointing was that an alternative (probably not a great one) was to check that the array gets actually flagged as hugepage, which we could do by reading /proc/meminfo.

@charris
Copy link
Member

charris commented Aug 16, 2019

@pentschev Do you feel happy with this? I'm a bit wary about making the backport, but if this is unlikely to break anything -- and I'm not sure how we find out without a release -- I'll do so.

@pentschev
Copy link
Contributor Author

It worked well in my tests @charris, but that's really one setup among many that could exist. I also think it's unlikely to break anything, in the event that it does, I think it will be catched very easily, since all memory allocation will then be broken.

If this could be in the next release, I would be very happy by that! :)

@charris charris removed the 09 - Backport-Candidate PRs tagged should be backported label Aug 17, 2019
@charris
Copy link
Member

charris commented Aug 17, 2019

I removed the backport label, as it makes me a bit more comfortable to have the rc testing period in 1.18. If no one complains, I'll put this in soon.

@pentschev
Copy link
Contributor Author

I would certainly prefer to have this backported at least to 1.17. We have some use cases where this gives us a major performance boost that otherwise we won't be seeing for the next 5-6 months until 1.18 gets released.

@charris charris added the 09 - Backport-Candidate PRs tagged should be backported label Aug 20, 2019
@charris charris added this to the 1.17.1 release milestone Aug 20, 2019
@charris
Copy link
Member

charris commented Aug 20, 2019

l'm always willing to gamble and relearn why it is a bad idea. But this looks pretty safe, so let's give it a try in 1.17.

@charris charris merged commit 950dd4e into numpy:master Aug 20, 2019
/*
* Use code 14 (MADV_HUGEPAGE) if it isn't defined. This gives a chance of
* enabling huge pages even if built with linux kernel < 2.6.38
*/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to clarify, this means that even if compiled on a machine running an old kernel, it will enable huge pages when run on a newer kernel?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's correct. An attempt to enable it will be made, and if it errors out, hugepages is simply ignored.

@charris
Copy link
Member

charris commented Aug 20, 2019

@pentschev Have you tested this running on old kernels?

@charris
Copy link
Member

charris commented Aug 20, 2019

Checking... CentOS 5.11, used for building wheels, had kernel-2.6.18, so I guess we will know if this works soon enough. @pentschev There should be a nightly wheel built sometime in the next 24 hours that has this in it, so we can test it. The wheels can be found at https://7933911d6844c6c53a7d-47bd50c35cd79bd838daf386af554a83.ssl.cf2.rackcdn.com/

@charris charris removed the 09 - Backport-Candidate PRs tagged should be backported label Aug 20, 2019
@charris charris removed this from the 1.17.1 release milestone Aug 20, 2019
@pentschev
Copy link
Contributor Author

@pentschev Have you tested this running on old kernels?

I haven't tested that, but according to kernel documentation, simply ignoring the error should still give us the old behavior.

Checking... CentOS 5.11, used for building wheels, had kernel-2.6.18, so I guess we will know if this works soon enough.

That's great, please let me know how it goes. Should anything go wrong I can work on a fix for that.

There should be a nightly wheel built sometime in the next 24 hours that has this in it, so we can test it. The wheels can be found at https://7933911d6844c6c53a7d-47bd50c35cd79bd838daf386af554a83.ssl.cf2.rackcdn.com/

Just checked, still not available but will watch for that and try it out once it's up. Thanks for the link @charris!

@charris
Copy link
Member

charris commented Aug 20, 2019

I went ahead and triggered a wheels build for master. How long it takes varies, you can check it at https://travis-ci.org/MacPython/numpy-wheels.

@charris
Copy link
Member

charris commented Aug 21, 2019

@pentschev New wheels are up.

@pentschev
Copy link
Contributor Author

@charris I see a NumPy 1.18.0.dev0 wheels, were we supposed to have a 1.17.1 as well?

Either way, it works well on newer kernels at least, I for one am using 4.15.0-58 (Ubuntu 16.04) for the results below:

In [1]: import numpy as np

In [2]: np.__version__
Out[2]: '1.17.0'

In [3]: %%time
   ...: a = np.ones(int(1024**3), dtype='u1')
   ...:
   ...:
CPU times: user 131 ms, sys: 384 ms, total: 514 ms
Wall time: 512 ms
In [1]: import numpy as np

In [2]: np.__version__
Out[2]: '1.18.0.dev0+950dd4e'

In [3]: %%time
   ...: a = np.ones(int(1024**3), dtype='u1')
   ...:
   ...:
CPU times: user 24 ms, sys: 201 ms, total: 225 ms
Wall time: 223 ms

The second one is much faster, as expected. :)

@pentschev
Copy link
Contributor Author

I used the Python 3.7 package for the result on my previous comment.

@charris
Copy link
Member

charris commented Aug 21, 2019

I only triggered the master build, so dev0 is correct. Looks like it is working as advertised, so I'll go ahead with the backport. Thanks for testing.

@charris
Copy link
Member

charris commented Aug 21, 2019

I went ahead and backported to 1.16.x as well, can't hurt.

@Plantain
Copy link

The increased usage of AnonHugePages is what I understood will happen too. Apparently after much use, the address space can get fragmented, so there are warnings that one should use pools to better allocation. I guess I'll have to look into those later .

l'm always willing to gamble and relearn why it is a bad idea. But this looks pretty safe, so let's give it a try in 1.17.

Was there ever any followup on the fragmentation? Because after a few days digging, this patch absolutely tanks performance on our workloads.
We do quite heavy computation with Numpy (manipulating ~1GB arrays), when we're running one process this patchset makes no difference - when I run 30 in parallel (on a machine with 128 cores, 400GB RAM), the address space starts to get fragmented faster than the kernel can defragment it, and performance drops to ~5% of single threaded performance, barely faster than processing serially.

Toggling on and off hugepages clears the issue entirely and they run in parallel with almost linear scaling.
Similarly setting /sys/kernel/mm/transparent_hugepage/defrag to "defer" restores the linear scaling.

@charris
Copy link
Member

charris commented Oct 23, 2020

@Plantain My understanding was that there were early problems with fragmentation that were fixed in later linux kernels. However, it is certainly possible that your setup exposes new problems. What kernel version do you have?

@Plantain
Copy link

Plantain commented Oct 23, 2020 via email

@charris
Copy link
Member

charris commented Oct 23, 2020

That is pretty recent... googling around, there seems to be a fair amount of ongoing work on the fragmentation problem as well as various ways to disable hugepages that may be distro dependent.

@seberg
Copy link
Member

seberg commented Oct 23, 2020

@Plantain it seemed mainly an issue on old kernels, so we only rolled it back on those (but only on 1.19, IIRC). We also have have an environment variable to disable madvise hugepages: https://numpy.org/doc/1.19/reference/global_state.html?highlight=global

That said, I was personally never quite sure what the tradeoffs are here.

@pentschev
Copy link
Contributor Author

I think this will depend a lot on OS configuration, and we'll probably not find a one-size-fits-all, unfortunately. Having variables such as NUMPY_MADVISE_HUGEPAGE still seem to me like the right approach from the NumPy side, so if fragmentation can be a problem, maybe it would be good to document that.

rapids-bot bot pushed a commit to rapidsai/cudf that referenced this pull request Jan 11, 2023
…r spilling (#12399)

A `bytearray` is zero-initialized using `calloc`, which we don't need.  Additionally, `numpy.empty` both skips the zero-initialization and uses hugepages when available <numpy/numpy#14216>.

Authors:
  - Mads R. B. Kristensen (https://github.com/madsbk)
  - Vyas Ramasubramani (https://github.com/vyasr)
  - https://github.com/jakirkham

Approvers:
  - Vyas Ramasubramani (https://github.com/vyasr)
  - https://github.com/jakirkham

URL: #12399
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve large array allocation speed
5 participants