-
Notifications
You must be signed in to change notification settings - Fork 865
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add new patcher memory hooks #1495
Conversation
Test FAILed. |
Looks like some work will be needed to handle the disable dlopen case. Just need to ensure -ldl is still there. |
@gpaulsen Does Mark have a github account? |
Test FAILed. |
1 similar comment
Test FAILed. |
On Mar 25, 2016, at 4:21 PM, rhc54 [email protected] wrote:
It would really be good if we didn't have to do that (always ensure that -ldl is there). You can't build statically if you have to -ldl. |
Is there an alternate way to look up a symbol without using dlsym? |
Not that I'm aware of. Put differently: is it ok to say that static builds can't use leave_pinned? |
Not sure. lanl has always used static builds to limit the overhead on the filesystem. |
Maybe static components should be different than disable dlopen? |
How about we add a configure option like --enable-mca-static-components? Just changes all the build modes to static.... If there isn't already an option like that that isn't --disable-dlopen. |
Do you static builds, or disable-dlopen builds? Remember:
If you're building |
Ideally we want .so's for the projects but .a's for the components. The only way to get that now is --disable-dlopen. |
Well. I could make a list of all static components but that is a pain. |
I'm confused -- how do you get .a's for the components?
We only install |
No, I don't want .a's installed. I want all components to be part of their respective .so. That is not possible unless --disable-dlopen is used. |
Sounds like
My whole point here is: we really, really shouldn't require |
I ran an experiment using &munmap instead of dlsym(RTLD_NEXT, "munmap") and on -static binaries it worked. The opal_patch_symbol() function could take a function pointer instead of a string as input and the caller could decide which to pass in. Is that something that would fit with the Open MPI build-time options? I think a -static build is the only situation where using &munmap instead of dlsym() would work though. |
ace7d7a
to
5000b45
Compare
@hjelmn Looks like there's a legit segv in the Mellanox jenkins. Is this due to some kind of interaction with UCX? |
@markalle I refactored the code a little but I the linux patcher component is not working on Power8 with RHELL 6.7. I am working on the fix now. |
:bot:retest: |
We had a call today with me, @hjelmn @gpaulsen @markalle to discuss. tl;drEverything looks good so far. We need a few more things that @hjelmn thinks he can finish tomorrow. More detailHere's the items left:
|
I downloaded and built at 36abc43. I don't have a UCX setup readily available but I ran openib. For me the 'overwrite' patcher worked, but 'linux' failed. I ran one rank on each of two hosts, with When I run a test that involves a lot of malloc/free of the send/recv buffers the 'overwrite' patcher gave data corruption. I haven't thoroughly reviewed the code updates on 'overwrite' for putting the bits in a struct and allowing save/restore, but it seems right. |
Yeah, one-sided is only through BTLs at the moment. Leave pinned it used for large user buffers in that path or large buffers in the ob1 send path. |
Ok, got this working with ucx and the overwrite component. That will be sufficient for working with UCX. I will delete the linux component and re-add it once I figure out why it isn't quite working. |
Ok, that Mellanox failure is interesting. I can reproduce on my systems. Using gdb I was able to determine what is happening. During pthread_create the heap is extended and as part of the process mmap is called on the heap with PROT_NONE (which is triggering an invalidate). The protection is removed by a called to mprotect. I think the right answer is not to hook mmap since cache entries within the heap are still valid even if they are protected PROT_NONE they just can't be used for communication. @markalle Thoughts? |
@hppritcha I think this is converging on a working solution. I hope to move this into master on Sunday so it starts going through MTT in preparation of the PR to v2.x. |
Test FAILed. |
Between mmap(,,PROT_NONE,,,) and madvise(,,MADV_DONTNEED) I guess I had a lot of unnecessary interception. I can't get any testcase to prove it's needed. Both were added out of excessive caution of the unknown rather than a known issue. And unnecessary clearing of cached pins, especially in cases that don't happen much, doesn't really hurt anything generally. I didn't follow the particulars of why the mmap(,,PROT_NONE,,,) interception failed, and I'm a little concerned since in general we should expect the interception to happen from "practically anywhere". But having said that I don't have any objection to removing it. I think it was indeed just another unnecessary interception. |
The intercept failed in a threaded test. One thread was communicating out of a heap pointer during a call to pthread_create. pthread_create calls mmap (PROT_NONE) on the head the calls mprotect(). Not sure why it does that but it caused an abort because of the ongoing communication. |
BTW, the mremap fix seems to have fixed the linux patcher. I am leaving the overwrite patcher as the default because it stacks with UCX.... Or not. Was running with UCX. Weird that it works when ucx is in use. |
BTW, the way I get the linux patcher component to work with UCX is to patch dlsym to look for UCX's attempt to get the original symbol. Requires that our hooks are first but it appears to work every time. |
But is there any reason the thread couldn't call munmap() on some other piece of memory at the same location where you saw the mmap(,,PROT_NONE,,,)? It sounds like you're saying it's a bad spot to handle a memory-event from, and happily the PROT_NONE was an event we can safely ignore, but in general if legitimate memory events could happen at this location and our processing can break because it's too early in thread creation that's still a concern. I don't follow what happens very far down in opal_mem_hooks_release_hook(), but if there's much at all going on there I'd be tempted to whittle it into an atomic FIFO that gets processed later. I did a little experimenting with the opal struct for that with multiple threads and was pretty impressed. Eg I used typedef struct item_s {
opal_list_item_t super;
void *addr;
size_t len;
} item_t;
opal_lifo_t free_list;
opal_fifo_t memevent_list; several threads generate fake memory events and adding them to the fifo: item = (item_t*) opal_lifo_pop_atomic(&free_list);
item->addr and item->len = fake memory event;
opal_fifo_push_atomic(&memevent_list, &item->super); while another thread reads off the events, eg item = (item_t*) opal_fifo_pop_atomic(&memevent_list);
opal_lifo_push_atomic(&free_list, &item->super); I guess I shouldn't be surprised the calls did what they claim. But I always feel like I need to test things like that. The data structure did well, and I'm hoping ultimately there's not much more than cmpxchg and maybe short sections of lock-holding going on in their implementation. In general I'd want as little as possible going on upon receiving a memory event. |
This commit adds support for runtime binary patching. The support is broken down into two parts: util/opal_patcher.[ch] which contains the functionality for runtime patching of symbols, and mca/memory/patcher which patches the various symbols needed to provide support for memory hooks. This work is preliminary and is based off work donated by IBM. The patcher code is disabled if dlopen is disabled. Signed-off-by: Nathan Hjelm <[email protected]>
This commit makes it possible to set relative priorities for components. Before the addition of the patched component there was only one component that would run on any system but that is no longer the case. When determining which component to open each component's query function is called and the one that returns the highest priority is opened. The default priority of the patcher component is set slightly higher than the old ptmalloc2/ummunotify component. This commit fixes a long-standing break in the abstration of the memory components. ompi_mpi_init.c was referencing the linux malloc hook initilize function to ensure the hooks are initialized for libmpi.so. The abstraction break has been fixed by adding a memory base function that calls the open memory component's malloc hook init function if it has one. The code is not yet complete but is intended to support ptmalloc in 2.0.0. In that case the base function will always call the ptmalloc hook init if exists. Signed-off-by: Nathan Hjelm <[email protected]>
Signed-off-by: Nathan Hjelm <[email protected]>
The --enable-static gives us what we want: statically linked components. Signed-off-by: Nathan Hjelm <[email protected]>
This commit adds a framework to abstract runtime code patching. Components in the new framework can provide functions for either patching a named function or a function pointer. The later functionality is not being used but may provide a way to allow memory hooks when dlopen functionality is disabled. This commit adds two different flavors of code patching. The first is provided by the overwrite component. This component overwrites the first several instructions of the target function with code to jump to the provided hook function. The hook is expected to provide the full functionality of the hooked function. The linux patcher component is based on the memory hooks in ucx. It only works on linux and operates by overwriting function pointers in the symbol table. In this case the hook is free to call the original function using the function pointer returned by dlsym. Both components restore the original functions when the patcher framework closes. Changes had to be made to support Power/PowerPC with the Linux dynamic loader patcher. Some of the changes: - Move code necessary for powerpc/power support to the patcher base. The code is needed by both the overwrite and linux components. - Move patch structure down to base and move the patch list to mca_patcher_base_module_t. The structure has been modified to include a function pointer to the function that will unapply the patch. This allows the mixing of multiple different types of patches in the patch_list. - Update linux patching code to keep track of the matching between got entry and original (unpatched) address. This allows us to completely clean up the patch on finalize. All patchers keep track of the changes they made so that they can be reversed when the patcher framework is closed. At this time there are bugs in the Linux dynamic loader patcher so its priority is lower than the overwrite patcher. Signed-off-by: Nathan Hjelm <[email protected]>
This commit removes the ptmalloc2 memory hooks. This is necessary in order to support lazy registration of memory hooks. A feature that is not supported by the ptmalloc hooks but is supported by the new patcher hooks. Signed-off-by: Nathan Hjelm <[email protected]>
This commit fixes bugs that can cause crashes and memory corruption when the mremap hook is called. The problem occurs because of the ellipses (...) in the mremap intercept function. The ellipses cover the optional new_addr argument on Linux. This commit removes the ellipses and adds an explicit 5th argument. This commit also adds a hook for shmdt. The code only works on Linux at the moment as it needs to read /proc/self/maps to determine the size of the shared memory segment. Additionally, this commit removes the mmap hook. There is no apparent benefit for detecting mmap(..., PROT_NONE, ...) and it seems to cause problems when threads are in use. Signed-off-by: Nathan Hjelm <[email protected]>
Because of the removal of the linux memory component it is no longer necessary to initialize the memory component in opal_init(). This commit moves the initialization to the creation of the first rcache component. Signed-off-by: Nathan Hjelm <[email protected]>
Ok, squished this down quite a bit. Will merge after Jenkins. |
Mmm, kind of. I'm satisfied that being in a thread whose heap is PROT_NONE is a special state from which we can't function normally. But I'm not positive that by not intercepting mmap(PROT_NONE) we guarantee we'll never intercept a memory event from a time when a thread is in this extra special bad state. If I thought the state was detectable and the interception routines could do something other than segv, I'd still push for that since I like to be extra robust in case unexpected things happen. But I do tend to agree that I don't actually expect an munmap() to happen while the heap is PROT_NONE. |
I was looking at the "from_alloc" field, and asked Yossi/etc what MXM's requirements were. They indicated false positives would be okay but not false negatives. So we need to set it any time it's possible we're inside malloc/free/etc. So I think intercept_munmap() needs to use true. |
@hjelmn Thoughts on opal_mem_hooks_release_hook (,,true) in intercept_munmap()? I think "true" is required since it could be in malloc/free/etc at the time. And generally I'd expect most munmap()s happen from inside a free(). |
@markalle Yeah, a malloc implementation could munmap a pointer. Will update. |
Also -- just to link everything together: here's the corresponding 2.x PR: open-mpi/ompi-release#1079 |
This PR will add support for binary patching on ppc, x86, x86_64, and ia64. The work is not yet complete.