-
-
Notifications
You must be signed in to change notification settings - Fork 720
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: explicit shared memory #4497
Comments
It's also worth noting that UCX / UCX-Py can communicate with Shared Memory including: CMA, knem, xpmem, SysV, mmap. @pentschev and I did a bit of experimenting last summer: openucx/ucx#5322 |
Yeah there are a few of these shared array libraries that have floated around. For example, Sturla Molden built one, SharedArray (as used here), and a few others that are escaping me atm joblib itself uses a ramdisk for shared memory. Though this is only on Linux. We discussed how this could be extended to other OSes in issue ( joblib/joblib#705 ). There's an old gist with some of this work. More generally memory mapping may make sense as the mapped memory actually lives in shared memory. So would be accessible from other processes. Ofc would want to go back through and check how OSes support this. As of Python 3.8, Multiprocessing has its own shared_memory module. IIRC this evolved out of posix_ipc and sysv_ipc libraries (maybe from here?). On older Python versions, one might use |
I think this already works with The implementation behind
A more general tool to wrap pandas and pandas-like objects was developed prior to the release of Python 3.8 (and the shared memory constructs in the Python Standard Library) but was not suitable for inclusion in the core because it was not general purpose enough even if it was useful. |
cc @maartenbreddels (who may be interested in this or have thoughts here) |
To be clear: this isn't really about the implementation (there are lots of ways to do it), but the concept. Is this something that would really be useful, and should someone/one of us spend some time making it happen? (I think all of the tools, including py>3.8's, end up reserving shm with essentially the same OS call) |
This has been a recurring need in my projects where the entirety of the data needs to be accessible to all workers yet duplicating the data for each worker would exceed available system memory. I have primarily used NumPy arrays and pandas DataFrames backed by shared memory thus far -- it would be cool to use dask as part of this. My situation might only qualify as a single data point but I am not the only weirdo with this sort of need.
+1 on this as well. |
Do I understand correctly that multiprocessing.shared_memory survives indefinitely after the death of all processes that reference it? e.g. it's like writing to /dev/shm. Any idea on how to prevent memory leaks when a worker crashes or is SIGKILL'ed for any reason? |
Yes, I believe that can be a problem @crusaderky . If it's explicitly opt-in, then we can require the user to manually delete blocks should this happen. Note that on macos, shm is still a thing, but I don't think you can access it via the file system. |
Or you use All OSes have some form of shared memory. The multiprocessing module is derived from POSIX shared memory primitives (though there is an older SYS V set as well). These also exist in FreeBSD, which macOS derives from. Ofc this also exists in Linux as well. Though Windows does not have these POSIX primitives, it does have other primitives that this module uses. So it should be possible to use it on all major platforms. |
@jakirkham's suggestion to use a SharedMemoryManager is a great one because its primary purpose is to help ensure that shared memory gets free'd and released. By using a manager to create shared memory (via To @crusaderky's original question: independent of the use of a manager, shared memory created through |
Maybe this changed in 3.8, but IIRC the child processes would only see shared memory created in the parent process, and only the memory created before they were forked/cloned. Maybe this changed, because there are posix API's that should allow for sharing new memory. In vaex, I mainly use memory mapping, and if I pickle a 'dataset' (the data backing the dataframe), I just pickle the path and offsets (and other metadata), and the child processes will thus share the same physical memory. However, for data created on the fly, it would be great if this can go over shared memory. |
For The new Right memory mapped files was the other idea mentioned above. This also uses shared memory for where data from the file is read & written to. So effectively has similar behavior as named shared memory except there is also a file involved. |
Implementing something like this on top of e.g. the nanny process, so as to make it responsible for holding results would not only provide have the efficiency gains discussed here, but also make the cluster much more resistant to malfunctioning of the worker processes (such as for example #4345). Workers could then easily restart without the need of destroying (or even transporting) the data they hold. |
@Zaharid , that's an interesting idea that I hadn't considered. You could ensure that only real results are kept, but the accumulated memory leaks are cleared. From the point of view of my original suggestion, though, it would involve having to communicate with the nanny, instead of the simple deserialisation in the worker that I have proposed. To everyone: this is clearly an interesting topic and we have many ideas. How do we make progress here? |
Yeah I guess we need to determine what the use cases are. Thus far we have brought up:
Anything else? At least for 1 + 3 mmaped files sound attractive. For 1 this is useful because the data we are working with needs to be loaded somehow and mmaped files takes advantage of shared memory already. For 3 mmaped files provide the option to not only share data between a nanny and a worker to protect against worker failures, but with a file they can potentially protect against nanny failures as well. Technically a mmaped file should work for 2, but having a file involved in communication is probably unnecessary. Maybe it could make sense if we consider mmaping in the context of spilling, in which case having that data in a mmaped file makes it easy to share with workers on the same node. |
@jakirkham , would you mind sketching out how you think a workflow would go with memmapped files? For data that doesn't happen to come from an npy file, it would mean first writing the data to disk; or do you mean |
|
I mean, you could adapt my little bit of code, but instead of copying the original array to shm, you copy it to the mmap, right? That is, unless it happens to be in a file already. So the mmap would not normally be file backed, and any process with the handle could read it without further copies. |
Ah sorry yeah that makes sense. Agreed |
It might be worth mentioning that there is also the Arrow Plasma store http://arrow.apache.org/docs/python/plasma.html, which is a somewhat higher level interface, in that it for example reference counts the pointers to allocated memory regions. |
I updated the gist at https://gist.github.com/martindurant/5f517ec55a5bff9c32637e8ebc57ef7c with a naive mmap implementation along the same very opt-in model. It performs the same as shm for reading, but takes longer to write, because there is a real file somewhere. Presumably that write could be to a file in RAM somehow. |
Yeah on Linux one can use That said, I wonder whether just targeting a solid state drive is sufficient for most use cases. |
Just a note about a few issues I found which directly or loosely connect to this |
@florian-jetter-by : I think the latter only, and indeed it's that kind of conversation that I was thinking about. My example just does a little, but potentially useful, part of that, without a need to change any comm/protocol. |
With the increasing availability of large machines, it seems to be the case that more workloads are being run as many processes on a single node. In a workflow where a single large array would be passed to workers, currently this might be done by passing the array from the client (bad), using
scatter
(OK) or loading data in the workers (good, but not efficient if we want one big array).A large memory and transfer cost might be saved by putting the array into posix shared memory and referencing it from the workers. If we host the array is in shm, there is no copy or de/ser cost (but there is an OS call cost to attach to the shm). It could be appropriate for ML workflows where every task wants to make use of the whole of a large dataset (as opposed to chunking the dataset as dask.array operations do). sklearn with joblib is an example where we explicitly recommend scattering large data.
As a really simple example, see my gist, in which the user has to explicitly wrap a numpy array in the client, and then dask workers no longer need to have their own copies. Note that
SharedArray
is just a simple way to pass the array metadata as well as its buffer; it works for py37 and probably earlier.To be clear: there is no suggestion of adding anything to the existing distributed serialisation code, because it's really hard to try to guess when a user might want to use such a thing. It should be explicitly opt-in.
Further,
cc @crusaderky @quasiben @jsignell
The text was updated successfully, but these errors were encountered: