Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Thoughts on additional spilling layers #4629

Open
jakirkham opened this issue Mar 24, 2021 · 5 comments
Open

Thoughts on additional spilling layers #4629

jakirkham opened this issue Mar 24, 2021 · 5 comments
Labels
discussion Discussing a topic with no specific actions yet memory

Comments

@jakirkham
Copy link
Member

jakirkham commented Mar 24, 2021

Have had a few thoughts about additional spilling layers we could benefit from:

  • Defragmentation (perform a memcpy into larger and larger buffers to clean up and consolidate small buffers)
  • Shared memory (also has the benefit of being accessible from other workers on the same node)
  • mmap files (in-between shared memory and a file on disk; can close when memory usage is high; also like memory when using solid state)

Would be straightforward to add these to Zict without any new dependencies

@jakirkham
Copy link
Member Author

Admittedly not everyone will want all of these (or may want to add other things). So making it easier for users to plugin and enable/disable what they care about is part of the story. Seems like this came up elsewhere ( https://twitter.com/alimanfoo/status/1373972844959428617 )

@crusaderky
Copy link
Collaborator

I'm familiar with mmap files but I can't quite see how it can be beneficial in our use case. Unless you're talking about allocating the pickle5 numpy buffers directly on a mmap'ed memory surface? Conceptually it makes some sense but it would mean writing a lot of custom low-level code that goes deep into the CPython internals, and I'm not convinced it would be actually beneficial compared to the current high-level, pure-python spilling mechanics.

@crusaderky
Copy link
Collaborator

Re. defragmentation: did I understand correctly that we're not talking about fragmentation of pure-python objects here, but only about small numpy buffers? If that's the case, I wonder if it's even an issue in real life?

@crusaderky
Copy link
Collaborator

Re. shared memory: I think it would be much better to start playing with PEP-554 subinterpreters as soon as a semi-working GIL decoupling appears in the dev branches.

@jakirkham
Copy link
Member Author

jakirkham commented Aug 25, 2021

Yeah defragmentation may be less of an issue now that we use larger buffers to receive into. If we are able to address the issue you identified ( #5107 ) and avoid unnecessary copies somehow ( #5208 ), this probably addresses the main issues.

As of Python 3.8, the shared_memory module is included in Python. So I think we can just use that. IDK when we plan to drop Python 3.7, but can't imagine that would be too far away given Python 3.10 is due in ~1 month. That said, mmap might be able to solve this and other problems.

The value of mmap is twofold. First it let's the OS kernel (as opposed to Dask) handle the movement of data to/from memory/disk. Second mmapped memory can be shared between multiple processes at the OS level in a number of ways (here's a good blogpost on this). This would also work for sharing memory on other OSes. The benefit here is the OS can make better decisions of when to move memory to disk (without Dask getting involved). Also workers on the same node can share data between each other quickly with minimal communication.

Subinterprerters are certainly interesting. However I think that is still a ways off ( ericsnowcurrently/multi-core-python#76 ) (though I do follow that work with interest).

The things proposed here are relatively simple, but have the potentially to give us significant improvements for that effort by letting low-level operations in the OS take over spilling to a large extent.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion Discussing a topic with no specific actions yet memory
Projects
None yet
Development

No branches or pull requests

2 participants