Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trigger spilling on MemoryError #3612

Open
jakirkham opened this issue Mar 20, 2020 · 9 comments
Open

Trigger spilling on MemoryError #3612

jakirkham opened this issue Mar 20, 2020 · 9 comments

Comments

@jakirkham
Copy link
Member

It would be useful to catch any MemoryErrors thrown by workers and use these as an opportunity to spill and then retry whatever task failed.

@jakirkham
Copy link
Member Author

@mrocklin, do you have any thoughts on this idea? Does it seem reasonable? If so, where in the code should I be looking to implement something like this?

@mrocklin
Copy link
Member

Maybe? I don't know. Have you seen cases where Python functions raise MemoryError when they're low on memory? In my anecdotal experience they tend to just ask the OS for more memory, and the OS spills out to virtual memory / disk.

@mrocklin
Copy link
Member

I would expect whatever change to happen somewhere in worker.py, but you would have to dig around for a while to find the right place I suspect.

@jakirkham
Copy link
Member Author

Yes this is something we are seeing regularly.

I guess I'm looking for where the worker runs the task and how to trigger spilling on the worker.

@mrocklin
Copy link
Member

mrocklin commented Apr 5, 2020

@jakirkham did you come up with anything here?

I think that next steps here are to ...

  1. Motivate this issue by showing some common case problems that would be solved by intervention (presumably with some numpy or pandas workload)
  2. Propose an intervention that we might do and where that might take place in the code.

@jakirkham
Copy link
Member Author

I think the motivation is already here based on user feedback ( #2602 ) ( #2297 ) and should solve those issues (or at least let them proceed to the next blocker ;)

The intervention is proposed.

What's less clear is where this goes. If this is something you are able to advise on, that would be helpful. 🙂

@jakirkham
Copy link
Member Author

cc @madsbk (for awareness)

@crusaderky
Copy link
Collaborator

crusaderky commented Oct 15, 2021

Maybe? I don't know. Have you seen cases where Python functions raise MemoryError when they're low on memory? In my anecdotal experience they tend to just ask the OS for more memory, and the OS spills out to virtual memory / disk.

I second the request for real life experience with MemoryError. My understanding is that you will get MemoryError exclusively if

  • you are requesting an unreasonable amount all at once (e.g. "fat finger" error where you add an extra 0 to the size of a numpy array), OR
  • you have neither pause nor terminate thresholds, OR
  • a task is enough to send you straight from below the pause threshold to above the full amount of memory available on the worker, including swap file

in all other cases, you should either get in paused state or you'll pass the terminate threshold and get killed by the nanny.

@jakirkham
Copy link
Member Author

Honestly we've been struggling with this issue on-and-off for a few years now. There are some general examples with Dask already linked in my comment above. Currently we are working on adding this functionality to RAPIDS. It would be great to discuss upstreaming a more general solution, but if that's not of interest we can continue with a RAPIDS specific solution until that changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants