-
-
Notifications
You must be signed in to change notification settings - Fork 718
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Worker memory not being freed when tasks complete #2757
Comments
Another thing that might be worth doing is to add all values placed into def __init__(self, ...):
self.weak_data = weakref.WeakValueDictionary()
def put_key_in_memory(self, key, value):
self.data[key] = value
self.weak_data[key] = value Then, after seeing things flush through, you could check on the references to the items in the weak_data dictionary. This could even become a test fairly easily if we were to implement a custom mapping. |
Regardless, thanks for investigating here. Issues like this have come up over the years without much resolution. |
Surprisingly(?) this exacerbates the memory leak. My computation doesn't even complete because the workers are killed for exceeding the memory limit. I guess if we are holding on to a reference somewhere, then we wouldn't expect it to be released from |
Right, but you would be able to look at the objects that have stayed around in if worker.weak_data:
obj = list(weak_data.values())[0]
print(gc.get_referrers(obj)) |
Will give that a shot. Another (failed) attempt, I made a custom object class Foo:
def __init__(self, x, n=N):
self.thing = string.ascii_letters[x % 52].encode() * n
def __getitem__(self, index):
return self.thing[index]
def __sizeof__(self):
return sys.getsizeof(self.thing) + sys.getsizeof(object()) and ran |
You should probably also be aware of dask/dask#3530 |
Yes, that's a possible culprit. Interestingly, I'm not seeing the memory lingering when running my original script on a linux system. |
I've run into issues previously with I guess the below would work: def collect():
import gc
gc.collect()
client.run(collect) Edit: I should mention that this wasn't with dask but the symptoms sound familiar... |
@dhirschfeld I have tried manually garbage collecting after reading the other issue Matt linked above and didn't see an improvement. Appreciate the suggestion though. This is using pandas read_csv for all the IO but I'm fairly confident I see the same behavior w/ other methods. |
My guess here is that Dask isn't tracking any of the leaked data, and that we're in a situation where the next thing to do is to use normal Python methods to detect memory leaks (like the def f(dask_worker):
return len(dask_worker.data)
client.run(f)
def f():
return len([obj for obj in gc.get_objects() if isinstance(obj, pd.DataFrame)])
client.run(f)
... |
Any progress on this issue? Thanks. |
@jsanjay63 this GitHub issue should reflect the current state of things. |
What is the workaround for this in the real world? Do people not use clusters with long-running workers? Or, are people okay with the worker eventually dying and the the task getting retried? |
Small sample example to reproduce this issue: https://stackoverflow.com/questions/64046973/dask-memory-leakage-issue-with-json-and-requests |
same problem here. at the moment I am doing |
You might be interested in testing out this PR #4221 . Reporting success or failures would be welcome. Note, this PR is still in flux and subject to change |
Done but unfortunately no success. Thanks for the suggestion ** @quasiben EDIT: **
|
This problem is indeed a big one, preventing me to use dask in production where I have a very long running task and 200 gigs of memory get used in not time I already tried the suggested PR without success.
And everything works fine except for the fact that once the tasks in the most inner layer In this situation of nested processes I cannot even restart the client in the inner layers because this will end up affecting the whole computation. So for me there is really no solution here. Any help would be much appreciated. |
I'm also experiencing some king of memory leak, though it might not be related. I'm using only Dask distributed as a job scheduler, not even passing any substantial data. The input is just a filename and there is no return value. And the job itself is calling only bare pandas and numpy. This way I'm processing 4000 files (almost equaly sized) on 40 core machine in cca 45 minutes. With Dask distributed the memory usage continuously increases until the work is done. At that point it's consuming 40 GB and the memory is not freed. I see The strange thing is that I don't experience memory leak when using So it seams that Dask is not directly involved, yet it makes the difference somehow. I'm running Debian buster, python3.7 and latest libraries (dask==2020.12.0, numpy==1.19.5, pandas==1.2.0). (Python3.8 seems to make no difference.) There are See the code...
def compute_profile_from_file(filepath):
df = pd.read_csv(filepath, compression="gzip", sep=";", names=("eid","t","src","spd"))
# I observe memory leak even if just reading the data.
# As I add more processing steps more memory is leaked.
...
df.to_parquet(...)
def main_dask():
fs = fsspec.filesystem("file")
filepaths = fs.glob("/data/*.csv.gz")
client = Client(LocalCluster(n_workers=40, threads_per_worker=1))
results = set()
for filepath in filepaths:
if len(results) == 40:
_, results = wait(results, return_when='FIRST_COMPLETED')
job = client.submit(compute_profile_from_file, filepath)
results.add(job)
client.gather(results);
del results
time.sleep(24*3600)
def main_mp():
fs = fsspec.filesystem("file")
filepaths = fs.glob("/data/*.csv.gz")
import multiprocessing as mp
mp.set_start_method('spawn')
pool = mp.Pool(40)
pool.map(compute_profile_from_file, filepaths)
time.sleep(24*3600)
if __name__ == "__main__":
#main_dask()
#main_mp() |
Any update here? We are also trying to use Dask in production, but this is causing some major issues for us. |
This issue, and probably a few others scattered across dask / distributed, should have the current state of things. FWIW, I'm not able to reproduce my original issue now, at least not on a different machine. Previous attempts were on macOS, but on Linux under WSL2 I see
Previously the I also learned about |
Is there a simple, easy, effective way to kill all dask processes/workers hogging up memory, whether via cmd line or directly in Python? I thought that it's automatically done upon completion of a python cmd line call or interrupting the execution, but I guess not. I executed these set of commands several times in VS code while debugging and wasn't aware that memory wasn't being freed up on every iteration. Now 73% of the RAM is blocked and I have no idea how to free it. Can someone please help? COMMANDS: MY CONFIGURATION:
Please help? |
Just to chime in that I'm having a similar issue in long-running jobs (a very big job made of many small tasks). Worker logs indicate that something like 'memory full but no data to spill to disc'. It's very hard to diagnose because it only really has an impact after many hours - a 7-hour job completed but a 14 hour one got stuck as all the workers' memory filled so they stopped accepting or processing tasks, but since there's nothing to spill to disc they couldn't free their memory either. Interestingly I first had the problem appear much sooner when i was using Numba to accelerate my algorithm. Then I switched to Cython instead and it seemed to cure the problem, but when I ran a much bigger job the problem still appeared :( Whatever it is must still be happening but much more slowly. |
Just happened to notice this issue, which is still open and has a number of up-votes. I wonder if this is an instance of memory not being released back to the OS, which now has its own section in the docs? Additionally, Nannies now set the I'm wondering if we should close this now? cc @crusaderky xref #6681 #6780 #4874 |
I am running jobs on a fairly fragile SBC cluster using DASK SSH and have been playing around with memory tricks a little...
for example: futures = notebook_client.map(function, list) grabs the results returned in a df_listdf_list=notebook_client.gather(futures) force clear memory for the completed futures on the SBC[x.cancel() for x in futures]
Hopefully, I never send a job that exceeds memory and know that limit, force/check clearing worker memory after the worker reports back as part of the submission, sequentually restart the clients if they accumulate memory I don't understand. are there any fork bombs out there we don't know about or any more ideas I missed? I'm sure experts can find the leaks? So far I haven't broken anything quite in a while - all the best, cheers. |
I'm still investigating, but in the meantime I wanted to get this issue started.
I'm noticing that after executing a task graph with large inputs and a small output, my worker memory stays high. In the example below we
So the final result returned to the client is small, a single Python int. The only large objects should be the initially generated bytestrings.
The console output below is
In an effort to test whether the scheduler or worker is holding a reference to the data, I submit a bunch of tiny
inc
tasks to one of the worker. I notice that the memory on that worker does settle downThat's at least consistent with the worker or scheduler holding a reference to the data, but there could be many other causes. I'm still debugging.
The number of
inc
tasks, 2731, seems to be significant. With 2730inc
tasks, I don't see any memory reduction on that worker.The text was updated successfully, but these errors were encountered: