-
Notifications
You must be signed in to change notification settings - Fork 16
Investigate memory usage #73
Comments
More details - https://forum.image.sc/t/ram-overload-crashing-with-brokenpipeerror-errno-32-broken-pipe/74959? The new rc (0.3.1rc1) in combination with setting |
The issue occurs on Ubuntu 22.04, but not on macOS (with M1). |
Windows also seems to be ok. May be due to using multiprocessing |
https://github.com/pythonprofilers/memory_profiler might be helpful for debugging this |
My lab recently started working with whole brain lightsheet data, and we observed the same issue on three separate windows machines with clean cellfinder installs. I did some digging around and I think I have figured out what is causing the increased memory use. tl;dr cellfinder-core 0.3.1 is loading every plane into memory in the background. In detect.main#131, we create producer threads. By using async_apply, which immediately returns an async object, we create a queue containing n_planes async objects. The for loop exits, and we start our consumer task (mp_3d_filter.process), which runs in the parent thread. In the background, our pool of producers begin chewing through the tasks in the queue. Each time a plane is processed by the producer threads, a dask array is converted to an np array, requiring it to be loaded into memory. The producer threads are able to produce planes faster than the consumer can consume them. This means that np arrays are building up in memory. If we tell cellfinder to run detection on only a small number of planes (e.g., 50), we can observe all planes being loaded into memory in the background. Eventually the producer threads finish all the tasks in the queue and go idle (they will persist until consumer task is done, because the consumer task is initiated in the context of the pool). From here we see memory gradually decline as results from the queue are consumed. Similarly, if we use only a single producer thread, memory use remains constant, as the consumer immediately eats every plane as soon as the producer finishes processing it. Interestingly, the estimated ETA with a single producer is substantially lower (6 vs 11 hours). We noticed that with many producers (~22), they spend most of their time idle. This is likely due to the parent thread being unable to do scheduling while busy with the consumer task. With only a single producer there may be less multiprocessing overhead, so the parent thread can spend more time running the consumer task, On the subject of runtime -- is the consumer task single threaded because it consumes ball_z_size planes? Would it be possible to multithread the 3D filter by giving each child process ball_z_size planes, processing them, and then doing the 3D filter in the child process? |
The behavior we are seeing is actually more similar to: #52 |
Thanks for the clear report! I haven't checked carefully myself, but the 2D filter running faster than the 3D filter and filling up memory seems like a very likely cause. I think the solution there is to limit the maximum number of planes loaded into memory, by forcing the 2D jobs to wait for the 3D filter to consume the planes. I had a quick play around yesterday afternoon, and got in a bit of a tangle. I think it's going to require a few of us in the Brainglobe team to sit down and work out the best way to solve this, so there probably won't be a quick fix (sorry 😢 ). re. the consumer task, I think it might be possible to multithread it, but potentially at the cost of increasing memory consumption. We can have a think about that too though. |
I started prototyping a potential solution today, based on my understanding of the current code operates (please correct any misunderstandings). The big change is to cache a constant number of planes (max_planes), and pass ball_z_size planes to the 3D filter. We use the same worker pool for both tasks, so we need to cache n_processes * ball_z_size planes. The current solution does end up duplicating some effort, since each caching step needs to reload some of the planes already processed to avoid boundary issues. This is probably fixable, but I wanted to focus on getting something working first. This prototype addresses the memory use issue by ensuring a constant number of planes occupy memory at one time (configurable through how many processes we assign to the task). It also addresses the issue of the parent thread being unable to perform scheduling for children by offloading 3D filter to child tasks. Lastly, it should provide a dramatic speed improvement by allowing multiple threads to perform 3D filtering step in a parallel manner. I'm only showing the changes to detect.main b/c the only other change is to volume_filter.py to convert from reading from a queue to a list. This code seems to correctly spawn threads and process planes, and memory use seems reasonable. I haven't been able to verify the output because for some reason python isn't able to pickle something while trying to run results.get(). Posting here to get feedback on the solution and see if anyone has insight into the pickling problem. mp_ctx = multiprocessing.get_context("spawn")
n_processes = 4 #this will depend on memory capacity of system
max_planes = n_processes * ball_z_size
with mp_ctx.Pool(n_processes) as worker_pool:
planes = [None] * max_planes
results = [None] * max_planes
to_return = []
counter = 0
while counter < len(signal_array):
counter = max(0, counter-ball_z_size) #sets counter back ball_z_size planes
#in case where len(signal_array)%max_planes != 0, avoids trying to access invalid indices
end = min(max_planes, len(signal_array)-counter)
for i in range(0, end):
planes[i] = worker_pool.apply_async(mp_tile_processor.get_tile_mask, args=(signal_array[counter],))
counter += 1
planes = [plane.get() for plane in planes] #blocks until all results are ready
for i in range(0, max_planes-ball_z_size):
results[i] = worker_pool.apply_async(mp_3d_filter.process, args=(planes[i:i+ball_z_size]), callback=callback)
for result in results:
#blocks until all results are ready, will need to do something to combine piecewise
to_return.append(result.get()) EDIT: include pickling error File "C:\Users<user>\anaconda3\envs\iblenv\lib\site-packages\cellfinder_core\detect\detect.py", line 298, in main |
Thanks for the reply and the code, and sorry for the slow reply from our end. You might have seen elsewhere in the repo that we decided to embark on a re-write of the cell detection code from Cython to pure Python, accellerated using Numba. One of the motivations for this is we're hoping to port the multiprocessing implementation to dask, which will allow much more flexible control of how the multiprocessing is done (including limiting memory use) while pushing the maintenance burden of the multiprocessing code upstream to dask. We've just finished doing the Cython > Python re-write, so I'd hazard a guess at ~two months for us to finish converting the multiprocessing to dask, and putting out a new release. |
@dstansby I apologize for the delay, but I finally got around to testing #139. It does appear to solve the memory issue, but it also seems to have introduced a new bug. Unless I implemented your changes incorrectly, only one thread is active at a time using the updated code. Below you can see the 20+ threads cellfinder spawns, and only 1 child thread (+ the parent thread) is using any CPU. The active thread hops around each time a new plane is loaded, but there is never more than one thread active. As I noted previously, there isn't much point processing planes faster than the parent thread can consume them (which is currently single-threaded), so this behavior might be ok. A simpler way to achieve it would be to just set n_processes=1. |
Thanks for the testing, I managed to reproduce the same issue locally. To fix that I've updated #139 so:
This should mean that
I can't reproduce your original case, which I think was 3. above (2D filtering faster than 3D filtering), so could you give the current code on #139 another go to see if it fixes your original issue? |
@dstansby This does appear to fix memory use without breaking threading. Once the n_procs + ball_z_size planes are loaded, only about 1-2 threads are needed to keep the queue full. I don't have the documentation handy, but is n_processes configurable for the detection step? It could further help with memory management to not need to keep many threads active (I have a 24 core machine), since there is up to a 2.2GB overhead for each thread (on my machine). |
@n-garc |
Add a progress bar for detection
There have been a few reports of cellfinder requiring far too much RAM (100GB+). This seems to be new and should be investigated.
The text was updated successfully, but these errors were encountered: