Investigate memory usage #73

adamltyson · 2022-11-30T10:50:16Z

There have been a few reports of cellfinder requiring far too much RAM (100GB+). This seems to be new and should be investigated.

adamltyson · 2022-12-13T11:24:25Z

More details - https://forum.image.sc/t/ram-overload-crashing-with-brokenpipeerror-errno-32-broken-pipe/74959?

The new rc (0.3.1rc1) in combination with setting --n-free-cpus to a low number helps the issue, but doesn't fix it. cellfinder-core is still using more memory than necessary.

adamltyson · 2022-12-15T15:31:32Z

The issue occurs on Ubuntu 22.04, but not on macOS (with M1).

adamltyson · 2022-12-19T16:44:37Z

Windows also seems to be ok. May be due to using multiprocessing spawn mode which I think is default on Windows and macOS, but not Linux. Will allowing fork on Linux work ok?

dstansby · 2023-01-30T17:19:01Z

https://github.com/pythonprofilers/memory_profiler might be helpful for debugging this

n-garc · 2023-02-17T19:47:20Z

My lab recently started working with whole brain lightsheet data, and we observed the same issue on three separate windows machines with clean cellfinder installs. I did some digging around and I think I have figured out what is causing the increased memory use.

tl;dr cellfinder-core 0.3.1 is loading every plane into memory in the background.

In detect.main#131, we create producer threads. By using async_apply, which immediately returns an async object, we create a queue containing n_planes async objects. The for loop exits, and we start our consumer task (mp_3d_filter.process), which runs in the parent thread. In the background, our pool of producers begin chewing through the tasks in the queue. Each time a plane is processed by the producer threads, a dask array is converted to an np array, requiring it to be loaded into memory. The producer threads are able to produce planes faster than the consumer can consume them. This means that np arrays are building up in memory.

If we tell cellfinder to run detection on only a small number of planes (e.g., 50), we can observe all planes being loaded into memory in the background. Eventually the producer threads finish all the tasks in the queue and go idle (they will persist until consumer task is done, because the consumer task is initiated in the context of the pool). From here we see memory gradually decline as results from the queue are consumed.

Similarly, if we use only a single producer thread, memory use remains constant, as the consumer immediately eats every plane as soon as the producer finishes processing it. Interestingly, the estimated ETA with a single producer is substantially lower (6 vs 11 hours). We noticed that with many producers (~22), they spend most of their time idle. This is likely due to the parent thread being unable to do scheduling while busy with the consumer task. With only a single producer there may be less multiprocessing overhead, so the parent thread can spend more time running the consumer task,

On the subject of runtime -- is the consumer task single threaded because it consumes ball_z_size planes? Would it be possible to multithread the 3D filter by giving each child process ball_z_size planes, processing them, and then doing the 3D filter in the child process?

n-garc · 2023-02-17T22:37:31Z

The behavior we are seeing is actually more similar to: #52

dstansby · 2023-02-21T09:37:30Z

Thanks for the clear report! I haven't checked carefully myself, but the 2D filter running faster than the 3D filter and filling up memory seems like a very likely cause. I think the solution there is to limit the maximum number of planes loaded into memory, by forcing the 2D jobs to wait for the 3D filter to consume the planes.

I had a quick play around yesterday afternoon, and got in a bit of a tangle. I think it's going to require a few of us in the Brainglobe team to sit down and work out the best way to solve this, so there probably won't be a quick fix (sorry 😢 ).

re. the consumer task, I think it might be possible to multithread it, but potentially at the cost of increasing memory consumption. We can have a think about that too though.

n-garc · 2023-02-23T00:22:32Z

I started prototyping a potential solution today, based on my understanding of the current code operates (please correct any misunderstandings). The big change is to cache a constant number of planes (max_planes), and pass ball_z_size planes to the 3D filter. We use the same worker pool for both tasks, so we need to cache n_processes * ball_z_size planes. The current solution does end up duplicating some effort, since each caching step needs to reload some of the planes already processed to avoid boundary issues. This is probably fixable, but I wanted to focus on getting something working first.

This prototype addresses the memory use issue by ensuring a constant number of planes occupy memory at one time (configurable through how many processes we assign to the task). It also addresses the issue of the parent thread being unable to perform scheduling for children by offloading 3D filter to child tasks. Lastly, it should provide a dramatic speed improvement by allowing multiple threads to perform 3D filtering step in a parallel manner.

I'm only showing the changes to detect.main b/c the only other change is to volume_filter.py to convert from reading from a queue to a list. This code seems to correctly spawn threads and process planes, and memory use seems reasonable. I haven't been able to verify the output because for some reason python isn't able to pickle something while trying to run results.get(). Posting here to get feedback on the solution and see if anyone has insight into the pickling problem.

    mp_ctx = multiprocessing.get_context("spawn")
    n_processes = 4 #this will depend on memory capacity of system
    max_planes = n_processes * ball_z_size
    with mp_ctx.Pool(n_processes) as worker_pool:
        planes = [None] * max_planes
        results = [None] * max_planes
        to_return = []

        counter = 0
        while counter < len(signal_array):
            counter = max(0, counter-ball_z_size) #sets counter back ball_z_size planes 
            #in case where len(signal_array)%max_planes != 0, avoids trying to access invalid indices
            end = min(max_planes, len(signal_array)-counter) 
            for i in range(0, end):
                planes[i] = worker_pool.apply_async(mp_tile_processor.get_tile_mask, args=(signal_array[counter],))
                counter += 1

            planes = [plane.get() for plane in planes] #blocks until all results are ready
            for i in range(0, max_planes-ball_z_size):
                results[i] = worker_pool.apply_async(mp_3d_filter.process, args=(planes[i:i+ball_z_size]), callback=callback)
            
            for result in results:
                #blocks until all results are ready, will need to do something to combine piecewise 
                to_return.append(result.get())

EDIT: include pickling error

File "C:\Users<user>\anaconda3\envs\iblenv\lib\site-packages\cellfinder_core\detect\detect.py", line 298, in main
to_return.append(result.get())
File "C:\Users<user>\anaconda3\envs\iblenv\lib\multiprocessing\pool.py", line 771, in get
raise self._value
File "C:\Users<user>\anaconda3\envs\iblenv\lib\multiprocessing\pool.py", line 537, in _handle_tasks
put(task)
File "C:\Users<user>\anaconda3\envs\iblenv\lib\multiprocessing\connection.py", line 206, in send
self._send_bytes(_ForkingPickler.dumps(obj))
File "C:\Users<user>\anaconda3\envs\iblenv\lib\multiprocessing\reduction.py", line 51, in dumps
cls(buf, protocol).dump(obj)
File "stringsource", line 2, in View.MemoryView._memoryviewslice.__r

dstansby · 2023-03-28T10:22:44Z

Thanks for the reply and the code, and sorry for the slow reply from our end. You might have seen elsewhere in the repo that we decided to embark on a re-write of the cell detection code from Cython to pure Python, accellerated using Numba. One of the motivations for this is we're hoping to port the multiprocessing implementation to dask, which will allow much more flexible control of how the multiprocessing is done (including limiting memory use) while pushing the maintenance burden of the multiprocessing code upstream to dask.

We've just finished doing the Cython > Python re-write, so I'd hazard a guess at ~two months for us to finish converting the multiprocessing to dask, and putting out a new release.

dstansby · 2023-04-05T14:51:40Z

@n-garc I think I (some what spontaneously!) came up with a simple solution to the issue at #139 - would you be able to test the code in that PR out and see if it fixes your memory issues?

n-garc · 2023-04-06T00:18:51Z

@n-garc I think I (some what spontaneously!) came up with a simple solution to the issue at #139 - would you be able to test the code in that PR out and see if it fixes your memory issues?

I will try it out and see if that fixes it.

n-garc · 2023-04-19T19:45:23Z

@dstansby I apologize for the delay, but I finally got around to testing #139. It does appear to solve the memory issue, but it also seems to have introduced a new bug. Unless I implemented your changes incorrectly, only one thread is active at a time using the updated code. Below you can see the 20+ threads cellfinder spawns, and only 1 child thread (+ the parent thread) is using any CPU. The active thread hops around each time a new plane is loaded, but there is never more than one thread active. As I noted previously, there isn't much point processing planes faster than the parent thread can consume them (which is currently single-threaded), so this behavior might be ok. A simpler way to achieve it would be to just set n_processes=1.

dstansby · 2023-04-21T10:14:59Z

Thanks for the testing, I managed to reproduce the same issue locally. To fix that I've updated #139 so:

At the start, n_procs + ball_z_size planes are unlocked for 2D filtering
When the 3D filtering starts, n_procs planes ahead of where the 3D filter is at are unlocked for 2D filtering

This should mean that

At any one time, a maximum of n_procs + ball_z_size planes are loaded into memory
If 3D filtering is faster than 2D filtering, the full parallelism of n_procs is utilised to do the 2D filtering
If 2D filtering is faster than 3D filtering, the number of planes in memory is still limited

I can't reproduce your original case, which I think was 3. above (2D filtering faster than 3D filtering), so could you give the current code on #139 another go to see if it fixes your original issue?

n-garc · 2023-04-21T16:38:33Z

@dstansby This does appear to fix memory use without breaking threading. Once the n_procs + ball_z_size planes are loaded, only about 1-2 threads are needed to keep the queue full. I don't have the documentation handy, but is n_processes configurable for the detection step? It could further help with memory management to not need to keep many threads active (I have a 24 core machine), since there is up to a 2.2GB overhead for each thread (on my machine).

adamltyson · 2023-04-24T10:00:52Z

@n-garc n_processes isn't directly configurable, but n_free_cpus is, so the number of CPU cores used is based on the number your system has, minus a configurable number to spare. cellfinder also should make sure the RAM of the machine isn't exceeded (using imlib), but this relies on the rest of the detection working properly.

Add a progress bar for detection

adamltyson added the bug Something isn't working label Nov 30, 2022

adamltyson transferred this issue from brainglobe/cellfinder Dec 13, 2022

adamltyson self-assigned this Dec 15, 2022

adamltyson assigned adamltyson and unassigned adamltyson Jan 10, 2023

adamltyson assigned dstansby Feb 20, 2023

This was referenced Feb 28, 2023

Revert cell detection multiprocessing to old version #81

Closed

Revert multiprocessing changes #84

Closed

alessandrofelder mentioned this issue Mar 1, 2023

Benchmark 2D filter #88

Closed

dstansby added this to the 0.4.0 milestone Mar 30, 2023

dstansby mentioned this issue Apr 5, 2023

Limit number of 2D filtering outputs stored in memory #139

Merged

adamltyson mentioned this issue Apr 21, 2023

Only check unique neighbour IDs #146

Merged

dstansby added this to Core development Apr 24, 2023

github-project-automation bot moved this to Todo in Core development Apr 24, 2023

dstansby moved this from Todo to In Progress in Core development Apr 24, 2023

alessandrofelder closed this as completed in #139 Apr 28, 2023

github-project-automation bot moved this from In Progress to Done in Core development Apr 28, 2023

willGraham01 pushed a commit that referenced this issue Aug 24, 2023

Merge pull request #73 from dstansby/progress-bars

2462a46

Add a progress bar for detection

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate memory usage #73

Investigate memory usage #73

adamltyson commented Nov 30, 2022

adamltyson commented Dec 13, 2022

adamltyson commented Dec 15, 2022

adamltyson commented Dec 19, 2022

dstansby commented Jan 30, 2023

n-garc commented Feb 17, 2023

n-garc commented Feb 17, 2023

dstansby commented Feb 21, 2023

n-garc commented Feb 23, 2023 •

edited

Loading

dstansby commented Mar 28, 2023

dstansby commented Apr 5, 2023

n-garc commented Apr 6, 2023

n-garc commented Apr 19, 2023

dstansby commented Apr 21, 2023

n-garc commented Apr 21, 2023

adamltyson commented Apr 24, 2023

Investigate memory usage #73

Investigate memory usage #73

Comments

adamltyson commented Nov 30, 2022

adamltyson commented Dec 13, 2022

adamltyson commented Dec 15, 2022

adamltyson commented Dec 19, 2022

dstansby commented Jan 30, 2023

n-garc commented Feb 17, 2023

n-garc commented Feb 17, 2023

dstansby commented Feb 21, 2023

n-garc commented Feb 23, 2023 • edited Loading

dstansby commented Mar 28, 2023

dstansby commented Apr 5, 2023

n-garc commented Apr 6, 2023

n-garc commented Apr 19, 2023

dstansby commented Apr 21, 2023

n-garc commented Apr 21, 2023

adamltyson commented Apr 24, 2023

n-garc commented Feb 23, 2023 •

edited

Loading