Fix "unable to open shared memory" while using MPFuture #517

borzunov · 2022-11-01T02:08:53Z

Currently, one may sometimes get the "unable to open shared memory" error (see the screenshot) while using hivemind.MPFuture. Interestingly, the smaller HIVEMIND_SHM_BUFFER_SIZE is, the more often the error occurs (e.g., in Petals, it occurs right after starting the server if HIVEMIND_SHM_BUFFER_SIZE=2).

Turns out, it happens when the origin process garbage-collects all instances of MPFuture using the same shmem buffer, then the underlying buffer is freed, and target processes can't reconnect to it anymore when unpickling its instances of MPFuture.

This PR fixes this important issue.

borzunov · 2022-11-01T03:34:13Z

hivemind/utils/mpfuture.py

@@ -303,7 +303,7 @@ def __del__(self):
    def __getstate__(self):
        return dict(
            _sender_pipe=self._sender_pipe,
-            _shared_state_code=self._shared_state_code,
+            _shared_state_code=ForkingPickler.dumps(self._shared_state_code).tobytes(),


Despite the name, it works correctly with all multiprocessing start methods (fork, spawn, and forkserver). This may be tested using this short script:

import multiprocessing as mp import torch import time from multiprocessing.reduction import ForkingPickler torch.multiprocessing.set_sharing_strategy("file_system") def foo(q): t = torch.zeros(3).share_memory_() q.put(ForkingPickler.dumps(t).tobytes()) t[1] = 777 if __name__ == '__main__': mp.set_start_method('spawn') q = mp.Queue() p = mp.Process(target=foo, args=(q,)) p.start() time.sleep(1) print(ForkingPickler.loads(q.get())) p.join()

codecov · 2022-11-01T03:35:29Z

Codecov Report

Merging #517 (f7cd74a) into master (3e817a5) will decrease coverage by 0.06%.
The diff coverage is 60.00%.

@@            Coverage Diff             @@
##           master     #517      +/-   ##
==========================================
- Coverage   75.97%   75.91%   -0.07%     
==========================================
  Files          81       81              
  Lines        7943     7946       +3     
==========================================
- Hits         6035     6032       -3     
- Misses       1908     1914       +6

Impacted Files	Coverage Δ
hivemind/utils/mpfuture.py	`89.28% <60.00%> (-0.76%)`	⬇️
hivemind/moe/server/connection_handler.py	`46.87% <0.00%> (-1.05%)`	⬇️
hivemind/dht/protocol.py	`92.23% <0.00%> (-0.92%)`	⬇️
hivemind/moe/server/runtime.py	`69.16% <0.00%> (-0.84%)`	⬇️
hivemind/moe/server/server.py	`43.71% <0.00%> (-0.55%)`	⬇️
hivemind/dht/routing.py	`94.11% <0.00%> (+0.58%)`	⬆️

justheuristic

Great job! Merge when ready.

Currently, one may sometimes get the "unable to open shared memory" error (see the screenshot) while using `hivemind.MPFuture`. Interestingly, the smaller `HIVEMIND_SHM_BUFFER_SIZE` is, the more often the error occurs (e.g., in Petals, it occurs right after starting the server if `HIVEMIND_SHM_BUFFER_SIZE=2`). Turns out, it happens when the origin process garbage-collects all instances of MPFuture using the same shmem buffer, then the underlying buffer is freed, and target processes can't reconnect to it anymore when unpickling its instances of MPFuture. This PR fixes this important issue. (cherry picked from commit 94c985d)

borzunov added the bug Something isn't working label Nov 1, 2022

borzunov requested a review from justheuristic November 1, 2022 02:08

Fix "unable to open shared memory object" while using MPFuture

ea6e3fc

borzunov force-pushed the fix-mpfuture-shmem branch from d2c7433 to ea6e3fc Compare November 1, 2022 02:10

Use ForkingPickler

f7cd74a

borzunov commented Nov 1, 2022

View reviewed changes

justheuristic approved these changes Nov 1, 2022

View reviewed changes

borzunov merged commit 94c985d into master Nov 1, 2022

borzunov deleted the fix-mpfuture-shmem branch November 1, 2022 15:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix "unable to open shared memory" while using MPFuture #517

Fix "unable to open shared memory" while using MPFuture #517

borzunov commented Nov 1, 2022

borzunov Nov 1, 2022 •

edited

Loading

codecov bot commented Nov 1, 2022

justheuristic left a comment

Fix "unable to open shared memory" while using MPFuture #517

Fix "unable to open shared memory" while using MPFuture #517

Conversation

borzunov commented Nov 1, 2022

borzunov Nov 1, 2022 • edited Loading

Choose a reason for hiding this comment

codecov bot commented Nov 1, 2022

Codecov Report

justheuristic left a comment

Choose a reason for hiding this comment

borzunov Nov 1, 2022 •

edited

Loading