-
Notifications
You must be signed in to change notification settings - Fork 167
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix "unable to open shared memory" while using MPFuture #517
Conversation
d2c7433
to
ea6e3fc
Compare
@@ -303,7 +303,7 @@ def __del__(self): | |||
def __getstate__(self): | |||
return dict( | |||
_sender_pipe=self._sender_pipe, | |||
_shared_state_code=self._shared_state_code, | |||
_shared_state_code=ForkingPickler.dumps(self._shared_state_code).tobytes(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Despite the name, it works correctly with all multiprocessing start methods (fork
, spawn
, and forkserver
). This may be tested using this short script:
import multiprocessing as mp
import torch
import time
from multiprocessing.reduction import ForkingPickler
torch.multiprocessing.set_sharing_strategy("file_system")
def foo(q):
t = torch.zeros(3).share_memory_()
q.put(ForkingPickler.dumps(t).tobytes())
t[1] = 777
if __name__ == '__main__':
mp.set_start_method('spawn')
q = mp.Queue()
p = mp.Process(target=foo, args=(q,))
p.start()
time.sleep(1)
print(ForkingPickler.loads(q.get()))
p.join()
Codecov Report
@@ Coverage Diff @@
## master #517 +/- ##
==========================================
- Coverage 75.97% 75.91% -0.07%
==========================================
Files 81 81
Lines 7943 7946 +3
==========================================
- Hits 6035 6032 -3
- Misses 1908 1914 +6
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great job! Merge when ready.
Currently, one may sometimes get the "unable to open shared memory" error (see the screenshot) while using `hivemind.MPFuture`. Interestingly, the smaller `HIVEMIND_SHM_BUFFER_SIZE` is, the more often the error occurs (e.g., in Petals, it occurs right after starting the server if `HIVEMIND_SHM_BUFFER_SIZE=2`). Turns out, it happens when the origin process garbage-collects all instances of MPFuture using the same shmem buffer, then the underlying buffer is freed, and target processes can't reconnect to it anymore when unpickling its instances of MPFuture. This PR fixes this important issue. (cherry picked from commit 94c985d)
Currently, one may sometimes get the "unable to open shared memory" error (see the screenshot) while using
hivemind.MPFuture
. Interestingly, the smallerHIVEMIND_SHM_BUFFER_SIZE
is, the more often the error occurs (e.g., in Petals, it occurs right after starting the server ifHIVEMIND_SHM_BUFFER_SIZE=2
).Turns out, it happens when the origin process garbage-collects all instances of MPFuture using the same shmem buffer, then the underlying buffer is freed, and target processes can't reconnect to it anymore when unpickling its instances of MPFuture.
This PR fixes this important issue.