Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix "unable to open shared memory" while using MPFuture #517

Merged
merged 2 commits into from
Nov 1, 2022

Conversation

borzunov
Copy link
Member

@borzunov borzunov commented Nov 1, 2022

photo_2022-11-01 06 03 12

Currently, one may sometimes get the "unable to open shared memory" error (see the screenshot) while using hivemind.MPFuture. Interestingly, the smaller HIVEMIND_SHM_BUFFER_SIZE is, the more often the error occurs (e.g., in Petals, it occurs right after starting the server if HIVEMIND_SHM_BUFFER_SIZE=2).

Turns out, it happens when the origin process garbage-collects all instances of MPFuture using the same shmem buffer, then the underlying buffer is freed, and target processes can't reconnect to it anymore when unpickling its instances of MPFuture.

This PR fixes this important issue.

@borzunov borzunov added the bug Something isn't working label Nov 1, 2022
@@ -303,7 +303,7 @@ def __del__(self):
def __getstate__(self):
return dict(
_sender_pipe=self._sender_pipe,
_shared_state_code=self._shared_state_code,
_shared_state_code=ForkingPickler.dumps(self._shared_state_code).tobytes(),
Copy link
Member Author

@borzunov borzunov Nov 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Despite the name, it works correctly with all multiprocessing start methods (fork, spawn, and forkserver). This may be tested using this short script:

import multiprocessing as mp
import torch
import time
from multiprocessing.reduction import ForkingPickler

torch.multiprocessing.set_sharing_strategy("file_system")

def foo(q):
    t = torch.zeros(3).share_memory_()
    q.put(ForkingPickler.dumps(t).tobytes())
    t[1] = 777

if __name__ == '__main__':
    mp.set_start_method('spawn')
    q = mp.Queue()
    p = mp.Process(target=foo, args=(q,))
    p.start()
    time.sleep(1)
    print(ForkingPickler.loads(q.get()))
    p.join()

@codecov
Copy link

codecov bot commented Nov 1, 2022

Codecov Report

Merging #517 (f7cd74a) into master (3e817a5) will decrease coverage by 0.06%.
The diff coverage is 60.00%.

@@            Coverage Diff             @@
##           master     #517      +/-   ##
==========================================
- Coverage   75.97%   75.91%   -0.07%     
==========================================
  Files          81       81              
  Lines        7943     7946       +3     
==========================================
- Hits         6035     6032       -3     
- Misses       1908     1914       +6     
Impacted Files Coverage Δ
hivemind/utils/mpfuture.py 89.28% <60.00%> (-0.76%) ⬇️
hivemind/moe/server/connection_handler.py 46.87% <0.00%> (-1.05%) ⬇️
hivemind/dht/protocol.py 92.23% <0.00%> (-0.92%) ⬇️
hivemind/moe/server/runtime.py 69.16% <0.00%> (-0.84%) ⬇️
hivemind/moe/server/server.py 43.71% <0.00%> (-0.55%) ⬇️
hivemind/dht/routing.py 94.11% <0.00%> (+0.58%) ⬆️

Copy link
Member

@justheuristic justheuristic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job! Merge when ready.

@borzunov borzunov merged commit 94c985d into master Nov 1, 2022
@borzunov borzunov deleted the fix-mpfuture-shmem branch November 1, 2022 15:22
mryab pushed a commit that referenced this pull request Nov 29, 2022
Currently, one may sometimes get the "unable to open shared memory" error (see the screenshot) while using `hivemind.MPFuture`. Interestingly, the smaller `HIVEMIND_SHM_BUFFER_SIZE` is, the more often the error occurs (e.g., in Petals, it occurs right after starting the server if `HIVEMIND_SHM_BUFFER_SIZE=2`).

Turns out, it happens when the origin process garbage-collects all instances of MPFuture using the same shmem buffer, then the underlying buffer is freed, and target processes can't reconnect to it anymore when unpickling its instances of MPFuture.

This PR fixes this important issue.

(cherry picked from commit 94c985d)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants