Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pex fails when used in conjunction with Pytorch #527

Closed
y0ast opened this issue Jul 20, 2018 · 4 comments
Closed

Pex fails when used in conjunction with Pytorch #527

y0ast opened this issue Jul 20, 2018 · 4 comments
Assignees
Labels

Comments

@y0ast
Copy link

y0ast commented Jul 20, 2018

The following code:

import torch
import torch.utils.data

class Test(torch.utils.data.Dataset):
    def __len__(self):
        return 10

    def __getitem__(self, idx):
        return torch.randn(5, 5)


if __name__ == '__main__':
  n_workers = 4
  data = Test()
  train_loader = torch.utils.data.DataLoader(data, batch_size=2, num_workers=n_workers)

  for item in train_loader:
    print(item)
    break

Runs fine with a python interpreter with pytorch and numpy installed:

$ python test.py

(0 ,.,.) =
 -0.4520 -1.2118  0.3244  0.7276 -0.8220
  1.9171  0.9921  0.5206  0.2886 -0.3736
  1.3648  1.0917  0.1983  0.0402 -1.2338
  0.3155 -0.6425  0.3849 -0.4376  1.2371
  0.5352 -0.3482  1.4171  0.4569 -0.3687

(1 ,.,.) =
  0.1180 -1.0528  0.6809  0.2318 -2.2028
  0.1399  0.5150 -0.5507 -0.7639 -0.8540
  1.9070  0.8260  0.8660 -0.1029 -0.1823
 -2.3825 -2.0044  1.1457 -0.9469 -0.4882
 -0.3794  0.6706 -0.8183 -0.4942 -1.7503
[torch.FloatTensor of size 2x5x5]

But fails with pex:

$ pex torch numpy -- test.py
Traceback (most recent call last):
  File "/private/var/folders/b5/hgpy98310bg_g3hfqngl759m0000gn/T/tmpxYzeas/.bootstrap/_pex/pex.py", line 367, in execute
    self._wrap_coverage(self._wrap_profiling, self._execute)
  File "/private/var/folders/b5/hgpy98310bg_g3hfqngl759m0000gn/T/tmpxYzeas/.bootstrap/_pex/pex.py", line 293, in _wrap_coverage
    runner(*args)
  File "/private/var/folders/b5/hgpy98310bg_g3hfqngl759m0000gn/T/tmpxYzeas/.bootstrap/_pex/pex.py", line 325, in _wrap_profiling
    runner(*args)
  File "/private/var/folders/b5/hgpy98310bg_g3hfqngl759m0000gn/T/tmpxYzeas/.bootstrap/_pex/pex.py", line 413, in _execute
    return self.execute_interpreter()
  File "/private/var/folders/b5/hgpy98310bg_g3hfqngl759m0000gn/T/tmpxYzeas/.bootstrap/_pex/pex.py", line 423, in execute_interpreter
    self.execute_content(name, content)
  File "/private/var/folders/b5/hgpy98310bg_g3hfqngl759m0000gn/T/tmpxYzeas/.bootstrap/_pex/pex.py", line 453, in execute_content
    exec_function(ast, globals())
  File "<exec_function>", line 4, in exec_function
  File "test_dl.py", line 18, in <module>
    for item in train_loader:
  File "/private/var/folders/b5/hgpy98310bg_g3hfqngl759m0000gn/T/tmpxYzeas/.deps/torch-0.4.0-cp27-none-macosx_10_6_x86_64.whl/torch/utils/data/dataloader.py", line 286, in __next__
    return self._process_next_batch(batch)
  File "/private/var/folders/b5/hgpy98310bg_g3hfqngl759m0000gn/T/tmpxYzeas/.deps/torch-0.4.0-cp27-none-macosx_10_6_x86_64.whl/torch/utils/data/dataloader.py", line 307, in _process_next_batch
    raise batch.exc_type(batch.exc_msg)
RuntimeError: Traceback (most recent call last):
  File "/private/var/folders/b5/hgpy98310bg_g3hfqngl759m0000gn/T/tmpxYzeas/.deps/torch-0.4.0-cp27-none-macosx_10_6_x86_64.whl/torch/utils/data/dataloader.py", line 57, in _worker_loop
    samples = collate_fn([dataset[i] for i in batch_indices])
  File "/private/var/folders/b5/hgpy98310bg_g3hfqngl759m0000gn/T/tmpxYzeas/.deps/torch-0.4.0-cp27-none-macosx_10_6_x86_64.whl/torch/utils/data/dataloader.py", line 113, in default_collate
    storage = batch[0].storage()._new_shared(numel)
  File "/private/var/folders/b5/hgpy98310bg_g3hfqngl759m0000gn/T/tmpxYzeas/.deps/torch-0.4.0-cp27-none-macosx_10_6_x86_64.whl/torch/storage.py", line 114, in _new_shared
    return cls._new_using_filename(size)
RuntimeError: error executing torch_shm_manager at "/private/var/folders/b5/hgpy98310bg_g3hfqngl759m0000gn/T/tmpxYzeas/.deps/torch-0.4.0-cp27-none-macosx_10_6_x86_64.whl/torch/lib/torch_shm_manager" at /Users/soumith/code/builder/wheel/pytorch-src/torch/lib/libshm/core.cpp:125

Relevant info from the PyTorch guys on what they're doing is here: https://pytorch.org/docs/stable/multiprocessing.html#file-system-file-system

@y0ast
Copy link
Author

y0ast commented Jul 20, 2018

Additional info: the above code was run on Mac and breaks. On Linux PyTorch uses a file_descriptor based sharing system which does work well.

However forced setting of the sharing strategy:

torch.multiprocessing.set_sharing_strategy('file_system')

makes it also fail on linux.

So the conclusion is that something in the file_system strategy of shared memory management is incompatible with pex.

Adding --not-zip-safe doesn't fix the problem.

@jsirois
Copy link
Member

jsirois commented Jul 20, 2018

OK, root cause is torch_shm_manager is not executable in the pex environment. When I build with --not-zip-safe and chmod +x /home/jsirois/.pex/install/torch-0.4.0-cp36-cp36m-manylinux1_x86_64.whl.40da15562268d7f03a1d135a85e2cc9ac9a46e8d/torch-0.4.0-cp36-cp36m-manylinux1_x86_64.whl/torch/lib/torch_shm_manager after running once (so that the torch.pex and its internal wheels are exploded to the filesystem), things work. I ran a quick experiment using python's ZipFile.extractall and reproduced permissions not being preserved (using the unzip binary torch_shm_manager is executable after extraction). Here is the upstream bug: https://bugs.python.org/issue15795

I think we can work around this by subclassing ZipFile :/

And WOW - torch is a ~400MB wheel. That's getting to be the size of a docker image.

@jsirois jsirois added the bug label Jul 20, 2018
@jsirois jsirois self-assigned this Jul 20, 2018
jsirois added a commit to jsirois/pex that referenced this issue Jul 21, 2018
Previously, a zipped pex would lose permission bits when exracted to the
filesystem for `--not-zip-safe` pexes or `PEX_FORCE_LOCAL` runs. This
was due to an underlying bug in the `zipfile` stdlib tracked here:
  https://bugs.python.org/issue15795

Work around the bug in `zipfile.Zipfile` by extending it and running a
chmod'ing cleanup whenever `extract` or `extractall` is called.

Fixes pex-tool#527
@jsirois
Copy link
Member

jsirois commented Jul 21, 2018

To be clear @y0ast, once #528 is in and released in pex 1.4.5, you'll need to --not-zip-safe for torch, it will never work from a zipped up context as it's written.

@y0ast
Copy link
Author

y0ast commented Jul 23, 2018

Thanks @jsirois, nice find!

Thanks for the quick fix too :).

The reason torch is so big is that it has a bunch of precompiled CUDA kernels inside. We've had some discussions with them in the past to make it smaller, but it seems like cutting it down further is either a lot of work or leads to cutting features.

jsirois added a commit that referenced this issue Jul 23, 2018
Previously, a zipped pex would lose permission bits when exracted to the
filesystem for `--not-zip-safe` pexes or `PEX_FORCE_LOCAL` runs. This
was due to an underlying bug in the `zipfile` stdlib tracked here:
  https://bugs.python.org/issue15795

Work around the bug in `zipfile.Zipfile` by extending it and running a
chmod'ing cleanup whenever `extract` or `extractall` is called.

Fixes #527
@jsirois jsirois mentioned this issue Jul 27, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants