Pex fails when used in conjunction with Pytorch #527

y0ast · 2018-07-20T14:07:03Z

The following code:

import torch
import torch.utils.data

class Test(torch.utils.data.Dataset):
    def __len__(self):
        return 10

    def __getitem__(self, idx):
        return torch.randn(5, 5)


if __name__ == '__main__':
  n_workers = 4
  data = Test()
  train_loader = torch.utils.data.DataLoader(data, batch_size=2, num_workers=n_workers)

  for item in train_loader:
    print(item)
    break

Runs fine with a python interpreter with pytorch and numpy installed:

$ python test.py

(0 ,.,.) =
 -0.4520 -1.2118  0.3244  0.7276 -0.8220
  1.9171  0.9921  0.5206  0.2886 -0.3736
  1.3648  1.0917  0.1983  0.0402 -1.2338
  0.3155 -0.6425  0.3849 -0.4376  1.2371
  0.5352 -0.3482  1.4171  0.4569 -0.3687

(1 ,.,.) =
  0.1180 -1.0528  0.6809  0.2318 -2.2028
  0.1399  0.5150 -0.5507 -0.7639 -0.8540
  1.9070  0.8260  0.8660 -0.1029 -0.1823
 -2.3825 -2.0044  1.1457 -0.9469 -0.4882
 -0.3794  0.6706 -0.8183 -0.4942 -1.7503
[torch.FloatTensor of size 2x5x5]

But fails with pex:

$ pex torch numpy -- test.py
Traceback (most recent call last):
  File "/private/var/folders/b5/hgpy98310bg_g3hfqngl759m0000gn/T/tmpxYzeas/.bootstrap/_pex/pex.py", line 367, in execute
    self._wrap_coverage(self._wrap_profiling, self._execute)
  File "/private/var/folders/b5/hgpy98310bg_g3hfqngl759m0000gn/T/tmpxYzeas/.bootstrap/_pex/pex.py", line 293, in _wrap_coverage
    runner(*args)
  File "/private/var/folders/b5/hgpy98310bg_g3hfqngl759m0000gn/T/tmpxYzeas/.bootstrap/_pex/pex.py", line 325, in _wrap_profiling
    runner(*args)
  File "/private/var/folders/b5/hgpy98310bg_g3hfqngl759m0000gn/T/tmpxYzeas/.bootstrap/_pex/pex.py", line 413, in _execute
    return self.execute_interpreter()
  File "/private/var/folders/b5/hgpy98310bg_g3hfqngl759m0000gn/T/tmpxYzeas/.bootstrap/_pex/pex.py", line 423, in execute_interpreter
    self.execute_content(name, content)
  File "/private/var/folders/b5/hgpy98310bg_g3hfqngl759m0000gn/T/tmpxYzeas/.bootstrap/_pex/pex.py", line 453, in execute_content
    exec_function(ast, globals())
  File "<exec_function>", line 4, in exec_function
  File "test_dl.py", line 18, in <module>
    for item in train_loader:
  File "/private/var/folders/b5/hgpy98310bg_g3hfqngl759m0000gn/T/tmpxYzeas/.deps/torch-0.4.0-cp27-none-macosx_10_6_x86_64.whl/torch/utils/data/dataloader.py", line 286, in __next__
    return self._process_next_batch(batch)
  File "/private/var/folders/b5/hgpy98310bg_g3hfqngl759m0000gn/T/tmpxYzeas/.deps/torch-0.4.0-cp27-none-macosx_10_6_x86_64.whl/torch/utils/data/dataloader.py", line 307, in _process_next_batch
    raise batch.exc_type(batch.exc_msg)
RuntimeError: Traceback (most recent call last):
  File "/private/var/folders/b5/hgpy98310bg_g3hfqngl759m0000gn/T/tmpxYzeas/.deps/torch-0.4.0-cp27-none-macosx_10_6_x86_64.whl/torch/utils/data/dataloader.py", line 57, in _worker_loop
    samples = collate_fn([dataset[i] for i in batch_indices])
  File "/private/var/folders/b5/hgpy98310bg_g3hfqngl759m0000gn/T/tmpxYzeas/.deps/torch-0.4.0-cp27-none-macosx_10_6_x86_64.whl/torch/utils/data/dataloader.py", line 113, in default_collate
    storage = batch[0].storage()._new_shared(numel)
  File "/private/var/folders/b5/hgpy98310bg_g3hfqngl759m0000gn/T/tmpxYzeas/.deps/torch-0.4.0-cp27-none-macosx_10_6_x86_64.whl/torch/storage.py", line 114, in _new_shared
    return cls._new_using_filename(size)
RuntimeError: error executing torch_shm_manager at "/private/var/folders/b5/hgpy98310bg_g3hfqngl759m0000gn/T/tmpxYzeas/.deps/torch-0.4.0-cp27-none-macosx_10_6_x86_64.whl/torch/lib/torch_shm_manager" at /Users/soumith/code/builder/wheel/pytorch-src/torch/lib/libshm/core.cpp:125

Relevant info from the PyTorch guys on what they're doing is here: https://pytorch.org/docs/stable/multiprocessing.html#file-system-file-system

The text was updated successfully, but these errors were encountered:

y0ast · 2018-07-20T14:19:32Z

Additional info: the above code was run on Mac and breaks. On Linux PyTorch uses a file_descriptor based sharing system which does work well.

However forced setting of the sharing strategy:

torch.multiprocessing.set_sharing_strategy('file_system')

makes it also fail on linux.

So the conclusion is that something in the file_system strategy of shared memory management is incompatible with pex.

Adding --not-zip-safe doesn't fix the problem.

jsirois · 2018-07-20T21:49:49Z

OK, root cause is torch_shm_manager is not executable in the pex environment. When I build with --not-zip-safe and chmod +x /home/jsirois/.pex/install/torch-0.4.0-cp36-cp36m-manylinux1_x86_64.whl.40da15562268d7f03a1d135a85e2cc9ac9a46e8d/torch-0.4.0-cp36-cp36m-manylinux1_x86_64.whl/torch/lib/torch_shm_manager after running once (so that the torch.pex and its internal wheels are exploded to the filesystem), things work. I ran a quick experiment using python's ZipFile.extractall and reproduced permissions not being preserved (using the unzip binary torch_shm_manager is executable after extraction). Here is the upstream bug: https://bugs.python.org/issue15795

I think we can work around this by subclassing ZipFile :/

And WOW - torch is a ~400MB wheel. That's getting to be the size of a docker image.

Previously, a zipped pex would lose permission bits when exracted to the filesystem for `--not-zip-safe` pexes or `PEX_FORCE_LOCAL` runs. This was due to an underlying bug in the `zipfile` stdlib tracked here: https://bugs.python.org/issue15795 Work around the bug in `zipfile.Zipfile` by extending it and running a chmod'ing cleanup whenever `extract` or `extractall` is called. Fixes pex-tool#527

jsirois · 2018-07-21T03:49:52Z

To be clear @y0ast, once #528 is in and released in pex 1.4.5, you'll need to --not-zip-safe for torch, it will never work from a zipped up context as it's written.

y0ast · 2018-07-23T09:52:30Z

Thanks @jsirois, nice find!

Thanks for the quick fix too :).

The reason torch is so big is that it has a bunch of precompiled CUDA kernels inside. We've had some discussions with them in the past to make it smaller, but it seems like cutting it down further is either a lot of work or leads to cutting features.

Previously, a zipped pex would lose permission bits when exracted to the filesystem for `--not-zip-safe` pexes or `PEX_FORCE_LOCAL` runs. This was due to an underlying bug in the `zipfile` stdlib tracked here: https://bugs.python.org/issue15795 Work around the bug in `zipfile.Zipfile` by extending it and running a chmod'ing cleanup whenever `extract` or `extractall` is called. Fixes #527

jsirois added the bug label Jul 20, 2018

jsirois self-assigned this Jul 20, 2018

jsirois mentioned this issue Jul 21, 2018

Fix pex extraction perms. #528

Merged

jsirois added the in progress label Jul 21, 2018

jsirois closed this as completed in #528 Jul 23, 2018

jsirois mentioned this issue Jul 27, 2018

Release 1.4.5 #533

Closed

jsirois removed the in progress label Apr 3, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pex fails when used in conjunction with Pytorch #527

Pex fails when used in conjunction with Pytorch #527

y0ast commented Jul 20, 2018 •

edited

Loading

y0ast commented Jul 20, 2018

jsirois commented Jul 20, 2018

jsirois commented Jul 21, 2018 •

edited

Loading

y0ast commented Jul 23, 2018

Pex fails when used in conjunction with Pytorch #527

Pex fails when used in conjunction with Pytorch #527

Comments

y0ast commented Jul 20, 2018 • edited Loading

y0ast commented Jul 20, 2018

jsirois commented Jul 20, 2018

jsirois commented Jul 21, 2018 • edited Loading

y0ast commented Jul 23, 2018

y0ast commented Jul 20, 2018 •

edited

Loading

jsirois commented Jul 21, 2018 •

edited

Loading