Skip to content
This repository has been archived by the owner on Jan 3, 2023. It is now read-only.

Updated Docker images #89

Closed
Kaixhin opened this issue Sep 13, 2015 · 17 comments
Closed

Updated Docker images #89

Kaixhin opened this issue Sep 13, 2015 · 17 comments

Comments

@Kaixhin
Copy link
Contributor

Kaixhin commented Sep 13, 2015

I've updated my Docker builds for version 1.0 - one for the cpu backend, and a new one for the gpu backend. The GPU images referenced in the 0.9 docs are still available, but with a note about deprecation.

I've tested the cpu version with neon examples/mnist_mlp.yaml and python examples/mnist_mlp.py, and it appears fine. However, the gpu version builds the cpu version because of #83. When building the code to check for the GPU capabilities, please keep #19 in mind.

@scttl
Copy link
Contributor

scttl commented Sep 14, 2015

Hi,

Thanks for updating these.

We've just pushed a fix for #83, can you try building the gpu backend based image off of the latest master? Now that we're trying to infer if an appropriate GPU is present as part of the basic build (falling back to CPU), I'm wondering if separate cpu/gpu docker images are still needed?

@Kaixhin
Copy link
Contributor Author

Kaixhin commented Sep 14, 2015

The CPU image just builds neon on Ubuntu Core 14.04, but the CUDA images also include the CUDA SDK (with v6.5, v7.0 and v7.5 available with kaixhin/cuda-neon:6.5, kaixhin/cuda-neon:7.0 and kaixhin/cuda-neon respectively). I've had an email from someone asking if CUDA 7.0 would still be supported in the transition to neon v1.0, so the versioning seems useful.

Included below is a stack trace from trying to run neon -b gpu examples/mnist_mlp.yaml, which occurs after the dataset downloads. It appears the CPU backend runs by default, so I'm guessing that it doesn't build the GPU backend properly. Looking at my nervanagpu Dockerfile, running pip install --upgrade six did solve one error I saw in an earlier stacktrace (after the #83 fix), but installing PyCUDA with pip didn't make a difference.

Traceback (most recent call last):
  File "/usr/local/bin/neon", line 172, in <module>
    callbacks=callbacks)
  File "/usr/local/lib/python2.7/dist-packages/neon/models/model.py", line 120, in fit
    self._epoch_fit(dataset, callbacks)
  File "/usr/local/lib/python2.7/dist-packages/neon/models/model.py", line 142, in _epoch_fit
    x = self.fprop(x)
  File "/usr/local/lib/python2.7/dist-packages/neon/models/model.py", line 173, in fprop
    x = l.fprop(x, inference)
  File "/usr/local/lib/python2.7/dist-packages/neon/layers/layer.py", line 422, in fprop
    self.be.compound_dot(A=self.W, B=inputs, C=self.outputs)
  File "/usr/local/lib/python2.7/dist-packages/neon/backends/nervanagpu.py", line 1134, in compound_dot
    kernel = _get_gemm_kernel(self.cubin_path, clss, op, size)
  File "<string>", line 2, in _get_gemm_kernel
  File "/usr/local/lib/python2.7/dist-packages/pycuda/tools.py", line 430, in context_dependent_memoize
    result = func(*args)
  File "/usr/local/lib/python2.7/dist-packages/neon/backends/nervanagpu.py", line 1711, in _get_gemm_kernel
    module = _get_module(path, clss, op, size)
  File "<string>", line 2, in _get_module
  File "/usr/local/lib/python2.7/dist-packages/pycuda/tools.py", line 430, in context_dependent_memoize
    result = func(*args)
  File "/usr/local/lib/python2.7/dist-packages/neon/backends/nervanagpu.py", line 1703, in _get_module
    return drv.module_from_file(os.path.join(path, cubin))
pycuda._driver.RuntimeError: cuModuleLoad failed: file not found

@seba-1511
Copy link
Contributor

I've had the same issue when updating to 1.0. Could it be that your cuda_path/lib64 is not in your LD_LIBRARY_PATH ?

@Kaixhin
Copy link
Contributor Author

Kaixhin commented Sep 14, 2015

Both PATH and LD_LIBRARY_PATH have been set up. I also confirmed with a manual check right now.

@scttl
Copy link
Contributor

scttl commented Sep 14, 2015

We've uncovered an issue with the GPU build procedure on some machines that we're currently looking into. There's a good chance your kernels didn't get built (check neon/backends/kernels/cubin) which I think could lead to the error you are seeing.

@Kaixhin
Copy link
Contributor Author

Kaixhin commented Sep 14, 2015

It is missing - the kernels folder only contains C_interface, cu and sass. There's a lot to look through, but if you start from the end you might find something in the Docker build logs that get produced.

@scttl
Copy link
Contributor

scttl commented Sep 15, 2015

Does Docker hub build on AWS? Suspecting the reason you don't end up with kernels is because your build machine doesn't have a Maxwell capable GPU?

Until we build in this support (see #80) we'll need to do a better job of detecting and warning the user that this is the case.

@Kaixhin
Copy link
Contributor Author

Kaixhin commented Sep 15, 2015

Probably, if not something similar - which is why I changed nvidia-smi to nvcc in e519b81.

Some kind of flag/manual make option to force the install, with a warning, would work. Although, if you have to use the cudanet backend to support older GPUs (#75) I'll need to move back to the kaixhin/nervanagpu-neon and kaixhin/cudanet-neon images to support this properly.

@pcallier
Copy link
Contributor

FWIW I have the same issues and error message with missing kernels on our Maxwell GPU with CUDA 7.0

@scttl
Copy link
Contributor

scttl commented Sep 16, 2015

We just pushed some changes that should address the kernel build issue that was introduced with the fix for #83.

We've also modified things to build the kernels even on machines that don't have a Maxwell GPU, which should help with building the GPU based docker images. You still won't be able to run the GPU backend without a Maxwell GPU though. To remedy this we'll likely end up backporting nervanagpu kernels, and don't plan on resurrecting the cudanet backend.

@Kaixhin
Copy link
Contributor Author

Kaixhin commented Sep 16, 2015

The builds have been failing for a while now, so after some investigation it looks like this is the result of the Automated Build limits. Judging from the timings between creating builds and the exceptions being thrown, the builds are hitting the 2 hour limit.

Any ideas? Perhaps something in my Dockerfile can be changed? That said, the builds were failing before I added pkg-config and libopencv-dev, so I don't think the visualisation functionality is adding much time.

@scttl
Copy link
Contributor

scttl commented Sep 17, 2015

wow 2 hours is pretty crazy, any idea what part of the build dominates? I tried to login and view the build details page for a recent run but the logs panel was empty for me.

I just ran a clean GPU based build of neon and that took about 7 minutes start to finish (make so virtualenv based install). One slowdown on a sysinstall based install currently is that building the kernels depends on maxas, which in turn depends on the virtualenv, so python packages end up getting installed twice -- once in the virtualenv and once in the system. Fixing that should shave something like 5-10 minutes off of build time.

@Kaixhin
Copy link
Contributor Author

Kaixhin commented Sep 17, 2015

Docker Support sent me an email about it: "We are seeing your build is failing due to too much memory consumption and crashing again and again". It appears that it could fail earlier, but the exception is only thrown after 2 hours. To go over the limits of the Automated Builds:

  • 2 hours
  • 2 GB RAM
  • 1 CPU
  • 30 GB Disk Space

The suggested solution is to break this into several Automated Builds, but I'm not sure that would even solve the problem. There are basically 2 steps - installing a few Ubuntu packages, and doing the actual make. The first step has never been an issue for software with similar requirements.

In the latest failed build I removed the -j tag from make to lower memory consumption, but it still failed (although there is a chance that this has become a timeout issue, I doubt it). The CUDA base image is 1GB, so I doubt disk space is a problem.

scttl added a commit that referenced this issue Sep 18, 2015
Speed ups should at least partially address #89
@scttl
Copy link
Contributor

scttl commented Sep 18, 2015

We've just pushed an update to neon which should remove the unnecessary virtualenv python dependency install as part of a system-wide install. There's also a new make sysinstall_nodeps which could be even faster if you've already handled installing python dependencies elsewhere.

As to the memory consumption issue, what we think may be going on here is that the neon/backends/make_kernels.py utility used to compile cubin's and so forth is currently launching each of the nvcc subprocess calls simultaneously for all kernels. We'll probably need to add some sort of throttling to the launch of these on machines with limited resources like the dockerhub build box.

@Kaixhin
Copy link
Contributor Author

Kaixhin commented Sep 18, 2015

On the suggestion from Docker Support I created a virtual machine with the same resource limits as their machines, but my builds were succeeding. I contacted them and they tried to see what the issue is (response below), but yes it seems that you've identified what's causing the memory consumption issue. Let me know once the throttling is in place so that I can try again.

We ran the build twice, and when maxas is installed, it still uses a bunch of parallelism, and it looks like that was the part that was causing problems initially.

We can't find anything more definitive, but when it is installing maxas, it looks like it is running a bunch of cudafe++ and regular c/c++ compilers.
Running it directly with the logs going to our screen, we can see
nvcc -arch sm_50 -cubin -o /root/neon/neon/backends/kernels/cubin/hconv_bprop_C32_N64.cubin /root/neon/neon/backends/kernels/cu/hconv_bprop_C32_N64.cu Killed
A bunch of lines like these.

It does use up all the swap on the system, of which we only have 512MB.
I know that most linux distros tend to create a larger amount of swap, so perhaps that is why you were able to build it locally.

scttl added a commit that referenced this issue Sep 18, 2015
Defaults to 25, can be adjusted via --max_concurrent.

Should fix the remainder of #89
@scttl
Copy link
Contributor

scttl commented Sep 18, 2015

Ok latest push throttles default number of concurrent kernel build processes to 10 (there were upwards of ~50 launching at the same time without this limit). Hopefully this should be sufficient for the docker hub environment but if not you can further refine from the top level via:
make sysinstall -e KERNEL_BUILDER_BUILD_OPTS="--kernels --max_concurrent X" where X is replaced with how many processes you can run simultaneously.

Try playing around with that and let us know if you're still seeing issues.

@Kaixhin
Copy link
Contributor Author

Kaixhin commented Sep 19, 2015

Great - thanks to that commit the automated builds are now succeeding on the Docker Hub! I've set up weekly builds for both the CPU and CUDA versions, so you can add them to the docs if you want.

FYI the following error gets thrown for both the mnist_mlp.yaml and mnist_mlp.py example with the GPU backend, but not the CPU backend. It doesn't seem to stop testing on the Python example though, so I'm not considering it a major issue.

Exception pycuda._driver.LogicError: 'context::detach failed: invalid device context - cannot detach from invalid context' in <bound method NervanaGPU.__del__ of <neon.backends.nervanagpu.NervanaGPU object at 0x7f5b170abf50>> ignored

@Kaixhin Kaixhin closed this as completed Sep 19, 2015
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants