Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed to build xformers #748

Open
lurui22230 opened this issue May 13, 2023 · 12 comments
Open

Failed to build xformers #748

lurui22230 opened this issue May 13, 2023 · 12 comments

Comments

@lurui22230
Copy link

❓ Questions and Help

my cuda version is 12.1 but I install pytorch version 2.0.0 + cu118 for there is no cu121. when I build xformers it raise an error: The detected CUDA version (12.1) mismatches the version that was used to compile
PyTorch (11.8). Please make sure to use the same CUDA versions
do I have to change my CUDA to 11.8?

@danthe3rd
Copy link
Contributor

Yes exactly. You'll need to use nvcc version 11.8.
However we provide (already pre-compiled) binaries you should be able to use if you are on windows/linux - for instance through pip - see here

@avolkov1
Copy link

I would like CUDA 12 support. Is CUDA 12 in the roadmap? Currently fails to build with:

TORCH_CUDA_ARCH_LIST="7.5 8.0 8.6 9.0+PTX"

The sm90 results in errors. This builds fine though:

TORCH_CUDA_ARCH_LIST="7.5 8.0 8.6+PTX"

Thanks

@danthe3rd
Copy link
Contributor

Hi, what error do you have? xFormers should build fine for H100 (although tested on CUDA 11.8 I believe)

@avolkov1
Copy link

avolkov1 commented Jun 23, 2023

Here's my Dockerfile:

FROM nvcr.io/nvidia/pytorch:23.05-py3

# a bunch of other stuff omitted here for brevity

RUN export TORCH_CUDA_ARCH_LIST="8.0 8.6 9.0+PTX" && \
    pip install -v -U git+https://github.com/facebookresearch/xformers.git@6f0602f#egg=xformers

For information regarding base container refer to:
https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/rel-23-05.html
https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch

Building this I get errors of this sort:

  ptxas info    : Used 128 registers
  ptxas info    : Compiling entry function '_Z55fmha_cutlassB_bf16_aligned_128x128_k128_seqaligned_sm80N23AttentionBackwardKernelIN7cutlass4arch4Sm80ENS0_10bfloat16_tELb1ELb0ELb1ELi128ELi128ELi128ELb1EE6ParamsE' for 'sm_90'
  ptxas info    : Function properties for _Z55fmha_cutlassB_bf16_aligned_128x128_k128_seqaligned_sm80N23AttentionBackwardKernelIN7cutlass4arch4Sm80ENS0_10bfloat16_tELb1ELb0ELb1ELi128ELi128ELi128ELb1EE6ParamsE
      16 bytes stack frame, 16 bytes spill stores, 20 bytes spill loads
�[0m�[91m  ptxas info    : Used 128 registers
�[0m�[91m  ninja: build stopped: subcommand failed.
�[0m�[91m  Traceback (most recent call last):
�[0m�[91m    File "/usr/local/lib/python3.10/dist-packages/torch/utils/cpp_extension.py", line 1893, in _run_ninja_build
�[0m�[91m      subprocess.run(
�[0m�[91m    File "/usr/lib/python3.10/subprocess.py", line 524, in run
�[0m�[91m      raise CalledProcessError(retcode, process.args,
�[0m�[91m  subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.
�[0m�[91m
�[0m�[91m  The above exception was the direct cause of the following exception:
�[0m�[91m
�[0m�[91m  Traceback (most recent call last):
�[0m�[91m    File "<string>", line 1, in <module>
�[0m�[91m    File "/tmp/pip-install-jof7clp4/xformers_067ae14e74554708b683adbd6e14b909/setup.py", line 388, in <module>
�[0m�[91m      setuptools.setup(
�[0m�[91m    File "/usr/local/lib/python3.10/dist-packages/setuptools/__init__.py", line 87, in setup
�[0m�[91m      return distutils.core.setup(**attrs)
�[0m�[91m    File "/usr/lib/python3.10/distutils/core.py", line 148, in setup
�[0m�[91m      dist.run_commands()
�[0m�[91m    File "/usr/lib/python3.10/distutils/dist.py", line 966, in run_commands
�[0m�[91m      self.run_command(cmd)
�[0m�[91m    File "/usr/local/lib/python3.10/dist-packages/setuptools/dist.py", line 1217, in run_command
�[0m�[91m      super().run_command(command)
�[0m�[91m    File "/usr/lib/python3.10/distutils/dist.py", line 985, in run_command
�[0m�[91m      cmd_obj.run()
�[0m�[91m    File "/usr/local/lib/python3.10/dist-packages/wheel/bdist_wheel.py", line 343, in run
�[0m�[91m      self.run_command("build")
�[0m�[91m    File "/usr/lib/python3.10/distutils/cmd.py", line 313, in run_command
�[0m�[91m      self.distribution.run_command(command)
�[0m�[91m    File "/usr/local/lib/python3.10/dist-packages/setuptools/dist.py", line 1217, in run_command
�[0m�[91m      super().run_command(command)
�[0m�[91m    File "/usr/lib/python3.10/distutils/dist.py", line 985, in run_command
�[0m�[91m      cmd_obj.run()
�[0m�[91m    File "/usr/lib/python3.10/distutils/command/build.py", line 135, in run
�[0m�[91m      self.run_command(cmd_name)
�[0m�[91m    File "/usr/lib/python3.10/distutils/cmd.py", line 313, in run_command
      self.distribution.run_command(command)
�[0m�[91m    File "/usr/local/lib/python3.10/dist-packages/setuptools/dist.py", line 1217, in run_command
�[0m�[91m      super().run_command(command)
�[0m�[91m    File "/usr/lib/python3.10/distutils/dist.py", line 985, in run_command
�[0m�[91m      cmd_obj.run()
�[0m�[91m    File "/usr/local/lib/python3.10/dist-packages/setuptools/command/build_ext.py", line 84, in run
�[0m�[91m      _build_ext.run(self)
�[0m�[91m    File "/usr/local/lib/python3.10/dist-packages/Cython/Distutils/old_build_ext.py", line 186, in run
�[0m�[91m      _build_ext.build_ext.run(self)
�[0m�[91m    File "/usr/lib/python3.10/distutils/command/build_ext.py", line 340, in run
�[0m�[91m      self.build_extensions()
�[0m�[91m    File "/tmp/pip-install-jof7clp4/xformers_067ae14e74554708b683adbd6e14b909/setup.py", line 333, in build_extensions
�[0m�[91m      super().build_extensions()
�[0m�[91m    File "/usr/local/lib/python3.10/dist-packages/torch/utils/cpp_extension.py", line 843, in build_extensions
�[0m�[91m      build_ext.build_extensions(self)
�[0m�[91m    File "/usr/local/lib/python3.10/dist-packages/Cython/Distutils/old_build_ext.py", line 195, in build_extensions
�[0m�[91m      _build_ext.build_ext.build_extensions(self)
�[0m�[91m    File "/usr/lib/python3.10/distutils/command/build_ext.py", line 449, in build_extensions
�[0m�[91m      self._build_extensions_serial()
�[0m�[91m    File "/usr/lib/python3.10/distutils/command/build_ext.py", line 474, in _build_extensions_serial
�[0m�[91m      self.build_extension(ext)
�[0m�[91m    File "/usr/local/lib/python3.10/dist-packages/setuptools/command/build_ext.py", line 246, in build_extension
�[0m�[91m      _build_ext.build_extension(self, ext)
�[0m�[91m    File "/usr/lib/python3.10/distutils/command/build_ext.py", line 529, in build_extension
�[0m�[91m      objects = self.compiler.compile(sources,
�[0m�[91m    File "/usr/local/lib/python3.10/dist-packages/torch/utils/cpp_extension.py", line 658, in unix_wrap_ninja_compile
�[0m�[91m      _write_ninja_file_and_compile_objects(
�[0m�[91m    File "/usr/local/lib/python3.10/dist-packages/torch/utils/cpp_extension.py", line 1574, in _write_ninja_file_and_compile_objects
�[0m�[91m      _run_ninja_build(
�[0m�[91m    File "/usr/local/lib/python3.10/dist-packages/torch/utils/cpp_extension.py", line 1909, in _run_ninja_build
�[0m�[91m      raise RuntimeError(message) from e
�[0m�[91m  RuntimeError: Error compiling objects for extension
�[0m  Building wheel for xformers (setup.py): finished with status 'error'
�[91m  ERROR: Failed building wheel for xformers
�[0m�[91m  Running command /usr/bin/python -u -c 'import io, os, sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-jof7clp4/xformers_067ae14e74554708b683adbd6e14b909/setup.py'"'"'; __file__='"'"'/tmp/pip-install-jof7clp4/xformers_067ae14e74554708b683adbd6e14b909/setup.py'"'"';f = getattr(tokenize, '"'"'open'"'"', open)(__file__) if os.path.exists(__file__) else io.StringIO('"'"'from setuptools import setup; setup()'"'"');code = f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' clean --all
�[0m  Running setup.py clean for xformers
�[91m  No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda'
�[0m�[91m  running clean
�[0m�[91m  'build/lib.linux-x86_64-3.10' does not exist -- can't clean it
  'build/bdist.linux-x86_64' does not exist -- can't clean it
�[0m�[91m  'build/scripts-3.10' does not exist -- can't clean it
�[0mFailed to build xformers

I'm able to build and use the container with xformers when I do this:

RUN export TORCH_CUDA_ARCH_LIST="8.0 8.6+PTX" && \
    pip install -v -U git+https://github.com/facebookresearch/xformers.git@6f0602f#egg=xformers

And xformers seems to function well with above workaround.

I attached the full build log.

sm90_build.txt

@avolkov1
Copy link

avolkov1 commented Jul 10, 2023

@danthe3rd So I posted the error. Will you guys have a chance to look at it? Using 8.6+PTS will be suboptimal when running on Hopper.

Thanks

@danthe3rd
Copy link
Contributor

Can you post the full log on pastebin for instance? I believe the actual error happens before the extract you posted

@avolkov1
Copy link

@danthe3rd

I posted the full log. Refer to this link:
https://github.com/facebookresearch/xformers/files/11850330/sm90_build.txt

That's the full log, Nothing omitted. I run a docker build and that's what happens. If I were to build it interactively, exactly the same thing would happen. To replicate build the following dockerfile:

FROM nvcr.io/nvidia/pytorch:23.05-py3

RUN export TORCH_CUDA_ARCH_LIST="8.0 8.6 9.0+PTX" && \
    pip install -v -U git+https://github.com/facebookresearch/xformers.git@6f0602f#egg=xformers

I also tested today with latest released:

FROM nvcr.io/nvidia/pytorch:23.06-py3

RUN export TORCH_CUDA_ARCH_LIST="8.0 8.6 9.0+PTX" && \
    pip install -v -U git+https://github.com/facebookresearch/[email protected]#egg=xformers

@bottler
Copy link
Contributor

bottler commented Jul 10, 2023

I wonder if the build could be running out of memory? Could you try setting MAX_JOBS=1 in the environment when you build?

@avolkov1
Copy link

Please try to replicate my build issue. If you can't replicate my issue and you believe it's something on my end, I will gladly try whatever you suggest.

I provided the full log and detailed enough instructions to replicate.

@danthe3rd
Copy link
Contributor

Please try to replicate my build issue

We don't officially support Docker - now we're happy to help and try to diagnose your issues, but we don't have the bandwidth to run the things ourselves at this point ...

I wonder if the build could be running out of memory? Could you try setting MAX_JOBS=1 in the environment when you build?

This is a possible issue, adding more architectures will use more memory at build-time. I would recommend trying with TORCH_CUDA_ARCH_LIST=9.0 only, and MAX_JOBS=1 to see if the issue reproduces (build will be much slower tho). If that's the case, you should increase the memory limit.

@avolkov1
Copy link

avolkov1 commented Jul 11, 2023

We don't officially support Docker ...

Ok then. I also had the same issue without docker. On the system I'm building on I have 140 GB RAM and 85 GB free disk space. So I highly doubt it's a memory issue.

I'll try using MAX_JOBS=1. I'll report if it works, but otherwise seems like no one has bandwidth to look at it. As long as you're all aware of the issue. Eventually you'll have to update this when CUDA 11 will be outdated.

@avolkov1
Copy link

Alright, this did build successfully. I built the following container:

ARG FROM_IMAGE_NAME=nvcr.io/nvidia/pytorch:23.06-py3
FROM ${FROM_IMAGE_NAME}

RUN export TORCH_CUDA_ARCH_LIST="9.0+PTX" MAX_JOBS=1 && \
    pip install -v -U \
      git+https://github.com/facebookresearch/[email protected]#egg=xformers

build_xformers_sm90.txt

I'll play around with the MAX_JOBS setting to speedup the build. Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants