Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed to build TensorFlow-2.11.0-foss-2022a-CUDA-11.7.0.eb #17892

Closed
xuagu37 opened this issue May 6, 2023 · 10 comments · Fixed by #18235
Closed

Failed to build TensorFlow-2.11.0-foss-2022a-CUDA-11.7.0.eb #17892

xuagu37 opened this issue May 6, 2023 · 10 comments · Fixed by #18235
Milestone

Comments

@xuagu37
Copy link

xuagu37 commented May 6, 2023

Dear easybuild community,

I tried to build Tensorflow by:

eb TensorFlow-2.11.0-foss-2022a-CUDA-11.7.0.eb -r

It failed with the following error message:

# Execution platform: @local_execution_config_platform//:platform
ERROR: /tmp/xuan/TensorFlow/2.11.0/foss-2022a-CUDA-11.7.0/TensorFlow/bazel-root/9ed3032cf4142236baf321f6dff02d38/external/com_github_grpc_grpc/BUILD:2585:16: Compiling src/core/ext/upb-generated/src/proto/grpc/lb/v1/load_balancer.upb.c failed: undeclared inclusion(s) in rule '@com_github_grpc_grpc//:grpc_lb_upb':
this rule is missing dependency declarations for the following files included by 'src/core/ext/upb-generated/src/proto/grpc/lb/v1/load_balancer.upb.c':
  '/home/xuan/EasyBuild/software/GCCcore/11.3.0/lib/gcc/x86_64-pc-linux-gnu/11.3.0/include/stddef.h'
  '/home/xuan/EasyBuild/software/GCCcore/11.3.0/lib/gcc/x86_64-pc-linux-gnu/11.3.0/include/stdint.h'
  '/home/xuan/EasyBuild/software/GCCcore/11.3.0/lib/gcc/x86_64-pc-linux-gnu/11.3.0/include/stdarg.h'
  '/home/xuan/EasyBuild/software/GCCcore/11.3.0/lib/gcc/x86_64-pc-linux-gnu/11.3.0/include/stdbool.h'
Target //tensorflow/tools/pip_package:build_pip_package failed to build
INFO: Elapsed time: 36.307s, Critical Path: 2.73s
INFO: 5604 processes: 4969 internal, 635 local.
FAILED: Build did NOT complete successfully
FAILED: Build did NOT complete successfully

Any help will be really appreciated.

Best Regards

@boegel boegel transferred this issue from easybuilders/easybuild May 10, 2023
@boegel boegel added this to the 4.x milestone May 10, 2023
@boegel
Copy link
Member

boegel commented May 10, 2023

@xuagu37 Which EasyBuild version are you using here?
If it's not EasyBuild v4.7.1 (or newer), try updating your EasyBuild installation first, you're probably missing some required changes in the TensorFlow easyblock (see easybuilders/easybuild-easyblocks#2854)

@Flamefire
Copy link
Contributor

Flamefire commented May 12, 2023

IIRC this specific issue should be fixed by TensorFlow-2.8.4_resolve-gcc-symlinks.patch assuming that when loading your GCCcore/11.3.0 module it uses symlinks at some point to /home/xuan/EasyBuild/software/GCCcore/11.3.0

Check that this patch is in the EC you are using and you are using the latest easyblock as mentioned by @boegel above.

@xuagu37
Copy link
Author

xuagu37 commented May 16, 2023

Thanks for the replies!

  1. I am using EB v4.7.1
  2. The tensorflow easyblock file has already been updated.
  3. I have TensorFlow-2.8.4_resolve-gcc-symlinks.patch in the EC.

Any other potential issues you can think of?
Thanks for your help!

@Flamefire
Copy link
Contributor

First, I'd like to verify that this is the symlink issue. For that go to the machine where you tried to build TF and load the GCC/11.3.0 module. Then run gcc -E -xc -v /dev/null and look at the line after "#include <...> search starts here:" showing a path ending in lib/gcc/x86_64-pc-linux-gnu/11.3.0/include`

  1. Show that path
  2. Show output of readlink -f <that-path>
  3. Show output of readlink -f /home/xuan/EasyBuild/software/GCCcore/11.3.0/lib/gcc/x86_64-pc-linux-gnu/11.3.0/include/stddef.h`
  4. Anything else notworthy related to those paths, e.g. hardlinks used?

@xuagu37
Copy link
Author

xuagu37 commented May 16, 2023

  1. The output of "gcc -E -xc -v /dev/null":
Using built-in specs.
COLLECT_GCC=/home/xuan/EasyBuild/software/GCCcore/11.3.0/bin/gcc
OFFLOAD_TARGET_NAMES=nvptx-none
Target: x86_64-pc-linux-gnu
Configured with: ../configure --enable-languages=c,c++,fortran --without-cuda-driver --enable-offload-targets=nvptx-none --enable-lto --enable-checking=release --disable-multilib --enable-shared=yes --enable-static=yes --enable-threads=posix --enable-plugins --enable-gold --enable-ld=default --prefix=/home/xuan/EasyBuild/software/GCCcore/11.3.0 --with-local-prefix=/home/xuan/EasyBuild/software/GCCcore/11.3.0 --enable-bootstrap --with-isl=/tmp/xuan/GCCcore/11.3.0/system-system/gcc-11.3.0/stage2_stuff --build=x86_64-pc-linux-gnu --host=x86_64-pc-linux-gnu
Thread model: posix
Supported LTO compression algorithms: zlib zstd
gcc version 11.3.0 (GCC) 
COLLECT_GCC_OPTIONS='-E' '-v' '-mtune=generic' '-march=x86-64'
 /proj/nsc_testing/xuan/EasyBuild/software/GCCcore/11.3.0/bin/../libexec/gcc/x86_64-pc-linux-gnu/11.3.0/cc1 -E -quiet -v -iprefix /proj/nsc_testing/xuan/EasyBuild/software/GCCcore/11.3.0/bin/../lib/gcc/x86_64-pc-linux-gnu/11.3.0/ /dev/null -mtune=generic -march=x86-64 -dumpbase null
ignoring nonexistent directory "/proj/nsc_testing/xuan/EasyBuild/software/GCCcore/11.3.0/bin/../lib/gcc/x86_64-pc-linux-gnu/11.3.0/include-fixed"
ignoring nonexistent directory "/proj/nsc_testing/xuan/EasyBuild/software/GCCcore/11.3.0/bin/../lib/gcc/x86_64-pc-linux-gnu/11.3.0/../../../../x86_64-pc-linux-gnu/include"
ignoring duplicate directory "/proj/nsc_testing/xuan/EasyBuild/software/GCCcore/11.3.0/bin/../lib/gcc/../../lib/gcc/x86_64-pc-linux-gnu/11.3.0/include"
ignoring nonexistent directory "/proj/nsc_testing/xuan/EasyBuild/software/GCCcore/11.3.0/bin/../lib/gcc/../../lib/gcc/x86_64-pc-linux-gnu/11.3.0/include-fixed"
ignoring nonexistent directory "/proj/nsc_testing/xuan/EasyBuild/software/GCCcore/11.3.0/bin/../lib/gcc/../../lib/gcc/x86_64-pc-linux-gnu/11.3.0/../../../../x86_64-pc-linux-gnu/include"
#include "..." search starts here:
#include <...> search starts here:
 /home/xuan/EasyBuild/software/binutils/2.38-GCCcore-11.3.0/include
 /home/xuan/EasyBuild/software/zlib/1.2.12-GCCcore-11.3.0/include
 /cm/shared/apps/slurm/current/include
 /proj/nsc_testing/xuan/EasyBuild/software/GCCcore/11.3.0/bin/../lib/gcc/x86_64-pc-linux-gnu/11.3.0/include
 /home/xuan/EasyBuild/software/GCCcore/11.3.0/include
 /usr/include
End of search list.
# 0 "/dev/null"
# 0 "<built-in>"
# 0 "<command-line>"
# 1 "/usr/include/stdc-predef.h" 1 3 4
# 0 "<command-line>" 2
# 1 "/dev/null"
COMPILER_PATH=/proj/nsc_testing/xuan/EasyBuild/software/GCCcore/11.3.0/bin/../libexec/gcc/x86_64-pc-linux-gnu/11.3.0/:/proj/nsc_testing/xuan/EasyBuild/software/GCCcore/11.3.0/bin/../libexec/gcc/
LIBRARY_PATH=/proj/nsc_testing/xuan/EasyBuild/software/GCCcore/11.3.0/bin/../lib/gcc/x86_64-pc-linux-gnu/11.3.0/:/proj/nsc_testing/xuan/EasyBuild/software/GCCcore/11.3.0/bin/../lib/gcc/:/home/xuan/EasyBuild/software/binutils/2.38-GCCcore-11.3.0/lib/../lib64/:/home/xuan/EasyBuild/software/zlib/1.2.12-GCCcore-11.3.0/lib/../lib64/:/cm/shared/apps/slurm/current/lib64/../lib64/:/proj/nsc_testing/xuan/EasyBuild/software/GCCcore/11.3.0/bin/../lib/gcc/x86_64-pc-linux-gnu/11.3.0/../../../../lib64/:/lib/../lib64/:/usr/lib/../lib64/:/home/xuan/EasyBuild/software/binutils/2.38-GCCcore-11.3.0/lib/:/home/xuan/EasyBuild/software/zlib/1.2.12-GCCcore-11.3.0/lib/:/cm/shared/apps/slurm/current/lib64/slurm/:/cm/shared/apps/slurm/current/lib64/:/proj/nsc_testing/xuan/EasyBuild/software/GCCcore/11.3.0/bin/../lib/gcc/x86_64-pc-linux-gnu/11.3.0/../../../:/lib/:/usr/lib/
COLLECT_GCC_OPTIONS='-E' '-v' '-mtune=generic' '-march=x86-64'

The line we are interested in is:

/proj/nsc_testing/xuan/EasyBuild/software/GCCcore/11.3.0/bin/../lib/gcc/x86_64-pc-linux-gnu/11.3.0/include
  1. The output of realpath /proj/nsc_testing/xuan/EasyBuild/software/GCCcore/11.3.0/bin/../lib/gcc/x86_64-pc-linux-gnu/11.3.0/include is
/proj/nsc_testing/xuan/EasyBuild/software/GCCcore/11.3.0/lib/gcc/x86_64-pc-linux-gnu/11.3.0/include
  1. The output is realpath /home/xuan/EasyBuild/software/GCCcore/11.3.0/lib/gcc/x86_64-pc-linux-gnu/11.3.0/include/stddef.h is:
/proj/nsc_testing/xuan/EasyBuild/software/GCCcore/11.3.0/lib/gcc/x86_64-pc-linux-gnu/11.3.0/include/stddef.h
  1. I put my EasyBuild directory at /proj/nsc_testing/xuan/EasyBuild and then created symlink pointing to /home/xuan/EasyBuild. I'm not sure if that matters?

I don't have "reallink" but I used "realpath" to generate the outputs of 2&3.
Thanks for your time invested!

@Flamefire
Copy link
Contributor

4. I put my EasyBuild directory at /proj/nsc_testing/xuan/EasyBuild and then created symlink pointing to /home/xuan/EasyBuild. I'm not sure if that matters?

That is exactly why I asked: This configuration is what causes the issue (a bug in Bazel/TensorFlow) and the patch should fix that. I don't know why it doesn't for you. So I need a bit more information:

  • Can you attach the log from the failed build please
  • Can you also attach the .tf_configure.bazelrc created during the build in the build dir (you might need to run with eb --disable-cleanup-builddir ...)

@xuagu37
Copy link
Author

xuagu37 commented May 16, 2023

Please see the attached.
tensorflow_build.log
tf_configure.bazelrc.txt

@Flamefire
Copy link
Contributor

Hm, what I see is GCC_HOST_COMPILER_PATH=/proj/nsc_testing/xuan/.tmp/eb-5qeoh7qu/tmpfpm1vljs/rpath_wrappers/gcc_wrapper/gcc , i.e. the use of rpath_wrappers which could be the issue. I guess the problem is that TF resolves the symlink but only up until the rpath wrapper, not the actual GCC which then fails

I have an idea for a solution: Can you modify the file easybuild/tools/toolchain/toolchain.py of your installed easybuild-framework package at https://github.com/easybuilders/easybuild-framework/blob/b1a528735d8e30769b8e512042d7d5cc10406574/easybuild/tools/toolchain/toolchain.py#L1026 and replace that by 'orig_cmd': os.path.realpath(orig_cmd),

@boegel It might make sense to add that to framework as I see no downsides but potentially faster runtimes due to less meta-data ops.

As an alternative the patch TensorFlow-2.1.0_fix-cuda-build.patch also seems to work.

@xuagu37
Copy link
Author

xuagu37 commented May 31, 2023

I decided to pause my project of EasyBuild for now. Thanks for all the help! Feel free to close the issue.

@Flamefire
Copy link
Contributor

@boegel However the issue is valid and I verified this: It happens with compiler wrappers such as ccache and EasyBuilds rpath compiler wrapper when the compiler is on a symlink.

Both presented approaches (modifying EB or adding the patch) solve the rpath-wrapper issue.

It might be better to readd the patch to newer TF ECs though because that would also fix the ccache usecase. The upstream patch which fixed the compiler-on-symlink issue doesn't fix the compiler-symlink-with-compiler-wrapper issue. I added our patch as a TF PR (again): tensorflow/tensorflow#60668

It should be enough to simply add that patch to all TF ECs >= 2.1 and do a test-build only up until the patch step.

branfosj pushed a commit that referenced this issue Jul 1, 2023
Add the TensorFlow-2.1.0_fix-cuda-build.patch to the TensorFlow-CUDA ECs
to fix failure when compilers are on symlinked paths and e.g. ccache or
rpath wrappers are used.

Fixes #17892
@boegel boegel modified the milestones: 4.x, 4.7.3 Jul 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants