Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix TensorFlow easyblock for new versions of Bazel & TensorFlow #2854

Merged

Conversation

Flamefire
Copy link
Contributor

@Flamefire Flamefire commented Jan 6, 2023

This resolves a major blocker for installing newer versions of TensorFlow (2.8+), at least the CUDA versions.

The main issue is a change in TensorFlow which leads to Bazel not passing environment variables we set via --action_env to all actions because somewhere around Bazel 3.7 they changed the meaning of that option from "pass to all actions" to "pass to target actions" and we need to use --host_action_env for "host actions" and "exec actions".

Furthermore they degraded the impact of --distinct-host-configuration=false and officially announced it as a "no-op" in Bazel 6.0 which might explain failures reported earlier regarding "exec_tools" vs "tools"

Solution is that for recent Bazel versions --action_env and --host_action_env is used, see bazelbuild/bazel#17062

As a drive-by fix I change the LooseVersion import and made the failing tests reported unique as during testing I was hit by "520 test failed: <500x the same test> <some others>" and for further convenience sorted them

Test report using 2.7.1: easybuilders/easybuild-easyconfigs#16795 (comment) and later for 2.8.4 in easybuilders/easybuild-easyconfigs#17058

@jfgrimm
Copy link
Member

jfgrimm commented Jan 13, 2023

Test report by @jfgrimm

Overview of tested easyconfigs (in order)

Build succeeded for 0 out of 1 (1 easyconfigs in total)
gpu01.pri.viking.alces.network - Linux CentOS Linux 7.9.2009, x86_64, Intel(R) Xeon(R) Gold 6138 CPU @ 2.00GHz (skylake_avx512), 2 x NVIDIA Tesla V100-SXM2-32GB, 510.47.03, Python 3.6.8
See https://gist.github.com/490cfcc806d26fbc7ba30443ae22e81a for a full test report.

@Flamefire
Copy link
Contributor Author

Test report by @Flamefire

Overview of tested easyconfigs (in order)

  • SUCCESS TensorFlow-2.7.1-foss-2021b.eb
  • SUCCESS TensorFlow-2.7.1-foss-2021b-CUDA-11.4.1.eb

Build succeeded for 2 out of 2 (2 easyconfigs in total)
taurusi8016 - Linux CentOS Linux 7.9.2009, x86_64, AMD EPYC 7352 24-Core Processor (zen2), 8 x NVIDIA NVIDIA A100-SXM4-40GB, 470.57.02, Python 2.7.5
See https://gist.github.com/ef8622b78f1534c4a9bff24464ddb35d for a full test report.

@Flamefire
Copy link
Contributor Author

Test report by @Flamefire

Overview of tested easyconfigs (in order)

  • SUCCESS TensorFlow-2.7.1-foss-2021b.eb

Build succeeded for 1 out of 1 (1 easyconfigs in total)
taurusml2 - Linux RHEL 7.6, POWER, 8335-GTX (power9le), 6 x NVIDIA Tesla V100-SXM2-32GB, 440.64.00, Python 2.7.5
See https://gist.github.com/b60b7d22ae5c88b8f1082198c58070d6 for a full test report.

@boegel boegel added the update label Jan 18, 2023
@boegel boegel modified the milestones: 4.x, next release (4.7.1?) Jan 18, 2023
easybuild/easyblocks/t/tensorflow.py Show resolved Hide resolved
easybuild/easyblocks/t/tensorflow.py Outdated Show resolved Hide resolved
@@ -446,7 +453,9 @@ def configure_step(self):
# and will hang forever building the TensorFlow package.
# So limit to something high but still reasonable while allowing ECs to overwrite it
if self.cfg['maxparallel'] is None:
self.cfg['parallel'] = min(self.cfg['parallel'], 64)
# Seemingly Bazel around 3.x got better, so double the max there
bazel_max = 64 if get_bazel_version() < '3.0.0' else 128
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer if we bump the Bazel version requirement a bit here, for example to 4.0, unless you have a clear reference as to why this makes sense for Bazel 3.x too...
This is mainly to avoid introducing regressions, and having to re-test a wide range of TensorFlow easyconfigs with this updated PR (we use Bazel 3.x from TensorFlow-2.3.1-foss-2019b.eb onwards - that's 18 easyconfigs to test - with Bazel 4.x it's only TensorFlow-2.8.4-foss-2021b.eb currently)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was already cautious because this more or less arbitrary limitation was introduced when we used Bazel 0.x. Bazel 3.7 has quite a few changes that I'm using at other places so 3.x seemed like a good choice for a major version to make the switch.

I'm not even sure it really was Bazel causing the trouble. Furthermore it shouldn't affect too much and it has been a few years that passed.
I actually "needed" that for TF 2.7 when creating TF 2.8 to check for differences in behavior. TF 2.7 uses Bazel 3.7 hence the choice for 3 as the major version.

I'll kick off a few tests on one of the systems that were the cause for this limitation (192 core PPC), 96 core x86 seems to work fine. If those pass it should be ok, shouldn't it?

@@ -760,6 +771,14 @@ def build_step(self):
self.bazel_opts = [
'--output_user_root=%s' % self.output_user_root_dir,
]
if bazel_version >= '4.0.0':
self.bazel_opts.append('--local_startup_timeout_secs=600') # 5min for bazel to start
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you clarify why this is needed?
What's taking Bazel 4.x+ so long to "start"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The option was introduced in Bazel 4.0. I was observing failures on some of our nodes where the build failed (after installing all 20 or so extensions) due to a timeout of 2min "connecting to Bazel"
As it could be a filesystem issue being slow to read new stuff I wanted to use the new flag.

I enhanced the comment and made it actually 5min, not 10 ;-)

easybuild/easyblocks/t/tensorflow.py Show resolved Hide resolved
easybuild/easyblocks/t/tensorflow.py Outdated Show resolved Hide resolved
easybuild/easyblocks/t/tensorflow.py Show resolved Hide resolved
easybuild/easyblocks/t/tensorflow.py Show resolved Hide resolved
@Flamefire
Copy link
Contributor Author

Flamefire commented Jan 20, 2023

Test report by @Flamefire

Overview of tested easyconfigs (in order)

  • SUCCESS TensorFlow-2.3.1-foss-2019b-Python-3.7.4.eb
  • SUCCESS TensorFlow-2.3.1-fosscuda-2019b-Python-3.7.4.eb
  • SUCCESS TensorFlow-2.4.1-fosscuda-2019b-Python-3.7.4.eb
  • FAIL (build issue) TensorFlow-2.5.0-fosscuda-2019b-Python-3.7.4.eb (partial log available at https://gist.github.com/cb2daa676f957123b10489bcede356ba) - Flaky. Report for this follows

Build succeeded for 3 out of 4 (4 easyconfigs in total)
taurusml4 - Linux RHEL 7.6, POWER, 8335-GTX (power9le), 6 x NVIDIA Tesla V100-SXM2-32GB, 440.64.00, Python 2.7.5
See https://gist.github.com/a9e7cafd552d7d30bd61037b56f4fdd7 for a full test report.

@Flamefire
Copy link
Contributor Author

Test report by @Flamefire

Overview of tested easyconfigs (in order)

  • SUCCESS TensorFlow-2.5.0-fosscuda-2019b-Python-3.7.4.eb

Build succeeded for 1 out of 1 (1 easyconfigs in total)
taurusml4 - Linux RHEL 7.6, POWER, 8335-GTX (power9le), 6 x NVIDIA Tesla V100-SXM2-32GB, 440.64.00, Python 2.7.5
See https://gist.github.com/3806bfd4262fb7ee7a57e113a3977ff9 for a full test report.

@boegel
Copy link
Member

boegel commented Mar 17, 2023

Test report by @boegel

Overview of tested easyconfigs (in order)

  • SUCCESS TensorFlow-2.3.1-foss-2019b-Python-3.7.4.eb
  • SUCCESS TensorFlow-2.3.1-foss-2020a-Python-3.8.2.eb
  • SUCCESS TensorFlow-2.3.1-fosscuda-2019b-Python-3.7.4.eb
  • SUCCESS TensorFlow-2.3.1-fosscuda-2020a-Python-3.8.2.eb
  • SUCCESS TensorFlow-2.4.1-foss-2020b.eb
  • SUCCESS TensorFlow-2.4.1-fosscuda-2019b-Python-3.7.4.eb
  • SUCCESS TensorFlow-2.4.1-fosscuda-2020b.eb
  • SUCCESS TensorFlow-2.4.4-foss-2021a.eb
  • SUCCESS TensorFlow-2.5.0-foss-2020b.eb
  • SUCCESS TensorFlow-2.5.0-fosscuda-2019b-Python-3.7.4.eb
  • SUCCESS TensorFlow-2.5.0-fosscuda-2020b.eb
  • SUCCESS TensorFlow-2.5.3-foss-2021a-CUDA-11.3.1.eb
  • SUCCESS TensorFlow-2.5.3-foss-2021a.eb
  • SUCCESS TensorFlow-2.6.0-foss-2021a-CUDA-11.3.1.eb
  • SUCCESS TensorFlow-2.6.0-foss-2021a.eb
  • SUCCESS TensorFlow-2.7.1-foss-2021b-CUDA-11.4.1.eb
  • SUCCESS TensorFlow-2.7.1-foss-2021b.eb
  • SUCCESS TensorFlow-2.8.4-foss-2021b.eb

Build succeeded for 18 out of 18 (18 easyconfigs in total)
node3139.skitty.os - Linux RHEL 8.6, x86_64, Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz (skylake_avx512), Python 3.6.8
See https://gist.github.com/d4931d9ca18511f1620b99d98542bfa6 for a full test report.

Copy link
Member

@boegel boegel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@boegel boegel merged commit 6c44a8e into easybuilders:develop Mar 17, 2023
@boegel
Copy link
Member

boegel commented Mar 17, 2023

Test report by @boegel

Overview of tested easyconfigs (in order)

  • SUCCESS TensorFlow-2.4.1-fosscuda-2019b-Python-3.7.4.eb
  • SUCCESS TensorFlow-2.4.1-fosscuda-2020b.eb
  • SUCCESS TensorFlow-2.5.0-fosscuda-2019b-Python-3.7.4.eb
  • SUCCESS TensorFlow-2.5.0-fosscuda-2020b.eb
  • SUCCESS TensorFlow-2.5.3-foss-2021a-CUDA-11.3.1.eb
  • SUCCESS TensorFlow-2.6.0-foss-2021a-CUDA-11.3.1.eb
  • SUCCESS TensorFlow-2.7.1-foss-2021b-CUDA-11.4.1.eb
  • SUCCESS Bazel-3.4.1-GCCcore-8.3.0.eb
  • SUCCESS SWIG-4.0.1-GCCcore-8.3.0.eb
  • SUCCESS protobuf-3.13.0-GCCcore-9.3.0.eb
  • SUCCESS Zip-3.0-GCCcore-9.3.0.eb
  • SUCCESS TensorFlow-2.3.1-fosscuda-2019b-Python-3.7.4.eb
  • SUCCESS Bazel-3.6.0-GCCcore-9.3.0.eb
  • SUCCESS cuDNN-8.0.4.30-CUDA-11.0.2.eb
  • SUCCESS NCCL-2.8.3-GCCcore-9.3.0-CUDA-11.0.2.eb
  • SUCCESS double-conversion-3.1.5-GCCcore-9.3.0.eb
  • SUCCESS flatbuffers-1.12.0-GCCcore-9.3.0.eb
  • SUCCESS HDF5-1.10.6-gompic-2020a.eb
  • SUCCESS poetry-1.0.9-GCCcore-9.3.0-Python-3.8.2.eb
  • SUCCESS pkgconfig-1.5.1-GCCcore-9.3.0-Python-3.8.2.eb
  • SUCCESS h5py-2.10.0-fosscuda-2020a-Python-3.8.2.eb
  • SUCCESS giflib-5.2.1-GCCcore-9.3.0.eb
  • SUCCESS JsonCpp-1.9.4-GCCcore-9.3.0.eb
  • SUCCESS LMDB-0.9.24-GCCcore-9.3.0.eb
  • SUCCESS nsync-1.24.0-GCCcore-9.3.0.eb
  • SUCCESS protobuf-python-3.13.0-fosscuda-2020a-Python-3.8.2.eb
  • SUCCESS snappy-1.1.8-GCCcore-9.3.0.eb
  • SUCCESS SWIG-4.0.1-GCCcore-9.3.0.eb
  • SUCCESS TensorFlow-2.3.1-fosscuda-2020a-Python-3.8.2.eb

Build succeeded for 29 out of 29 (9 easyconfigs in total)
node3309.joltik.os - Linux RHEL 8.6, x86_64, Intel(R) Xeon(R) Gold 6242 CPU @ 2.80GHz (cascadelake), 1 x NVIDIA Tesla V100-SXM2-32GB, 525.85.12, Python 3.6.8
See https://gist.github.com/560e52743f5d5f059e62d4a55c8dc9cb for a full test report.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants