-
Notifications
You must be signed in to change notification settings - Fork 283
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix TensorFlow easyblock for new versions of Bazel & TensorFlow #2854
fix TensorFlow easyblock for new versions of Bazel & TensorFlow #2854
Conversation
0b5354c
to
3dddcaa
Compare
Test report by @jfgrimm Overview of tested easyconfigs (in order)
Build succeeded for 0 out of 1 (1 easyconfigs in total) |
Test report by @Flamefire Overview of tested easyconfigs (in order)
Build succeeded for 2 out of 2 (2 easyconfigs in total) |
Test report by @Flamefire Overview of tested easyconfigs (in order)
Build succeeded for 1 out of 1 (1 easyconfigs in total) |
@@ -446,7 +453,9 @@ def configure_step(self): | |||
# and will hang forever building the TensorFlow package. | |||
# So limit to something high but still reasonable while allowing ECs to overwrite it | |||
if self.cfg['maxparallel'] is None: | |||
self.cfg['parallel'] = min(self.cfg['parallel'], 64) | |||
# Seemingly Bazel around 3.x got better, so double the max there | |||
bazel_max = 64 if get_bazel_version() < '3.0.0' else 128 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer if we bump the Bazel version requirement a bit here, for example to 4.0
, unless you have a clear reference as to why this makes sense for Bazel 3.x too...
This is mainly to avoid introducing regressions, and having to re-test a wide range of TensorFlow easyconfigs with this updated PR (we use Bazel 3.x from TensorFlow-2.3.1-foss-2019b.eb
onwards - that's 18 easyconfigs to test - with Bazel 4.x it's only TensorFlow-2.8.4-foss-2021b.eb
currently)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was already cautious because this more or less arbitrary limitation was introduced when we used Bazel 0.x. Bazel 3.7 has quite a few changes that I'm using at other places so 3.x seemed like a good choice for a major version to make the switch.
I'm not even sure it really was Bazel causing the trouble. Furthermore it shouldn't affect too much and it has been a few years that passed.
I actually "needed" that for TF 2.7 when creating TF 2.8 to check for differences in behavior. TF 2.7 uses Bazel 3.7 hence the choice for 3 as the major version.
I'll kick off a few tests on one of the systems that were the cause for this limitation (192 core PPC), 96 core x86 seems to work fine. If those pass it should be ok, shouldn't it?
easybuild/easyblocks/t/tensorflow.py
Outdated
@@ -760,6 +771,14 @@ def build_step(self): | |||
self.bazel_opts = [ | |||
'--output_user_root=%s' % self.output_user_root_dir, | |||
] | |||
if bazel_version >= '4.0.0': | |||
self.bazel_opts.append('--local_startup_timeout_secs=600') # 5min for bazel to start |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you clarify why this is needed?
What's taking Bazel 4.x+ so long to "start"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The option was introduced in Bazel 4.0. I was observing failures on some of our nodes where the build failed (after installing all 20 or so extensions) due to a timeout of 2min "connecting to Bazel"
As it could be a filesystem issue being slow to read new stuff I wanted to use the new flag.
I enhanced the comment and made it actually 5min, not 10 ;-)
Test report by @Flamefire Overview of tested easyconfigs (in order)
Build succeeded for 3 out of 4 (4 easyconfigs in total) |
Test report by @Flamefire Overview of tested easyconfigs (in order)
Build succeeded for 1 out of 1 (1 easyconfigs in total) |
Test report by @boegel Overview of tested easyconfigs (in order)
Build succeeded for 18 out of 18 (18 easyconfigs in total) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
Test report by @boegel Overview of tested easyconfigs (in order)
Build succeeded for 29 out of 29 (9 easyconfigs in total) |
This resolves a major blocker for installing newer versions of TensorFlow (2.8+), at least the CUDA versions.
The main issue is a change in TensorFlow which leads to Bazel not passing environment variables we set via
--action_env
to all actions because somewhere around Bazel 3.7 they changed the meaning of that option from "pass to all actions" to "pass to target actions" and we need to use--host_action_env
for "host actions" and "exec actions".Furthermore they degraded the impact of
--distinct-host-configuration=false
and officially announced it as a "no-op" in Bazel 6.0 which might explain failures reported earlier regarding "exec_tools" vs "tools"Solution is that for recent Bazel versions
--action_env
and--host_action_env
is used, see bazelbuild/bazel#17062As a drive-by fix I change the
LooseVersion
import and made the failing tests reported unique as during testing I was hit by"520 test failed: <500x the same test> <some others>"
and for further convenience sorted themTest report using 2.7.1: easybuilders/easybuild-easyconfigs#16795 (comment) and later for 2.8.4 in easybuilders/easybuild-easyconfigs#17058