-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
builds with remote caching sometimes fails on incomplete downloads #5047
Comments
Thanks for reporting. I ll take a look! |
Still happening occasionally in in bazel 0.13.0. Failure from this morning: https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/pr-logs/pull/54372/pull-kubernetes-e2e-gce-device-plugin-gpu/31984/
|
Thanks @buchgr! |
we're seeing this again with 0.15.0 :( Test run a and b both hit an error like:
Per the logs these tests ran in the image
|
I've personally hit this 3x today, I think we will downgrade to 0.14.0 in the short term.. |
@BenTheElder sorry about that :(. I don't understand what's happening there yet. Can you try running with I will try to reproduce the error. |
cc: @laurentlb might need a patch release once we understand what's going on there :(. |
Hey thanks for the response!
Sure, I'll set up a test job with 0.15.0 and this flag and see if we can still repro. Am I correct that this flag is not available before 0.15? Also FWIW we're not in a huge rush to upgrade to any 0.15 features for Kubernetes I think and we staged the rollout to just some smaller repo(s) not the main one so we're fine for the moment after rollback. We'll of course want to help hunt this down and fix so we can continue to keep up with stable though. |
Hey Ben, so in 0.15.0 we made a change for all action outputs to be downloaded in parallel. There is no concurrency limit applied by default, but I figured the number of action outputs would probably be somewhat reasonable and thus there would be a natural limit. I found after the 0.15.0 release (after my vacation) that rules_go to have actions with very high number of outputs and that we would then run into open file descriptor limits and those kind of things. I plan to add a ~200 connections limit in the 0.16.0 release, or 0.15.1 if there will be one. I think with the Correct, the flag was introduced in 0.15.0 |
Got it, thanks! I'll start testing this after our team standup (~nowish). |
I've not been able to reproduce with
This sounds somewhat familiar with |
Hmmm perhaps not... spotted one failure now.
As far as I can tell this run should have had bazel 0.15 with |
@BenTheElder We have a rule that builds the Go standard library for the target platform and mode. It's used as an input for most other actions. It's a directory output, but I guess that's expressed as individual files over the wire. I'm planning to change the implementation of that rule to split the standard library into pieces that can be used as separate inputs (tools, archives, sources, headers). We'd also stop including packages that can't be imported from other Go code (e.g., vendored and internal packages in the standard library). That should improve performance quite a bit, both locally and with remote execution. |
This change limits the number of open tcp connections by default to 100 for remote caching. We have had error reports where some use cases Bazel would open so many TCP connections that it crashed/ran out of sockets. The max. number of TCP connections can still be adjusted by specifying --remote_max_connections. See also #5047. RELNOTES: In remote caching we limit the number of open TCP connections to 100 by default. The number can be adjusted by specifying the --remote_max_connections flag. PiperOrigin-RevId: 202958838
Ok that took a while, but I believe I understand the error now. So effectively we asynchronously trigger concurrent downloads for all output files, which gives us a list of futures and we then at the end we wait for all downloads to finish. If one download fails, we immediately trigger a routine to delete all output files (which fails), so that we can continue with local execution instead. However, when triggering this routine to delete all output files we don't wait for all downloads to have finished. They continue downloading in the background. That routine to delete files, recursively deletes all files in a directory and then tries to delete the directory itself. Now that's racing with the async downloads that are still running in the background T_T. Async programming is hard. I ll send out a fix and make sure it gets cherry picked into 0.16.0 as this is a regression. |
Thanks @buchgr! |
For downloading output files / directories we trigger all downloads concurrently and asynchronously in the background and after that wait for all downloads to finish. However, if a download failed we did not wait for the remaining downloads to finish but immediately started deleting partial downloads and continued with local execution of the action. That leads to two interesting bugs: 1) The cleanup procedure races with the downloads that are still in progress. As it tries to delete files and directories, new files and directories are created and that will often lead to "Directory not empty" errors as seen in bazelbuild#5047. 2) The clean up procedure does not detect the race, succeeds and subsequent local execution fails because not all files have been deleted.
…uild#5491 This change limits the number of open tcp connections by default to 100 for remote caching. We have had error reports where some use cases Bazel would open so many TCP connections that it crashed/ran out of sockets. The max. number of TCP connections can still be adjusted by specifying --remote_max_connections. See also bazelbuild#5047. RELNOTES: In remote caching we limit the number of open TCP connections to 100 by default. The number can be adjusted by specifying the --remote_max_connections flag. PiperOrigin-RevId: 202958838
This issue has been fixed in master (I hope :D). It will be cherry-picked into the next release candidate for 0.16.0. Would be great if you could give it a spin! |
Will do! I'll monitor the rcs
On Wed 25 Jul 2018 at 14:45, Jakob Buchgraber ***@***.***> wrote:
This issue has been fixed in master (I hope :D). It will be cherry-picked
into the next release candidate for 0.16.0. Would be great if you could
give it a spin!
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#5047 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAIY-zPZ0qb8mvWqSZS8Ee37C3MPO6Xvks5uKGhOgaJpZM4TatA2>
.
--
twitter.com/steeve
github.com/steeve
linkd.in/smorin
|
It will be release candidate 3. should be out later today. |
For downloading output files / directories we trigger all downloads concurrently and asynchronously in the background and after that wait for all downloads to finish. However, if a download failed we did not wait for the remaining downloads to finish but immediately started deleting partial downloads and continued with local execution of the action. That leads to two interesting bugs: * The cleanup procedure races with the downloads that are still in progress. As it tries to delete files and directories, new files and directories are created and that will often lead to "Directory not empty" errors as seen in #5047. * The clean up procedure does not detect the race, succeeds and subsequent local execution fails because not all files have been deleted. The solution is to always wait for all downloads to complete before entering the cleanup routine. Ideally we would also cancel all outstanding downloads, however, that's not as straightfoward as it seems. That is, the j.u.c.Future API does not provide a way to cancel a computation and also wait for that computation actually having determinated. So we'd need to introduce a separate mechanism to cancel downloads. RELNOTES: None PiperOrigin-RevId: 205980446
Thank you! We'll start testing once the RC is out 😄 |
Baseline: 4f64b77 Cherry picks: + 4c9a0c8: reduce the size of bazel's embedded jdk + d3228b6: remote: limit number of open tcp connections by default. Fixes #5491 + 8ff87c1: Fix autodetection of linker flags + c4622ac: Fix autodetection of -z linker flags + 1021965: blaze_util_posix.cc: fix order of #define + ab1f269: blaze_util_freebsd.cc: include path.h explicitly + 68e92b4: openjdk: update macOS openjdk image. Fixes #5532 + f45c224: Set the start time of binary and JSON profiles to zero correctly. + bca1912: remote: fix race on download error. Fixes #5047 + 3842bd3: jdk: use parallel old gc and disable compact strings Incompatible changes: - The $(ANDROID_CPU) Make variable is not available anymore. Use $(TARGET_CPU) after an Android configuration transition instead. - The $(JAVA_TRANSLATIONS) Make variable is not supported anymore. - Skylark structs (using struct()) may no longer have to_json and to_proto overridden. - The mobile-install --skylark_incremental_res flag is no longer available, use the --skylark flag instead. New features: - android_local_test now takes advantage of Robolectric's binary resource processing which allows for faster tests. - Allow @ in package names. Important changes: - Option --glibc is removed, toolchain selection relies solely on --cpu and --compiler options. - Build support for enabling cross binary FDO optimization. - The --distdir option is no longer experimental. This option allows to specify additional directories to look for files before trying to fetch them from the network. Files from any of the distdirs are only used if a checksum for the file is specified and both, the filename and the checksum, match. - Java coverage works now with multiple jobs. - Flip default value of --experimental_shortened_obj_file_path to true, Bazel now generates short object file path by default. - New rules for importing Android dependencies: `aar_import_external` and `aar_maven_import_external`. `aar_import_external` enables specifying external AAR dependencies using a list of HTTP URLs for the artifact. `aar_maven_import_external` enables specifying external AAR dependencies using the artifact coordinate and a list of server URLs. - The BAZEL_JAVAC_OPTS environment variable allows arguments, e.g., "-J-Xmx2g", may be passed to the javac compiler during bootstrap build. This is helpful if your system chooses too small of a max heap size for the Java compiler during the bootstrap build. - --noexpand_configs_in_place is deprecated. - A tool to parse the Bazel execution log. - Support for LIPO has been fully removed. - Remove support for --discard_actions_after_execution. - Add --materialize_param_files flag to write parameter files even when actions are executed remotely. - Windows default system bazelrc is read from the user's ProgramData if present. - --[no]allow_undefined_configs no longer exists, passing undefined configs is an error. - In remote caching we limit the number of open TCP connections to 100 by default. The number can be adjusted by specifying the --remote_max_connections flag.
…uild#5491 This change limits the number of open tcp connections by default to 100 for remote caching. We have had error reports where some use cases Bazel would open so many TCP connections that it crashed/ran out of sockets. The max. number of TCP connections can still be adjusted by specifying --remote_max_connections. See also bazelbuild#5047. RELNOTES: In remote caching we limit the number of open TCP connections to 100 by default. The number can be adjusted by specifying the --remote_max_connections flag. PiperOrigin-RevId: 202958838
For downloading output files / directories we trigger all downloads concurrently and asynchronously in the background and after that wait for all downloads to finish. However, if a download failed we did not wait for the remaining downloads to finish but immediately started deleting partial downloads and continued with local execution of the action. That leads to two interesting bugs: * The cleanup procedure races with the downloads that are still in progress. As it tries to delete files and directories, new files and directories are created and that will often lead to "Directory not empty" errors as seen in bazelbuild#5047. * The clean up procedure does not detect the race, succeeds and subsequent local execution fails because not all files have been deleted. The solution is to always wait for all downloads to complete before entering the cleanup routine. Ideally we would also cancel all outstanding downloads, however, that's not as straightfoward as it seems. That is, the j.u.c.Future API does not provide a way to cancel a computation and also wait for that computation actually having determinated. So we'd need to introduce a separate mechanism to cancel downloads. RELNOTES: None PiperOrigin-RevId: 205980446
Baseline: 4f64b77 Cherry picks: + 4c9a0c8: reduce the size of bazel's embedded jdk + d3228b6: remote: limit number of open tcp connections by default. Fixes bazelbuild#5491 + 8ff87c1: Fix autodetection of linker flags + c4622ac: Fix autodetection of -z linker flags + 1021965: blaze_util_posix.cc: fix order of #define + ab1f269: blaze_util_freebsd.cc: include path.h explicitly + 68e92b4: openjdk: update macOS openjdk image. Fixes bazelbuild#5532 + f45c224: Set the start time of binary and JSON profiles to zero correctly. + bca1912: remote: fix race on download error. Fixes bazelbuild#5047 + 3842bd3: jdk: use parallel old gc and disable compact strings Incompatible changes: - The $(ANDROID_CPU) Make variable is not available anymore. Use $(TARGET_CPU) after an Android configuration transition instead. - The $(JAVA_TRANSLATIONS) Make variable is not supported anymore. - Skylark structs (using struct()) may no longer have to_json and to_proto overridden. - The mobile-install --skylark_incremental_res flag is no longer available, use the --skylark flag instead. New features: - android_local_test now takes advantage of Robolectric's binary resource processing which allows for faster tests. - Allow @ in package names. Important changes: - Option --glibc is removed, toolchain selection relies solely on --cpu and --compiler options. - Build support for enabling cross binary FDO optimization. - The --distdir option is no longer experimental. This option allows to specify additional directories to look for files before trying to fetch them from the network. Files from any of the distdirs are only used if a checksum for the file is specified and both, the filename and the checksum, match. - Java coverage works now with multiple jobs. - Flip default value of --experimental_shortened_obj_file_path to true, Bazel now generates short object file path by default. - New rules for importing Android dependencies: `aar_import_external` and `aar_maven_import_external`. `aar_import_external` enables specifying external AAR dependencies using a list of HTTP URLs for the artifact. `aar_maven_import_external` enables specifying external AAR dependencies using the artifact coordinate and a list of server URLs. - The BAZEL_JAVAC_OPTS environment variable allows arguments, e.g., "-J-Xmx2g", may be passed to the javac compiler during bootstrap build. This is helpful if your system chooses too small of a max heap size for the Java compiler during the bootstrap build. - --noexpand_configs_in_place is deprecated. - A tool to parse the Bazel execution log. - Support for LIPO has been fully removed. - Remove support for --discard_actions_after_execution. - Add --materialize_param_files flag to write parameter files even when actions are executed remotely. - Windows default system bazelrc is read from the user's ProgramData if present. - --[no]allow_undefined_configs no longer exists, passing undefined configs is an error. - In remote caching we limit the number of open TCP connections to 100 by default. The number can be adjusted by specifying the --remote_max_connections flag.
This change limits the number of open tcp connections by default to 100 for remote caching. We have had error reports where some use cases Bazel would open so many TCP connections that it crashed/ran out of sockets. The max. number of TCP connections can still be adjusted by specifying --remote_max_connections. See also #5047. RELNOTES: In remote caching we limit the number of open TCP connections to 100 by default. The number can be adjusted by specifying the --remote_max_connections flag. PiperOrigin-RevId: 202958838
For downloading output files / directories we trigger all downloads concurrently and asynchronously in the background and after that wait for all downloads to finish. However, if a download failed we did not wait for the remaining downloads to finish but immediately started deleting partial downloads and continued with local execution of the action. That leads to two interesting bugs: * The cleanup procedure races with the downloads that are still in progress. As it tries to delete files and directories, new files and directories are created and that will often lead to "Directory not empty" errors as seen in #5047. * The clean up procedure does not detect the race, succeeds and subsequent local execution fails because not all files have been deleted. The solution is to always wait for all downloads to complete before entering the cleanup routine. Ideally we would also cancel all outstanding downloads, however, that's not as straightfoward as it seems. That is, the j.u.c.Future API does not provide a way to cancel a computation and also wait for that computation actually having determinated. So we'd need to introduce a separate mechanism to cancel downloads. RELNOTES: None PiperOrigin-RevId: 205980446
…uild#5491 This change limits the number of open tcp connections by default to 100 for remote caching. We have had error reports where some use cases Bazel would open so many TCP connections that it crashed/ran out of sockets. The max. number of TCP connections can still be adjusted by specifying --remote_max_connections. See also bazelbuild#5047. RELNOTES: In remote caching we limit the number of open TCP connections to 100 by default. The number can be adjusted by specifying the --remote_max_connections flag. PiperOrigin-RevId: 202958838
For downloading output files / directories we trigger all downloads concurrently and asynchronously in the background and after that wait for all downloads to finish. However, if a download failed we did not wait for the remaining downloads to finish but immediately started deleting partial downloads and continued with local execution of the action. That leads to two interesting bugs: * The cleanup procedure races with the downloads that are still in progress. As it tries to delete files and directories, new files and directories are created and that will often lead to "Directory not empty" errors as seen in bazelbuild#5047. * The clean up procedure does not detect the race, succeeds and subsequent local execution fails because not all files have been deleted. The solution is to always wait for all downloads to complete before entering the cleanup routine. Ideally we would also cancel all outstanding downloads, however, that's not as straightfoward as it seems. That is, the j.u.c.Future API does not provide a way to cancel a computation and also wait for that computation actually having determinated. So we'd need to introduce a separate mechanism to cancel downloads. RELNOTES: None PiperOrigin-RevId: 205980446
Baseline: 4f64b77 Cherry picks: + 4c9a0c8: reduce the size of bazel's embedded jdk + d3228b6: remote: limit number of open tcp connections by default. Fixes #5491 + 8ff87c1: Fix autodetection of linker flags + c4622ac: Fix autodetection of -z linker flags + 1021965: blaze_util_posix.cc: fix order of #define + ab1f269: blaze_util_freebsd.cc: include path.h explicitly + 68e92b4: openjdk: update macOS openjdk image. Fixes #5532 + f45c224: Set the start time of binary and JSON profiles to zero correctly. + bca1912: remote: fix race on download error. Fixes #5047 + 3842bd3: jdk: use parallel old gc and disable compact strings + 6bd0bdf: Add objc-fully-link to the list of actions that require the apple_env feature. This fixes apple_static_library functionality. + f330439: Add the action_names_test_files target to the OSS version of tools/buils_defs/cc/BUILD. + d215b64: Fix StackOverflowError on Windows. Fixes #5730 + 366da4c: In java_rules_skylark depend on the javabase through //tools/jdk:current_java_runtime + 30c601d: Don't use @local_jdk for jni headers + c56699d: 'DumpPlatformClasspath' now dumps the current JDK's default platform classpath This release is a patch release that contains fixes for several serious regressions that were found after the release of Bazel 0.16.0. In particular this release resolves the following issues: - Bazel crashes with a StackOverflowError on Windows (See #5730) - Bazel requires a locally installed JDK and does not fall back to the embedded JDK (See #5744) - Bazel fails to build for Homebrew on macOS El Capitan (See #5777) - A regression in apple_static_library (See #5683) Please watch our blog for a more detailed release announcement.
This change limits the number of open tcp connections by default to 100 for remote caching. We have had error reports where some use cases Bazel would open so many TCP connections that it crashed/ran out of sockets. The max. number of TCP connections can still be adjusted by specifying --remote_max_connections. See also #5047. RELNOTES: In remote caching we limit the number of open TCP connections to 100 by default. The number can be adjusted by specifying the --remote_max_connections flag. PiperOrigin-RevId: 202958838
For downloading output files / directories we trigger all downloads concurrently and asynchronously in the background and after that wait for all downloads to finish. However, if a download failed we did not wait for the remaining downloads to finish but immediately started deleting partial downloads and continued with local execution of the action. That leads to two interesting bugs: * The cleanup procedure races with the downloads that are still in progress. As it tries to delete files and directories, new files and directories are created and that will often lead to "Directory not empty" errors as seen in #5047. * The clean up procedure does not detect the race, succeeds and subsequent local execution fails because not all files have been deleted. The solution is to always wait for all downloads to complete before entering the cleanup routine. Ideally we would also cancel all outstanding downloads, however, that's not as straightfoward as it seems. That is, the j.u.c.Future API does not provide a way to cancel a computation and also wait for that computation actually having determinated. So we'd need to introduce a separate mechanism to cancel downloads. RELNOTES: None PiperOrigin-RevId: 205980446
I'm still seeing this issue using Bazel 0.17.1 and a medium sized iOS application (using rules_apple 0.7.0 and rules_swift as of 7bbcf9584613169cda709b9c217f5ac29cc5a089). Unfortunately I can't share the app but I'd be happy to help debug the issue. My config:
And I'm building with the command:
Where
|
Can you run with --verbose_failures and see if you get the same error message? |
Same error message.
This time though, I noticed that buried in the compilation output there is this:
However, the output seems wrong since I can't tell if it retried, and if so how many times it retried. I also ran it adding the flag Maybe this is a different issue... should I open a new Issue? |
That's actually a different error, as the output does not contain |
I'm seeing this one with Bazel 0.20.0. I think a failure in the remote cache:
triggered this again:
this was with |
I am getting a ClosedChannelException error when trying to use bazel remote cache deployed on k8s. Was a separate issue opened for this? |
Description of the problem / feature request:
Sometimes builds with http remote caching enabled fail with
Failed to delete output files after incomplete download. Cannot continue with local execution.
See EG the logs here: https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/pr-logs/pull/62063/pull-kubernetes-bazel-test/38632/
Bugs: what's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.
I've had some difficulty making this bug reproducible unfortunately, you need to have remote caching enabled and fail to download to an entry, which often still succeeds and falls back to local building.
What operating system are you running Bazel on?
A debian jessie based docker container.
What's the output of
bazel info release
?release 0.11.0
Any other information, logs, or outputs that you want to share?
https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/pr-logs/pull/62063/pull-kubernetes-bazel-test/38632/
https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/pr-logs/pull/62723/pull-kubernetes-bazel-build/38747/
The text was updated successfully, but these errors were encountered: