Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

builds with remote caching sometimes fails on incomplete downloads #5047

Closed
BenTheElder opened this issue Apr 18, 2018 · 36 comments
Closed

builds with remote caching sometimes fails on incomplete downloads #5047

BenTheElder opened this issue Apr 18, 2018 · 36 comments
Assignees
Labels
P1 I'll work on this now. (Assignee required)

Comments

@BenTheElder
Copy link

Description of the problem / feature request:

Sometimes builds with http remote caching enabled fail with Failed to delete output files after incomplete download. Cannot continue with local execution.

See EG the logs here: https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/pr-logs/pull/62063/pull-kubernetes-bazel-test/38632/

Bugs: what's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

I've had some difficulty making this bug reproducible unfortunately, you need to have remote caching enabled and fail to download to an entry, which often still succeeds and falls back to local building.

What operating system are you running Bazel on?

A debian jessie based docker container.

What's the output of bazel info release?

release 0.11.0

Any other information, logs, or outputs that you want to share?

https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/pr-logs/pull/62063/pull-kubernetes-bazel-test/38632/

https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/pr-logs/pull/62723/pull-kubernetes-bazel-build/38747/

@BenTheElder
Copy link
Author

cc @buchgr @ixdy @rmmh
I've spent a fair bit of time trying to make this reproducible, unsuccesfuly. even digging through our logs most runs are either successful or actual failures, but this happens often enough to be a bad experience.

@buchgr buchgr self-assigned this Apr 19, 2018
@buchgr
Copy link
Contributor

buchgr commented Apr 19, 2018

Thanks for reporting. I ll take a look!

@c-parsons c-parsons added P2 We'll consider working on this in future. (Assignee optional) category: remote execution / caching labels Apr 21, 2018
@ixdy
Copy link
Contributor

ixdy commented May 15, 2018

Still happening occasionally in in bazel 0.13.0. Failure from this morning: https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/pr-logs/pull/54372/pull-kubernetes-e2e-gce-device-plugin-gpu/31984/

W0515 15:50:29.365] INFO: Analysed target //build/release-tars:release-tars (1997 packages loaded).
W0515 15:50:29.370] INFO: Found 1 target...
W0515 15:50:29.510] [0 / 15] [-----] Writing file build/release-tars/kubernetes-manifests.args
W0515 15:50:34.291] [49 / 2,338] [-----] Writing file vendor/github.com/google/certificate-transparency-go/x509/go_default_library._cgo_.o-2.params
W0515 15:50:37.467] ERROR: /bazel-scratch/.cache/bazel/_bazel_prow/48d5366022b4e3197674c8d6e2bee219/external/io_bazel_rules_go/BUILD.bazel:9:1: GoStdlib external/io_bazel_rules_go/linux_amd64_stripped/stdlib~/pkg failed: Failed to delete output files after incomplete download. Cannot continue with local execution.: /bazel-scratch/.cache/bazel/_bazel_prow/48d5366022b4e3197674c8d6e2bee219/execroot/__main__/bazel-out/k8-fastbuild/bin/external/io_bazel_rules_go/linux_amd64_stripped/stdlib~/pkg (Directory not empty)

@BenTheElder
Copy link
Author

Thanks @buchgr!

lfpino pushed a commit that referenced this issue May 24, 2018
ulfjack pushed a commit that referenced this issue May 25, 2018
@BenTheElder
Copy link
Author

we're seeing this again with 0.15.0 :(

Test run a and b both hit an error like:

ERROR: /bazel-scratch/.cache/bazel/_bazel_root/27a95fe2fc68aa1195ddf1a4bbaaea62/external/io_bazel_rules_go/BUILD.bazel:9:1: GoStdlib external/io_bazel_rules_go/linux_amd64_race_stripped/stdlib~/pkg failed: Failed to delete output files after incomplete download. Cannot continue with local execution.: /bazel-scratch/.cache/bazel/_bazel_root/27a95fe2fc68aa1195ddf1a4bbaaea62/execroot/main/bazel-out/k8-fastbuild/bin/external/io_bazel_rules_go/linux_amd64_race_stripped/stdlib~/pkg/linux_amd64_race (Directory not empty)

Per the logs these tests ran in the image gcr.io/k8s-testimages/kubekins-e2e:v20180627-0ed738715-experimental which is definitely 0.15.0:

$ docker run --rm -it --entrypoint=/bin/sh gcr.io/k8s-testimages/kubekins-e2e:v20180627-0ed738715-experimental
# bazel version
Extracting Bazel installation...
WARNING: --batch mode is deprecated. Please instead explicitly shut down your Bazel server using the command "bazel shutdown".
Build label: 0.15.0
Build target: bazel-out/k8-opt/bin/src/main/java/com/google/devtools/build/lib/bazel/BazelServer_deploy.jar
Build time: Tue Jun 26 12:10:19 2018 (1530015019)
Build timestamp: 1530015019
Build timestamp as int: 1530015019

@BenTheElder
Copy link
Author

BenTheElder commented Jun 29, 2018

I've personally hit this 3x today, I think we will downgrade to 0.14.0 in the short term..
Edit: I can't be completely certain yet, but AFAICT this regressed somewhere between 0.14.0 and 0.15.0

@buchgr
Copy link
Contributor

buchgr commented Jun 29, 2018

@BenTheElder sorry about that :(. I don't understand what's happening there yet. Can you try running with --remote_max_connections=200. I think this will avoid the failures in the first place.

I will try to reproduce the error.

@buchgr buchgr added the P1 I'll work on this now. (Assignee required) label Jun 29, 2018
@buchgr buchgr reopened this Jun 29, 2018
@buchgr buchgr removed the P2 We'll consider working on this in future. (Assignee optional) label Jun 29, 2018
@buchgr
Copy link
Contributor

buchgr commented Jun 29, 2018

cc: @laurentlb might need a patch release once we understand what's going on there :(.

@BenTheElder
Copy link
Author

Hey thanks for the response!

Can you try running with --remote_max_connections=200. I think this will avoid the failures in the first place.

Sure, I'll set up a test job with 0.15.0 and this flag and see if we can still repro. Am I correct that this flag is not available before 0.15?

Also FWIW we're not in a huge rush to upgrade to any 0.15 features for Kubernetes I think and we staged the rollout to just some smaller repo(s) not the main one so we're fine for the moment after rollback. We'll of course want to help hunt this down and fix so we can continue to keep up with stable though.

@buchgr
Copy link
Contributor

buchgr commented Jun 29, 2018

Hey Ben,

so in 0.15.0 we made a change for all action outputs to be downloaded in parallel. There is no concurrency limit applied by default, but I figured the number of action outputs would probably be somewhat reasonable and thus there would be a natural limit. I found after the 0.15.0 release (after my vacation) that rules_go to have actions with very high number of outputs and that we would then run into open file descriptor limits and those kind of things. I plan to add a ~200 connections limit in the 0.16.0 release, or 0.15.1 if there will be one.

I think with the --remote_max_connections=200 you will not encounter the error, but the error itself is still unexplained. It almost seems as if Bazel is interpreting a directory as a file.

Correct, the flag was introduced in 0.15.0

@BenTheElder
Copy link
Author

Got it, thanks! I'll start testing this after our team standup (~nowish).

@BenTheElder
Copy link
Author

I've not been able to reproduce with --remote_max_connections=200 enabled for bazel 0.15 based jobs, with the flag accidentally disabled in initial setup I did see this again though. I'm going to keep kicking off some more runs but I'm pretty confident that this does fix things for us.

I think with the --remote_max_connections=200 you will not encounter the error, but the error itself is still unexplained. It almost seems as if Bazel is interpreting a directory as a file.

This sounds somewhat familiar with rules_go any thoughts @jayconrod ?

@BenTheElder
Copy link
Author

Hmmm perhaps not... spotted one failure now.

W0701 01:48:08.092] WARNING: Error reading from the remote cache:
W0701 01:48:08.093] Exhaused retry attempts (0)
W0701 01:48:08.442] ERROR: /bazel-scratch/.cache/bazel/_bazel_root/e9f728bbd90b3fba632eb31b20e1dacd/external/io_bazel_rules_go/BUILD.bazel:8:1: GoStdlib external/io_bazel_rules_go/linux_amd64_stripped/stdlib~/pkg failed: Failed to delete output files after incomplete download. Cannot continue with local execution.: /bazel-scratch/.cache/bazel/_bazel_root/e9f728bbd90b3fba632eb31b20e1dacd/execroot/__main__/bazel-out/k8-fastbuild/bin/external/io_bazel_rules_go/linux_amd64_stripped/stdlib~/pkg/linux_amd64 (Directory not empty)

As far as I can tell this run should have had bazel 0.15 with --remote_max_connections=200 in the same bazelrc as the other caching flags.

pod-log.txt

@jayconrod
Copy link
Contributor

@BenTheElder We have a rule that builds the Go standard library for the target platform and mode. It's used as an input for most other actions. It's a directory output, but I guess that's expressed as individual files over the wire.

I'm planning to change the implementation of that rule to split the standard library into pieces that can be used as separate inputs (tools, archives, sources, headers). We'd also stop including packages that can't be imported from other Go code (e.g., vendored and internal packages in the standard library). That should improve performance quite a bit, both locally and with remote execution.

bazel-io pushed a commit that referenced this issue Jul 2, 2018
This change limits the number of open tcp connections
by default to 100 for remote caching. We have had error
reports where some use cases Bazel would open so many
TCP connections that it crashed/ran out of sockets. The
max. number of TCP connections can still be adjusted by
specifying --remote_max_connections.

See also #5047.

RELNOTES: In remote caching we limit the number of open
TCP connections to 100 by default. The number can be adjusted
by specifying the --remote_max_connections flag.
PiperOrigin-RevId: 202958838
@buchgr
Copy link
Contributor

buchgr commented Jul 24, 2018

Ok that took a while, but I believe I understand the error now.

So effectively we asynchronously trigger concurrent downloads for all output files, which gives us a list of futures and we then at the end we wait for all downloads to finish. If one download fails, we immediately trigger a routine to delete all output files (which fails), so that we can continue with local execution instead.

However, when triggering this routine to delete all output files we don't wait for all downloads to have finished. They continue downloading in the background. That routine to delete files, recursively deletes all files in a directory and then tries to delete the directory itself. Now that's racing with the async downloads that are still running in the background T_T.

Async programming is hard. I ll send out a fix and make sure it gets cherry picked into 0.16.0 as this is a regression.

@BenTheElder
Copy link
Author

Thanks @buchgr!

buchgr added a commit to buchgr/bazel that referenced this issue Jul 24, 2018
For downloading output files / directories we trigger all
downloads concurrently and asynchronously in the background
and after that wait for all downloads to finish. However, if
a download failed we did not wait for the remaining downloads
to finish but immediately started deleting partial downloads
and continued with local execution of the action.

That leads to two interesting bugs:
 1) The cleanup procedure races with the downloads that are still
    in progress. As it tries to delete files and directories, new
    files and directories are created and that will often
    lead to "Directory not empty" errors as seen in bazelbuild#5047.
 2) The clean up procedure does not detect the race, succeeds and
    subsequent local execution fails because not all files have
    been deleted.
natansil pushed a commit to natansil/bazel that referenced this issue Jul 24, 2018
…uild#5491

This change limits the number of open tcp connections
by default to 100 for remote caching. We have had error
reports where some use cases Bazel would open so many
TCP connections that it crashed/ran out of sockets. The
max. number of TCP connections can still be adjusted by
specifying --remote_max_connections.

See also bazelbuild#5047.

RELNOTES: In remote caching we limit the number of open
TCP connections to 100 by default. The number can be adjusted
by specifying the --remote_max_connections flag.
PiperOrigin-RevId: 202958838
@buchgr
Copy link
Contributor

buchgr commented Jul 25, 2018

This issue has been fixed in master (I hope :D). It will be cherry-picked into the next release candidate for 0.16.0. Would be great if you could give it a spin!

@steeve
Copy link
Contributor

steeve commented Jul 25, 2018 via email

@buchgr
Copy link
Contributor

buchgr commented Jul 25, 2018

It will be release candidate 3. should be out later today.

c-parsons pushed a commit that referenced this issue Jul 25, 2018
For downloading output files / directories we trigger all
downloads concurrently and asynchronously in the background
and after that wait for all downloads to finish. However, if
a download failed we did not wait for the remaining downloads
to finish but immediately started deleting partial downloads
and continued with local execution of the action.

That leads to two interesting bugs:
* The cleanup procedure races with the downloads that are still
in progress. As it tries to delete files and directories, new
files and directories are created and that will often
lead to "Directory not empty" errors as seen in #5047.
* The clean up procedure does not detect the race, succeeds and
subsequent local execution fails because not all files have
been deleted.

The solution is to always wait for all downloads to complete
before entering the cleanup routine. Ideally we would also
cancel all outstanding downloads, however, that's not as
straightfoward as it seems. That is, the j.u.c.Future API does
not provide a way to cancel a computation and also wait for
that computation actually having determinated. So we'd need
to introduce a separate mechanism to cancel downloads.

RELNOTES: None
PiperOrigin-RevId: 205980446
@BenTheElder
Copy link
Author

Thank you! We'll start testing once the RC is out 😄

bazel-io pushed a commit that referenced this issue Jul 31, 2018
Baseline: 4f64b77

Cherry picks:
   + 4c9a0c8:
     reduce the size of bazel's embedded jdk
   + d3228b6:
     remote: limit number of open tcp connections by default. Fixes
     #5491
   + 8ff87c1:
     Fix autodetection of linker flags
   + c4622ac:
     Fix autodetection of -z linker flags
   + 1021965:
     blaze_util_posix.cc: fix order of #define
   + ab1f269:
     blaze_util_freebsd.cc: include path.h explicitly
   + 68e92b4:
     openjdk: update macOS openjdk image. Fixes #5532
   + f45c224:
     Set the start time of binary and JSON profiles to zero correctly.
   + bca1912:
     remote: fix race on download error. Fixes #5047
   + 3842bd3:
     jdk: use parallel old gc and disable compact strings

Incompatible changes:

  - The $(ANDROID_CPU) Make variable is not available anymore. Use
    $(TARGET_CPU) after an Android configuration transition instead.
  - The $(JAVA_TRANSLATIONS) Make variable is not supported anymore.
  - Skylark structs (using struct()) may no longer have to_json and
    to_proto overridden.
  - The mobile-install --skylark_incremental_res flag is no longer
    available, use the --skylark flag instead.

New features:

  - android_local_test now takes advantage of Robolectric's binary
    resource processing which allows for faster tests.
  - Allow @ in package names.

Important changes:

  - Option --glibc is removed, toolchain selection relies solely on
    --cpu and --compiler options.
  - Build support for enabling cross binary FDO optimization.
  - The --distdir option is no longer experimental. This
      option allows to specify additional directories to look for
      files before trying to fetch them from the network. Files from
      any of the distdirs are only used if a checksum for the file
      is specified and both, the filename and the checksum, match.
  - Java coverage works now with multiple jobs.
  - Flip default value of --experimental_shortened_obj_file_path to
    true, Bazel now generates short object file path by default.
  - New rules for importing Android dependencies:
    `aar_import_external` and `aar_maven_import_external`.
    `aar_import_external` enables specifying external AAR
    dependencies using a list of HTTP URLs for the artifact.
    `aar_maven_import_external` enables specifying external AAR
    dependencies using the artifact coordinate and a list of server
    URLs.
  - The BAZEL_JAVAC_OPTS environment variable allows arguments, e.g.,
    "-J-Xmx2g", may be passed to the javac compiler during bootstrap
    build. This is helpful if your system chooses too small of a max
    heap size for the Java compiler during the bootstrap build.
  - --noexpand_configs_in_place is deprecated.
  - A tool to parse the Bazel execution log.
  - Support for LIPO has been fully removed.
  - Remove support for --discard_actions_after_execution.
  - Add --materialize_param_files flag to write parameter files even
    when actions are executed remotely.
  - Windows default system bazelrc is read from the user's
    ProgramData if present.
  - --[no]allow_undefined_configs no longer exists, passing undefined
    configs is an error.
  - In remote caching we limit the number of open
    TCP connections to 100 by default. The number can be adjusted
    by specifying the --remote_max_connections flag.
werkt pushed a commit to werkt/bazel that referenced this issue Aug 2, 2018
…uild#5491

This change limits the number of open tcp connections
by default to 100 for remote caching. We have had error
reports where some use cases Bazel would open so many
TCP connections that it crashed/ran out of sockets. The
max. number of TCP connections can still be adjusted by
specifying --remote_max_connections.

See also bazelbuild#5047.

RELNOTES: In remote caching we limit the number of open
TCP connections to 100 by default. The number can be adjusted
by specifying the --remote_max_connections flag.
PiperOrigin-RevId: 202958838
werkt pushed a commit to werkt/bazel that referenced this issue Aug 2, 2018
For downloading output files / directories we trigger all
downloads concurrently and asynchronously in the background
and after that wait for all downloads to finish. However, if
a download failed we did not wait for the remaining downloads
to finish but immediately started deleting partial downloads
and continued with local execution of the action.

That leads to two interesting bugs:
* The cleanup procedure races with the downloads that are still
in progress. As it tries to delete files and directories, new
files and directories are created and that will often
lead to "Directory not empty" errors as seen in bazelbuild#5047.
* The clean up procedure does not detect the race, succeeds and
subsequent local execution fails because not all files have
been deleted.

The solution is to always wait for all downloads to complete
before entering the cleanup routine. Ideally we would also
cancel all outstanding downloads, however, that's not as
straightfoward as it seems. That is, the j.u.c.Future API does
not provide a way to cancel a computation and also wait for
that computation actually having determinated. So we'd need
to introduce a separate mechanism to cancel downloads.

RELNOTES: None
PiperOrigin-RevId: 205980446
buchgr pushed a commit to buchgr/bazel that referenced this issue Aug 7, 2018
Baseline: 4f64b77

Cherry picks:
   + 4c9a0c8:
     reduce the size of bazel's embedded jdk
   + d3228b6:
     remote: limit number of open tcp connections by default. Fixes
     bazelbuild#5491
   + 8ff87c1:
     Fix autodetection of linker flags
   + c4622ac:
     Fix autodetection of -z linker flags
   + 1021965:
     blaze_util_posix.cc: fix order of #define
   + ab1f269:
     blaze_util_freebsd.cc: include path.h explicitly
   + 68e92b4:
     openjdk: update macOS openjdk image. Fixes bazelbuild#5532
   + f45c224:
     Set the start time of binary and JSON profiles to zero correctly.
   + bca1912:
     remote: fix race on download error. Fixes bazelbuild#5047
   + 3842bd3:
     jdk: use parallel old gc and disable compact strings

Incompatible changes:

  - The $(ANDROID_CPU) Make variable is not available anymore. Use
    $(TARGET_CPU) after an Android configuration transition instead.
  - The $(JAVA_TRANSLATIONS) Make variable is not supported anymore.
  - Skylark structs (using struct()) may no longer have to_json and
    to_proto overridden.
  - The mobile-install --skylark_incremental_res flag is no longer
    available, use the --skylark flag instead.

New features:

  - android_local_test now takes advantage of Robolectric's binary
    resource processing which allows for faster tests.
  - Allow @ in package names.

Important changes:

  - Option --glibc is removed, toolchain selection relies solely on
    --cpu and --compiler options.
  - Build support for enabling cross binary FDO optimization.
  - The --distdir option is no longer experimental. This
      option allows to specify additional directories to look for
      files before trying to fetch them from the network. Files from
      any of the distdirs are only used if a checksum for the file
      is specified and both, the filename and the checksum, match.
  - Java coverage works now with multiple jobs.
  - Flip default value of --experimental_shortened_obj_file_path to
    true, Bazel now generates short object file path by default.
  - New rules for importing Android dependencies:
    `aar_import_external` and `aar_maven_import_external`.
    `aar_import_external` enables specifying external AAR
    dependencies using a list of HTTP URLs for the artifact.
    `aar_maven_import_external` enables specifying external AAR
    dependencies using the artifact coordinate and a list of server
    URLs.
  - The BAZEL_JAVAC_OPTS environment variable allows arguments, e.g.,
    "-J-Xmx2g", may be passed to the javac compiler during bootstrap
    build. This is helpful if your system chooses too small of a max
    heap size for the Java compiler during the bootstrap build.
  - --noexpand_configs_in_place is deprecated.
  - A tool to parse the Bazel execution log.
  - Support for LIPO has been fully removed.
  - Remove support for --discard_actions_after_execution.
  - Add --materialize_param_files flag to write parameter files even
    when actions are executed remotely.
  - Windows default system bazelrc is read from the user's
    ProgramData if present.
  - --[no]allow_undefined_configs no longer exists, passing undefined
    configs is an error.
  - In remote caching we limit the number of open
    TCP connections to 100 by default. The number can be adjusted
    by specifying the --remote_max_connections flag.
c-parsons pushed a commit that referenced this issue Aug 8, 2018
This change limits the number of open tcp connections
by default to 100 for remote caching. We have had error
reports where some use cases Bazel would open so many
TCP connections that it crashed/ran out of sockets. The
max. number of TCP connections can still be adjusted by
specifying --remote_max_connections.

See also #5047.

RELNOTES: In remote caching we limit the number of open
TCP connections to 100 by default. The number can be adjusted
by specifying the --remote_max_connections flag.
PiperOrigin-RevId: 202958838
c-parsons pushed a commit that referenced this issue Aug 8, 2018
For downloading output files / directories we trigger all
downloads concurrently and asynchronously in the background
and after that wait for all downloads to finish. However, if
a download failed we did not wait for the remaining downloads
to finish but immediately started deleting partial downloads
and continued with local execution of the action.

That leads to two interesting bugs:
* The cleanup procedure races with the downloads that are still
in progress. As it tries to delete files and directories, new
files and directories are created and that will often
lead to "Directory not empty" errors as seen in #5047.
* The clean up procedure does not detect the race, succeeds and
subsequent local execution fails because not all files have
been deleted.

The solution is to always wait for all downloads to complete
before entering the cleanup routine. Ideally we would also
cancel all outstanding downloads, however, that's not as
straightfoward as it seems. That is, the j.u.c.Future API does
not provide a way to cancel a computation and also wait for
that computation actually having determinated. So we'd need
to introduce a separate mechanism to cancel downloads.

RELNOTES: None
PiperOrigin-RevId: 205980446
buchgr added a commit to buchgr/bazel that referenced this issue Aug 9, 2018
…uild#5491

This change limits the number of open tcp connections
by default to 100 for remote caching. We have had error
reports where some use cases Bazel would open so many
TCP connections that it crashed/ran out of sockets. The
max. number of TCP connections can still be adjusted by
specifying --remote_max_connections.

See also bazelbuild#5047.

RELNOTES: In remote caching we limit the number of open
TCP connections to 100 by default. The number can be adjusted
by specifying the --remote_max_connections flag.
PiperOrigin-RevId: 202958838
buchgr added a commit to buchgr/bazel that referenced this issue Aug 9, 2018
For downloading output files / directories we trigger all
downloads concurrently and asynchronously in the background
and after that wait for all downloads to finish. However, if
a download failed we did not wait for the remaining downloads
to finish but immediately started deleting partial downloads
and continued with local execution of the action.

That leads to two interesting bugs:
* The cleanup procedure races with the downloads that are still
in progress. As it tries to delete files and directories, new
files and directories are created and that will often
lead to "Directory not empty" errors as seen in bazelbuild#5047.
* The clean up procedure does not detect the race, succeeds and
subsequent local execution fails because not all files have
been deleted.

The solution is to always wait for all downloads to complete
before entering the cleanup routine. Ideally we would also
cancel all outstanding downloads, however, that's not as
straightfoward as it seems. That is, the j.u.c.Future API does
not provide a way to cancel a computation and also wait for
that computation actually having determinated. So we'd need
to introduce a separate mechanism to cancel downloads.

RELNOTES: None
PiperOrigin-RevId: 205980446
bazel-io pushed a commit that referenced this issue Aug 13, 2018
Baseline: 4f64b77

Cherry picks:
   + 4c9a0c8:
     reduce the size of bazel's embedded jdk
   + d3228b6:
     remote: limit number of open tcp connections by default. Fixes
     #5491
   + 8ff87c1:
     Fix autodetection of linker flags
   + c4622ac:
     Fix autodetection of -z linker flags
   + 1021965:
     blaze_util_posix.cc: fix order of #define
   + ab1f269:
     blaze_util_freebsd.cc: include path.h explicitly
   + 68e92b4:
     openjdk: update macOS openjdk image. Fixes #5532
   + f45c224:
     Set the start time of binary and JSON profiles to zero correctly.
   + bca1912:
     remote: fix race on download error. Fixes #5047
   + 3842bd3:
     jdk: use parallel old gc and disable compact strings
   + 6bd0bdf:
     Add objc-fully-link to the list of actions that require the
     apple_env feature. This fixes apple_static_library functionality.
   + f330439:
     Add the action_names_test_files target to the OSS version of
     tools/buils_defs/cc/BUILD.
   + d215b64:
     Fix StackOverflowError on Windows. Fixes #5730
   + 366da4c:
     In java_rules_skylark depend on the javabase through
     //tools/jdk:current_java_runtime
   + 30c601d:
     Don't use @local_jdk for jni headers
   + c56699d:
     'DumpPlatformClasspath' now dumps the current JDK's default
     platform classpath

This release is a patch release that contains fixes for several serious
regressions that were found after the release of Bazel 0.16.0.

In particular this release resolves the following issues:

 - Bazel crashes with a StackOverflowError on Windows (See #5730)
 - Bazel requires a locally installed JDK and does not fall back
   to the embedded JDK (See #5744)
 - Bazel fails to build for Homebrew on macOS El Capitan (See #5777)
 - A regression in apple_static_library (See #5683)

Please watch our blog for a more detailed release announcement.
laurentlb pushed a commit that referenced this issue Sep 12, 2018
This change limits the number of open tcp connections
by default to 100 for remote caching. We have had error
reports where some use cases Bazel would open so many
TCP connections that it crashed/ran out of sockets. The
max. number of TCP connections can still be adjusted by
specifying --remote_max_connections.

See also #5047.

RELNOTES: In remote caching we limit the number of open
TCP connections to 100 by default. The number can be adjusted
by specifying the --remote_max_connections flag.
PiperOrigin-RevId: 202958838
laurentlb pushed a commit that referenced this issue Sep 12, 2018
For downloading output files / directories we trigger all
downloads concurrently and asynchronously in the background
and after that wait for all downloads to finish. However, if
a download failed we did not wait for the remaining downloads
to finish but immediately started deleting partial downloads
and continued with local execution of the action.

That leads to two interesting bugs:
* The cleanup procedure races with the downloads that are still
in progress. As it tries to delete files and directories, new
files and directories are created and that will often
lead to "Directory not empty" errors as seen in #5047.
* The clean up procedure does not detect the race, succeeds and
subsequent local execution fails because not all files have
been deleted.

The solution is to always wait for all downloads to complete
before entering the cleanup routine. Ideally we would also
cancel all outstanding downloads, however, that's not as
straightfoward as it seems. That is, the j.u.c.Future API does
not provide a way to cancel a computation and also wait for
that computation actually having determinated. So we'd need
to introduce a separate mechanism to cancel downloads.

RELNOTES: None
PiperOrigin-RevId: 205980446
@ob
Copy link
Contributor

ob commented Sep 18, 2018

I'm still seeing this issue using Bazel 0.17.1 and a medium sized iOS application (using rules_apple 0.7.0 and rules_swift as of 7bbcf9584613169cda709b9c217f5ac29cc5a089). Unfortunately I can't share the app but I'd be happy to help debug the issue.

My config:

build --spawn_strategy=standalone
build --genrule_strategy=standalone
build --experimental_objc_enable_module_maps
build --features swift.no_generated_module_map
build --symlink_prefix=build/
build --xcode_version 9.4.1
build --remote_http_cache=$REMOTE_CACHE_URL

And I'm building with the command:

bazel build --jobs 128 //:Learning

Where Learning is an ios_application. The error message is:

ERROR: /Users/obonilla/r/learning-ios_trunk/BUILD.bazel:317:1: SwiftCompile Learning.swiftmodule failed: Unexpected IO error.: Exhaused retry attempts (0)
Target //:Learning failed to build
Use --verbose_failures to see the command lines of failed build steps.
ERROR: /Users/obonilla/r/learning-ios_trunk/BUILD.bazel:1259:1 SwiftCompile Learning.swiftmodule failed: Unexpected IO error.: Exhaused retry attempts (0)

@buchgr
Copy link
Contributor

buchgr commented Sep 19, 2018

Can you run with --verbose_failures and see if you get the same error message?

@ob
Copy link
Contributor

ob commented Sep 19, 2018

Same error message.

ERROR: /Users/obonilla/r/learning-ios_trunk/LearningDataModel/BUILD.bazel:2:1: C++ compilation of rule '//LearningDataModel:LearningDataModel_ObjC' failed: Unexpected IO error.: Exhaused retry attempts (0)

This time though, I noticed that buried in the compilation output there is this:

WARNING: Error reading from the remote cache:
ClosedChannelException

However, the output seems wrong since I can't tell if it retried, and if so how many times it retried. I also ran it adding the flag --remote_local_fallback, which I expected would fail back to building if there was a problem with the remote cache, but it didn't appear to do anything.

Maybe this is a different issue... should I open a new Issue?

@buchgr
Copy link
Contributor

buchgr commented Sep 20, 2018

That's actually a different error, as the output does not contain Failed to delete output files after incomplete download. Cannot continue with local execution.. It's possible that local fallback is broken :. Please open a new issue.

@ob
Copy link
Contributor

ob commented Dec 8, 2018

I'm seeing this one with Bazel 0.20.0. I think a failure in the remote cache:

[538 / 5,501] CompileStrings scripts/bazel/deviceBuilds/appStore/Learning.resources/LIAuthLibraryBundle.bundle/nb.lproj/LIAuthLibrary.strings; 3s remote-cache ... (1426 actions, 1423 running)
WARNING: Error reading from the remote cache:
502 Bad Gateway
<html>
<head><title>502 Bad Gateway</title></head>
<body bgcolor="white">
<center><h1>502 Bad Gateway</h1></center>
<hr><center>nginx/1.12.2</center>
</body>
</html>
....
WARNING: Error reading from the remote cache:
Exhausted retry attempts (0)

triggered this again:

ERROR:... XibCompile .../CodeBlockView.ibtool-outputs failed: Failed to delete output files after incomplete download. Cannot continue with local execution.: ...CodeBlockView.ibtool-outputs/CodeBlockView.nib (Directory not empty)
INFO: Elapsed time: 45.007s, Critical Path: 21.92s

this was with --jobs 2000 so it might be that same race?

@rijosimon
Copy link

That's actually a different error, as the output does not contain Failed to delete output files after incomplete download. Cannot continue with local execution.. It's possible that local fallback is broken :. Please open a new issue.

I am getting a ClosedChannelException error when trying to use bazel remote cache deployed on k8s. Was a separate issue opened for this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P1 I'll work on this now. (Assignee required)
Projects
None yet
Development

No branches or pull requests

8 participants