Workspace download retries with read timeouts #8974

keith · 2019-07-24T18:47:18Z

It seems like currently if bazel fails to download something from the WORKSPACE that has multiple URLs defined, it doesn't always fallback to the other URL. For example I hit this error:

Starting local Bazel server and connecting to it...
 - /root/.cache/bazel/_bazel_root/c99f8fb4c845b3aae6d69f1a3c75aa35/external/io_bazel_rules_go/go/private/sdk.bzl:93:5
 - /root/.cache/bazel/_bazel_root/c99f8fb4c845b3aae6d69f1a3c75aa35/external/io_bazel_rules_go/go/private/sdk.bzl:261:13
 - /tmp/share/code/WORKSPACE:83:1
ERROR: An error occurred during the fetch of repository 'go_sdk':
   Traceback (most recent call last):
	File "/root/.cache/bazel/_bazel_root/c99f8fb4c845b3aae6d69f1a3c75aa35/external/io_bazel_rules_go/go/private/sdk.bzl", line 78
		_remote_sdk(ctx, [url.format(filename) for url...], <2 more arguments>)
	File "/root/.cache/bazel/_bazel_root/c99f8fb4c845b3aae6d69f1a3c75aa35/external/io_bazel_rules_go/go/private/sdk.bzl", line 149, in _remote_sdk
		ctx.download(url = urls, sha256 = sha256, output ...")
java.io.IOException: Error downloading [https://dl.google.com/go/go1.12.7.linux-amd64.tar.gz] to /root/.cache/bazel/_bazel_root/c99f8fb4c845b3aae6d69f1a3c75aa35/external/go_sdk/go_sdk.tar.gz: Read timed out
ERROR: While resolving toolchains for target @com_github_bazelbuild_buildtools//buildifier:buildifier: invalid registered toolchain '@go_sdk//:go_android_amd64': no such package '@go_sdk//': Traceback (most recent call last):
	File "/root/.cache/bazel/_bazel_root/c99f8fb4c845b3aae6d69f1a3c75aa35/external/io_bazel_rules_go/go/private/sdk.bzl", line 78
		_remote_sdk(ctx, [url.format(filename) for url...], <2 more arguments>)
	File "/root/.cache/bazel/_bazel_root/c99f8fb4c845b3aae6d69f1a3c75aa35/external/io_bazel_rules_go/go/private/sdk.bzl", line 149, in _remote_sdk
		ctx.download(url = urls, sha256 = sha256, output ...")
java.io.IOException: Error downloading [https://dl.google.com/go/go1.12.7.linux-amd64.tar.gz] to /root/.cache/bazel/_bazel_root/c99f8fb4c845b3aae6d69f1a3c75aa35/external/go_sdk/go_sdk.tar.gz: Read timed out
ERROR: Analysis of target '//:buildifier.check' failed; build aborted: invalid registered toolchain '@go_sdk//:go_android_amd64': no such package '@go_sdk//': Traceback (most recent call last):
	File "/root/.cache/bazel/_bazel_root/c99f8fb4c845b3aae6d69f1a3c75aa35/external/io_bazel_rules_go/go/private/sdk.bzl", line 78
		_remote_sdk(ctx, [url.format(filename) for url...], <2 more arguments>)
	File "/root/.cache/bazel/_bazel_root/c99f8fb4c845b3aae6d69f1a3c75aa35/external/io_bazel_rules_go/go/private/sdk.bzl", line 149, in _remote_sdk
		ctx.download(url = urls, sha256 = sha256, output ...")
java.io.IOException: Error downloading [https://dl.google.com/go/go1.12.7.linux-amd64.tar.gz] to /root/.cache/bazel/_bazel_root/c99f8fb4c845b3aae6d69f1a3c75aa35/external/go_sdk/go_sdk.tar.gz: Read timed out
ERROR: Build failed. Not running target

When my WORKSPACE contained:

http_archive(
    name = "io_bazel_rules_go",
    sha256 = "8df59f11fb697743cbb3f26cfb8750395f30471e9eabde0d174c3aebc7a1cd39",
    urls = [
        "https://storage.googleapis.com/bazel-mirror/github.com/bazelbuild/rules_go/releases/download/0.19.1/rules_go-0.19.1.tar.gz",
        "https://github.com/bazelbuild/rules_go/releases/download/0.19.1/rules_go-0.19.1.tar.gz",
    ],
)

I would have expected this to fallback to the GitHub URL before failing.

What's the output of `bazel info release`?

release 0.28.0

The text was updated successfully, but these errors were encountered:

artem-zinnatullin · 2019-07-30T20:50:31Z

I've submitted two fixes for this issue:

Fallback to next urls if download fails in HttpDownloader #9015 which makes HttpDownloader to actually use provided urls as fallback if HTTP payload errors out
Fix retrying of SocketTimeoutExceptions in HttpConnector #9008 which brings back the recovery from SocketTimeoutException occurred during HTTP connection (it was broken few months ago)

I also think HttpDownloader, HttpConnector and HttpConnectorMultiplexer based on HttpUrlConnection should be deprecated in favor of more modern non-blocking Netty based code — HttpDownloadHandler, potentially would need to be abstracted out so both repository downloads and blob downloads could reuse it.

Would be happy to work on this if we achieve consensus 😊

buchgr · 2019-07-31T08:35:03Z

Would be happy to work on this if we achieve consensus blush

We wanted to do this for a while. It's just been that we never got around to do it. Feel free to send patches!

artem-zinnatullin · 2019-08-01T00:19:14Z

Great!

Would it be ok to open a separate issue to outline the HttpUrlConnection → Netty migration details and track its progress?

Can you please consider #9015 and #9008 in the mean time as they significantly improve resiliency of repository downloads? 😊

As part of investigation of #8974 I found that recent change 982e0b8 broke retries of `SocketTimeoutException` in `HttpConnector`, intention of that change was good but lack of tests resulted in broken logic. IntelliJ highlights the problem ? branch with `instanceof SocketTimeoutException` would have never been executed: <img width="764" alt="Screen Shot 2019-07-29 at 3 26 11 PM" src="https://user-images.githubusercontent.com/967132/62089179-d7bfa400-b21c-11e9-882a-1c1fe1fcb683.png"> --- This PR adds missing tests and fixes the logic to still present `SocketTimeoutException` as `IOException` for upstream consumers while handling it properly internally in the `HttpConnector`. Closes #9008. PiperOrigin-RevId: 261675244

Fixes #8974. `HttpDownloader` never retried `IOException` that could have occurred during `ByteStreams.copy(payload, out)`, HttpConnector would have retried connection phase errors but not payload ones. This chageset adds fallback to next url(s) if present for both payload read errors and connection errors and adds (completely) missing tests for `HttpDownloader`. Note, that this PR technically disables questionable `HttpConnectorMultiplexer` optimization that attempts to connect to multiple urls at the same time (by starting threads that race with each other) and picking the one that connected faster. There is a way to keep that optimization while falling back to next urls, but it would require all exceptions to contain url that caused it so that `HttpDownloader` could retry download on other urls. I think making `HttpDownloader` more reliable and actually using provided `urls` as fallback is much more important than mentioned optimization. Closes #9015. PiperOrigin-RevId: 261702678

As part of investigation of #8974 I found that recent change 982e0b8 broke retries of `SocketTimeoutException` in `HttpConnector`, intention of that change was good but lack of tests resulted in broken logic. IntelliJ highlights the problem ? branch with `instanceof SocketTimeoutException` would have never been executed: <img width="764" alt="Screen Shot 2019-07-29 at 3 26 11 PM" src="https://user-images.githubusercontent.com/967132/62089179-d7bfa400-b21c-11e9-882a-1c1fe1fcb683.png"> --- This PR adds missing tests and fixes the logic to still present `SocketTimeoutException` as `IOException` for upstream consumers while handling it properly internally in the `HttpConnector`. Closes #9008. PiperOrigin-RevId: 261675244

Fixes #8974. `HttpDownloader` never retried `IOException` that could have occurred during `ByteStreams.copy(payload, out)`, HttpConnector would have retried connection phase errors but not payload ones. This chageset adds fallback to next url(s) if present for both payload read errors and connection errors and adds (completely) missing tests for `HttpDownloader`. Note, that this PR technically disables questionable `HttpConnectorMultiplexer` optimization that attempts to connect to multiple urls at the same time (by starting threads that race with each other) and picking the one that connected faster. There is a way to keep that optimization while falling back to next urls, but it would require all exceptions to contain url that caused it so that `HttpDownloader` could retry download on other urls. I think making `HttpDownloader` more reliable and actually using provided `urls` as fallback is much more important than mentioned optimization. Closes #9015. PiperOrigin-RevId: 261702678

As part of investigation of #8974 I found that recent change 982e0b8 broke retries of `SocketTimeoutException` in `HttpConnector`, intention of that change was good but lack of tests resulted in broken logic. IntelliJ highlights the problem ? branch with `instanceof SocketTimeoutException` would have never been executed: <img width="764" alt="Screen Shot 2019-07-29 at 3 26 11 PM" src="https://user-images.githubusercontent.com/967132/62089179-d7bfa400-b21c-11e9-882a-1c1fe1fcb683.png"> --- This PR adds missing tests and fixes the logic to still present `SocketTimeoutException` as `IOException` for upstream consumers while handling it properly internally in the `HttpConnector`. Closes #9008. PiperOrigin-RevId: 261675244

Fixes #8974. `HttpDownloader` never retried `IOException` that could have occurred during `ByteStreams.copy(payload, out)`, HttpConnector would have retried connection phase errors but not payload ones. This chageset adds fallback to next url(s) if present for both payload read errors and connection errors and adds (completely) missing tests for `HttpDownloader`. Note, that this PR technically disables questionable `HttpConnectorMultiplexer` optimization that attempts to connect to multiple urls at the same time (by starting threads that race with each other) and picking the one that connected faster. There is a way to keep that optimization while falling back to next urls, but it would require all exceptions to contain url that caused it so that `HttpDownloader` could retry download on other urls. I think making `HttpDownloader` more reliable and actually using provided `urls` as fallback is much more important than mentioned optimization. Closes #9015. PiperOrigin-RevId: 261702678

jin added team-ExternalDeps External dependency handling, remote repositiories, WORKSPACE file. untriaged labels Jul 24, 2019

laurentlb added P2 We'll consider working on this in future. (Assignee optional) type: bug and removed untriaged labels Jul 29, 2019

laurentlb assigned aehlig Jul 29, 2019

This was referenced Jul 29, 2019

Fix retrying of SocketTimeoutExceptions in HttpConnector #9008

Closed

Fallback to next urls if download fails in HttpDownloader #9015

Closed

bazel-io closed this as completed in 14651cd Aug 5, 2019

artem-zinnatullin mentioned this issue Aug 5, 2019

Release 0.29 - August 2019 (stable) #8572

Closed

ghost mentioned this issue Nov 22, 2019

HttpDownloader: java.io.IOException: thread interrupted #10294

Closed

philwo added the team-OSS Issues for the Bazel OSS team: installation, release processBazel packaging, website label Jun 15, 2020

philwo unassigned aehlig Jun 15, 2020

suzaku mentioned this issue Dec 2, 2020

FR: http_archive should support retrying IOException for the same URL #12599

Closed

philwo removed the team-OSS Issues for the Bazel OSS team: installation, release processBazel packaging, website label Nov 29, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Workspace download retries with read timeouts #8974

Workspace download retries with read timeouts #8974

keith commented Jul 24, 2019

artem-zinnatullin commented Jul 30, 2019

buchgr commented Jul 31, 2019

artem-zinnatullin commented Aug 1, 2019

Workspace download retries with read timeouts #8974

Workspace download retries with read timeouts #8974

Comments

keith commented Jul 24, 2019

What's the output of bazel info release?

artem-zinnatullin commented Jul 30, 2019

buchgr commented Jul 31, 2019

artem-zinnatullin commented Aug 1, 2019

What's the output of `bazel info release`?