-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Properly cancel repository downloads #23837
Properly cancel repository downloads #23837
Conversation
When cancelling an asynchronous repository download task, an interrupt signal is sent to the download thread. This doesn't mean that the download stops immediately. Avoid restarting a download until the previous download has actually stopped, so that the new download is able to clean old data without crashing (on Windows). Fixes bazelbuild#21773
b3d15df
to
1b42adc
Compare
Thank you @fmeum. After rebasing (and fixing a double space in a comment for you), I ran the test on bb-storage again as in #21773 and can see the new "Cancelling download " profiling point appear, which is great. This version haven't been tested in production yet. Let me know if you would like that before merging, it may take a week or two. |
src/test/java/com/google/devtools/build/lib/bazel/repository/downloader/HttpDownloaderTest.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Wyverald Could you review this as well?
src/test/java/com/google/devtools/build/lib/bazel/repository/downloader/HttpDownloaderTest.java
Outdated
Show resolved
Hide resolved
That would probably mean missing the Bazel 8.0.0 and 7.4.0 releases. I'm not sure whether a confirmed race is better or worse than an unconfirmed fix. @meteorcloudy What do you think? |
The first commit has been verified in production, it's just the second commit from today that has not been tested on scale. |
The only meaningful difference is that I made the cleanup uninterruptible. Since it should only be interrupted when the user interrupts the build, it shouldn't make a big difference - it just requires the user to explicitly choose to kill the server in order to interrupt the cleanup. |
Hmm, I don't have a strong opinion, sounds like this could be a nice-to-have and could wait for 8.1? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the fix! I think this is fine to cherry-pick into 8.0.0rc2 or so, but probably not going to make it in 7.4.0.
...ava/com/google/devtools/build/lib/bazel/repository/starlark/StarlarkBaseExternalContext.java
Show resolved
Hide resolved
@@ -138,6 +140,8 @@ public Future<Path> startDownload( | |||
eventHandler, | |||
clientEnv, | |||
context); | |||
} finally { | |||
doneSignal.countDown(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this could lead to a deadlock if the future is never scheduled before being canceled (that is, after submit()
is called and returns, but before the Callable gets called).
WorkerSkyKeyComputeState
works around this by using a ListeningExecutorService
and using addListener
on the returned Future (https://cs.opensource.google/bazel/bazel/+/master:src/main/java/com/google/devtools/build/skyframe/WorkerSkyKeyComputeState.java;drc=23e5ec157d582585bb9635b5c374ec21cc50b4ea;l=173). let's do the same here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is a good point, but I verified, by both code inspection and empirically, that the Future.addListener
also triggers its runnable too early. Therefore, I had to add a semaphore to get the ordering right, which my small scale experiment confirms work. (Well, I could not trigger the case where the task never starts, but the bug this is aimed to solve is gone.)
return executorService.submit( | ||
String context, | ||
CountDownLatch doneSignal) { | ||
Semaphore doneSignaller = new Semaphore(1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of combining a semaphore and a latch, maybe we could use a single shared Phaser
instead? A download can register
at the start of the lambda and arriveAndDeregister
in the finally
block. Then close
can wait on arriveAndAwaitAdvance
. You can search for Phaser
in the codebase to find another very similar usage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see how to implement it using Phaser
. We need three phases: Init (0), started (1) and done (2). Calling awaitAdvance(1)
to wait for the done phase is not possible while in the init phase. @fmeum What is your idea?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I may very well be missing something, but why do we need to track the init
phase at all with a Phaser
? If the task has never been scheduled by the executor, we also don't need to cancel it. If it has started, it will either have registered with the Phaser
(and will thus be awaited with awaitAndAdvance
) or not since it hasn't even reached the register
call (you can check the return value to avoid registrations happening after the advance).
Just to clarify: With the Phaser
approach, addListener
also shouldn't be needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated with Phaser
, which is much better. Check the diff for the two last commits together. Tested on my small reproduction but not in production.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that we could even use a single Phaser
per context, see https://github.com/fmeum/bazel/tree/23837-phaser. It doesn't give use the opportunity the log the time taken to wait for a particular download cancellation though.
@bazel-io fork 8.0.0 |
return executorService.submit( | ||
() -> { | ||
if (downloadPhaser.register() < 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (downloadPhaser.register() < 0) { | |
if (downloadPhaser.register() != 0) { |
looks more obviously correct to me, although I am not 100% sure it even makes a difference.
} | ||
try (SilentCloseable c = | ||
Profiler.instance().profile("Cancelling download " + outputPath)) { | ||
downloadPhaser.awaitAdvance(downloadPhaser.arriveAndDeregister()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
downloadPhaser.awaitAdvance(downloadPhaser.arriveAndDeregister()); | |
downloadPhaser.arriveAndAwaitAdvance(); |
should suffice as we don't care about deregistering (there won't be another phase).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code was written to use Phaser termination as the signalling method. I've now changed to to use phase=0
as signalling method according to the suggestion. The JSON profiles look very nice in the small test case.
public void close() { | ||
// Call register() to ensure arriveAndDeregister() will terminate the | ||
// downloadPhaser if the download has not started yet. | ||
if (downloadPhaser.register() < 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (downloadPhaser.register() < 0) { | |
if (downloadPhaser.register() != 0) { |
@Wyverald Now that we have a pretty clean fix, should we consider cherry-picking this to 7.4.0 as well? Looking at the comment in the 8.0.0 cherry-pick issue, this would not just result in loud failures on Windows, but also in silent rebuilds on other platforms. |
When cancelling an asynchronous repository download task, an interrupt signal is sent to the download thread. This doesn't mean that the download stops immediately. Avoid restarting a download until the previous download has actually stopped, so that the new download is able to clean old data without crashing (on Windows). Fixes bazelbuild#21773 Closes bazelbuild#23837. PiperOrigin-RevId: 686175953 Change-Id: I8d75f905b739d38b6cb430d5b5e84fda9a2d14e3
When cancelling an asynchronous repository download task, an interrupt signal is sent to the download thread. This doesn't mean that the download stops immediately. Avoid restarting a download until the previous download has actually stopped, so that the new download is able to clean old data without crashing (on Windows). Fixes #21773 Closes #23837. PiperOrigin-RevId: 686175953 Change-Id: I8d75f905b739d38b6cb430d5b5e84fda9a2d14e3 Commit 48338d2 Co-authored-by: Fredrik Medley <[email protected]>
When cancelling an asynchronous repository download task, an interrupt signal is sent to the download thread. This doesn't mean that the download stops immediately. Avoid restarting a download until the previous download has actually stopped, so that the new download is able to clean old data without crashing (on Windows). Fixes bazelbuild#21773 Closes bazelbuild#23837. PiperOrigin-RevId: 686175953 Change-Id: I8d75f905b739d38b6cb430d5b5e84fda9a2d14e3
When cancelling an asynchronous repository download task, an interrupt signal is sent to the download thread. This doesn't mean that the download stops immediately. Avoid restarting a download until the previous download has actually stopped, so that the new download is able to clean old data without crashing (on Windows). Fixes #21773 Closes #23837. PiperOrigin-RevId: 686175953 Change-Id: I8d75f905b739d38b6cb430d5b5e84fda9a2d14e3 Commit 48338d2 Co-authored-by: Fredrik Medley <[email protected]>
When cancelling an asynchronous repository download task, an interrupt signal is sent to the download thread. This doesn't mean that the download stops immediately. Avoid restarting a download until the previous download has actually stopped, so that the new download is able to clean old data without crashing (on Windows).
Fixes #21773