-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unexpected Exception 'Must not call uploadBlobs after shutdown.' when closing BEP transports, this is a bug. #12575
Comments
CC @michaeledgar who seems to know a thing or two about BES artifact uploading |
I've successfully reproduced this issue using the steps provided. I'm looking into a fix now. Thanks for reporting this! |
Thanks @michaeledgar ! Really appreciate you taking the time to look into this. Let me know if I can help in any way. If you have any thoughts on the ThreadPool shutdown stacktrace at the bottom (whether the fix for this might address that or not) I'd be super appreciative. Happy to open a second issue if not - working furiously on finding a reliable repro case for that one as we speak. |
I have a fix that eliminates the crash similar to the ideas you proposed: the |
Ah that's great news, huge thank you for looking into this @michaeledgar! Hopefully the reference counting will address the other issue as well, since it seems like in that scenario If it doesn't, I'll open a separate issue - which we should be able to tackle with a little more reference counting. Thanks again for your help! |
The fix is currently in review; the patch may be accessed early at https://bazel-review.googlesource.com/c/bazel/+/149490. |
…t enabled. When both BES uploading and File BEP output are enabled, a single BuildEventArtifactUploader object is shared by two different BuildEventTransports. Both were calling #shutdown() which in turn called ByteStreamUploader#shutdown(). If shutdown is called while one transport is still uploading, the ByteStreamUploader will fail an assertion and crash. This change adds reference counting to the BuildEventArtifactUploader interface and ensures the reference counts are maintained correctly when sharing a BuildEventArtifactUploader across multiple independent BuildEventTransport threads. Fixes #12575. RELNOTES: None. TESTED=Made repro modifications to GrpcCacheClient.java described in #12575 and confirmed crash without this change. Implemented this fix, observed crash was no longer reproducible. Added logs to ByteStreamBuildEventArtifactUploader#deallocate() to verify deallocation happened after both BuildEventTransports had completed. PiperOrigin-RevId: 349589743
…t enabled. When both BES uploading and File BEP output are enabled, a single BuildEventArtifactUploader object is shared by two different BuildEventTransports. Both were calling #shutdown() which in turn called ByteStreamUploader#shutdown(). If shutdown is called while one transport is still uploading, the ByteStreamUploader will fail an assertion and crash. This change adds reference counting to the BuildEventArtifactUploader interface and ensures the reference counts are maintained correctly when sharing a BuildEventArtifactUploader across multiple independent BuildEventTransport threads. Fixes #12575. RELNOTES: None. TESTED=Made repro modifications to GrpcCacheClient.java described in #12575 and confirmed crash without this change. Implemented this fix, observed crash was no longer reproducible. Added logs to ByteStreamBuildEventArtifactUploader#deallocate() to verify deallocation happened after both BuildEventTransports had completed. PiperOrigin-RevId: 349589743
Description of the problem:
With a gRPC remote cache and BES backend enabled, Bazel intermittently fails to write build event file with the following error:
I've traced this down FileTransport.java and BuildEventServiceUploader.java both calling
shutdown
on the same ByteStreamBuildEventArtifactUploader. Normally this is fine, because they shutdown around the same time, after all uploads are complete.However, if findMissingDigests takes a while, FileTransport's shutdown is called which shuts down the ByteStreamBuildEventArtifactUploader. Then when findMissingDigests returns - upload is attempted on an uploader that has already been shut down.
This happens in practice on builds with thousands of outputs that need to be uploaded. It can also be triggered artificially by adding an intermittent sleep into Bazel in GrpcCacheClient here before returning the missing digests.
Bugs: what's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.
Create the following BUILD file that generates 10 dummy outputs to be downloaded.
Add intermittent sleep to Bazel in GrpcCacheClient here before returning the missing digests.
Run a build with
--remote_cache
,--bes_backend
,--build_event_json_file
set. Setting a random--remote_instance_name
makes sure that the outputs will be freshly uploaded on each run. This fails >50% of the time.What operating system are you running Bazel on?
Linux
What's the output of
bazel info release
?release 3.7.1
I've reproduced on 3.1.0 and 3.7.1, but haven't tried outside of that range. It's easiest to reproduce with a custom Bazel version with the sleep.
Any other information, logs, or outputs that you want to share?
Some less than ideal fixes I've found for this include removing the shutdown call from FileTransport, or wrapping uploadLocalFiles in BuildEventArtifactUploader with uploader.retain() and uploader.release(). Don't love either of these.
Would love any advice from someone who is more familiar with this code on what a good fix would look like - would be happy to send a pull request. I've seen this error with 4 different companies we've been working with.
Not sure if this this is related, but I'm hoping that the fix here might also fix another bug which we see much more frequently and doesn't seem to require setting
build_event_json_file
, but I've been having a harder time reproducing reliably enough to file a detailed bug report.My hunch for that one is that after commit d82341d that ThreadPool is shut down and some race condition is causing uploads to be added to the ThreadPool after it has been shutdown.
Really appreciate any and all help - would be happy to send pull requests for fixes. Just looking for some guidance on what the right fix looks like.
The text was updated successfully, but these errors were encountered: