-
Notifications
You must be signed in to change notification settings - Fork 120
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Document NOT_FOUND Execution error reporting #259
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -125,6 +125,14 @@ service Execution { | |
// operation completes, and then respond with the completed operation. The | ||
// server MAY choose to stream additional updates as execution progresses, | ||
// such as to provide an update as to the state of the execution. | ||
// | ||
// In addition to the cases describe for Execute, the WaitExecution method | ||
// may fail as follows: | ||
// | ||
// * `NOT_FOUND`: The operation no longer exists due to any of a transient | ||
// condition, an unknown operation name, or if the server implements the | ||
// Operations API DeleteOperation method and it was called for the current | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. As far as I can parse the Operation API, having another stream call DeleteOperation should have no effect on the operation itself--the API says that it "indicates the client is no longer interested in the Operation" but that calling it "does not cancel the operation." On the other hand, calling CancelOperation should probably cancel the Action (best effort) in which case any calls to WaitExecution should return SUCCESS, the underlying ExecutionResponse should say the Action was cancelled. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. With the move to waitExecution for the description here, I'd say that calling DeleteOperation during the stream of Execute is not the intended context for a NOT_FOUND. Only a DeleteOperation call before a WaitExecution would apply here, which could move an operation's queryable state from valid and waitable to nonexistent. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That makes sense, although it feels like it just reduces to the "unknown operation name" at that point. I don't feel strongly about it, though. |
||
// execution. The client should call `Execute` to retry. | ||
rpc WaitExecution(WaitExecutionRequest) returns (stream google.longrunning.Operation) { | ||
option (google.api.http) = { post: "/v2/{name=operations/**}:waitExecution" body: "*" }; | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does it mean for an operation to be missing due to a "transient condition?"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was in the spirit of "transient condition" of UNAVAILABLE above, in a way that could induce an operation to be considered nonexistent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that Bazel considers NOT_FOUND to be retriable, so I think it's better if the server returns UNAVAILABLE or UNKNOWN in the event of an unexpected error causing a transient issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interestingly enough, Buildbarn already had a WaitExecution implementation that returns NOT_FOUND in that case. Over the last two years I’ve been doing restarts of our scheduler on a weekly basis that causes it to lose data and thus return NOT_FOUND. So far it hasn’t caused builds to fail….?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bazel does consider NOT_FOUND retriable, both in the existing and experimental remote executors:
https://github.com/bazelbuild/bazel/blob/fed23d5a08cecbdcc4725adf77824a6c5bde1b4e/src/main/java/com/google/devtools/build/lib/remote/ExperimentalGrpcRemoteExecutor.java#L135-L136
https://github.com/bazelbuild/bazel/blob/fed23d5a08cecbdcc4725adf77824a6c5bde1b4e/src/main/java/com/google/devtools/build/lib/remote/ExperimentalGrpcRemoteExecutor.java#L160
https://github.com/bazelbuild/bazel/blob/fed23d5a08cecbdcc4725adf77824a6c5bde1b4e/src/main/java/com/google/devtools/build/lib/remote/ExperimentalGrpcRemoteExecutor.java#L202-L213
https://github.com/bazelbuild/bazel/blob/fed23d5a08cecbdcc4725adf77824a6c5bde1b4e/src/main/java/com/google/devtools/build/lib/remote/GrpcRemoteExecutor.java#L217-L221
Where it in any case does not disambiguate between execute and waitExecution producing the status.
(the description copy appears in both implementations)