Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document NOT_FOUND Execution error reporting #259

Merged
merged 1 commit into from
Jun 27, 2023
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions build/bazel/remote/execution/v2/remote_execution.proto
Original file line number Diff line number Diff line change
Expand Up @@ -125,6 +125,14 @@ service Execution {
// operation completes, and then respond with the completed operation. The
// server MAY choose to stream additional updates as execution progresses,
// such as to provide an update as to the state of the execution.
//
// In addition to the cases describe for Execute, the WaitExecution method
// may fail as follows:
//
// * `NOT_FOUND`: The operation no longer exists due to any of a transient
// condition, an unknown operation name, or if the server implements the
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does it mean for an operation to be missing due to a "transient condition?"

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was in the spirit of "transient condition" of UNAVAILABLE above, in a way that could induce an operation to be considered nonexistent.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that Bazel considers NOT_FOUND to be retriable, so I think it's better if the server returns UNAVAILABLE or UNKNOWN in the event of an unexpected error causing a transient issue.

Copy link
Collaborator

@EdSchouten EdSchouten Jun 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interestingly enough, Buildbarn already had a WaitExecution implementation that returns NOT_FOUND in that case. Over the last two years I’ve been doing restarts of our scheduler on a weekly basis that causes it to lose data and thus return NOT_FOUND. So far it hasn’t caused builds to fail….?

Copy link
Collaborator Author

// Operations API DeleteOperation method and it was called for the current
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I can parse the Operation API, having another stream call DeleteOperation should have no effect on the operation itself--the API says that it "indicates the client is no longer interested in the Operation" but that calling it "does not cancel the operation." On the other hand, calling CancelOperation should probably cancel the Action (best effort) in which case any calls to WaitExecution should return SUCCESS, the underlying ExecutionResponse should say the Action was cancelled.

Copy link
Collaborator Author

@werkt werkt Jun 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the move to waitExecution for the description here, I'd say that calling DeleteOperation during the stream of Execute is not the intended context for a NOT_FOUND. Only a DeleteOperation call before a WaitExecution would apply here, which could move an operation's queryable state from valid and waitable to nonexistent.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense, although it feels like it just reduces to the "unknown operation name" at that point. I don't feel strongly about it, though.

// execution. The client should call `Execute` to retry.
rpc WaitExecution(WaitExecutionRequest) returns (stream google.longrunning.Operation) {
option (google.api.http) = { post: "/v2/{name=operations/**}:waitExecution" body: "*" };
}
Expand Down