-
Notifications
You must be signed in to change notification settings - Fork 120
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Document NOT_FOUND Execution error reporting #259
Conversation
@@ -99,6 +99,9 @@ service Execution { | |||
// * `CANCELLED`: The operation was cancelled by the client. This status is | |||
// only possible if the server implements the Operations API CancelOperation | |||
// method, and it was called for the current execution. | |||
// * `NOT_FOUND`: Due to a transient condition, an unknown operation name, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that this error condition also applies to Execute(). That call doesn't take an operation name, right? It's specific to WaitExecution().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Transiency could well occur during Execute, in the same timeframe that an Execute request might end, and a WaitExecution would be made. The errors that WaitExecution could see are also summarized here, and I made that explicit with the add below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Transiency could well occur during Execute, in the same timeframe that an Execute request might end, and a WaitExecution would be made.
This part I don't really understand. All I'm saying is, can Execute() fail with NOT_FOUND
because an invalid operation name was specified? My understand is that this cannot occur, because it is never called with an operation name.
My recommendation would be to leave the description of Execute() as is, but to replace this sentence that you added to WaitExecution:
This method will report errors consistent with
Execute
.
With something along the lines of:
In addition to the cases describe for
Execute
, theWaitExecution
method may fail as follows:
NOT_FOUND
: Description goes here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a subtle point, but I think specifying this with WaitExecution alone will probably not prevent Execute interpretation on clients from needing to handle it as a response to either method. The sentiment is that if a WaitExecution would return a NOT_FOUND at some point due to operation disappearance, then a currently active Execute call should also return that.
Putting it into the description for WaitExecution will suffice, I'll update.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The sentiment is that if a WaitExecution would return a NOT_FOUND at some point due to operation disappearance, then a currently active Execute call should also return that.
WaitExecution should basically do the following things:
- Look up an operation by operation name.
- Potentially attach/associate the current gRPC call to that operation, depending on what the implementation looks like.
- Report events/mutations that occur against the operation.
I think you're stating above that NOT_FOUND should be returned if an operation were to disappear during (3). But I don't think that's desirable. There are gRPC status codes that are far more relevant to situations like these, and we should use those instead. (CANCELLED? ABORTED? Pick one.)
In my opinion NOT_FOUND should only be returned if step (1) fails. As step (1) is specific to WaitExecution and not Execute, there is thus no point in documenting NOT_FOUND as a valid status code for Execute.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright then. Moved to above WaitExecution in an updated commit already.
477dec4
to
869e49e
Compare
869e49e
to
376a8a5
Compare
376a8a5
to
26ce6f1
Compare
// may fail as follows: | ||
// | ||
// * `NOT_FOUND`: The operation no longer exists due to any of a transient | ||
// condition, an unknown operation name, or if the server implements the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does it mean for an operation to be missing due to a "transient condition?"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was in the spirit of "transient condition" of UNAVAILABLE above, in a way that could induce an operation to be considered nonexistent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that Bazel considers NOT_FOUND to be retriable, so I think it's better if the server returns UNAVAILABLE or UNKNOWN in the event of an unexpected error causing a transient issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interestingly enough, Buildbarn already had a WaitExecution implementation that returns NOT_FOUND in that case. Over the last two years I’ve been doing restarts of our scheduler on a weekly basis that causes it to lose data and thus return NOT_FOUND. So far it hasn’t caused builds to fail….?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bazel does consider NOT_FOUND retriable, both in the existing and experimental remote executors:
https://github.com/bazelbuild/bazel/blob/fed23d5a08cecbdcc4725adf77824a6c5bde1b4e/src/main/java/com/google/devtools/build/lib/remote/ExperimentalGrpcRemoteExecutor.java#L135-L136
https://github.com/bazelbuild/bazel/blob/fed23d5a08cecbdcc4725adf77824a6c5bde1b4e/src/main/java/com/google/devtools/build/lib/remote/ExperimentalGrpcRemoteExecutor.java#L160
https://github.com/bazelbuild/bazel/blob/fed23d5a08cecbdcc4725adf77824a6c5bde1b4e/src/main/java/com/google/devtools/build/lib/remote/ExperimentalGrpcRemoteExecutor.java#L202-L213
https://github.com/bazelbuild/bazel/blob/fed23d5a08cecbdcc4725adf77824a6c5bde1b4e/src/main/java/com/google/devtools/build/lib/remote/GrpcRemoteExecutor.java#L217-L221
Where it in any case does not disambiguate between execute and waitExecution producing the status.
(the description copy appears in both implementations)
// | ||
// * `NOT_FOUND`: The operation no longer exists due to any of a transient | ||
// condition, an unknown operation name, or if the server implements the | ||
// Operations API DeleteOperation method and it was called for the current |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I can parse the Operation API, having another stream call DeleteOperation should have no effect on the operation itself--the API says that it "indicates the client is no longer interested in the Operation" but that calling it "does not cancel the operation." On the other hand, calling CancelOperation should probably cancel the Action (best effort) in which case any calls to WaitExecution should return SUCCESS, the underlying ExecutionResponse should say the Action was cancelled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the move to waitExecution for the description here, I'd say that calling DeleteOperation during the stream of Execute is not the intended context for a NOT_FOUND. Only a DeleteOperation call before a WaitExecution would apply here, which could move an operation's queryable state from valid and waitable to nonexistent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense, although it feels like it just reduces to the "unknown operation name" at that point. I don't feel strongly about it, though.
ping on this? I don't have merge privs. |
No description provided.