Document NOT_FOUND Execution error reporting #259

werkt · 2023-05-29T00:38:56Z

No description provided.

EdSchouten · 2023-05-29T07:22:53Z

build/bazel/remote/execution/v2/remote_execution.proto

@@ -99,6 +99,9 @@ service Execution {
  // * `CANCELLED`: The operation was cancelled by the client. This status is
  //   only possible if the server implements the Operations API CancelOperation
  //   method, and it was called for the current execution.
+  // * `NOT_FOUND`: Due to a transient condition, an unknown operation name,


I don't think that this error condition also applies to Execute(). That call doesn't take an operation name, right? It's specific to WaitExecution().

Transiency could well occur during Execute, in the same timeframe that an Execute request might end, and a WaitExecution would be made. The errors that WaitExecution could see are also summarized here, and I made that explicit with the add below.

Transiency could well occur during Execute, in the same timeframe that an Execute request might end, and a WaitExecution would be made.

This part I don't really understand. All I'm saying is, can Execute() fail with NOT_FOUND because an invalid operation name was specified? My understand is that this cannot occur, because it is never called with an operation name.

My recommendation would be to leave the description of Execute() as is, but to replace this sentence that you added to WaitExecution:

This method will report errors consistent with Execute.

With something along the lines of:

In addition to the cases describe for Execute, the WaitExecution method may fail as follows:

NOT_FOUND: Description goes here.

It's a subtle point, but I think specifying this with WaitExecution alone will probably not prevent Execute interpretation on clients from needing to handle it as a response to either method. The sentiment is that if a WaitExecution would return a NOT_FOUND at some point due to operation disappearance, then a currently active Execute call should also return that.

Putting it into the description for WaitExecution will suffice, I'll update.

The sentiment is that if a WaitExecution would return a NOT_FOUND at some point due to operation disappearance, then a currently active Execute call should also return that.

WaitExecution should basically do the following things:

Look up an operation by operation name.

Potentially attach/associate the current gRPC call to that operation, depending on what the implementation looks like.

Report events/mutations that occur against the operation.

I think you're stating above that NOT_FOUND should be returned if an operation were to disappear during (3). But I don't think that's desirable. There are gRPC status codes that are far more relevant to situations like these, and we should use those instead. (CANCELLED? ABORTED? Pick one.)

In my opinion NOT_FOUND should only be returned if step (1) fails. As step (1) is specific to WaitExecution and not Execute, there is thus no point in documenting NOT_FOUND as a valid status code for Execute.

Alright then. Moved to above WaitExecution in an updated commit already.

build/bazel/remote/execution/v2/remote_execution.proto

bergsieker · 2023-06-09T21:05:06Z

build/bazel/remote/execution/v2/remote_execution.proto

+  // may fail as follows:
+  //
+  // * `NOT_FOUND`: The operation no longer exists due to any of a transient
+  //   condition, an unknown operation name, or if the server implements the


What does it mean for an operation to be missing due to a "transient condition?"

This was in the spirit of "transient condition" of UNAVAILABLE above, in a way that could induce an operation to be considered nonexistent.

I don't think that Bazel considers NOT_FOUND to be retriable, so I think it's better if the server returns UNAVAILABLE or UNKNOWN in the event of an unexpected error causing a transient issue.

Interestingly enough, Buildbarn already had a WaitExecution implementation that returns NOT_FOUND in that case. Over the last two years I’ve been doing restarts of our scheduler on a weekly basis that causes it to lose data and thus return NOT_FOUND. So far it hasn’t caused builds to fail….?

Bazel does consider NOT_FOUND retriable, both in the existing and experimental remote executors:

https://github.com/bazelbuild/bazel/blob/fed23d5a08cecbdcc4725adf77824a6c5bde1b4e/src/main/java/com/google/devtools/build/lib/remote/ExperimentalGrpcRemoteExecutor.java#L135-L136
https://github.com/bazelbuild/bazel/blob/fed23d5a08cecbdcc4725adf77824a6c5bde1b4e/src/main/java/com/google/devtools/build/lib/remote/ExperimentalGrpcRemoteExecutor.java#L160
https://github.com/bazelbuild/bazel/blob/fed23d5a08cecbdcc4725adf77824a6c5bde1b4e/src/main/java/com/google/devtools/build/lib/remote/ExperimentalGrpcRemoteExecutor.java#L202-L213
https://github.com/bazelbuild/bazel/blob/fed23d5a08cecbdcc4725adf77824a6c5bde1b4e/src/main/java/com/google/devtools/build/lib/remote/GrpcRemoteExecutor.java#L217-L221

Where it in any case does not disambiguate between execute and waitExecution producing the status.

(the description copy appears in both implementations)

bergsieker · 2023-06-09T21:12:35Z

build/bazel/remote/execution/v2/remote_execution.proto

+  //
+  // * `NOT_FOUND`: The operation no longer exists due to any of a transient
+  //   condition, an unknown operation name, or if the server implements the
+  //   Operations API DeleteOperation method and it was called for the current


As far as I can parse the Operation API, having another stream call DeleteOperation should have no effect on the operation itself--the API says that it "indicates the client is no longer interested in the Operation" but that calling it "does not cancel the operation." On the other hand, calling CancelOperation should probably cancel the Action (best effort) in which case any calls to WaitExecution should return SUCCESS, the underlying ExecutionResponse should say the Action was cancelled.

With the move to waitExecution for the description here, I'd say that calling DeleteOperation during the stream of Execute is not the intended context for a NOT_FOUND. Only a DeleteOperation call before a WaitExecution would apply here, which could move an operation's queryable state from valid and waitable to nonexistent.

That makes sense, although it feels like it just reduces to the "unknown operation name" at that point. I don't feel strongly about it, though.

werkt · 2023-06-23T14:21:28Z

ping on this? I don't have merge privs.

werkt requested a review from bergsieker as a code owner May 29, 2023 00:38

EdSchouten requested changes May 29, 2023

View reviewed changes

werkt force-pushed the not-found-execution branch 2 times, most recently from 477dec4 to 869e49e Compare May 31, 2023 14:00

EdSchouten reviewed May 31, 2023

View reviewed changes

build/bazel/remote/execution/v2/remote_execution.proto Outdated Show resolved Hide resolved

build/bazel/remote/execution/v2/remote_execution.proto Outdated Show resolved Hide resolved

werkt force-pushed the not-found-execution branch from 869e49e to 376a8a5 Compare May 31, 2023 21:04

Document NOT_FOUND Execution error reporting

26ce6f1

werkt force-pushed the not-found-execution branch from 376a8a5 to 26ce6f1 Compare May 31, 2023 21:05

EdSchouten approved these changes Jun 1, 2023

View reviewed changes

bergsieker reviewed Jun 9, 2023

View reviewed changes

bergsieker approved these changes Jun 27, 2023

View reviewed changes

bergsieker merged commit 068363a into bazelbuild:main Jun 27, 2023

werkt deleted the not-found-execution branch June 28, 2023 00:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Document NOT_FOUND Execution error reporting #259

Document NOT_FOUND Execution error reporting #259

werkt commented May 29, 2023

EdSchouten May 29, 2023

werkt May 29, 2023

EdSchouten May 29, 2023

werkt May 31, 2023

EdSchouten May 31, 2023

werkt May 31, 2023

bergsieker Jun 9, 2023

werkt Jun 10, 2023

bergsieker Jun 12, 2023

EdSchouten Jun 12, 2023 •

edited

Loading

werkt Jun 15, 2023

bergsieker Jun 9, 2023

werkt Jun 10, 2023 •

edited

Loading

bergsieker Jun 12, 2023

werkt commented Jun 23, 2023

Document NOT_FOUND Execution error reporting #259

Document NOT_FOUND Execution error reporting #259

Conversation

werkt commented May 29, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

EdSchouten Jun 12, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

werkt Jun 10, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

werkt commented Jun 23, 2023

EdSchouten Jun 12, 2023 •

edited

Loading

werkt Jun 10, 2023 •

edited

Loading