Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document NOT_FOUND Execution error reporting #259

Merged
merged 1 commit into from
Jun 27, 2023

Conversation

werkt
Copy link
Collaborator

@werkt werkt commented May 29, 2023

No description provided.

@werkt werkt requested a review from bergsieker as a code owner May 29, 2023 00:38
@@ -99,6 +99,9 @@ service Execution {
// * `CANCELLED`: The operation was cancelled by the client. This status is
// only possible if the server implements the Operations API CancelOperation
// method, and it was called for the current execution.
// * `NOT_FOUND`: Due to a transient condition, an unknown operation name,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that this error condition also applies to Execute(). That call doesn't take an operation name, right? It's specific to WaitExecution().

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Transiency could well occur during Execute, in the same timeframe that an Execute request might end, and a WaitExecution would be made. The errors that WaitExecution could see are also summarized here, and I made that explicit with the add below.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Transiency could well occur during Execute, in the same timeframe that an Execute request might end, and a WaitExecution would be made.

This part I don't really understand. All I'm saying is, can Execute() fail with NOT_FOUND because an invalid operation name was specified? My understand is that this cannot occur, because it is never called with an operation name.

My recommendation would be to leave the description of Execute() as is, but to replace this sentence that you added to WaitExecution:

This method will report errors consistent with Execute.

With something along the lines of:

In addition to the cases describe for Execute, the WaitExecution method may fail as follows:

  • NOT_FOUND: Description goes here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a subtle point, but I think specifying this with WaitExecution alone will probably not prevent Execute interpretation on clients from needing to handle it as a response to either method. The sentiment is that if a WaitExecution would return a NOT_FOUND at some point due to operation disappearance, then a currently active Execute call should also return that.

Putting it into the description for WaitExecution will suffice, I'll update.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sentiment is that if a WaitExecution would return a NOT_FOUND at some point due to operation disappearance, then a currently active Execute call should also return that.

WaitExecution should basically do the following things:

  1. Look up an operation by operation name.
  2. Potentially attach/associate the current gRPC call to that operation, depending on what the implementation looks like.
  3. Report events/mutations that occur against the operation.

I think you're stating above that NOT_FOUND should be returned if an operation were to disappear during (3). But I don't think that's desirable. There are gRPC status codes that are far more relevant to situations like these, and we should use those instead. (CANCELLED? ABORTED? Pick one.)

In my opinion NOT_FOUND should only be returned if step (1) fails. As step (1) is specific to WaitExecution and not Execute, there is thus no point in documenting NOT_FOUND as a valid status code for Execute.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright then. Moved to above WaitExecution in an updated commit already.

build/bazel/remote/execution/v2/remote_execution.proto Outdated Show resolved Hide resolved
build/bazel/remote/execution/v2/remote_execution.proto Outdated Show resolved Hide resolved
@werkt werkt force-pushed the not-found-execution branch 2 times, most recently from 477dec4 to 869e49e Compare May 31, 2023 14:00
@werkt werkt force-pushed the not-found-execution branch from 869e49e to 376a8a5 Compare May 31, 2023 21:04
@werkt werkt force-pushed the not-found-execution branch from 376a8a5 to 26ce6f1 Compare May 31, 2023 21:05
// may fail as follows:
//
// * `NOT_FOUND`: The operation no longer exists due to any of a transient
// condition, an unknown operation name, or if the server implements the
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does it mean for an operation to be missing due to a "transient condition?"

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was in the spirit of "transient condition" of UNAVAILABLE above, in a way that could induce an operation to be considered nonexistent.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that Bazel considers NOT_FOUND to be retriable, so I think it's better if the server returns UNAVAILABLE or UNKNOWN in the event of an unexpected error causing a transient issue.

Copy link
Collaborator

@EdSchouten EdSchouten Jun 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interestingly enough, Buildbarn already had a WaitExecution implementation that returns NOT_FOUND in that case. Over the last two years I’ve been doing restarts of our scheduler on a weekly basis that causes it to lose data and thus return NOT_FOUND. So far it hasn’t caused builds to fail….?

Copy link
Collaborator Author

//
// * `NOT_FOUND`: The operation no longer exists due to any of a transient
// condition, an unknown operation name, or if the server implements the
// Operations API DeleteOperation method and it was called for the current
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I can parse the Operation API, having another stream call DeleteOperation should have no effect on the operation itself--the API says that it "indicates the client is no longer interested in the Operation" but that calling it "does not cancel the operation." On the other hand, calling CancelOperation should probably cancel the Action (best effort) in which case any calls to WaitExecution should return SUCCESS, the underlying ExecutionResponse should say the Action was cancelled.

Copy link
Collaborator Author

@werkt werkt Jun 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the move to waitExecution for the description here, I'd say that calling DeleteOperation during the stream of Execute is not the intended context for a NOT_FOUND. Only a DeleteOperation call before a WaitExecution would apply here, which could move an operation's queryable state from valid and waitable to nonexistent.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense, although it feels like it just reduces to the "unknown operation name" at that point. I don't feel strongly about it, though.

@werkt
Copy link
Collaborator Author

werkt commented Jun 23, 2023

ping on this? I don't have merge privs.

@bergsieker bergsieker merged commit 068363a into bazelbuild:main Jun 27, 2023
@werkt werkt deleted the not-found-execution branch June 28, 2023 00:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants