Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add error handling for host allocation failures #1149

Merged
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
24 changes: 21 additions & 3 deletions cuegui/cuegui/MenuActions.py
Original file line number Diff line number Diff line change
Expand Up @@ -1565,10 +1565,28 @@ def changeAllocation(self, rpcObjects=None):
self._caller, title, body, sorted(allocations.keys()), 0, False)
if choice:
allocation = allocations[str(allocationName)]
error_hosts = []
for host in hosts:
self.cuebotCall(host.setAllocation,
"Set Allocation on %s Failed" % host.data.name,
allocation)
# pylint: disable=broad-except
try:
self.cuebotCall(host.setAllocation,
"Set Allocation on %s Failed" % host.data.name,
allocation)
except Exception as e:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a more specific exception type you can catch here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately not. The gRPC call returns a Rendevous exception, hence I parse the exception details to check if it's an allocation error

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recently stumbled upon a similar grpc exception handling issue and this is a wide spread issue, most locations on the code are catching a general exception to handle grpc exceptions.

A better option is to catch grpc.RpcError and check the .details() to check the msg.

try:
  something_that_triggers_rendezvous()
except grpc.RpcError as rpc_error_call:  
  code = rpc_error_call.code() # (grpc.StatusCode.UNAVAILABLE, grpc.StatusCode.CANCELLED, ...)
  details = rpc_error_call.details()

See for exemple connectGrpcWithRetries on rqnetwork.py.

IMO this can be handled in another PR as it's a widespread implementation error.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bcipriano waiting for your confirmation to merge this

# Handle allocation modification errors separately
# pylint: disable=no-member
if (hasattr(e, "details") and
"EntityModificationError" in str(e.details())):
error_hosts.append(host.name())
else:
raise
if error_hosts:
error_msg = "{hosts} not moved.\nHosts with reserved cores " \
"cannot be moved between allocations."
QtWidgets.QMessageBox.warning(self._caller,
"Warning",
error_msg.format(hosts=", ".join(error_hosts)),
QtWidgets.QMessageBox.Ok)
self._update()

setRepair_info = ["Set Repair State", None, "configure"]
Expand Down