Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplify fault and restart code (1/3) #1568

Open
wants to merge 4 commits into
base: main
Choose a base branch
from
Open

Conversation

mkeeter
Copy link
Contributor

@mkeeter mkeeter commented Nov 20, 2024

The conditions under which we restart a client IO task are somewhat baroque:

  • It may halt spontaneously (e.g. upon a socket error)
  • We may fault it explicitly (e.g. upon IO error). In this case, we want to skip all work associated with this client.
  • We may restart the IO task if initial negotiation fails
  • ...and when replacing a Downstairs
  • ...and when deactivated

This PR adds more specific types: enum ClientNegotiationFailed and enum ClientFaultReason. Mid-negotiation restarts must provide a ClientNegotiationFailed; faulting the IO task must provide a ClientFaultReason. Both of these types are then converted into a ClientStopReason for logging.

In addition, faulting a client is now done through Downstairs::fault_client instead of calling both Downstairs::skip_all_jobs and DownstairsClient::fault. Having a single function makes this harder to mess up!

@mkeeter mkeeter force-pushed the mkeeter/simplify-faults branch 2 times, most recently from 3d11ac2 to de39d6e Compare December 2, 2024 14:44
@mkeeter mkeeter changed the title Simplify fault and restart code Simplify fault and restart code (1/3) Dec 2, 2024
@mkeeter mkeeter force-pushed the mkeeter/simplify-faults branch from de39d6e to 26823a9 Compare December 3, 2024 15:55
Copy link
Contributor

@leftwo leftwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is good, one nit and one question/confirmation.

@@ -2202,25 +2180,53 @@ pub(crate) struct DownstairsStats {
#[derive(Debug)]
pub(crate) enum ClientStopReason {
/// We are about to replace the client task
Replacing,
Replaced,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's still Replacing here, as that's what we are doing.

If we request to replace a downstairs during negotiation, that's ClientNegotiationFailed(Replacing)
If we request to replace a downstairs after activation, does that become ClientStopReason(Replaced)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that ClientStopReason::Replacing makes more sense here.

If we request to replace a downstairs during negotiation, that's ClientNegotiationFailed(Replacing)
If we request to replace a downstairs after activation, does that become ClientStopReason(Replaced)?

Close, the two options are ClientStopReason::NegotiationFailed(ClientNegotiationFailed::Replaced)) and ClientStopReason::Replaced.

i,
up_state,
ClientFaultReason::FailedLiveRepair,
);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we no longer need the Faulted check here, it means when this DS went to Faulted, it must have triggered the fault_client() path. That would triggered the skip_all_jobs() call.

I don't believe that was always true, but it (as far as I can tell) is true now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I agree with you – the DsState::Faulted check is now in the catchall _ => { .. } branch below. You don't want anything to change here, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I agree with you – the DsState::Faulted check is now in the catchall _ => { .. } branch below. You don't want anything to change here, right?

Correct, no changes, just making sure I'm understanding the impact of the change and my logic is correct.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants