-
Notifications
You must be signed in to change notification settings - Fork 238
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
A17: Client-Side Health Checking #97
A17: Client-Side Health Checking #97
Conversation
by keepalives, at which point the client would disconnect. But if the | ||
problem is caused by a bug in the health-checking service, then it's | ||
possible that a server could still be responding but has failed to notify | ||
the client that it is unhealthy. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could also advertise the server's status via trailer metadata. So active RPCs that are issued while the service is unhealthy could attach the trailer and client could learn of the server's change in disposition that way, as a fallback (in case server failed to deliver update on main health-checking stream). This would not rely on polling, and HTTP/2 header compression should be able to encode the value efficiently if it is included in trailer for every RPC, so overhead would be very low. So if a channel were idle and became unhealthy (and server failed to deliver update), the client would not discover this until its first RPC on that channel.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That could certainly work, but it seems like a lot of overhead in terms of both code complexity (there would be a lot of plumbing involved) and performance (the cost of encoding and decoding the metadata -- it turns out that from a performance perspective, HTTP/2 header compression is actually one our biggest expenses). I'm not sure it's really worthwhile for something that is a very unlikely failure mode to begin with.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@markdroth, I'm not sure what performance cost your concerned about since the header would only need to exist when there's a problem. The bigger questions in my mind are 1) how do we know the "scope" of the failure trailer, that is, which services/methods are impacted and 2) how do we know when the service improves?
It does also require issuing an RPC to a unhealthy backend before you learn it is unhealthy, but depending on who you talk to that's more minor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You may be right that the performance isn't a big deal. But the code complexity still seems significant. I'm just not sure it's worth this much trouble to address something that's extremely unlikely to begin with.
A17-client-side-health-checking.md
Outdated
## Proposal | ||
|
||
The gRPC client will be configured to send health-checking RPCs to | ||
each backend that it is connected to. Whenever a backend responds as |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you clarify: are you saying backend as opposed to balancers? Do balancers no longer get health checked?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Balancers have never been health-checked, and this design doesn't change that. (Although we do have some other plans in the works unrelated to this design that may add the ability to do that.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the phrasing here should be " The gRPC Client can be configured to send"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
A17-client-side-health-checking.md
Outdated
|
||
The current health-checking protocol is a unary API, where the client is | ||
expected to periodically poll the server. That was sufficient for use | ||
via UHC, where the health checks come from a small number of centralized |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
UHC is undefined.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, fixed.
A17-client-side-health-checking.md
Outdated
because (a) it won't affect the overall state of the parent channel | ||
unless all subchannels are in the same state at the same time and (b) | ||
because the connectivity state API can drop states anyway, the application | ||
won't be able to tell that we didn't stop in `CONNECTING` in between anyway. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe remove the second "anyway."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dropped the first one instead.
If the `Watch()` call fails with status `UNIMPLEMENTED`, the client will | ||
act as if health checking is disabled. That is, it will not retry | ||
the health-checking call, but it will treat the channel as healthy | ||
(connectivity state `READY`). However, the client will record a [channel |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is going to be another point that will be interesting/hard for Java to figure out what to do, as creating trace events is very internal to the transport, including its synchronization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Noted. I don't think this requires any changes here, though.
A17-client-side-health-checking.md
Outdated
attempt will be subject to exponential backoff. When the next attempt | ||
starts, the subchannel will transition to state `CONNECTING`. | ||
|
||
When the client channel is shutting down or when the backend sends a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"client channel is shutting down" is a bit too early. I'd suggest "client subchannel is shutting down." When you call channel shutdown, existing RPCs continue. RPCs can be queued waiting for a connection to become available, and that shouldn't be interfered with.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Fixed.
A17-client-side-health-checking.md
Outdated
whenever a backend transitions from healthy to unhealthy, it may still | ||
receive a small number of RPCs that were already in flight from the | ||
client before the client received the notification that the backend is | ||
unhealthy. This race condition lasts approximately the one-way network |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's actually a full RTT, since the client could have sent a new RPC ½ RTT after the server went unhealthy (just before receiving the notification) and that RPC will take ½ RTT to reach the server.
It sounds like you're trying to talk the delay on client-side, but before you say "[the server] may still receive a small number of RPCs..." which is from the server's perspective.
Honestly, I don't think this is about how much RTT is involved. To me it's a queuing problem and a there-isn't-actually-a-race-here. When a connection goes active a large number of RPCs can be issued all at once, because they were queued while we were waiting for the connection to become active. Avoiding that "burst" from going to a bad backend is "nice." Also, there is an unavoidable race when transitioning from healthy to unhealthy, but if a server is already unhealthy when a client connects there is no necessity for a race to occur.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The window is shifted between the client and server, but the actual race condition is a one-way trip. But actually, I think you're right that it's a full RTT, for a different reason: it also includes any RPCs that had been sent by the client but not yet received by the server when the server becomes unhealthy.
Anyway, I've updated the text to address your comments. PTAL.
|
||
#### Call Credentials | ||
|
||
If there are any call credentials associated with the channel, the client |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI: this does not exist in Java.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Noted. I don't think the text needs to be changed, though; Java is just a case where there are no call creds associated with the channel.
A17-client-side-health-checking.md
Outdated
|
||
### Defaults | ||
|
||
Client-side health checking will be disabled by default; users |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/users/service owners/?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
A17-client-side-health-checking.md
Outdated
#### `pick_first` | ||
|
||
We do not plan to support health checking with `pick_first`, for two | ||
reasons. First, if you are only connecting to a single backend, you |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't quite true. If there are 4 backends, with ping_first you are only sending to one backend, but there are 3 other back-up backends. You could just connect to the next in the list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed this reason.
A17-client-side-health-checking.md
Outdated
### LB Policies Can Disable Health Checking When Needed | ||
|
||
There are some cases where an LB policy may want to disable client-side | ||
health-checking. To allow this, we will provide a channel arg that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there also going to be a channel argument for clients to use when they want to disable/ignore client-side health checks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Updated text to clarify.
- Add support for health checking in subchannel code using nanopb | ||
(in progress). | ||
|
||
### Java |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will also need plumbing to allow setting the tracing event.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added.
|
||
### Client Behavior | ||
|
||
The health checking client code will be built into the subchannel, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An alternative approach is that we provide a policy add-on that resides outside of core, and can wrap a policy into a health-check-enabled one. This would not force Java to deal with the protobuf dependency, and also keep the complexity of subchannel implementation down.
/cc @ejona86
A17-client-side-health-checking.md
Outdated
## Proposal | ||
|
||
The gRPC client will be configured to send health-checking RPCs to | ||
each backend that it is connected to. Whenever a backend responds as |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the phrasing here should be " The gRPC Client can be configured to send"
A17-client-side-health-checking.md
Outdated
client will need to be configured with a service name to use. However, | ||
by default, it can use the empty string, which would mean that the | ||
health of all services on a given host/port would be controlled with a | ||
single switch. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should define the convention of empty string more formally. Something like:
By convention, when the empty string is used as the service name, it is indicating the health of the server itself and not of specific services running at the server. Servers are free to override this behavior when appropriate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
A17-client-side-health-checking.md
Outdated
possible that a server could still be responding but has failed to notify | ||
the client that it is unhealthy. | ||
|
||
We can consider changing UHC to use this new streaming API, but that's |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This paragraph should be removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. Done.
This serving status will be returned by the `Watch()` method when the | ||
requested service name is initially unknown by the server. Note that the | ||
existing `Check()` unary method fails with status `NOT_FOUND` in this case, | ||
and we are not proposing to change that behavior. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not fail with a NOT_FOUND status ? It is good hygiene for server owners to initialize the healthchecking service before the server begins listening on the port, and we should not optimize for the case where the server is listening too early.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that it's good hygiene for servers to initialize the healthchecking status for their services before the server starts listening. However, not all servers will have good hygiene, and clients need to react somehow when they encounter a server that has bad hygiene. If the RPC just fails with NOT_FOUND
in this case, then the client has to retry the call using exponential backoff, and if the server sets the service's health status to SERVING
while the client is in backoff, it will delay the client from considering the subchannel healthy. I think we provide better behavior for the application by avoiding this potential delay.
|
||
### Client Behavior | ||
|
||
The health checking client code will be built into the subchannel, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WE should also describe what will happen for open source users. Persumably, the functionality will be available but not enabled by default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is addressed at the bottom of the document, but I would recommend moving that up to avoid confusion .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
A17-client-side-health-checking.md
Outdated
the health-checking call, but it will treat the channel as healthy | ||
(connectivity state `READY`). However, the client will record a [channel | ||
trace](https://github.com/grpc/proposal/blob/master/A3-channel-tracing.md) | ||
event indicating that this has happened. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should also be logged as an ERROR at high priority since failing open silently may violate operator assumptions.,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
channels. The naive approach of treating an unhealthy channel as having | ||
failed and disconnecting would both result in expensive reconnection | ||
attempts and might not actually cause us to select a different backend | ||
after re-resolving. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add more details here. Intuitively, it seems like pick-first policy should skip the address if it is not healthy and pick the next healthy one. The issue of expensive reconnection attempts (subject to exponential backoff) also exists for RR policy, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, this issue does not exist for RR, because RR never disconnects from a backend -- it will always be attempting to connect to all backend addresses in its list. So if a backend connection fails in RR, RR just keeps trying. From the subchannel's point of view, that does not cause any connection churn, regardless of whether the subchannel is actually disconnected or is connected but reporting unhealthy.
In contrast, in PF, if a backend connection fails, PF gives up on it and goes back to its list of addresses to try to find one that will work. That's why this is harder for PF -- it would actually cause us to try to switch to a different address, which would cause a lot of connection churn.
is unhealthy. The LB policy may therefore take some unnecessary but | ||
unharmful steps in response, such as asking the subchannel to connect | ||
(which will be a no-op if it's already connected) or asking the resolver | ||
to re-resolve (which will probably be throttled by the cooldown timer). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A possible solution here is to add another state called "connected" between "connecting" and "ready". The "connected" state tells LB policy to not re-resolve or ask the sub-channel to reconnect. Of course, this will increase the complexity of the state machine and also a re-resolution may anyway be necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These states are part of our public API in all languages; they are directly visible to applications. So to add a new state, we would need to either (a) add the new state to the public API, thus unnecessarily exposing internal implementation details to the application, or (b) use a different set of states when internally propagating the state than when exposing it to the application, which also adds complexity to our code. I think either of those options would be a fair amount of trouble, and that doesn't seem worthwhile, given that the extra work being done here is unharmful.
If this extra work does turn out to be harmful at some point in the future, I think we can find ways to avoid it without having to add a new connectivity state.
``` | ||
|
||
We will then add the following top-level field to the ServiceConfig proto: | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a link to ServiceConfig proto?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added link to service config doc.
The doc shows the config in JSON form, not proto form. I've been meaning to update the docs to publish the proto form as well, but haven't gotten around to it yet. In the interim, I've added an example here that shows how to express this in JSON form.
attempt will be subject to exponential backoff. When the next attempt | ||
starts, the subchannel will transition to state `CONNECTING`. | ||
|
||
When the client subchannel is shutting down or when the backend sends a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the server never closes the stream, will it stop server from shutting down cleanly? Should the server close all health check streams when shutting down?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is an implementation detail. If the server is shutting down, it will send a GOAWAY, which will cause the client to cancel the stream. If the server has already proactively cancelled the stream, that won't break anything.
client will need to be configured with a service name to use. However, | ||
by default, it can use the empty string, which would mean that the | ||
health of all services on a given host/port would be controlled with a | ||
single switch. Semantically, the empty string is used to represent the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean the status of the empty string service is independent from the non-empty string services, and there are no implication of their relationships whatsoever?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct. In terms of the implementation, the only way that the empty string is special is that at server startup, the empty string is reported as healthy by default, whereas all other services are reported as unknown by default.
Should tracing tools like census, channelz record the stats about the health checking RPCs? And can service owner provide service config for health checking? |
Since the call will be present on the subchannel but not on the parent channel, I think it would be recorded that way in channelz (i.e., the call counts would increase on the subchannel but not on the parent channel). The census question is a good one -- I don't think we had considered that. The implementation I had in mind for C-core would not record any of these stats in census, but that may be sub-optimal. Would the implementation in go be simpler if we didn't record these in census? @zhangkun83, how about in java? |
For Java I am not going to record the health check RPCs in Census, just like the requests to GRPCLB balancer. That would be extra work to have them recorded. |
It will make implementation a little simpler. I'll keep it consistent with C and Java then by not doing census for health checking RPCs. |
Okay, sounds like we'll all do the same thing for now, which is to not report the health check request to census. I suspect that this may be something that we'll want to revisit later, but at least we'll be consistent for now. @zhangkun83, as an aside, note that C-core does actually report the grpclb balancer calls to census. We don't do that by doing anything special in the grpclb LB policy; it happens as a consequence of the fact that we create a full-fledged channel to talk to the balancer, and that channel records data to census just like any other channel. |
cc @canguler |
CC @yang-g