Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dynamic client endpoints #676

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

antlai-temporal
Copy link
Contributor

What was changed

Clients can now dynamically update certain options fields such as the target url, the TLS configuration, the origin, or the keep alive settings, without creating a new client.

Why?

This will allow, e.g., updating a TLS certificate without restarting workers, or losing state associated with the client, such as, eager workflow start tracking, or metrics...

Checklist

  1. Closes
    [Feature Request] Dynamic client for workers #477
  2. How was this tested:

Integration tests and unit tests

@antlai-temporal antlai-temporal requested a review from a team as a code owner January 25, 2024 05:00
/// set field to a tombstone specified in the field description.
#[derive(Clone, Debug, Default, derive_builder::Builder)]
#[non_exhaustive]
pub struct ClientOptionsUpdate {
Copy link
Member

@cretz cretz Jan 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can the entire ClientOptions struct be reused instead? Not every field has to be optional. You can require the callers provide the same data they did for original client options if they don't want to change it. The lang side doesn't need help here.

(would also be neat if some of the runtime options like headers was set/updated via this common mechanism and didn't require reconnect, but that doesn't have to be now)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this probably makes decent sense too. I imagine the lang-side API will typically involve the user copying some existing options and changing them, in which case this all works out fine.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are the only fields that I can change dynamically, because they are associated with the endpoint, and I think it makes it more clear what can be changed with a separate struct. Also, fields like target_url are not optional in ClientOptions but need to be optional in update, or remember the original value set, which is a pain with concurrent updates...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we go with swapping out the whole options behind a mutex which I mentioned in another comment, then the other fields should be changeable too I think, which could work out nicely.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Things like retry_config are used by the Retry wrapper that I cannot change. client_name or client_version or identity may be used in metrics, we probably don't want to change them, and I cannot see a use case to change them dynamically... We could fail the update if they don't match.
I still think that is easier in the API to do deltas with only the changeable attributes, but internally we could use ClientOptions instead of a changeset, if we are ok with a mutex on options. We would still need some state for the updates that does not belong in options, such as the channel, the version id,... but we could probably lock them independently from options...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can't they be changed? I agree if they can't be changed it makes sense to accept something that only allows changing what can be. But, I'm not sure I follow why we couldn't change those. They are accessed each time a request is made - seems like changing them would affect subsequent requests and work properly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for retry_config we cache it inside:
impl RetryClient {
/// Use the provided retry config with the provided client
pub fn new(client: SG, retry_config: RetryConfig) -> Self {
Self {
client,
retry_config: Arc::new(retry_config),
}
}
}
we could change that if needed, but not sure if it is that important to dynamically change retry_config...

The others we can, but the question is whether we should, if it is used in metrics, or server side, it may be confusing, and
I don't see much value of changing them dynamically...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't have to actually expose those to end users, but I take your point. Especially since the RetryClient is a level higher. I'm fine with having just the diff struct.

Comment on lines +427 to +428
/// If the update fails, the client does not rollback to the previous configuration, and future
/// connections will fail.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is rough. Why can't we only apply the update after the new endpoint is connected and works? Can't you do an atomic swap of the endpoint where there is no downtime and it only contains a successfully connected endpoint?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. I would call this try_update_options and only swap if the connection succeeds. Also I find leaving the old options to be scary since now things aren't in sync.

It'd be fine to put a mutex around the options as well, so that they can be updated too. This also fits better with the idea of using the ClientOptions directly rather than ClientOptionsUpdate

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been looking into the atomic swap reusing the id for the endpoint, but the problem is that it s anything but atomic.
When you reuse the id the endpoint gets in a "preparing" queue, the old endpoint still active, when the new one gets connected the swap happens. The problem is that this is all asynchronous, and if there is a problem with the new endpoint there is no visible error, and the old one continuous forever. It is also impossible to know when this swap will happen, or validate that it happened without sending a carefully crafted request. The endpoints are also created in lazy mode, so they need to be part of channel to validate them.
I think the root problem is that this mechanism was designed for load balancing, not dynamic changes, and if you are adding more (similar) endpoints with different targets, you can be more relaxed about when exactly changes happened.
For this reason, I'm forcing the client to fail when there are problems, otherwise they get silently ignored...

Copy link
Member

@cretz cretz Jan 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that this is all asynchronous

Yeah I see that with the balance channel and makes sense. But can we wait to even send the insert until we have connected the endpoint? And can we send the remove after the insert? So logic would be:

endpoint = connect_new()
channel_tx.send(insert(endpoint))
channel_tx.send(remove(old_endpoint_seq))

Also curious, if you send an insert then a remove, is it guaranteed that the next call will be on the inserted? And in the case of what's there now w/ remove-then-insert, is it possible for multi-threaded use that there is a time where no endpoint is there?

I think the root problem is that this mechanism was designed for load balancing, not dynamic changes

I wonder if we can have our own "mutex_channel" or something, or if that's too much (EDIT: I see ::new is crate-private)

The endpoints are also created in lazy mode, so they need to be part of channel to validate them.

This is unfortunate but if we can't eagerly validate, ok. I now wonder if we should go higher than channel. What I mean is if we can swap our own channel for a completely new one instead of updating the channel.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that you can easily swap the channel without creating a new client, and the problem of creating new clients is that there is a lot of shared state associated to them, like eager workflow start workers, or metrics,... that need to be moved across atomically.
The semantics of the control channel are really weak. Remove happens immediately, but inserts are queued in a "preparing" queue, so there is no guarantee that after sending insert ok you are getting the new options.
To avoid the gap you can reuse the id in insert, and then you don't need a remove, but there is no guarantee when, or if, the swap will happen.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that it is "lazy". It will create the connection with the first request, and we want to make sure that this connection is not created until we finish the upgrade.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should attempt a get-system-info call on the client before swapping it. IMO we want to make sure the connection is created and is a known config to be successful even way before we swap it out. We don't want to swap out lazy or a when given a bad client, the worker will just silently fail for a minute (our default retry iirc) then fatal out.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before the swap we can check the connection, but that doesn't help for lazy. Internally Tonic defers the connection until it gets the first request. And even if we do a get-system-info, there is a race with someone else doing the first call after dropping the lock, if they grabbed it before the update started, and getting the error...

Copy link
Member

@cretz cretz Jan 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before the swap we can check the connection, but that doesn't help for lazy

This is why I'm suggesting a higher level client swap. Don't swap things at the Tonic level, swap the connected Temporal client.

there is a race with someone else doing the first call after dropping the lock, if they grabbed it before the update started, and getting the error

I think we can accept/document this race. If you're really concerned, we can use a separate lock for the entire client update process to prevent concurrent updates and put a short timeout on the get-system-info call, but of course we should never hold a lock during regular client operations.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would probably do this:

pub struct ConfiguredClient<C> {
    client_and_options: ArcSwap<ClientAndOptions>,
    headers: Arc<RwLock<HashMap<String, String>>>,
    /// Capabilities as read from the `get_system_info` RPC call made on client connection
    capabilities: Option<get_system_info_response::Capabilities>,
    workers: Arc<SlotManager>,
}

pub struct ClientAndOptions<C> {
    client: C,
    options: ClientOptions
}

If ArcSwap doesn't work out for whatever reason (it might be hard to use load() references in all cases) then you can do the Arc<RwLock<Arc<ClientAndOptions>>> approach

@@ -87,6 +91,9 @@ static LONG_POLL_METHOD_NAMES: [&str; 3] = [
const LONG_POLL_TIMEOUT: Duration = Duration::from_secs(70);
const OTHER_CALL_TIMEOUT: Duration = Duration::from_secs(30);

/// Buffer size for the channel that listens to change events
const DEFAULT_BUFFER_SIZE: usize = 1024;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect this can probably be a lot smaller (though I doubt it matters a whole lot either way).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just copied the constant used internally by tonic, frankly, no idea about the right size... But, yes, being a control channel, that looks large...

/// set field to a tombstone specified in the field description.
#[derive(Clone, Debug, Default, derive_builder::Builder)]
#[non_exhaustive]
pub struct ClientOptionsUpdate {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this probably makes decent sense too. I imagine the lang-side API will typically involve the user copying some existing options and changing them, in which case this all works out fine.


/// Metadata to enable the dynamic configuration of a client.
#[derive(Clone, Debug)]
pub struct DynamicUpdateInfo {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't need to be pub

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They may need to access the changeset to get the current options when there is no mutable ref to the client.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not accessible as it stands anyway. It's stored as a private field and there's no accessor.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. We need to decide whether to get rid of changeset and update ClientOptions directly (with mutex), or not, and then either add the accessor or make everything private.

Comment on lines +427 to +428
/// If the update fails, the client does not rollback to the previous configuration, and future
/// connections will fail.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. I would call this try_update_options and only swap if the connection succeeds. Also I find leaving the old options to be scary since now things aren't in sync.

It'd be fine to put a mutex around the options as well, so that they can be updated too. This also fits better with the idea of using the ClientOptions directly rather than ClientOptionsUpdate

@antlai-temporal
Copy link
Contributor Author

Agreed. I would call this try_update_options and only swap if the connection succeeds. Also I find leaving the old options to be scary since now things aren't in sync.

This is what I tried first, but as I explained above, it is all asynchronous, with lazy endpoints, and there is no error callback, so it is difficult to make the update only when it works...
Leaving the old options is really ugly, agreed. I checked that we don't use the possibly outdated fields in any critical manner but there is no guarantee that something will eventually break. It is also unrealistic that we will always have a mutable ref to client unless we add a top level mutex, which would mess up backwards compatibility.
Adding a lock on ClientOptions does not completely solve the problem, you need to hold both locks (also for update field), to safely do the update, and locks need to be acquired once, and in the correct order, to avoid deadlocks. I thought that could be quite intrusive too...

If we can live with occasional inconsistencies, we could lock them independently... That would improve on what we have when we have no mutable ref...

It'd be fine to put a mutex around the options as well, so that they can be updated too. This also fits better with the idea of using
the ClientOptions directly rather than ClientOptionsUpdate
See discussion above...
The problem is that you need to hold both locks to keep it consistent

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants