-
Notifications
You must be signed in to change notification settings - Fork 850
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sync deployments to task queue user data #6871
Conversation
string task_queue_name = 1; | ||
temporal.api.enums.v1.TaskQueueType task_queue_type = 2; | ||
google.protobuf.Timestamp first_poller_time = 3; | ||
} | ||
|
||
// TODO: comment what this is used as |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is used as a response when deployment read API queries the deployment workflow, enquiring it's status. It should maybe be called QueryDescribeDeploymentResponse
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks, I'll rename
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this looks great overall - thank you David for improving some of my work and adding the userdata stuff so quickly
had a couple of questions here and there but won't block
// lock so that only one poll does the update and the rest wait for it | ||
c.deploymentLock.Lock() | ||
defer c.deploymentLock.Lock() | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question - I remember us discussing about trying to avoid mixing locks and atomics whenever we can. The approach you have taken works, but I just wanted to paste this to ensure we have thought of this case as well.
moreover - if we are ensuring our updates are idempotent and only one will get through eventually, should we be making our polls wait for the lock acquisition?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree about mixing locks and atomics, that's why there are no more atomics.
if one poll comes in and starts the workflow/update, and another 999 come in in the next half second, there should only be one outstanding call. otherwise we'll slam frontend+history.
actually we need even more backoff here on error (comment below), but that can wait.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oops, thought you had decided to keep c.deploymentRegistered
an atomic. The git diff confused me, all good. I like this approach of using one and not both.
thanks for the reasoning but then we should document somewhere that the poll latency (using versioning) might increase given we are waiting on userdata propogating and these calls to frontend/matching
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
only on the first few polls from a new deployment. it's a potential failure mode but it shouldn't affect latency in the steady state.
@@ -31,47 +31,72 @@ import "google/protobuf/timestamp.proto"; | |||
import "temporal/api/deployment/v1/message.proto"; | |||
import "temporal/api/common/v1/message.proto"; | |||
|
|||
// Data for each deployment+task queue pair. This is stored in each deployment (for each task | |||
// queue), and synced to task queue user data (for each deployment). | |||
message Data { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we call it something more specific such as PerTaskQueueData
or TaskQueueDeploymentData
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well, deployment
is in the package name so I was trying to keep names shorter. I guess deployment.PerTaskQueueData
is fine, or maybe deployment.TaskQueueData
? I don't want to repeat deployment
message TaskQueueInfo { | ||
temporal.api.enums.v1.TaskQueueType task_queue_type = 1; | ||
google.protobuf.Timestamp first_poller_time = 2; | ||
Data data = 2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Conceptually, data.last_became_current_time
is filled based on deployment's last_became_current_time
. Once we have Rollout, deployment's last_became_current_time
will come from rollout records. For now, we should put the last_became_current_time
inside DeploymentLocalState
and update is together with is_current
.
Given that, maybe Data
should not be stored in here and only be calculated from the DeploymentLocalState + TaskQueueInfo when we call SyncUserData
. @dnr wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm trying to make some common structs so that we can add fields without repeating them in 30 different messages... but maybe last_became_current_time
has to be somewhat special?
I think what you wrote makes sense. but then this TaskQueueInfo really just contains one field, first_poller_time? what else is going to go here?
I might want to keep it as a data
and just leave out lbct so it would pick up any other per-tq fields?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So add the last_became_current_time at top level but still use Data
inside the state per TQ? I guess I don't have an objection, we can see if they'll converge more in the future and if so we can split the proto defs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, that's what I was thinking. I could be wrong, it depends how it evolves.
temporal.api.deployment.v1.Deployment deployment = 1; | ||
string task_queue_name = 2; | ||
temporal.api.enums.v1.TaskQueueType task_queue_type = 3; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when a deployment becomes current it'd usually need to update both activity and wf task queues. Do we want ability to pack them in a single call, or it's not worth it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, good point. this is the sort of thing I'm worried about changing when we do SetCurrentDeployment...
is making task_queue_type
repeated enough? seems like it should be
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
generally, the Data could be different per type. maybe make it a map of int -> Data?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ShahabT - remind me about - "what about the case when the activity task-queue is placed in a different deployment" case? or are we not dealing with this in pre-release?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh right, it has first poller time and that can be different. I'll try the map and see how it works
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, if we split up userdata by type later, then the map doesn't make sense. maybe it's actually not worth it...
message GetTaskQueueUserDataResponse { | ||
reserved 1; | ||
// Versioned user data, set if the task queue has user data and the request's last_known_user_data_version is less | ||
// than the version cached in the root partition. | ||
temporal.server.api.persistence.v1.VersionedTaskQueueUserData user_data = 2; | ||
} | ||
|
||
message SyncDeploymentUserDataRequest { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe SyncTaskQueueDeploymentDataRequest
? I think it's appropriate for the name to have TaskQueue in it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's matching service.. everything is about task queues already.
(I'm not trying to be annoying, the super-long names make things significantly harder to read)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, if you like this one better. But every other API in matching service has TaskQueue in it's name :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
except UpdateWorkerBuildIdCompatibility
and UpdateWorkerVersioningRules
, which are the two most similar to this one 🤷♂️
namespaceEntry *namespace.Namespace, | ||
pollMetadata *pollMetadata, | ||
) error { | ||
if !pollMetadata.workerVersionCapabilities.UseVersioning { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can use worker_versioning.DeploymentFromCapabilities
to get the deployment if v3 is used. Note that we don't want to register if v1-2 is used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice
return nil | ||
} | ||
if !c.partitionMgr.engine.config.EnableDeployments(namespaceEntry.Name().String()) { | ||
return nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if v3 is used but this config is not set, should we reject the pollers? Do we reject v1-2 pollers if versioning config is not enabled?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we have the separate "workflow apis" setting... there's just one setting for v1 and v2 though.
I suppose the answer is yes? I'm not sure of all the implications though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still not sure why we have separate configs for v1-2. we always enable/disable them together.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the idea was something like, if we find a bug, we can block further metadata updates without halting workflows
if err != nil { | ||
return err | ||
logger.Error("syncing task queue userdata", "error", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we add the deployment and tq info to the log?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, though it's possible to correlate with the Info log above by activity id (which is already in the logger).
actually, on second thought: the sdk workflow/activity logs are already tagged with workflow id, which has the deployment series+buildid at a glance. so I'll remove those from the explicit logs.
## What changed? - Sync deployments to task queue user data. - Various protos reorganization. ## Why? So the data is available in matching. ## How did you test it? not yet
What changed?
Why?
So the data is available in matching.
How did you test it?
not yet