Sync deployments to task queue user data #6871

dnr · 2024-11-22T03:05:09Z

What changed?

Sync deployments to task queue user data.
Various protos reorganization.

Why?

So the data is available in matching.

How did you test it?

not yet

Shivs11 · 2024-11-22T15:56:26Z

proto/internal/temporal/server/api/deployment/v1/message.proto

    string task_queue_name = 1;
    temporal.api.enums.v1.TaskQueueType task_queue_type = 2;
    google.protobuf.Timestamp first_poller_time = 3;
 }

+// TODO: comment what this is used as


this is used as a response when deployment read API queries the deployment workflow, enquiring it's status. It should maybe be called QueryDescribeDeploymentResponse

thanks, I'll rename

Shivs11

this looks great overall - thank you David for improving some of my work and adding the userdata stuff so quickly

had a couple of questions here and there but won't block

Shivs11 · 2024-11-22T16:23:32Z

service/matching/physical_task_queue_manager.go

+	// lock so that only one poll does the update and the rest wait for it
+	c.deploymentLock.Lock()
+	defer c.deploymentLock.Lock()
+


question - I remember us discussing about trying to avoid mixing locks and atomics whenever we can. The approach you have taken works, but I just wanted to paste this to ensure we have thought of this case as well.

moreover - if we are ensuring our updates are idempotent and only one will get through eventually, should we be making our polls wait for the lock acquisition?

I agree about mixing locks and atomics, that's why there are no more atomics.

if one poll comes in and starts the workflow/update, and another 999 come in in the next half second, there should only be one outstanding call. otherwise we'll slam frontend+history.

actually we need even more backoff here on error (comment below), but that can wait.

oops, thought you had decided to keep c.deploymentRegistered an atomic. The git diff confused me, all good. I like this approach of using one and not both.

thanks for the reasoning but then we should document somewhere that the poll latency (using versioning) might increase given we are waiting on userdata propogating and these calls to frontend/matching

only on the first few polls from a new deployment. it's a potential failure mode but it shouldn't affect latency in the steady state.

proto/internal/temporal/server/api/persistence/v1/task_queues.proto

proto/internal/temporal/server/api/deployment/v1/message.proto

service/worker/deployment/deployment_workflow.go

ShahabT · 2024-11-22T18:16:34Z

proto/internal/temporal/server/api/deployment/v1/message.proto

@@ -31,47 +31,72 @@ import "google/protobuf/timestamp.proto";
 import "temporal/api/deployment/v1/message.proto";
 import "temporal/api/common/v1/message.proto";

+// Data for each deployment+task queue pair. This is stored in each deployment (for each task
+// queue), and synced to task queue user data (for each deployment).
+message Data {


Should we call it something more specific such as PerTaskQueueData or TaskQueueDeploymentData?

well, deployment is in the package name so I was trying to keep names shorter. I guess deployment.PerTaskQueueData is fine, or maybe deployment.TaskQueueData? I don't want to repeat deployment

ShahabT · 2024-11-22T18:20:26Z

proto/internal/temporal/server/api/deployment/v1/message.proto

        message TaskQueueInfo {
            temporal.api.enums.v1.TaskQueueType task_queue_type = 1;
-            google.protobuf.Timestamp first_poller_time = 2;
+            Data data = 2;


Conceptually, data.last_became_current_time is filled based on deployment's last_became_current_time. Once we have Rollout, deployment's last_became_current_time will come from rollout records. For now, we should put the last_became_current_time inside DeploymentLocalState and update is together with is_current.

Given that, maybe Data should not be stored in here and only be calculated from the DeploymentLocalState + TaskQueueInfo when we call SyncUserData. @dnr wdyt?

I'm trying to make some common structs so that we can add fields without repeating them in 30 different messages... but maybe last_became_current_time has to be somewhat special?

I think what you wrote makes sense. but then this TaskQueueInfo really just contains one field, first_poller_time? what else is going to go here?

I might want to keep it as a data and just leave out lbct so it would pick up any other per-tq fields?

So add the last_became_current_time at top level but still use Data inside the state per TQ? I guess I don't have an objection, we can see if they'll converge more in the future and if so we can split the proto defs.

yes, that's what I was thinking. I could be wrong, it depends how it evolves.

proto/internal/temporal/server/api/deployment/v1/message.proto

ShahabT · 2024-11-22T18:24:56Z

proto/internal/temporal/server/api/deployment/v1/message.proto

+    temporal.api.deployment.v1.Deployment deployment = 1;
+    string task_queue_name = 2;
+    temporal.api.enums.v1.TaskQueueType task_queue_type = 3;


when a deployment becomes current it'd usually need to update both activity and wf task queues. Do we want ability to pack them in a single call, or it's not worth it?

hmm, good point. this is the sort of thing I'm worried about changing when we do SetCurrentDeployment...

is making task_queue_type repeated enough? seems like it should be

generally, the Data could be different per type. maybe make it a map of int -> Data?

@ShahabT - remind me about - "what about the case when the activity task-queue is placed in a different deployment" case? or are we not dealing with this in pre-release?

oh right, it has first poller time and that can be different. I'll try the map and see how it works

hmm, if we split up userdata by type later, then the map doesn't make sense. maybe it's actually not worth it...

ShahabT · 2024-11-22T18:27:07Z

proto/internal/temporal/server/api/matchingservice/v1/request_response.proto

 message GetTaskQueueUserDataResponse {
    reserved 1;
    // Versioned user data, set if the task queue has user data and the request's last_known_user_data_version is less
    // than the version cached in the root partition.
    temporal.server.api.persistence.v1.VersionedTaskQueueUserData user_data = 2;
 }

+message SyncDeploymentUserDataRequest {


maybe SyncTaskQueueDeploymentDataRequest? I think it's appropriate for the name to have TaskQueue in it.

it's matching service.. everything is about task queues already.

(I'm not trying to be annoying, the super-long names make things significantly harder to read)

Sure, if you like this one better. But every other API in matching service has TaskQueue in it's name :)

except UpdateWorkerBuildIdCompatibility and UpdateWorkerVersioningRules, which are the two most similar to this one 🤷‍♂️

proto/internal/temporal/server/api/persistence/v1/task_queues.proto

ShahabT · 2024-11-22T18:34:51Z

service/matching/physical_task_queue_manager.go

+	namespaceEntry *namespace.Namespace,
+	pollMetadata *pollMetadata,
+) error {
+	if !pollMetadata.workerVersionCapabilities.UseVersioning {


You can use worker_versioning.DeploymentFromCapabilities to get the deployment if v3 is used. Note that we don't want to register if v1-2 is used.

ShahabT · 2024-11-22T18:36:03Z

service/matching/physical_task_queue_manager.go

+		return nil
+	}
+	if !c.partitionMgr.engine.config.EnableDeployments(namespaceEntry.Name().String()) {
+		return nil


if v3 is used but this config is not set, should we reject the pollers? Do we reject v1-2 pollers if versioning config is not enabled?

we have the separate "workflow apis" setting... there's just one setting for v1 and v2 though.

I suppose the answer is yes? I'm not sure of all the implications though

Still not sure why we have separate configs for v1-2. we always enable/disable them together.

the idea was something like, if we find a bug, we can block further metadata updates without halting workflows

ShahabT · 2024-11-22T18:40:22Z

service/worker/deployment/deployment_activities.go

 	if err != nil {
-		return err
+		logger.Error("syncing task queue userdata", "error", err)


should we add the deployment and tq info to the log?

yeah, though it's possible to correlate with the Info log above by activity id (which is already in the logger).

actually, on second thought: the sdk workflow/activity logs are already tagged with workflow id, which has the deployment series+buildid at a glance. so I'll remove those from the explicit logs.

## What changed? - Sync deployments to task queue user data. - Various protos reorganization. ## Why? So the data is available in matching. ## How did you test it? not yet

dnr requested review from ShahabT, carlydf and Shivs11 November 22, 2024 03:05

dnr requested a review from a team as a code owner November 22, 2024 03:05

Shivs11 reviewed Nov 22, 2024

View reviewed changes

Shivs11 approved these changes Nov 22, 2024

View reviewed changes

ShahabT reviewed Nov 22, 2024

View reviewed changes

ShahabT approved these changes Nov 23, 2024

View reviewed changes

dnr added 2 commits November 22, 2024 23:00

sync to userdata

994ebe1

oops

f889500

dnr force-pushed the v3ud branch from c1b0787 to f889500 Compare November 23, 2024 07:14

dnr merged commit c5ad8bd into temporalio:versioning-3 Nov 23, 2024
34 of 45 checks passed

dnr deleted the v3ud branch November 23, 2024 07:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sync deployments to task queue user data #6871

Sync deployments to task queue user data #6871

dnr commented Nov 22, 2024

Shivs11 Nov 22, 2024

dnr Nov 22, 2024

Shivs11 left a comment

Shivs11 Nov 22, 2024

dnr Nov 23, 2024

Shivs11 Nov 23, 2024

dnr Nov 23, 2024

ShahabT Nov 22, 2024

dnr Nov 22, 2024

ShahabT Nov 22, 2024

dnr Nov 22, 2024

ShahabT Nov 23, 2024

dnr Nov 23, 2024

ShahabT Nov 22, 2024

dnr Nov 22, 2024

ShahabT Nov 23, 2024

Shivs11 Nov 23, 2024

dnr Nov 23, 2024

dnr Nov 23, 2024

ShahabT Nov 22, 2024

dnr Nov 22, 2024

ShahabT Nov 23, 2024

dnr Nov 23, 2024

ShahabT Nov 22, 2024

dnr Nov 22, 2024

ShahabT Nov 22, 2024

dnr Nov 23, 2024

ShahabT Nov 23, 2024

dnr Nov 23, 2024

ShahabT Nov 22, 2024

dnr Nov 23, 2024

Sync deployments to task queue user data #6871

Sync deployments to task queue user data #6871

Conversation

dnr commented Nov 22, 2024

What changed?

Why?

How did you test it?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Shivs11 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment