Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sync deployments to task queue user data #6871

Merged
merged 2 commits into from
Nov 23, 2024
Merged

Conversation

dnr
Copy link
Member

@dnr dnr commented Nov 22, 2024

What changed?

  • Sync deployments to task queue user data.
  • Various protos reorganization.

Why?

So the data is available in matching.

How did you test it?

not yet

@dnr dnr requested a review from a team as a code owner November 22, 2024 03:05
string task_queue_name = 1;
temporal.api.enums.v1.TaskQueueType task_queue_type = 2;
google.protobuf.Timestamp first_poller_time = 3;
}

// TODO: comment what this is used as
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is used as a response when deployment read API queries the deployment workflow, enquiring it's status. It should maybe be called QueryDescribeDeploymentResponse

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, I'll rename

Copy link
Member

@Shivs11 Shivs11 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks great overall - thank you David for improving some of my work and adding the userdata stuff so quickly

had a couple of questions here and there but won't block

// lock so that only one poll does the update and the rest wait for it
c.deploymentLock.Lock()
defer c.deploymentLock.Lock()

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question - I remember us discussing about trying to avoid mixing locks and atomics whenever we can. The approach you have taken works, but I just wanted to paste this to ensure we have thought of this case as well.

moreover - if we are ensuring our updates are idempotent and only one will get through eventually, should we be making our polls wait for the lock acquisition?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree about mixing locks and atomics, that's why there are no more atomics.

if one poll comes in and starts the workflow/update, and another 999 come in in the next half second, there should only be one outstanding call. otherwise we'll slam frontend+history.

actually we need even more backoff here on error (comment below), but that can wait.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops, thought you had decided to keep c.deploymentRegistered an atomic. The git diff confused me, all good. I like this approach of using one and not both.

thanks for the reasoning but then we should document somewhere that the poll latency (using versioning) might increase given we are waiting on userdata propogating and these calls to frontend/matching

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only on the first few polls from a new deployment. it's a potential failure mode but it shouldn't affect latency in the steady state.

@@ -31,47 +31,72 @@ import "google/protobuf/timestamp.proto";
import "temporal/api/deployment/v1/message.proto";
import "temporal/api/common/v1/message.proto";

// Data for each deployment+task queue pair. This is stored in each deployment (for each task
// queue), and synced to task queue user data (for each deployment).
message Data {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we call it something more specific such as PerTaskQueueData or TaskQueueDeploymentData?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well, deployment is in the package name so I was trying to keep names shorter. I guess deployment.PerTaskQueueData is fine, or maybe deployment.TaskQueueData? I don't want to repeat deployment

message TaskQueueInfo {
temporal.api.enums.v1.TaskQueueType task_queue_type = 1;
google.protobuf.Timestamp first_poller_time = 2;
Data data = 2;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Conceptually, data.last_became_current_time is filled based on deployment's last_became_current_time. Once we have Rollout, deployment's last_became_current_time will come from rollout records. For now, we should put the last_became_current_time inside DeploymentLocalState and update is together with is_current.

Given that, maybe Data should not be stored in here and only be calculated from the DeploymentLocalState + TaskQueueInfo when we call SyncUserData. @dnr wdyt?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm trying to make some common structs so that we can add fields without repeating them in 30 different messages... but maybe last_became_current_time has to be somewhat special?

I think what you wrote makes sense. but then this TaskQueueInfo really just contains one field, first_poller_time? what else is going to go here?

I might want to keep it as a data and just leave out lbct so it would pick up any other per-tq fields?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So add the last_became_current_time at top level but still use Data inside the state per TQ? I guess I don't have an objection, we can see if they'll converge more in the future and if so we can split the proto defs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, that's what I was thinking. I could be wrong, it depends how it evolves.

temporal.api.deployment.v1.Deployment deployment = 1;
string task_queue_name = 2;
temporal.api.enums.v1.TaskQueueType task_queue_type = 3;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when a deployment becomes current it'd usually need to update both activity and wf task queues. Do we want ability to pack them in a single call, or it's not worth it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, good point. this is the sort of thing I'm worried about changing when we do SetCurrentDeployment...

is making task_queue_type repeated enough? seems like it should be

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

generally, the Data could be different per type. maybe make it a map of int -> Data?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ShahabT - remind me about - "what about the case when the activity task-queue is placed in a different deployment" case? or are we not dealing with this in pre-release?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh right, it has first poller time and that can be different. I'll try the map and see how it works

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, if we split up userdata by type later, then the map doesn't make sense. maybe it's actually not worth it...

message GetTaskQueueUserDataResponse {
reserved 1;
// Versioned user data, set if the task queue has user data and the request's last_known_user_data_version is less
// than the version cached in the root partition.
temporal.server.api.persistence.v1.VersionedTaskQueueUserData user_data = 2;
}

message SyncDeploymentUserDataRequest {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe SyncTaskQueueDeploymentDataRequest? I think it's appropriate for the name to have TaskQueue in it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's matching service.. everything is about task queues already.

(I'm not trying to be annoying, the super-long names make things significantly harder to read)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, if you like this one better. But every other API in matching service has TaskQueue in it's name :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

except UpdateWorkerBuildIdCompatibility and UpdateWorkerVersioningRules, which are the two most similar to this one 🤷‍♂️

namespaceEntry *namespace.Namespace,
pollMetadata *pollMetadata,
) error {
if !pollMetadata.workerVersionCapabilities.UseVersioning {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use worker_versioning.DeploymentFromCapabilities to get the deployment if v3 is used. Note that we don't want to register if v1-2 is used.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice

return nil
}
if !c.partitionMgr.engine.config.EnableDeployments(namespaceEntry.Name().String()) {
return nil
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if v3 is used but this config is not set, should we reject the pollers? Do we reject v1-2 pollers if versioning config is not enabled?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have the separate "workflow apis" setting... there's just one setting for v1 and v2 though.

I suppose the answer is yes? I'm not sure of all the implications though

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still not sure why we have separate configs for v1-2. we always enable/disable them together.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the idea was something like, if we find a bug, we can block further metadata updates without halting workflows

if err != nil {
return err
logger.Error("syncing task queue userdata", "error", err)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we add the deployment and tq info to the log?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, though it's possible to correlate with the Info log above by activity id (which is already in the logger).

actually, on second thought: the sdk workflow/activity logs are already tagged with workflow id, which has the deployment series+buildid at a glance. so I'll remove those from the explicit logs.

@dnr dnr merged commit c5ad8bd into temporalio:versioning-3 Nov 23, 2024
34 of 45 checks passed
@dnr dnr deleted the v3ud branch November 23, 2024 07:23
dnr added a commit that referenced this pull request Nov 26, 2024
## What changed?
- Sync deployments to task queue user data.
- Various protos reorganization.

## Why?
So the data is available in matching.

## How did you test it?
not yet
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants