-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Move admin and meta component into cluster controller role #1180
Move admin and meta component into cluster controller role #1180
Conversation
bf4d604
to
d6647b3
Compare
e2e test are failing: https://github.com/restatedev/restate/actions/runs/7888143370/job/21525412929?pr=1180. Investigating what I broke. |
The e2e tests were broken because of restatedev/e2e#265. They should be fixed now. |
Still failing. Investigating what is going wrong with the e2e tests. |
I suspect that there is race condition between the e2e test invoking a service and the schema information having propagated to the ingress. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work @tillrohrmann. The changes look good to me, but I'd love to get clarification on the couple of questions about potential data-loss.
invocation_termination: InvocationTermination, | ||
) -> Result<(), Error> { | ||
let invocation_termination = | ||
bincode::serde::encode_to_vec(invocation_termination, bincode::config::standard()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a transitional step or do you think we will continue with the nested serialization for the long-run?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought of it as a transitional step to simplify things for the time being. I guess once we introduce the Admin grpc service we will specify the InvocationTermination
request properly and then could reuse it here.
crates/node/src/roles/worker.rs
Outdated
.map_err(WorkerRoleError::Worker), | ||
); | ||
component_set.spawn( | ||
Self::reload_schemas( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assumed we will need to perform a blocking initial schema fetch before starting the worker. Do we have a risk of partition processors failing/ignoring invocations because they can't see the schema on startup, or if it takes time? I'm worried about the risk of data loss if PP drops an invocation because it thinks the service have been removed (vs. that we are running with old Meta information).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The PP won't drop any invocations it does not know about. What can happen is that the ingress rejects for some time new invocations until it learns about the updated schema information. What can also happen, and this is probably not so nice, is that the PP will try to invoke some invocations for which it does not know the deployment information. This would then lead to a couple of retries which causes some noise in the logs. To prevent this from happening trying to fetch the schema before starting the worker is probably a good idea. This won't solve the problem in all cases but in most.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sounds like a fragile design assumption that might not stand the test of time. The good news is that it won't cause problems at the moment, but I think we can have a more robust approach with the following design:
- Schema is versioned.
- When schema is updated on an admin node, the admin node can inform all nodes with "ingress" role with the update. This is either a notification of the higher version number, or a full push of the new schema object. Ideally, we finish up the ingress version notification before we respond to user-requested schema changes, this is to achieve monotonic read.
- Ingress writes its last-seen schema version in all messages going through bifrost or in RPC to workers. A worker processing an invocation message will compare the schema version, if it's running with an older version, it pulls the new schema before moving forward. If the worker has the same schema as the incoming message (whether from another PP or ingress) it can more confidently reject invocations if we need to do that in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sending the schema information version with the invocation messages is a good idea to achieve monotonic reads :-)
crates/node/src/roles/worker.rs
Outdated
} | ||
} | ||
Err(err) => { | ||
debug!("Failed fetching schema information: {err}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be a debug in the single-node setup? Why do we expect failures to happen?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the single-node setup this should indeed not happen. It is a bit of a preparational step for allowing the cluster controller and the worker role to be run as two separate processes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rather make this scream loud if we are confident that it shouldn't happen :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By allowing to separate the admin from the worker role which will be the next step I intend to implement, it will be possible that the worker cannot reach the admin.
|
||
package dev.restate.schema; | ||
|
||
service Schema { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this can be the seed for the Metadata service? I also think we should postfix all grpc services with something to make it distinct from potential message types, perhaps with Svc
at the end or something.
Note: The postfix comment isn't related to this particular PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I thought so too concerning the Metadata
service. Should we start right away calling it MetadataSvc
instead? Given that these services are currently only internally used, it is probably also possible to rename them later.
Re postfix: +1 will change it.
This introduces a central system to manage long-running and background restate async tasks. The core of this proposal is to help us lean more towards spawning self-contained tasks that are addressable, trackable, and cancellable, than current deep future poll trees. It also allows nice possibilities like: - Limited structured concurrency (no auto-waiting for children though) - Graceful cancellations/shutdown reduces the risk of cancellation-unsafe drops - Potentially less memory, deep nested future state machines become shallower. - A single place where we can auto-propagate tracing context, flag tasks by priority, schedule tasks on different runtime, provide observability of what kind of tasks are running, etc. - Scoped tasks allow scoped cancellation. (cooperatively cancel all tasks for a partition id, or a specific tenant in the future, or be specific and filter specific kinds of tasks) - Limit concurrency of certain tasks by kind, partition, or tenant, etc. - Distributing tasks among multiple tokio runtimes based on the task kind - Support for different abort-policy based on the task kind. A the moment, I only migrated a small pieces of our code to this system to make the merging with #1180 easier. Once #1180 is merged, I'll move the rest of our services and bifrost to use it.
The current CI run fails because of conflicting changes on |
c90de31
to
ec89678
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🚀
2849b1c
to
d8cb5cf
Compare
This introduces a central system to manage long-running and background restate async tasks. The core of this proposal is to help us lean more towards spawning self-contained tasks that are addressable, trackable, and cancellable, than current deep future poll trees. It also allows nice possibilities like: - Limited structured concurrency (no auto-waiting for children though) - Graceful cancellations/shutdown reduces the risk of cancellation-unsafe drops - Potentially less memory, deep nested future state machines become shallower. - A single place where we can auto-propagate tracing context, flag tasks by priority, schedule tasks on different runtime, provide observability of what kind of tasks are running, etc. - Scoped tasks allow scoped cancellation. (cooperatively cancel all tasks for a partition id, or a specific tenant in the future, or be specific and filter specific kinds of tasks) - Limit concurrency of certain tasks by kind, partition, or tenant, etc. - Distributing tasks among multiple tokio runtimes based on the task kind - Support for different abort-policy based on the task kind. A the moment, I only migrated a small pieces of our code to this system to make the merging with #1180 easier. Once #1180 is merged, I'll move the rest of our services and bifrost to use it.
As a first step to move the Meta service into the cluster controller role, this commit removes the restate_worker_api::Handle dependency from this component.
This commit decouples the meta and admin components from the worker and moves them over to the cluster controller role. The way the worker learns about schema information changes is by periodically asking the schema service for updates. At the moment, every schema information fetch request will send the whole list of schema update commands over. Currently, it is not possible to query the storage via the Admin service. This will be fixed with a follow-up commit.
The QueryStorage grpc method allows to query the storage of a worker. It returns the result as arrow-flight encoded RecordBatches. The admin service uses this service to forward queries it receives on /query to create a response for them.
This commit renames the following grpc services: * Worker -> WorkerSvc * ClusterController -> ClusterControllerSvc * NodeCtrl -> NodeCtrlSvc
This commit enables the admin service to push schema updates to the worker in case of schema changes.
d8cb5cf
to
429b74a
Compare
This introduces a central system to manage long-running and background restate async tasks. The core of this proposal is to help us lean more towards spawning self-contained tasks that are addressable, trackable, and cancellable, than current deep future poll trees. It also allows nice possibilities like: - Limited structured concurrency (no auto-waiting for children though) - Graceful cancellations/shutdown reduces the risk of cancellation-unsafe drops - Potentially less memory, deep nested future state machines become shallower. - A single place where we can auto-propagate tracing context, flag tasks by priority, schedule tasks on different runtime, provide observability of what kind of tasks are running, etc. - Scoped tasks allow scoped cancellation. (cooperatively cancel all tasks for a partition id, or a specific tenant in the future, or be specific and filter specific kinds of tasks) - Limit concurrency of certain tasks by kind, partition, or tenant, etc. - Distributing tasks among multiple tokio runtimes based on the task kind - Support for different abort-policy based on the task kind. A the moment, I only migrated a small pieces of our code to this system to make the merging with #1180 easier. Once #1180 is merged, I'll move the rest of our services and bifrost to use it.
This introduces a central system to manage long-running and background restate async tasks. The core of this proposal is to help us lean more towards spawning self-contained tasks that are addressable, trackable, and cancellable, than current deep future poll trees. It also allows nice possibilities like: - Limited structured concurrency (no auto-waiting for children though) - Graceful cancellations/shutdown reduces the risk of cancellation-unsafe drops - Potentially less memory, deep nested future state machines become shallower. - A single place where we can auto-propagate tracing context, flag tasks by priority, schedule tasks on different runtime, provide observability of what kind of tasks are running, etc. - Scoped tasks allow scoped cancellation. (cooperatively cancel all tasks for a partition id, or a specific tenant in the future, or be specific and filter specific kinds of tasks) - Limit concurrency of certain tasks by kind, partition, or tenant, etc. - Distributing tasks among multiple tokio runtimes based on the task kind - Support for different abort-policy based on the task kind. A the moment, I only migrated a small pieces of our code to this system to make the merging with #1180 easier. Once #1180 is merged, I'll move the rest of our services and bifrost to use it.
This introduces a central system to manage long-running and background restate async tasks. The core of this proposal is to help us lean more towards spawning self-contained tasks that are addressable, trackable, and cancellable, than current deep future poll trees. It also allows nice possibilities like: - Limited structured concurrency (no auto-waiting for children though) - Graceful cancellations/shutdown reduces the risk of cancellation-unsafe drops - Potentially less memory, deep nested future state machines become shallower. - A single place where we can auto-propagate tracing context, flag tasks by priority, schedule tasks on different runtime, provide observability of what kind of tasks are running, etc. - Scoped tasks allow scoped cancellation. (cooperatively cancel all tasks for a partition id, or a specific tenant in the future, or be specific and filter specific kinds of tasks) - Limit concurrency of certain tasks by kind, partition, or tenant, etc. - Distributing tasks among multiple tokio runtimes based on the task kind - Support for different abort-policy based on the task kind. A the moment, I only migrated a small pieces of our code to this system to make the merging with #1180 easier. Once #1180 is merged, I'll move the rest of our services and bifrost to use it.
This PR moves the
Admin
andMeta
components from the worker role into the cluster controller role. For this to happen, we needed to do the following things:Admin
service to use it for killing/cancelling of invocationsAdmin
to use this API to query the storage. The query results are sent as arrow-flight encoded RecordBatches.Currently, the grpc address is hard-coded to 127.0.0.1:5122 which prevents the system from being run as separate processes.