-
Notifications
You must be signed in to change notification settings - Fork 61
Add quality of service to launch plan/execution spec #68
Conversation
Now that we are adopting quality of service should we keep https://github.com/lyft/flyteidl/blob/42929ee56bc48fc887ed372b468929d797f39643/protos/flyteidl/core/workflow.proto#L132 |
Actually I think we want to keep queueing budget because admin will still need to set it after interpreting the specified quality of service. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's great you started this work and I feel we might need to do a bit of a design here to make sure we get it right, otherwise we will end up with an unusable bit that's only used for categorical queuing budgets' values :)
A few questions to answer:
- Why 3? why not more or less?
- Should it be an absolute number to the service? or relative to your other workflows in the project? or somewhere in between?
- How to resolve conflicts between that attribute and the queuing budget/interruptible that you can set on the WF? does one win over the other?
- If it's relative to the other WFs in the project, will/should people be able to define what each of them mean for themselves? like my Service Quality High means 10min queuing budget but yours mean 1hr... for example
- If the values are absolute, we need to plan around how to ensure we have enough capacity to guarantee these classes for everyone... we don't need to answer this particular one now but we need to define the right indicators to use and what oncall should do... etc.
Maybe start a doc for defining what these mean?
protos/flyteidl/admin/common.proto
Outdated
@@ -245,3 +245,18 @@ message AuthRole { | |||
} | |||
} | |||
|
|||
// This general categorization specifies how much queue time an execution can tolerate. | |||
// This can be extended in the future to make other execution-time decisions. | |||
enum QualityOfService { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we adopted this model instead:
enum QualityOfService { | |
message QualityOfService { | |
enum Values { | |
UNDEFINED = 0; | |
... | |
} | |
} |
@@ -237,6 +237,8 @@ message ExecutionSpec { | |||
// Optional: auth override to apply this execution. | |||
AuthRole auth_role = 16; | |||
|
|||
// Indicates the amount of queueing a launched workflow execution can tolerate. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it only about queuing? does it also control interruptible bit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
update doc?
protos/flyteidl/admin/common.proto
Outdated
// Guarantees that this execution will begin as soon as requested and incur no queueing time. | ||
QUALITY_OF_SERVICE_HIGH = 1; | ||
|
||
// This execution may incur some queueing delay (e.g. 30 minutes) and is medium priority. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do not indicate time here in comments. The range currently is huge.
@EngHabu @migueltol22 @anandswaminathan mind taking another look? updated to reflect design decisions/discussions |
@@ -237,6 +237,8 @@ message ExecutionSpec { | |||
// Optional: auth override to apply this execution. | |||
AuthRole auth_role = 16; | |||
|
|||
// Indicates the amount of queueing a launched workflow execution can tolerate. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
update doc?
|
||
message QualityOfServiceSpec { | ||
// Indicates how much queueing delay an execution can tolerate. | ||
uint32 queueing_budget_mins = 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we use this instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
originally that's what I had but people commented that anything beyond minutes was too much granularity and unnecessary
one option is we could expose it as queueing_budget_mins in the SDK and then store that as a duration in the proto field but that can still be manipulated by non-SDK users. what do you think @EngHabu ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will go with the second option, updated to use duration and we can expose it as a field in mins only in the sdk cc @migueltol22
@@ -91,6 +92,9 @@ message LaunchPlanSpec { | |||
Auth auth = 8 [deprecated=true]; | |||
|
|||
AuthRole auth_role = 9; | |||
|
|||
// Indicates the amount of queueing a launched workflow execution can tolerate. | |||
QualityOfService quality_of_service = 16; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this will be part of the launchplan should we also remove this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe we'll want to resolve the queueing budget based on the quality of service on the admin side before creating the workflow CRD so we should still keep the queueing_budget field
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thank you for the catch! updated to use QualityOfService in the workflow metadata
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Will leave final approval to @EngHabu
TL;DR
This PR introduces a QualityOfService designation for workflow executions. This is primarily used to designate toleration for queueing time but can potentially be leveraged for other server-side execution configuration.
Type
Are all requirements met?
Complete description
https://docs.google.com/document/d/1rYqr01x0XL6n1AdmEOUbTIGeS9Z2hJCcnG5NMy8aQ68/edit#
Tracking Issue
flyteorg/flyte#343
Follow-up issue
NA