Skip to content
This repository has been archived by the owner on Oct 23, 2023. It is now read-only.

Add quality of service to launch plan/execution spec #68

Merged
merged 12 commits into from
Jul 8, 2020

Conversation

katrogan
Copy link
Contributor

@katrogan katrogan commented Jun 8, 2020

TL;DR

This PR introduces a QualityOfService designation for workflow executions. This is primarily used to designate toleration for queueing time but can potentially be leveraged for other server-side execution configuration.

Type

  • Bug Fix
  • Feature
  • Plugin

Are all requirements met?

  • Code completed
  • Smoke tested
  • Unit tests added
  • Code documentation added
  • Any pending items have an associated Issue

Complete description

https://docs.google.com/document/d/1rYqr01x0XL6n1AdmEOUbTIGeS9Z2hJCcnG5NMy8aQ68/edit#

Tracking Issue

flyteorg/flyte#343

Follow-up issue

NA

@migueltol22
Copy link
Contributor

@katrogan
Copy link
Contributor Author

katrogan commented Jun 8, 2020

Actually I think we want to keep queueing budget because admin will still need to set it after interpreting the specified quality of service.

migueltol22
migueltol22 previously approved these changes Jun 9, 2020
Copy link
Contributor

@EngHabu EngHabu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's great you started this work and I feel we might need to do a bit of a design here to make sure we get it right, otherwise we will end up with an unusable bit that's only used for categorical queuing budgets' values :)
A few questions to answer:

  • Why 3? why not more or less?
  • Should it be an absolute number to the service? or relative to your other workflows in the project? or somewhere in between?
  • How to resolve conflicts between that attribute and the queuing budget/interruptible that you can set on the WF? does one win over the other?
  • If it's relative to the other WFs in the project, will/should people be able to define what each of them mean for themselves? like my Service Quality High means 10min queuing budget but yours mean 1hr... for example
  • If the values are absolute, we need to plan around how to ensure we have enough capacity to guarantee these classes for everyone... we don't need to answer this particular one now but we need to define the right indicators to use and what oncall should do... etc.

Maybe start a doc for defining what these mean?

@@ -245,3 +245,18 @@ message AuthRole {
}
}

// This general categorization specifies how much queue time an execution can tolerate.
// This can be extended in the future to make other execution-time decisions.
enum QualityOfService {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we adopted this model instead:

Suggested change
enum QualityOfService {
message QualityOfService {
enum Values {
UNDEFINED = 0;
...
}
}

@@ -237,6 +237,8 @@ message ExecutionSpec {
// Optional: auth override to apply this execution.
AuthRole auth_role = 16;

// Indicates the amount of queueing a launched workflow execution can tolerate.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it only about queuing? does it also control interruptible bit?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update doc?

// Guarantees that this execution will begin as soon as requested and incur no queueing time.
QUALITY_OF_SERVICE_HIGH = 1;

// This execution may incur some queueing delay (e.g. 30 minutes) and is medium priority.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do not indicate time here in comments. The range currently is huge.

@katrogan
Copy link
Contributor Author

katrogan commented Jun 9, 2020

@EngHabu thanks for the suggestions. I started a doc here to capture a few different implementation proposals and open questions.

@katrogan
Copy link
Contributor Author

@EngHabu @migueltol22 @anandswaminathan mind taking another look? updated to reflect design decisions/discussions

@katrogan
Copy link
Contributor Author

katrogan commented Jul 1, 2020

@@ -237,6 +237,8 @@ message ExecutionSpec {
// Optional: auth override to apply this execution.
AuthRole auth_role = 16;

// Indicates the amount of queueing a launched workflow execution can tolerate.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update doc?


message QualityOfServiceSpec {
// Indicates how much queueing delay an execution can tolerate.
uint32 queueing_budget_mins = 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we use this instead?

Copy link
Contributor Author

@katrogan katrogan Jul 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

originally that's what I had but people commented that anything beyond minutes was too much granularity and unnecessary

one option is we could expose it as queueing_budget_mins in the SDK and then store that as a duration in the proto field but that can still be manipulated by non-SDK users. what do you think @EngHabu ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will go with the second option, updated to use duration and we can expose it as a field in mins only in the sdk cc @migueltol22

@@ -91,6 +92,9 @@ message LaunchPlanSpec {
Auth auth = 8 [deprecated=true];

AuthRole auth_role = 9;

// Indicates the amount of queueing a launched workflow execution can tolerate.
QualityOfService quality_of_service = 16;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this will be part of the launchplan should we also remove this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe we'll want to resolve the queueing budget based on the quality of service on the admin side before creating the workflow CRD so we should still keep the queueing_budget field

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you for the catch! updated to use QualityOfService in the workflow metadata

Copy link
Contributor

@migueltol22 migueltol22 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Will leave final approval to @EngHabu

@katrogan katrogan requested a review from EngHabu July 8, 2020 17:41
@katrogan katrogan merged commit e9727af into master Jul 8, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants