- FrameworkController DeepDive
- Framework Interop
- Framework ExecutionType
- Predefined Container EnvironmentVariable
- Pod Failure Classification
- Predefined CompletionCode
- CompletionStatus
- RetryPolicy
- FrameworkAttemptCompletionPolicy
- Framework ScaleUp/ScaleDown
- Large Scale Framework
- Framework and Pod History
- Framework and Task State Machine
- Framework Consistency vs Availability
- Controller Extension
- Best Practice
See FrameworkController DeepDive.pptx
As Framework is actually a Kubernetes CRD, all CRD Clients can be used to interoperate with it, such as:
- kubectl
kubectl create -f {Framework File Path} # Note this is not Foreground Deletion, see [DELETE Framework] section kubectl delete framework {FrameworkName} kubectl get framework {FrameworkName} kubectl describe framework {FrameworkName} kubectl get frameworks kubectl describe frameworks ...
- Kubernetes Client Library
- Any HTTP Client
API Kind | Operations |
---|---|
Framework | CREATE DELETE GET LIST WATCH WATCH_LIST PATCH (Start, Stop, Add TaskRole, Delete TaskRole, Add/Delete Task) |
ConfigMap | All operations except for CREATE PUT PATCH |
Pod | All operations except for CREATE PUT PATCH |
Request
POST /apis/frameworkcontroller.microsoft.com/v1/namespaces/{FrameworkNamespace}/frameworks
Body: Framework
Type: application/json or application/yaml
Description
Create the specified Framework.
Any ExecutionType can be specified to create the Framework.
Response
Code | Body | Description |
---|---|---|
OK(200) | Framework | Return current Framework. |
Created(201) | Framework | Return current Framework. |
Accepted(202) | Framework | Return current Framework. |
Conflict(409) | Status | The specified Framework already exists. |
Request
PATCH /apis/frameworkcontroller.microsoft.com/v1/namespaces/{FrameworkNamespace}/frameworks/{FrameworkName}
Body:
[
{
"op": "replace",
"path": "/spec/executionType",
"value": "Start"
}
]
Type: application/json-patch+json
Description
Start the specified Framework whose ExecutionType should be Create
.
Before the Start, the Framework will not start to run or complete, but the object of the Framework is created, see Framework PreStart Example.
Response
Code | Body | Description |
---|---|---|
OK(200) | Framework | Return current Framework. |
NotFound(404) | Status | The specified Framework is not found. |
Request
PATCH /apis/frameworkcontroller.microsoft.com/v1/namespaces/{FrameworkNamespace}/frameworks/{FrameworkName}
Body:
[
{
"op": "replace",
"path": "/spec/executionType",
"value": "Stop"
}
]
Type: application/json-patch+json
Description
Stop the specified Framework whose ExecutionType should be Create
or Start
.
After the Stop, the Framework will start to complete, but the object of the Framework will not be deleted, see Framework PostStop Example.
Response
Code | Body | Description |
---|---|---|
OK(200) | Framework | Return current Framework. |
NotFound(404) | Status | The specified Framework is not found. |
Request
PATCH /apis/frameworkcontroller.microsoft.com/v1/namespaces/{FrameworkNamespace}/frameworks/{FrameworkName}
Body:
Follow the TaskRoleSpec to override below $(TaskRoleSpec) placeholder.
[
{
"op": "add",
"path": "/spec/taskRoles/-",
"value": $(TaskRoleSpec)
}
]
Type: application/json-patch+json
Description
Append the specified TaskRole to the specified Framework.
See more in Framework ScaleUp/ScaleDown.
Response
Code | Body | Description |
---|---|---|
OK(200) | Framework | Return current Framework. |
NotFound(404) | Status | The specified Framework is not found. |
Request
PATCH /apis/frameworkcontroller.microsoft.com/v1/namespaces/{FrameworkNamespace}/frameworks/{FrameworkName}
Body:
Use the current array index of the $(TaskRoleName) to override below $(TaskRoleIndex).
[
{
"op": "test",
"path": "/spec/taskRoles/$(TaskRoleIndex)/name",
"value": "$(TaskRoleName)"
},
{
"op": "remove",
"path": "/spec/taskRoles/$(TaskRoleIndex)"
}
]
Type: application/json-patch+json
Description
Delete the specified TaskRole from the specified Framework.
See more in Framework ScaleUp/ScaleDown.
Notes:
- Better to first delete all Tasks in the TaskRole and wait until they are all deleted, then delete the whole TaskRole. Otherwise, the Tasks in the TaskRole will be deleted according to last observed PodGracefulDeletionTimeoutSec in TaskRoleStatus, as the PodGracefulDeletionTimeoutSec in TaskSpec is already deleted.
Response
Code | Body | Description |
---|---|---|
OK(200) | Framework | Return current Framework. |
NotFound(404) | Status | The specified Framework is not found. |
UnprocessableEntity(422) | Status | The specified |
Request
PATCH /apis/frameworkcontroller.microsoft.com/v1/namespaces/{FrameworkNamespace}/frameworks/{FrameworkName}
Body:
Use the current array index of the $(TaskRoleName) to override below $(TaskRoleIndex).
[
{
"op": "test",
"path": "/spec/taskRoles/$(TaskRoleIndex)/name",
"value": "$(TaskRoleName)"
},
{
"op": "replace",
"path": "/spec/taskRoles/$(TaskRoleIndex)/taskNumber",
"value": $(TaskNumber)
}
]
Generally, you may also need to adjust the TaskRole's FrameworkAttemptCompletionPolicy according to the new $(TaskNumber). It is safe as Framework ScaleUp/ScaleDown Strong Safety Guarantee.
[
{
"op": "test",
"path": "/spec/taskRoles/$(TaskRoleIndex)/name",
"value": "$(TaskRoleName)"
},
{
"op": "replace",
"path": "/spec/taskRoles/$(TaskRoleIndex)/taskNumber",
"value": $(TaskNumber)
},
{
"op": "replace",
"path": "/spec/taskRoles/$(TaskRoleIndex)/frameworkAttemptCompletionPolicy/minSucceededTaskCount",
"value": $(MinSucceededTaskCount)
},
{
"op": "replace",
"path": "/spec/taskRoles/$(TaskRoleIndex)/frameworkAttemptCompletionPolicy/minFailedTaskCount",
"value": $(MinFailedTaskCount)
}
]
Type: application/json-patch+json
Description
Update the TaskNumber (and the FrameworkAttemptCompletionPolicy) of the specified TaskRole in the specified Framework.
See more in Framework ScaleUp/ScaleDown.
Response
Code | Body | Description |
---|---|---|
OK(200) | Framework | Return current Framework. |
NotFound(404) | Status | The specified Framework is not found. |
UnprocessableEntity(422) | Status | The specified |
Request
DELETE /apis/frameworkcontroller.microsoft.com/v1/namespaces/{FrameworkNamespace}/frameworks/{FrameworkName}
Body:
application/json
{
"propagationPolicy": "Foreground"
}
application/yaml
propagationPolicy: Foreground
Type: application/json or application/yaml
Description
Delete the specified Framework.
Notes:
- If you need to achieve all the Framework ConsistencyGuarantees or achieve higher Framework Availability by leveraging the PodGracefulDeletionTimeoutSec, you should always use and only use the Foreground Deletion in the provided body.
- However,
kubectl delete
does not support to specify the Foreground Deletion at least for Kubernetes v1.23.2, so you may have to use other Supported Client.
Response
Code | Body | Description |
---|---|---|
OK(200) | Framework | The specified Framework is deleting. Return current Framework. |
OK(200) | Status | The specified Framework is deleted. |
NotFound(404) | Status | The specified Framework is not found. |
Request
GET /apis/frameworkcontroller.microsoft.com/v1/namespaces/{FrameworkNamespace}/frameworks/{FrameworkName}
Description
Get the specified Framework.
Response
Code | Body | Description |
---|---|---|
OK(200) | Framework | Return current Framework. |
NotFound(404) | Status | The specified Framework is not found. |
Request
GET /apis/frameworkcontroller.microsoft.com/v1/namespaces/{FrameworkNamespace}/frameworks
GET /apis/frameworkcontroller.microsoft.com/v1/frameworks
QueryParameters: Same as StatefulSet QueryParameters
Description
Get all Frameworks (in the specified FrameworkNamespace).
Response
Code | Body | Description |
---|---|---|
OK(200) | FrameworkList | Return all Frameworks (in the specified FrameworkNamespace). |
Request
GET /apis/frameworkcontroller.microsoft.com/v1/watch/namespaces/{FrameworkNamespace}/frameworks/{FrameworkName}
QueryParameters: Same as StatefulSet QueryParameters
Description
Watch the change events of the specified Framework.
Response
Code | Body | Description |
---|---|---|
OK(200) | WatchEvent | Streaming the change events of the specified Framework. |
NotFound(404) | Status | The specified Framework is not found. |
Request
GET /apis/frameworkcontroller.microsoft.com/v1/watch/namespaces/{FrameworkNamespace}/frameworks
GET /apis/frameworkcontroller.microsoft.com/v1/watch/frameworks
QueryParameters: Same as StatefulSet QueryParameters
Description
Watch the change events of all Frameworks (in the specified FrameworkNamespace).
Response
Code | Body | Description |
---|---|---|
OK(200) | WatchEvent | Streaming the change events of all Frameworks (in the specified FrameworkNamespace). |
Framework ExecutionType can be specified to control the execution of the Framework:
- You can just Create Framework with
Create
ExecutionType, which does not also start it at the same time.- This is useful when you need to do some PreStart actions depend on the Framework object, see Framework PreStart Example. And once these actions are done, you can safely Start Framework.
- You can just Stop Framework, which does not also delete it at the same time.
- This is useful when you need to do some PostStop actions depend on the Framework object, see Framework PostStop Example. And once these actions are done, you can safely Delete Framework.
In this example, you need to run a Framework which depends on a ServiceAccount, but the ServiceAccount also depends on the Framework object to be OwnerReferences, so you cannot directly Create Framework with ExecutionType Start
.
- Create Framework with
Create
ExecutionType and a ServiceAccount reference as below, then the Framework will stay as AttemptCreationPending:
apiVersion: frameworkcontroller.microsoft.com/v1
kind: Framework
metadata:
name: prestart
spec:
executionType: Create
retryPolicy:
fancyRetryPolicy: false
maxRetryCount: 0
taskRoles:
- name: a
taskNumber: 4
frameworkAttemptCompletionPolicy:
minFailedTaskCount: 4
minSucceededTaskCount: 1
task:
retryPolicy:
fancyRetryPolicy: false
maxRetryCount: 0
podGracefulDeletionTimeoutSec: 600
pod:
spec:
restartPolicy: Never
serviceAccountName: prestart
containers:
- name: ubuntu
image: ubuntu:trusty
command: ["sh", "-c", "printenv && sleep infinity"]
- Use above creation response's
metadata.uid
to override below {{FrameworkUID}}, and Create ServiceAccount with above Framework reference as below:
apiVersion: v1
kind: ServiceAccount
metadata:
name: prestart
ownerReferences:
- apiVersion: frameworkcontroller.microsoft.com/v1
kind: Framework
name: prestart
uid: {{FrameworkUID}}
controller: true
blockOwnerDeletion: true
- Start Framework, then the Framework will start to run successfully.
- Delete Framework, then both the Framework and above ServiceAccount will be deleted.
In this example, you need to stop a Framework whose final stopped Framework object needs to be pushed to/pulled by external systems, so you cannot directly Delete Framework.
- Create Framework as below:
apiVersion: frameworkcontroller.microsoft.com/v1
kind: Framework
metadata:
name: poststop
spec:
executionType: Start
retryPolicy:
fancyRetryPolicy: false
maxRetryCount: 0
taskRoles:
- name: a
taskNumber: 4
frameworkAttemptCompletionPolicy:
minFailedTaskCount: 4
minSucceededTaskCount: 1
task:
retryPolicy:
fancyRetryPolicy: false
maxRetryCount: 0
podGracefulDeletionTimeoutSec: 600
pod:
spec:
restartPolicy: Never
containers:
- name: ubuntu
image: ubuntu:trusty
command: ["sh", "-c", "printenv && sleep infinity"]
- Stop Framework, then the Framework will be stopped, i.e. FrameworkCompleted.
- Get Framework, and archive it into a DataBase first.
- Delete Framework, then the Framework will be deleted.
Predefined Container EnvironmentVariable
You can specify how to classify and summarize Pod failures by the PodFailureSpec.
You can also directly leverage the Default PodFailureSpec.
You can leverage the Predefined CompletionCode to instruct your RetryPolicy and identify a certain predefined CompletionCode, regardless of different PodFailureSpec may be configured in different clusters.
CompletionStatus: It is generated from Predefined CompletionCode or PodPattern matching. For a Pod, if no PodPattern is matched and failed Container exists, the CompletionCode is the same as the last failed Container ExitCode.
TaskAttemptCompletionStatus: Besides the CompletionStatus, it also provides more detailed and structured diagnostic information about the completion of a TaskAttempt.
FrameworkAttemptCompletionStatus: Besides the CompletionStatus, it also provides more detailed and structured diagnostic information about the completion of a FrameworkAttempt.
Notes:
- Italic Conditions still need to be specified explicitly, as we have not supported the Framework Spec Defaulting yet.
- For the definition of each CompletionType, such as Transient Failed, see CompletionStatus.
FrameworkType | Framework RetryPolicy | TaskRole | Task RetryPolicy | Description |
---|---|---|---|---|
DEFAULT | FancyRetryPolicy = false MaxRetryCount = 0 |
TaskRole-A | FancyRetryPolicy = false MaxRetryCount = 0 |
The default RetryPolicy: Never Retry for any Failed or Succeeded. |
TaskRole-B | FancyRetryPolicy = false MaxRetryCount = 0 |
|||
Service | FancyRetryPolicy = false MaxRetryCount = -2 |
TaskRole-A | FancyRetryPolicy = false MaxRetryCount = -2 |
Always Retry for any Failed or Succeeded. |
Blind Batch | FancyRetryPolicy = false MaxRetryCount = -1 |
TaskRole-A | FancyRetryPolicy = false MaxRetryCount = -1 |
Always Retry for any Failed. Never Retry for Succeeded. |
Batch with Task Fault Tolerance | FancyRetryPolicy = true MaxRetryCount = 3 |
TaskRole-A | FancyRetryPolicy = true MaxRetryCount = 3 |
Always Retry for Transient Failed. Never Retry for Permanent Failed or Succeeded. Retry up to 3 times for Unknown Failed. |
Batch without Task Fault Tolerance | FancyRetryPolicy = true MaxRetryCount = 3 |
TaskRole-A | FancyRetryPolicy = false MaxRetryCount = 0 |
For Framework RetryPolicy, same as "Batch with Task Fault Tolerance". For Task RetryPolicy, because the Task cannot tolerate any failed TaskAttempt, such as it cannot recover from previous failed TaskAttempt, so Never Retry Task for any Failed or Succeeded. |
Debug Mode | FancyRetryPolicy = true MaxRetryCount = 0 |
TaskRole-A | FancyRetryPolicy = true MaxRetryCount = 0 |
Always Retry for Transient Failed. Never Retry for Permanent Failed or Unknown Failed or Succeeded. This can help to capture the unexpected exit of user application itself. |
Notes:
- Italic Conditions still need to be specified explicitly, as we have not supported the Framework Spec Defaulting yet.
FrameworkType | TaskRole | FrameworkAttemptCompletionPolicy | Description |
---|---|---|---|
DEFAULT | TaskRole-A | MinFailedTaskCount = 1 MinSucceededTaskCount = -1 |
The default FrameworkAttemptCompletionPolicy: Fail the FrameworkAttempt immediately if any Task failed. Succeed the FrameworkAttempt only until all Tasks succeeded. |
TaskRole-B | MinFailedTaskCount = 1 MinSucceededTaskCount = -1 |
||
Service | TaskRole-A | MinFailedTaskCount = 1 MinSucceededTaskCount = -1 |
Actually, any FrameworkAttemptCompletionPolicy is fine, since Service's Task will never complete, i.e. its Task's MaxRetryCount is -2, see RetryPolicy Example. |
MapReduce | Map | MinFailedTaskCount = {Map.TaskNumber} * {mapreduce.map.failures.maxpercent} + 1 MinSucceededTaskCount = -1 |
A few failed Tasks is acceptable, but always want to wait all Tasks to succeed: Fail the FrameworkAttempt immediately if the failed Tasks exceeded the limit. Succeed the FrameworkAttempt only until all Tasks completed and the failed Tasks is within the limit. |
Reduce | MinFailedTaskCount = {Reduce.TaskNumber} * {mapreduce.reduce.failures.maxpercent} + 1 MinSucceededTaskCount = -1 |
||
Master Dominated: PyTorch Training with explict master role, TensorFlow Training with explict chief role | Master/Chief | MinFailedTaskCount = 1 MinSucceededTaskCount = 1 |
The completion is fully determined by the single instance master of the user application: Fail the FrameworkAttempt immediately if the master failed. Succeed the FrameworkAttempt immediately if the master succeeded. |
Worker | MinFailedTaskCount = -1 MinSucceededTaskCount = -1 |
||
All Workers Dominated: PyTorch Training without explict master role, TensorFlow Training without explict chief role | ParameterServer/None | MinFailedTaskCount = 1 MinSucceededTaskCount = -1 |
Fail the FrameworkAttempt immediately if any Task failed. Succeed the FrameworkAttempt only until all workers succeeded. |
Worker | MinFailedTaskCount = 1 MinSucceededTaskCount = {Worker.TaskNumber} |
||
Any Worker Dominated: PyTorch Elastic Training | Worker | MinFailedTaskCount = {Worker.TaskNumber} MinSucceededTaskCount = 1 |
Fail the FrameworkAttempt only until all workers failed. Succeed the FrameworkAttempt immediately if any worker succeeded. |
Framework ScaleUp/ScaleDown (Rescale) refers to take any below action for an existing Framework on the fly:
- Add/Delete TaskRole without touching other TaskRoles.
- Add/Delete Task without touching other Tasks.
Before you start to Rescale Framework, make sure your application executed by the Framework can tolerate Rescale, such as:
- Your application should be able to rebalance its workload after Rescale:
- For Service application, may need to rebalance its serving traffic by a load balancer, and/or re-replicate its state.
- For Batch application, may need to rebalance its work items by a work queue, and/or adjust its Tasks membership, like PyTorch Elastic Training.
- For Batch application, it would better not rerun after ScaleUp:
- Rerun may happen if the application already completed, but the completion event have not yet been observed by FrameworkController. So, during this period, if you ScaleUp the Framework, the application may rerun.
- To mitigate it, the ScaleUp Task can immediately complete itself by leveraging empty work queue or existing checkpoint from previous run.
- For Batch application, it would better not too early succeeded after ScaleDown:
- Too early succeeded may happen if all Tasks succeeded except one Task still running, but you ScaleDown the running Task.
- To resolve it, make sure it is safe to ScaleDown the running Task, such as leverage
Master Dominated
orAny Worker Dominated
FrameworkType in FrameworkAttemptCompletionPolicy. For theMaster Dominated
FrameworkType, a master TaskRole is needed and do not ScaleDown the master TaskRole. For theAny Worker Dominated
FrameworkType, an exit barrier may be needed to ensure any worker succeeded means the whole application already succeeded, like PyTorch Elastic Training.
This example will demonstrate the basic usage of Framework Rescale, as well as its Strong Safety Guarantee.
- Create Framework as below, and wait until all Tasks are AttemptRunning:
apiVersion: frameworkcontroller.microsoft.com/v1
kind: Framework
metadata:
name: rescalebasic
spec:
executionType: Start
retryPolicy:
fancyRetryPolicy: false
maxRetryCount: 0
taskRoles:
- name: a
taskNumber: 4
frameworkAttemptCompletionPolicy:
minFailedTaskCount: 4
minSucceededTaskCount: 1
task:
retryPolicy:
fancyRetryPolicy: false
maxRetryCount: 0
podGracefulDeletionTimeoutSec: 600
pod:
spec:
restartPolicy: Never
containers:
- name: ubuntu
image: ubuntu:trusty
command: ["sh", "-c", "printenv && sleep infinity"]
- Delete Pod
rescalebasic-a-2
,rescalebasic-a-3
, and wait until their Tasks are Completed (Failed). - ScaleDown Framework: Decrease the taskNumber and minFailedTaskCount from 4 to 2 by below patch:
[
{
"op": "test",
"path": "/spec/taskRoles/0/name",
"value": "a"
},
{
"op": "replace",
"path": "/spec/taskRoles/0/taskNumber",
"value": 2
},
{
"op": "replace",
"path": "/spec/taskRoles/0/frameworkAttemptCompletionPolicy/minFailedTaskCount",
"value": 2
}
]
- The Completed Tasks
rescalebasic-a-2
,rescalebasic-a-3
are immediately deleted, and the Framework stays AttemptRunning.- This demonstrates SafetyGuarantee1, as otherwise, the old Failed Tasks
rescalebasic-a-2
,rescalebasic-a-3
may be wrongly considered in the new FrameworkAttemptCompletionPolicy.minFailedTaskCount (2) and triggers the completion.
- This demonstrates SafetyGuarantee1, as otherwise, the old Failed Tasks
- ScaleUp Framework: Increase the taskNumber and minFailedTaskCount from 2 to 4 by below patch, and wait until all Tasks are AttemptRunning:
[
{
"op": "test",
"path": "/spec/taskRoles/0/name",
"value": "a"
},
{
"op": "replace",
"path": "/spec/taskRoles/0/taskNumber",
"value": 4
},
{
"op": "replace",
"path": "/spec/taskRoles/0/frameworkAttemptCompletionPolicy/minFailedTaskCount",
"value": 4
}
]
- Delete Pod
rescalebasic-a-2
, and wait until its Task is Completed (Failed). - Redo Step 3, and wait until the
rescalebasic-a-2
,rescalebasic-a-3
Tasks are DeletionPending, but beforerescalebasic-a-3
is deleted, then immediately ScaleUp Framework: Increase the taskNumber and minFailedTaskCount from 2 to 3 by below patch:
[
{
"op": "test",
"path": "/spec/taskRoles/0/name",
"value": "a"
},
{
"op": "replace",
"path": "/spec/taskRoles/0/taskNumber",
"value": 3
},
{
"op": "replace",
"path": "/spec/taskRoles/0/frameworkAttemptCompletionPolicy/minFailedTaskCount",
"value": 3
}
]
- The Completed Task
rescalebasic-a-2
is immediately replaced by a new Task instance, the AttemptRunning Taskrescalebasic-a-3
is eventually deleted, and the Framework stays AttemptRunning.- This demonstrates SafetyGuarantee2, as otherwise, the previous ScaleDown Task
rescalebasic-a-2
may be wrongly reused in later ScaleUp.
- This demonstrates SafetyGuarantee2, as otherwise, the previous ScaleDown Task
- ScaleUp Framework: Increase the taskNumber and minFailedTaskCount from 3 to 4 by below patch, and wait until all Tasks are AttemptRunning:
[
{
"op": "test",
"path": "/spec/taskRoles/0/name",
"value": "a"
},
{
"op": "replace",
"path": "/spec/taskRoles/0/taskNumber",
"value": 4
},
{
"op": "replace",
"path": "/spec/taskRoles/0/frameworkAttemptCompletionPolicy/minFailedTaskCount",
"value": 4
}
]
- Delete Pod
rescalebasic-a-2
, and wait until its Task is Completed (Failed). - Redo Step 3, and wait until the
rescalebasic-a-2
,rescalebasic-a-3
Tasks are DeletionPending, but beforerescalebasic-a-3
is deleted, then immediately ScaleUp Framework: Increase the taskNumber and minFailedTaskCount from 2 to 5 by below patch:
[
{
"op": "test",
"path": "/spec/taskRoles/0/name",
"value": "a"
},
{
"op": "replace",
"path": "/spec/taskRoles/0/taskNumber",
"value": 5
},
{
"op": "replace",
"path": "/spec/taskRoles/0/frameworkAttemptCompletionPolicy/minFailedTaskCount",
"value": 5
}
]
- The Completed Task
rescalebasic-a-2
is immediately replaced by a new Task instance, the AttemptRunning Taskrescalebasic-a-3
is eventually replaced by a new Task instance, the new Taskrescalebasic-a-4
is immediately added, and the Framework stays AttemptRunning.- This demonstrates SafetyGuarantee2, as otherwise, the previous ScaleDown Task
rescalebasic-a-2
,rescalebasic-a-3
may be wrongly reused in later ScaleUp.
- This demonstrates SafetyGuarantee2, as otherwise, the previous ScaleDown Task
PyTorch Elastic Training On FrameworkController
ScaleUp Pipeline:
- As soon as FrameworkController observes not Completing/Completed Framework ScaleUp request, it will immediately mark the ScaleUp Task as AttemptCreationPending then persist (expose) the Task, before take any other action for the Framework, such as start to create its TaskAttempt or evaluate the Task's impact to the Framework, such as FrameworkAttemptCompletionPolicy.
- As soon as the AttemptCreationPending Task is persisted (exposed), the Task will impact its Framework in future, such as:
- The Task will be considered in FrameworkAttemptCompletionPolicy.
- The Task will be deleted in later ScaleDown.
- Only until then, FrameworkController will start to create the AttemptCreationPending TaskAttempt.
ScaleDown Pipeline:
- As soon as FrameworkController observes not Completing/Completed Framework ScaleDown request, it will immediately mark the ScaleDown Task as DeletionPending then persist (expose) the Task, before take any other action for the Framework, such as start to delete the Task or evaluate the Task's impact to the Framework, such as FrameworkAttemptCompletionPolicy.
- As soon as the DeletionPending Task is persisted (exposed), the Task will never impact its Framework in future anymore except for the Task's graceful deletion period itself, such as:
- The Task will never be considered in FrameworkAttemptCompletionPolicy.
- The Task will never be reused in later ScaleUp.
- Only until then, FrameworkController will start to delete the DeletionPending Task.
- After Framework Completing/Completed, the Framework may still contain DeletionPending Tasks but these Tasks must be Completed.
Notes:
- User also needs to explicitly ignore the DeletionPending Task if he does not care about the Task's graceful deletion period, like FrameworkBarrier.
Besides these general Framework ConsistencyGuarantees, FrameworkController also provides below Strong Safety Guarantees for Framework Rescale:
- SafetyGuarantee1:
- User can always safely Rescale Framework and update its other Spec within a single Framework update, as if the Rescale is already done, such as also update the FrameworkAttemptCompletionPolicy according to the new scale, as:
- The ScaleUp/ScaleDown Task will be immediately ignored/considered for the updated Framework.
- User can always safely Rescale Framework and update its other Spec within a single Framework update, as if the Rescale is already done, such as also update the FrameworkAttemptCompletionPolicy according to the new scale, as:
- SafetyGuarantee2:
- User can always safely Rescale before any previous Rescale totally finished (including the ScaleDown Task final deletion), as:
- For ScaleUp immediately followed by a ScaleDown:
- If the user observed a new non-DeletionPending Task (caused by ScaleUp), later ScaleDown will delete it (i.e. ScaleUp committed), otherwise, later ScaleDown may not delete it but previous ScaleUp must not have impacted it, such as start to create its TaskAttempt (i.e. ScaleUp rollbacked).
- For ScaleDown immediately followed by a ScaleUp:
- If the user observed a new DeletionPending Task (caused by ScaleDown), later ScaleUp will not reuse it (i.e. ScaleDown committed), otherwise, later ScaleUp may reuse it but previous ScaleDown must not have impacted it, such as start to delete it (i.e. ScaleDown rollbacked).
- For ScaleUp immediately followed by a ScaleDown:
- User can always safely Rescale before any previous Rescale totally finished (including the ScaleDown Task final deletion), as:
See Framework Rescale Basic Example to demonstrate these Strong Safety Guarantees.
To safely run large scale Framework, i.e. the total task number in a single Framework is greater than 300, you just need to enable the LargeFrameworkCompression. However, you may also need to decompress the Framework by yourself.
By leveraging the LogObjectSnapshot, external systems, such as Fluentd and ElasticSearch, can collect and process Framework, Task and Pod history snapshots even if it was retried or deleted, such as for persistence, metrics conversion, visualization, alerting, acting, analysis, etc.
For a specific Task identified by {FrameworkName}-{TaskRoleName}-{TaskIndex}:
-
ConsistencyGuarantee1:
At most one instance of the Task is running at any point in time.
-
ConsistencyGuarantee2:
No instance of the Task is running if it is deleted (or does not exist), TaskAttemptCompleted, TaskCompleted or the whole Framework is deleted (or does not exist).
For a specific Framework identified by {FrameworkName}:
-
ConsistencyGuarantee3:
At most one instance of the Framework is running at any point in time.
-
ConsistencyGuarantee4:
No instance of the Framework is running if it is FrameworkAttemptCompleted, FrameworkCompleted or the whole Framework is deleted (or does not exist).
The default behavior is to achieve all the ConsistencyGuarantees, if you do not explicitly violate below guidelines:
-
Achieve ConsistencyGuarantee1:
Do not force delete the managed Pod:
-
Do not set PodGracefulDeletionTimeoutSec to be not nil.
For example, the default PodGracefulDeletionTimeoutSec is acceptable.
-
Do not delete the managed Pod with 0 GracePeriodSeconds.
For example, the default Pod deletion is acceptable.
-
Do not delete the Node which runs the managed Pod.
For example, drain the Node before delete it is acceptable.
The Task running instance can be universally located by its TaskAttemptInstanceUID or PodUID.
To avoid the Pod is stuck in deleting forever, such as if its Node is down forever, leverage the same approach as Delete StatefulSet Pod only after the Pod termination has been confirmed manually or by your Cloud Controller Manager.
-
-
Achieve ConsistencyGuarantee2, ConsistencyGuarantee3 and ConsistencyGuarantee4:
-
Achieve ConsistencyGuarantee1.
-
Must delete the managed ConfigMap with Foreground PropagationPolicy.
For example, the default ConfigMap deletion is acceptable.
-
Must delete the Framework with Foreground PropagationPolicy.
For example, the default Framework deletion may not be acceptable, since the default PropagationPolicy for Framework object may be Background.
-
Do not change the OwnerReferences of the managed ConfigMap and Pods.
The Framework running instance can be universally located by its FrameworkAttemptInstanceUID or ConfigMapUID.
-
According to the CAP theorem, in the presence of a network partition, you cannot achieve both consistency and availability at the same time in any distributed system. So you have to make a trade-off between the Framework Consistency and the Framework Availability.
You can tune the trade-off, such as to achieve higher Framework Availability by sacrificing the Framework Consistency:
- Decrease Pod TolerationSeconds for TaintBasedEvictions
- Increase Node Eviction Rate
- Set a small PodGracefulDeletionTimeoutSec
- Violate other guidelines mentioned in How to achieve ConsistencyGuarantees, such as manually force delete a problematic Pod.
See more in:
- Usage
- Example: TensorFlow ParameterServer Training Example, etc.