User Manual

Index

FrameworkController DeepDive
Framework Interop
Framework ExecutionType
Predefined Container EnvironmentVariable
Pod Failure Classification
Predefined CompletionCode
CompletionStatus
RetryPolicy
FrameworkAttemptCompletionPolicy
Framework ScaleUp/ScaleDown
Large Scale Framework
Framework and Pod History
Framework and Task State Machine
Framework Consistency vs Availability
Controller Extension
- FrameworkBarrier
- HiveDScheduler
Best Practice

FrameworkController DeepDive Slides

Framework Interop

Supported Interoperation

API Kind	Operations
Framework	CREATE DELETE GET LIST WATCH WATCH_LIST PATCH (Start, Stop, Add TaskRole, Delete TaskRole, Add/Delete Task)
ConfigMap	All operations except for CREATE PUT PATCH
Pod	All operations except for CREATE PUT PATCH

CREATE Framework

Request

POST /apis/frameworkcontroller.microsoft.com/v1/namespaces/{FrameworkNamespace}/frameworks

Body: Framework

Type: application/json or application/yaml

Description

Create the specified Framework.

Any ExecutionType can be specified to create the Framework.

Response

Code	Body	Description
OK(200)	Framework	Return current Framework.
Created(201)	Framework	Return current Framework.
Accepted(202)	Framework	Return current Framework.
Conflict(409)	Status	The specified Framework already exists.

PATCH Framework

Start Framework

Request

PATCH /apis/frameworkcontroller.microsoft.com/v1/namespaces/{FrameworkNamespace}/frameworks/{FrameworkName}

Body:

[
  {
    "op": "replace",
    "path": "/spec/executionType",
    "value": "Start"
  }
]

Type: application/json-patch+json

Description

Start the specified Framework whose ExecutionType should be Create.

Before the Start, the Framework will not start to run or complete, but the object of the Framework is created, see Framework PreStart Example.

Response

Code	Body	Description
OK(200)	Framework	Return current Framework.
NotFound(404)	Status	The specified Framework is not found.

Stop Framework

Request

PATCH /apis/frameworkcontroller.microsoft.com/v1/namespaces/{FrameworkNamespace}/frameworks/{FrameworkName}

Body:

[
  {
    "op": "replace",
    "path": "/spec/executionType",
    "value": "Stop"
  }
]

Type: application/json-patch+json

Description

Stop the specified Framework whose ExecutionType should be Create or Start.

After the Stop, the Framework will start to complete, but the object of the Framework will not be deleted, see Framework PostStop Example.

Response

Code	Body	Description
OK(200)	Framework	Return current Framework.
NotFound(404)	Status	The specified Framework is not found.

Add TaskRole

Request

PATCH /apis/frameworkcontroller.microsoft.com/v1/namespaces/{FrameworkNamespace}/frameworks/{FrameworkName}

Body:

Follow the TaskRoleSpec to override below $(TaskRoleSpec) placeholder.

[
  {
    "op": "add",
    "path": "/spec/taskRoles/-",
    "value": $(TaskRoleSpec)
  }
]

Type: application/json-patch+json

Description

Append the specified TaskRole to the specified Framework.

See more in Framework ScaleUp/ScaleDown.

Response

Code	Body	Description
OK(200)	Framework	Return current Framework.
NotFound(404)	Status	The specified Framework is not found.

Delete TaskRole

Request

PATCH /apis/frameworkcontroller.microsoft.com/v1/namespaces/{FrameworkNamespace}/frameworks/{FrameworkName}

Body:

Use the current array index of the $(TaskRoleName) to override below $(TaskRoleIndex).

[
  {
    "op": "test",
    "path": "/spec/taskRoles/$(TaskRoleIndex)/name",
    "value": "$(TaskRoleName)"
  },
  {
    "op": "remove",
    "path": "/spec/taskRoles/$(TaskRoleIndex)"
  }
]

Type: application/json-patch+json

Description

Delete the specified TaskRole from the specified Framework.

See more in Framework ScaleUp/ScaleDown.

Notes:

Better to first delete all Tasks in the TaskRole and wait until they are all deleted, then delete the whole TaskRole. Otherwise, the Tasks in the TaskRole will be deleted according to last observed PodGracefulDeletionTimeoutSec in TaskRoleStatus, as the PodGracefulDeletionTimeoutSec in TaskSpec is already deleted.

Response

Code	Body	Description
OK(200)	Framework	Return current Framework.
NotFound(404)	Status	The specified Framework is not found.
UnprocessableEntity(422)	Status	The specified $(TaskRoleName) does not exist or does not match the specified $(TaskRoleIndex).

Add/Delete Task

Request

PATCH /apis/frameworkcontroller.microsoft.com/v1/namespaces/{FrameworkNamespace}/frameworks/{FrameworkName}

Body:

Use the current array index of the $(TaskRoleName) to override below $(TaskRoleIndex).

[
  {
    "op": "test",
    "path": "/spec/taskRoles/$(TaskRoleIndex)/name",
    "value": "$(TaskRoleName)"
  },
  {
    "op": "replace",
    "path": "/spec/taskRoles/$(TaskRoleIndex)/taskNumber",
    "value": $(TaskNumber)
  }
]

Generally, you may also need to adjust the TaskRole's FrameworkAttemptCompletionPolicy according to the new $(TaskNumber). It is safe as Framework ScaleUp/ScaleDown Strong Safety Guarantee.

[
  {
    "op": "test",
    "path": "/spec/taskRoles/$(TaskRoleIndex)/name",
    "value": "$(TaskRoleName)"
  },
  {
    "op": "replace",
    "path": "/spec/taskRoles/$(TaskRoleIndex)/taskNumber",
    "value": $(TaskNumber)
  },
  {
    "op": "replace",
    "path": "/spec/taskRoles/$(TaskRoleIndex)/frameworkAttemptCompletionPolicy/minSucceededTaskCount",
    "value": $(MinSucceededTaskCount)
  },
  {
    "op": "replace",
    "path": "/spec/taskRoles/$(TaskRoleIndex)/frameworkAttemptCompletionPolicy/minFailedTaskCount",
    "value": $(MinFailedTaskCount)
  }
]

Type: application/json-patch+json

Description

Update the TaskNumber (and the FrameworkAttemptCompletionPolicy) of the specified TaskRole in the specified Framework.

See more in Framework ScaleUp/ScaleDown.

Response

Code	Body	Description
OK(200)	Framework	Return current Framework.
NotFound(404)	Status	The specified Framework is not found.
UnprocessableEntity(422)	Status	The specified $(TaskRoleName) does not exist or does not match the specified $(TaskRoleIndex).

DELETE Framework

Request

DELETE /apis/frameworkcontroller.microsoft.com/v1/namespaces/{FrameworkNamespace}/frameworks/{FrameworkName}

Body:

application/json

{
  "propagationPolicy": "Foreground"
}

application/yaml

propagationPolicy: Foreground

Type: application/json or application/yaml

Description

Delete the specified Framework.

Notes:

If you need to achieve all the Framework ConsistencyGuarantees or achieve higher Framework Availability by leveraging the PodGracefulDeletionTimeoutSec, you should always use and only use the Foreground Deletion in the provided body.
However, kubectl delete does not support to specify the Foreground Deletion at least for Kubernetes v1.23.2, so you may have to use other Supported Client.

Response

Code	Body	Description
OK(200)	Framework	The specified Framework is deleting. Return current Framework.
OK(200)	Status	The specified Framework is deleted.
NotFound(404)	Status	The specified Framework is not found.

GET Framework

Request

GET /apis/frameworkcontroller.microsoft.com/v1/namespaces/{FrameworkNamespace}/frameworks/{FrameworkName}

Description

Get the specified Framework.

Response

Code	Body	Description
OK(200)	Framework	Return current Framework.
NotFound(404)	Status	The specified Framework is not found.

LIST Frameworks

Request

GET /apis/frameworkcontroller.microsoft.com/v1/namespaces/{FrameworkNamespace}/frameworks
GET /apis/frameworkcontroller.microsoft.com/v1/frameworks

QueryParameters: Same as StatefulSet QueryParameters

Description

Get all Frameworks (in the specified FrameworkNamespace).

Response

Code	Body	Description
OK(200)	FrameworkList	Return all Frameworks (in the specified FrameworkNamespace).

WATCH Framework

Request

GET /apis/frameworkcontroller.microsoft.com/v1/watch/namespaces/{FrameworkNamespace}/frameworks/{FrameworkName}

QueryParameters: Same as StatefulSet QueryParameters

Description

Watch the change events of the specified Framework.

Response

Code	Body	Description
OK(200)	WatchEvent	Streaming the change events of the specified Framework.
NotFound(404)	Status	The specified Framework is not found.

WATCH_LIST Frameworks

Request

GET /apis/frameworkcontroller.microsoft.com/v1/watch/namespaces/{FrameworkNamespace}/frameworks
GET /apis/frameworkcontroller.microsoft.com/v1/watch/frameworks

QueryParameters: Same as StatefulSet QueryParameters

Description

Watch the change events of all Frameworks (in the specified FrameworkNamespace).

Response

Code	Body	Description
OK(200)	WatchEvent	Streaming the change events of all Frameworks (in the specified FrameworkNamespace).

Framework ExecutionType

Framework ExecutionType can be specified to control the execution of the Framework:

You can just Create Framework with Create ExecutionType, which does not also start it at the same time.
- This is useful when you need to do some PreStart actions depend on the Framework object, see Framework PreStart Example. And once these actions are done, you can safely Start Framework.
You can just Stop Framework, which does not also delete it at the same time.
- This is useful when you need to do some PostStop actions depend on the Framework object, see Framework PostStop Example. And once these actions are done, you can safely Delete Framework.

In this example, you need to run a Framework which depends on a ServiceAccount, but the ServiceAccount also depends on the Framework object to be OwnerReferences, so you cannot directly Create Framework with ExecutionType Start.

Create Framework with Create ExecutionType and a ServiceAccount reference as below, then the Framework will stay as AttemptCreationPending:

apiVersion: frameworkcontroller.microsoft.com/v1
kind: Framework
metadata:
  name: prestart
spec:
  executionType: Create
  retryPolicy:
    fancyRetryPolicy: false
    maxRetryCount: 0
  taskRoles:
  - name: a
    taskNumber: 4
    frameworkAttemptCompletionPolicy:
      minFailedTaskCount: 4
      minSucceededTaskCount: 1
    task:
      retryPolicy:
        fancyRetryPolicy: false
        maxRetryCount: 0
      podGracefulDeletionTimeoutSec: 600
      pod:
        spec:
          restartPolicy: Never
          serviceAccountName: prestart
          containers:
          - name: ubuntu
            image: ubuntu:trusty
            command: ["sh", "-c", "printenv && sleep infinity"]

Use above creation response's metadata.uid to override below {{FrameworkUID}}, and Create ServiceAccount with above Framework reference as below:

apiVersion: v1
kind: ServiceAccount
metadata:
  name: prestart
  ownerReferences:
  - apiVersion: frameworkcontroller.microsoft.com/v1
    kind: Framework
    name: prestart
    uid: {{FrameworkUID}}
    controller: true
    blockOwnerDeletion: true

Start Framework, then the Framework will start to run successfully.
Delete Framework, then both the Framework and above ServiceAccount will be deleted.

Framework PostStop Example

In this example, you need to stop a Framework whose final stopped Framework object needs to be pushed to/pulled by external systems, so you cannot directly Delete Framework.

Create Framework as below:

apiVersion: frameworkcontroller.microsoft.com/v1
kind: Framework
metadata:
  name: poststop
spec:
  executionType: Start
  retryPolicy:
    fancyRetryPolicy: false
    maxRetryCount: 0
  taskRoles:
  - name: a
    taskNumber: 4
    frameworkAttemptCompletionPolicy:
      minFailedTaskCount: 4
      minSucceededTaskCount: 1
    task:
      retryPolicy:
        fancyRetryPolicy: false
        maxRetryCount: 0
      podGracefulDeletionTimeoutSec: 600
      pod:
        spec:
          restartPolicy: Never
          containers:
          - name: ubuntu
            image: ubuntu:trusty
            command: ["sh", "-c", "printenv && sleep infinity"]

Stop Framework, then the Framework will be stopped, i.e. FrameworkCompleted.
Get Framework, and archive it into a DataBase first.
Delete Framework, then the Framework will be deleted.

Predefined Container EnvironmentVariable

Framework Example

Pod Failure Classification

You can specify how to classify and summarize Pod failures by the PodFailureSpec.

You can also directly leverage the Default PodFailureSpec.

Predefined CompletionCode

You can leverage the Predefined CompletionCode to instruct your RetryPolicy and identify a certain predefined CompletionCode, regardless of different PodFailureSpec may be configured in different clusters.

CompletionStatus

CompletionStatus: It is generated from Predefined CompletionCode or PodPattern matching. For a Pod, if no PodPattern is matched and failed Container exists, the CompletionCode is the same as the last failed Container ExitCode.

TaskAttemptCompletionStatus: Besides the CompletionStatus, it also provides more detailed and structured diagnostic information about the completion of a TaskAttempt.

FrameworkAttemptCompletionStatus: Besides the CompletionStatus, it also provides more detailed and structured diagnostic information about the completion of a FrameworkAttempt.

RetryPolicy

Spec

RetryPolicySpec

Usage

RetryPolicySpec

Example

Notes:

Italic Conditions still need to be specified explicitly, as we have not supported the Framework Spec Defaulting yet.
For the definition of each CompletionType, such as Transient Failed, see CompletionStatus.

FrameworkType	Framework RetryPolicy	TaskRole	Task RetryPolicy	Description
DEFAULT	FancyRetryPolicy = false MaxRetryCount = 0	TaskRole-A	FancyRetryPolicy = false MaxRetryCount = 0	The default RetryPolicy: Never Retry for any Failed or Succeeded.
DEFAULT	FancyRetryPolicy = false MaxRetryCount = 0	TaskRole-B	FancyRetryPolicy = false MaxRetryCount = 0
Service	FancyRetryPolicy = false MaxRetryCount = -2	TaskRole-A	FancyRetryPolicy = false MaxRetryCount = -2	Always Retry for any Failed or Succeeded.
Blind Batch	FancyRetryPolicy = false MaxRetryCount = -1	TaskRole-A	FancyRetryPolicy = false MaxRetryCount = -1	Always Retry for any Failed. Never Retry for Succeeded.
Batch with Task Fault Tolerance	FancyRetryPolicy = true MaxRetryCount = 3	TaskRole-A	FancyRetryPolicy = true MaxRetryCount = 3	Always Retry for Transient Failed. Never Retry for Permanent Failed or Succeeded. Retry up to 3 times for Unknown Failed.
Batch without Task Fault Tolerance	FancyRetryPolicy = true MaxRetryCount = 3	TaskRole-A	FancyRetryPolicy = false MaxRetryCount = 0	For Framework RetryPolicy, same as "Batch with Task Fault Tolerance". For Task RetryPolicy, because the Task cannot tolerate any failed TaskAttempt, such as it cannot recover from previous failed TaskAttempt, so Never Retry Task for any Failed or Succeeded.
Debug Mode	FancyRetryPolicy = true MaxRetryCount = 0	TaskRole-A	FancyRetryPolicy = true MaxRetryCount = 0	Always Retry for Transient Failed. Never Retry for Permanent Failed or Unknown Failed or Succeeded. This can help to capture the unexpected exit of user application itself.

FrameworkAttemptCompletionPolicy

Spec

CompletionPolicySpec

Usage

CompletionPolicySpec

Example

Notes:

Italic Conditions still need to be specified explicitly, as we have not supported the Framework Spec Defaulting yet.

FrameworkType	TaskRole	FrameworkAttemptCompletionPolicy	Description
DEFAULT	TaskRole-A	MinFailedTaskCount = 1 MinSucceededTaskCount = -1	The default FrameworkAttemptCompletionPolicy: Fail the FrameworkAttempt immediately if any Task failed. Succeed the FrameworkAttempt only until all Tasks succeeded.
DEFAULT	TaskRole-B	MinFailedTaskCount = 1 MinSucceededTaskCount = -1
Service	TaskRole-A	MinFailedTaskCount = 1 MinSucceededTaskCount = -1	Actually, any FrameworkAttemptCompletionPolicy is fine, since Service's Task will never complete, i.e. its Task's MaxRetryCount is -2, see RetryPolicy Example.
MapReduce	Map	MinFailedTaskCount = {Map.TaskNumber} * {mapreduce.map.failures.maxpercent} + 1 MinSucceededTaskCount = -1	A few failed Tasks is acceptable, but always want to wait all Tasks to succeed: Fail the FrameworkAttempt immediately if the failed Tasks exceeded the limit. Succeed the FrameworkAttempt only until all Tasks completed and the failed Tasks is within the limit.
MapReduce	Reduce	MinFailedTaskCount = {Reduce.TaskNumber} * {mapreduce.reduce.failures.maxpercent} + 1 MinSucceededTaskCount = -1
Master Dominated: PyTorch Training with explict master role, TensorFlow Training with explict chief role	Master/Chief	MinFailedTaskCount = 1 MinSucceededTaskCount = 1	The completion is fully determined by the single instance master of the user application: Fail the FrameworkAttempt immediately if the master failed. Succeed the FrameworkAttempt immediately if the master succeeded.
	Worker	MinFailedTaskCount = -1 MinSucceededTaskCount = -1
All Workers Dominated: PyTorch Training without explict master role, TensorFlow Training without explict chief role	ParameterServer/None	MinFailedTaskCount = 1 MinSucceededTaskCount = -1	Fail the FrameworkAttempt immediately if any Task failed. Succeed the FrameworkAttempt only until all workers succeeded.
	Worker	MinFailedTaskCount = 1 MinSucceededTaskCount = {Worker.TaskNumber}
Any Worker Dominated: PyTorch Elastic Training	Worker	MinFailedTaskCount = {Worker.TaskNumber} MinSucceededTaskCount = 1	Fail the FrameworkAttempt only until all workers failed. Succeed the FrameworkAttempt immediately if any worker succeeded.

Framework ScaleUp/ScaleDown

Framework ScaleUp/ScaleDown (Rescale) refers to take any below action for an existing Framework on the fly:

Add/Delete TaskRole without touching other TaskRoles.
Add/Delete Task without touching other Tasks.

Application Assumption

Before you start to Rescale Framework, make sure your application executed by the Framework can tolerate Rescale, such as:

Your application should be able to rebalance its workload after Rescale:
1. For Service application, may need to rebalance its serving traffic by a load balancer, and/or re-replicate its state.
2. For Batch application, may need to rebalance its work items by a work queue, and/or adjust its Tasks membership, like PyTorch Elastic Training.
For Batch application, it would better not rerun after ScaleUp:
1. Rerun may happen if the application already completed, but the completion event have not yet been observed by FrameworkController. So, during this period, if you ScaleUp the Framework, the application may rerun.
2. To mitigate it, the ScaleUp Task can immediately complete itself by leveraging empty work queue or existing checkpoint from previous run.
For Batch application, it would better not too early succeeded after ScaleDown:
1. Too early succeeded may happen if all Tasks succeeded except one Task still running, but you ScaleDown the running Task.
2. To resolve it, make sure it is safe to ScaleDown the running Task, such as leverage Master Dominated or Any Worker Dominated FrameworkType in FrameworkAttemptCompletionPolicy. For the Master Dominated FrameworkType, a master TaskRole is needed and do not ScaleDown the master TaskRole. For the Any Worker Dominated FrameworkType, an exit barrier may be needed to ensure any worker succeeded means the whole application already succeeded, like PyTorch Elastic Training.

API

Add TaskRole
Delete TaskRole
Add/Delete Task

Example

Basic Example

This example will demonstrate the basic usage of Framework Rescale, as well as its Strong Safety Guarantee.

Create Framework as below, and wait until all Tasks are AttemptRunning:

apiVersion: frameworkcontroller.microsoft.com/v1
kind: Framework
metadata:
  name: rescalebasic
spec:
  executionType: Start
  retryPolicy:
    fancyRetryPolicy: false
    maxRetryCount: 0
  taskRoles:
  - name: a
    taskNumber: 4
    frameworkAttemptCompletionPolicy:
      minFailedTaskCount: 4
      minSucceededTaskCount: 1
    task:
      retryPolicy:
        fancyRetryPolicy: false
        maxRetryCount: 0
      podGracefulDeletionTimeoutSec: 600
      pod:
        spec:
          restartPolicy: Never
          containers:
          - name: ubuntu
            image: ubuntu:trusty
            command: ["sh", "-c", "printenv && sleep infinity"]

Delete Pod rescalebasic-a-2, rescalebasic-a-3, and wait until their Tasks are Completed (Failed).
ScaleDown Framework: Decrease the taskNumber and minFailedTaskCount from 4 to 2 by below patch:

[
  {
    "op": "test",
    "path": "/spec/taskRoles/0/name",
    "value": "a"
  },
  {
    "op": "replace",
    "path": "/spec/taskRoles/0/taskNumber",
    "value": 2
  },
  {
    "op": "replace",
    "path": "/spec/taskRoles/0/frameworkAttemptCompletionPolicy/minFailedTaskCount",
    "value": 2
  }
]

The Completed Tasks rescalebasic-a-2, rescalebasic-a-3 are immediately deleted, and the Framework stays AttemptRunning.
- This demonstrates SafetyGuarantee1, as otherwise, the old Failed Tasks rescalebasic-a-2, rescalebasic-a-3 may be wrongly considered in the new FrameworkAttemptCompletionPolicy.minFailedTaskCount (2) and triggers the completion.
ScaleUp Framework: Increase the taskNumber and minFailedTaskCount from 2 to 4 by below patch, and wait until all Tasks are AttemptRunning:

[
  {
    "op": "test",
    "path": "/spec/taskRoles/0/name",
    "value": "a"
  },
  {
    "op": "replace",
    "path": "/spec/taskRoles/0/taskNumber",
    "value": 4
  },
  {
    "op": "replace",
    "path": "/spec/taskRoles/0/frameworkAttemptCompletionPolicy/minFailedTaskCount",
    "value": 4
  }
]

Delete Pod rescalebasic-a-2, and wait until its Task is Completed (Failed).
Redo Step 3, and wait until the rescalebasic-a-2, rescalebasic-a-3 Tasks are DeletionPending, but before rescalebasic-a-3 is deleted, then immediately ScaleUp Framework: Increase the taskNumber and minFailedTaskCount from 2 to 3 by below patch:

[
  {
    "op": "test",
    "path": "/spec/taskRoles/0/name",
    "value": "a"
  },
  {
    "op": "replace",
    "path": "/spec/taskRoles/0/taskNumber",
    "value": 3
  },
  {
    "op": "replace",
    "path": "/spec/taskRoles/0/frameworkAttemptCompletionPolicy/minFailedTaskCount",
    "value": 3
  }
]

The Completed Task rescalebasic-a-2 is immediately replaced by a new Task instance, the AttemptRunning Task rescalebasic-a-3 is eventually deleted, and the Framework stays AttemptRunning.
- This demonstrates SafetyGuarantee2, as otherwise, the previous ScaleDown Task rescalebasic-a-2 may be wrongly reused in later ScaleUp.
ScaleUp Framework: Increase the taskNumber and minFailedTaskCount from 3 to 4 by below patch, and wait until all Tasks are AttemptRunning:

[
  {
    "op": "test",
    "path": "/spec/taskRoles/0/name",
    "value": "a"
  },
  {
    "op": "replace",
    "path": "/spec/taskRoles/0/taskNumber",
    "value": 4
  },
  {
    "op": "replace",
    "path": "/spec/taskRoles/0/frameworkAttemptCompletionPolicy/minFailedTaskCount",
    "value": 4
  }
]

Delete Pod rescalebasic-a-2, and wait until its Task is Completed (Failed).
Redo Step 3, and wait until the rescalebasic-a-2, rescalebasic-a-3 Tasks are DeletionPending, but before rescalebasic-a-3 is deleted, then immediately ScaleUp Framework: Increase the taskNumber and minFailedTaskCount from 2 to 5 by below patch:

[
  {
    "op": "test",
    "path": "/spec/taskRoles/0/name",
    "value": "a"
  },
  {
    "op": "replace",
    "path": "/spec/taskRoles/0/taskNumber",
    "value": 5
  },
  {
    "op": "replace",
    "path": "/spec/taskRoles/0/frameworkAttemptCompletionPolicy/minFailedTaskCount",
    "value": 5
  }
]

The Completed Task rescalebasic-a-2 is immediately replaced by a new Task instance, the AttemptRunning Task rescalebasic-a-3 is eventually replaced by a new Task instance, the new Task rescalebasic-a-4 is immediately added, and the Framework stays AttemptRunning.
- This demonstrates SafetyGuarantee2, as otherwise, the previous ScaleDown Task rescalebasic-a-2, rescalebasic-a-3 may be wrongly reused in later ScaleUp.

PyTorch Elastic Training Example

PyTorch Elastic Training On FrameworkController

Pipeline

ScaleUp Pipeline:

As soon as FrameworkController observes not Completing/Completed Framework ScaleUp request, it will immediately mark the ScaleUp Task as AttemptCreationPending then persist (expose) the Task, before take any other action for the Framework, such as start to create its TaskAttempt or evaluate the Task's impact to the Framework, such as FrameworkAttemptCompletionPolicy.
As soon as the AttemptCreationPending Task is persisted (exposed), the Task will impact its Framework in future, such as:
1. The Task will be considered in FrameworkAttemptCompletionPolicy.
2. The Task will be deleted in later ScaleDown.
Only until then, FrameworkController will start to create the AttemptCreationPending TaskAttempt.

ScaleDown Pipeline:

As soon as FrameworkController observes not Completing/Completed Framework ScaleDown request, it will immediately mark the ScaleDown Task as DeletionPending then persist (expose) the Task, before take any other action for the Framework, such as start to delete the Task or evaluate the Task's impact to the Framework, such as FrameworkAttemptCompletionPolicy.
As soon as the DeletionPending Task is persisted (exposed), the Task will never impact its Framework in future anymore except for the Task's graceful deletion period itself, such as:
1. The Task will never be considered in FrameworkAttemptCompletionPolicy.
2. The Task will never be reused in later ScaleUp.
Only until then, FrameworkController will start to delete the DeletionPending Task.
After Framework Completing/Completed, the Framework may still contain DeletionPending Tasks but these Tasks must be Completed.

Notes:

User also needs to explicitly ignore the DeletionPending Task if he does not care about the Task's graceful deletion period, like FrameworkBarrier.

Strong Safety Guarantee

Besides these general Framework ConsistencyGuarantees, FrameworkController also provides below Strong Safety Guarantees for Framework Rescale:

SafetyGuarantee1:
- User can always safely Rescale Framework and update its other Spec within a single Framework update, as if the Rescale is already done, such as also update the FrameworkAttemptCompletionPolicy according to the new scale, as:
  - The ScaleUp/ScaleDown Task will be immediately ignored/considered for the updated Framework.
SafetyGuarantee2:
- User can always safely Rescale before any previous Rescale totally finished (including the ScaleDown Task final deletion), as:
  - For ScaleUp immediately followed by a ScaleDown:
    - If the user observed a new non-DeletionPending Task (caused by ScaleUp), later ScaleDown will delete it (i.e. ScaleUp committed), otherwise, later ScaleDown may not delete it but previous ScaleUp must not have impacted it, such as start to create its TaskAttempt (i.e. ScaleUp rollbacked).
  - For ScaleDown immediately followed by a ScaleUp:
    - If the user observed a new DeletionPending Task (caused by ScaleDown), later ScaleUp will not reuse it (i.e. ScaleDown committed), otherwise, later ScaleUp may reuse it but previous ScaleDown must not have impacted it, such as start to delete it (i.e. ScaleDown rollbacked).

See Framework Rescale Basic Example to demonstrate these Strong Safety Guarantees.

Large Scale Framework

To safely run large scale Framework, i.e. the total task number in a single Framework is greater than 300, you just need to enable the LargeFrameworkCompression. However, you may also need to decompress the Framework by yourself.

Framework and Pod History

By leveraging the LogObjectSnapshot, external systems, such as Fluentd and ElasticSearch, can collect and process Framework, Task and Pod history snapshots even if it was retried or deleted, such as for persistence, metrics conversion, visualization, alerting, acting, analysis, etc.

ConsistencyGuarantee1:

At most one instance of the Task is running at any point in time.
ConsistencyGuarantee2:

No instance of the Task is running if it is deleted (or does not exist), TaskAttemptCompleted, TaskCompleted or the whole Framework is deleted (or does not exist).

For a specific Framework identified by {FrameworkName}:

ConsistencyGuarantee3:

At most one instance of the Framework is running at any point in time.
ConsistencyGuarantee4:

No instance of the Framework is running if it is FrameworkAttemptCompleted, FrameworkCompleted or the whole Framework is deleted (or does not exist).

How to achieve ConsistencyGuarantees

The default behavior is to achieve all the ConsistencyGuarantees, if you do not explicitly violate below guidelines:

Achieve ConsistencyGuarantee1:

Do not force delete the managed Pod:
1. Do not set PodGracefulDeletionTimeoutSec to be not nil.
  
  For example, the default PodGracefulDeletionTimeoutSec is acceptable.
2. Do not delete the managed Pod with 0 GracePeriodSeconds.
  
  For example, the default Pod deletion is acceptable.
3. Do not delete the Node which runs the managed Pod.
  
  For example, drain the Node before delete it is acceptable.
The Task running instance can be universally located by its TaskAttemptInstanceUID or PodUID.

To avoid the Pod is stuck in deleting forever, such as if its Node is down forever, leverage the same approach as Delete StatefulSet Pod only after the Pod termination has been confirmed manually or by your Cloud Controller Manager.
Achieve ConsistencyGuarantee2, ConsistencyGuarantee3 and ConsistencyGuarantee4:
1. Achieve ConsistencyGuarantee1.
2. Must delete the managed ConfigMap with Foreground PropagationPolicy.
  
  For example, the default ConfigMap deletion is acceptable.
3. Must delete the Framework with Foreground PropagationPolicy.
  
  For example, the default Framework deletion may not be acceptable, since the default PropagationPolicy for Framework object may be Background.
4. Do not change the OwnerReferences of the managed ConfigMap and Pods.
The Framework running instance can be universally located by its FrameworkAttemptInstanceUID or ConfigMapUID.

Framework Availability

According to the CAP theorem, in the presence of a network partition, you cannot achieve both consistency and availability at the same time in any distributed system. So you have to make a trade-off between the Framework Consistency and the Framework Availability.

You can tune the trade-off, such as to achieve higher Framework Availability by sacrificing the Framework Consistency:

Decrease Pod TolerationSeconds for TaintBasedEvictions
Increase Node Eviction Rate
Set a small PodGracefulDeletionTimeoutSec
Violate other guidelines mentioned in How to achieve ConsistencyGuarantees, such as manually force delete a problematic Pod.

See more in:

PodGracefulDeletionTimeoutSec
Pod Safety and Consistency Guarantees

Controller Extension

FrameworkBarrier

Usage
Example: FrameworkBarrier Example, TensorFlow ParameterServer Training Example, etc.

HiveDScheduler

Usage
Example: TensorFlow ParameterServer Training Example, etc.

Best Practice

Files

user-manual.md

Latest commit

History

user-manual.md

File metadata and controls