Skip to content
This repository has been archived by the owner on Jun 6, 2024. It is now read-only.

extend prerequisite field in job protocol #5145

Open
hzy46 opened this issue Dec 3, 2020 · 6 comments
Open

extend prerequisite field in job protocol #5145

hzy46 opened this issue Dec 3, 2020 · 6 comments
Labels

Comments

@hzy46
Copy link
Contributor

hzy46 commented Dec 3, 2020

Motivation

OpenPAI protocol support users to specify prerequisites (e.g. dockerimage, data, and script) and then reference them in taskrole. There are some limitations in current version.

  • current solution only support parameter (e.g. uri) definition. This is enough for the most frequently used dockerimage because docker plays a role of corresponding runtime executor. However, it is too limited for other types. For example, commands has to be injected in every taskrole to make the data ready in the job config below.
  • it is not well organized (object-oriented). The command wget is actions with the data, but it could not be placed together.
    • It is hard to reuse. If the data is referenced by more than one taskrole, the wget commands must be injected everywhere.
    • It is hard to use. User (or marketplace plugin) must modify more than one places to enable a data.
  • taskrole could only reference one data (or script, output)
prerequisites:
  - name: covid_data
    type: data
    uri:
      - https://x.x.x/yyy.zip # data uri
  - name: default_image
    type: dockerimage
    uri: 'openpai/standard:python_3.6-pytorch_1.2.0-gpu'
taskRoles:
  taskrole:
    dockerImage: default_image
    data: covid_data
    commands:
      - mkdir -p /data/covid19/data/
      - cd /data/covid19/data/
      - 'wget <% $data.uri[0] %>'
      - export DATA_DIR=/data/covid19/data/

Goal

  • propose protocol updates and runtime plugin to make prerequisites be well organized and object-oriented. Besides defining parameters, it also supports real functions (callbacks on specific events).
  • make easy and flexible reuse of data, script, and other prerequisites
  • better support management of dataset (via marketplace)
  • enable advanced features (e.g. cluster data set, data location aware scheduling) in the future
  • backward compatible (this version should support previous config).

Proposal

  1. support callbacks in prerequisites
  2. taskrole could reference a list of prerequisites
  3. runtime plugin implementation

Examples

  • defining actions with data
    • Different data requires different pre-commands: e.g. wget, nfs mount, azure blob download
prerequisites:
  - name: covid_data
    type: data
    callbacks:
      - event: containerStart
        commands:
          - mkdir -p /data/covid19/data/
          - cd /data/covid19/data/
          - 'wget https://x.x.x/yyy.zip'
          - export DATA_DIR=/data/covid19/data/

taskRoles:
  taskrole:
    dockerImage: default_image
    prerequisites: 
      - covid_data
    commands:
      - ls $DATA_DIR
  • setup environment/script prerequisites:
    • Some should run before the script starts: e.g. install pip packages, install openpai sdk.
    • Some should run after the script completes / succeeds / fails: e.g. log uploading, reports, alert
    • Enhanced debuggability such as start jupyter server (or ssh) in 30 mins after user's command fails

Full Spec:

prerequisites:
  - name: string # required, unique name to find the prerequisite (from local or marketplace)
    type: "dockerimage | script | data | output" # for survey purpose (except dockerimage), useless for backend
    plugin: string # optional, the executor to handle current prerequisite; default is com.microsoft.pai.runtimeplugin.cmd or docker (for dockerimage)
    require: [] # optional, other prerequisites on which the current one depends
    callbacks: # optional, commands to run on events
      - event: "containerStart | containerExit"
        commands: # commands translated by plugin
          - string # shell commands for com.microsoft.pai.runtimeplugin.cmd
          - string # TODO: other commands (e.g. python) for other plugins
    failurePolicy: "ignore | fail" # optional, same default as runtime plugin
    # plugin-specific properties
    uri: string | array # optional, for backward compatibility (it is required before)
    key1: value1 # referred by <% this.parameters.key1 %>
    key2: value2 # TODO: inheritable from required ones

taskRoles:
  taskrole:
    prerequisites: # optional, requirements will be automatically parsed and inserted
      - prerequisite-1 # on containerStart, will execute in order
      - prerequisite-2 # on containerExit, will execute in reverse order

Each of prerequisites will be handled in a way like

for prerequisite in prerequisites:
  plugin(**prerequisite)
@hzy46 hzy46 mentioned this issue Dec 3, 2020
52 tasks
@hzy46
Copy link
Contributor Author

hzy46 commented Dec 8, 2020

Update of this issue:

  1. Will sync with @mydmdm to determine the detailed schema. This will be an P1 item for v1.5.0 release.
  2. In OpenPAI runtime, use the following way to handle prerequisites: (1) use existing mechanism to inject commands into preCommands and postCommands (2) don't show explicit plugin definition in user's job protocol (3) make sure parameters and secrets work in prerequisites.
  3. We can add retry policy and failure policy. This can be left to future work.
  4. Sync with @Binyang2014 about support for cluster data.

@Binyang2014
Copy link
Contributor

Binyang2014 commented Dec 8, 2020

Question about this:

  1. Can all runtime plugins merge to prerequisites? If so we could deprecated runtime extra field. Make prerequisites the official way?
  2. Maybe I can treat prerequisites as a high level widget. It can use any implementation they want to achieve. Such as runtime-plugin or inject-command or communicate with other k8s/pai services. And there is other proposal for service plugin: [Proposal] Service Plugin - for marketplace / other services #4254 does it related?

@hzy46
Copy link
Contributor Author

hzy46 commented Dec 8, 2020

Question about this:

  1. Can all runtime plugins merge to prerequisites? If so we could deprecated runtime extra field. Make prerequisites the official way?
  2. Maybe I can treat prerequisites as a high level widget. It can use any implementation they want to achieve. Such as runtime-plugin or inject-command or communicate with other k8s/pai services. And there is other proposal for service plugin: [Proposal] Service Plugin - for marketplace / other services #4254 does it related?

I think they are used for different scenarios. Prerequisite is the requirement for a job. Without a prerequisite, a job usually fails. And prerequisite should be sharable among users. Runtime plugin is used to extend job protocol's functions. It can be nice-to-have (not necessary), and can be personal config (not sharable). There are some overlaps. Maybe we can move some officially-supported runtime plugin into prerequisites.

@debuggy debuggy mentioned this issue Jan 4, 2021
14 tasks
@mydmdm
Copy link
Contributor

mydmdm commented Jan 5, 2021

I have updated the full spec in the main body and here are some examples, including

  • P0 execute essential commands
  • P0 configure storage
  • P0 configure data based on storage
  • P1 functional plugin support (e.g. ssh)
prerequisites:
  - name: install-pai-copy
    type: script # indicate the purpose, not used by backend but for statistical analyzing (except dockerimage)
    plugin: com.microsoft.pai.runtimeplugin.cmd # default plugin if not specified
    callbacks:
      - event: containerStart
        commands:
          - xxx # commands to setup nodejs
          - npm install -g @swordfaith/pai_copy
    failurePolicy: ignore/fail
  - name: covid_data
    type: data
    plugin: com.microsoft.pai.runtimeplugin.cmd # 
    callbacks:
      - event: containerStart
        commands:
          - mkdir -p /data/covid19/data/
          - cd /data/covid19/data/
          - 'wget https://x.x.x/yyy.zip'
          - export DATA_DIR=/data/covid19/data/
  - name: nfs-storage-1
    type: storage # indicate the purpose, not used by backend but for statistical analyzing
    plugin: com.microsoft.pai.rest.storage # handled by REST server
    config: nfsconfig # special arguments for storage plugin only
    mountPoint: /mnt/nfs-storage-1
  - name: mnist-data
    type: data
    plugin: com.microsoft.pai.runtimeplugin.cmd
    require:
      - nfs-storage-1 # also inherit parameters like mountPoint
    callbacks:
      - event: containerStart
        commands:
          - export MNIST_DIR=<% this.mountPoint %>/mnist
  - name: output-dir
    type: output
    plugin: com.microsoft.pai.runtimeplugin.cmd
    require:
      - nfs-storage-1
    callbacks:
      - event: containerStart
        commands:
          - export OUTPUT_DIR=/tmp/output
      - event: containerExit
        commands:
          - 'if [ -z ${OUTPUT_DIR+x}]; then'
          - echo "Not found OUTPUT_DIR environ"
          - else
          - pai_copy upload  paiuploadtest //
          - fi
  - name: enable-ssh
    type: script
    plugin: com.microsoft.pai.runtimeplugin.ssh
    jobssh: true
    publicKeys: # optional, if not specified, only public keys in user.extensions.sshKeys will be added
      - ... # public keys 

taskRoles:
  taskrole:
    dockerImage: default_image
    prerequisites:
      - mnist-data # required will be automatically parsed and added in backend
      - output-dir

@hzy46
Copy link
Contributor Author

hzy46 commented Jan 29, 2021

(TBD) Test Cases for v1.5.0 release

  1. test cmd prerequisites
protocolVersion: 2
name: pre1
type: job
jobRetryCount: 0
prerequisites:
  - type: script
    name: justecho
    plugin: com.microsoft.pai.runtimeplugin.cmd
    callbacks:
      - event: taskStarts
        commands:
          - echo 111
          - echo 222
      - event: taskSucceeds
        commands:
          - echo 333
          - echo 444
  - type: dockerimage
    uri: 'openpai/standard:python_3.6-pytorch_1.2.0-gpu'
    name: docker_image_0
taskRoles:
  taskrole:
    instances: 1
    completion:
      minFailedInstances: 1
    taskRetryCount: 0
    prerequisites:
      - justecho
    dockerImage: docker_image_0
    resourcePerInstance:
      gpu: 1
      cpu: 3
      memoryMB: 29065
    commands:
      - sleep 0s
defaults:
  virtualCluster: default
extras:
  com.microsoft.pai.runtimeplugin:
    - plugin: ssh
      parameters:
        jobssh: true

expected runtime.log:

[Mon Feb  1 02:54:48 UTC 2021] [openpai-runtime] [INFO] Starting to exec precommands
[Mon Feb  1 02:54:48 UTC 2021] [openpai-runtime] 111
[Mon Feb  1 02:54:48 UTC 2021] [openpai-runtime] 222
[Mon Feb  1 02:54:48 UTC 2021] [openpai-runtime] [package_cache] Skip installation of group ssh.
[Mon Feb  1 02:54:48 UTC 2021] [openpai-runtime] [INFO] start ssh service
[Mon Feb  1 02:54:48 UTC 2021] [openpai-runtime] * Restarting OpenBSD Secure Shell server sshd
[Mon Feb  1 02:54:48 UTC 2021] [openpai-runtime] ...done.
[Mon Feb  1 02:54:48 UTC 2021] [openpai-runtime] [INFO] Precommands finished
[Mon Feb  1 02:54:48 UTC 2021] [openpai-runtime] [INFO] USER COMMAND START
[Mon Feb  1 02:54:58 UTC 2021] [openpai-runtime] [INFO] USER COMMAND END
[Mon Feb  1 02:54:58 UTC 2021] [openpai-runtime] 333
[Mon Feb  1 02:54:58 UTC 2021] [openpai-runtime] 444

  1. test multiple prerequisites
protocolVersion: 2
name: pre2
type: job
jobRetryCount: 0
prerequisites:
  - type: script
    name: justecho_first
    plugin: com.microsoft.pai.runtimeplugin.cmd
    callbacks:
      - event: taskStarts
        commands:
          - echo 111
      - event: taskSucceeds
        commands:
          - echo 222
  - type: script
    name: justecho_later
    plugin: com.microsoft.pai.runtimeplugin.cmd
    callbacks:
      - event: taskStarts
        commands:
          - echo aaa
      - event: taskSucceeds
        commands:
          - echo bbb
  - type: dockerimage
    uri: 'openpai/standard:python_3.6-pytorch_1.2.0-gpu'
    name: docker_image_0
taskRoles:
  taskrole:
    instances: 1
    completion:
      minFailedInstances: 1
    taskRetryCount: 0
    prerequisites:
      - justecho_first
      - justecho_later
    dockerImage: docker_image_0
    resourcePerInstance:
      gpu: 1
      cpu: 3
      memoryMB: 29065
    commands:
      - sleep 0s
defaults:
  virtualCluster: default
extras:
  com.microsoft.pai.runtimeplugin:
    - plugin: ssh
      parameters:
        jobssh: true

expected runtime.log:

[Mon Feb  1 02:55:16 UTC 2021] [openpai-runtime] [INFO] Starting to exec precommands
[Mon Feb  1 02:55:16 UTC 2021] [openpai-runtime] 111
[Mon Feb  1 02:55:16 UTC 2021] [openpai-runtime] aaa
[Mon Feb  1 02:55:16 UTC 2021] [openpai-runtime] [package_cache] Skip installation of group ssh.
[Mon Feb  1 02:55:16 UTC 2021] [openpai-runtime] [INFO] start ssh service
[Mon Feb  1 02:55:16 UTC 2021] [openpai-runtime] * Restarting OpenBSD Secure Shell server sshd
[Mon Feb  1 02:55:16 UTC 2021] [openpai-runtime] ...done.
[Mon Feb  1 02:55:16 UTC 2021] [openpai-runtime] [INFO] Precommands finished
[Mon Feb  1 02:55:16 UTC 2021] [openpai-runtime] [INFO] USER COMMAND START
[Mon Feb  1 02:55:27 UTC 2021] [openpai-runtime] [INFO] USER COMMAND END
[Mon Feb  1 02:55:27 UTC 2021] [openpai-runtime] bbb
[Mon Feb  1 02:55:27 UTC 2021] [openpai-runtime] 222
  1. test wrong config 1
    Error is expected:
protocolVersion: 2
name: pre3
type: job
jobRetryCount: 0
prerequisites:
  - type: script
    name: justecho
    plugin: com.microsoft.pai.runtimeplugin.cmd
    callbacks:
      - event: taskStarts
        commands:
          - echo 111
          - echo 222
      - event: taskSucceeds
        commands:
          - echo 333
          - echo 444
  - type: dockerimage
    uri: 'openpai/standard:python_3.6-pytorch_1.2.0-gpu'
    name: docker_image_0
taskRoles:
  taskrole:
    instances: 1
    completion:
      minFailedInstances: 1
    taskRetryCount: 0
    prerequisites:
      - justecho_wrong
    dockerImage: docker_image_0
    resourcePerInstance:
      gpu: 0
      cpu: 1
      memoryMB: 9672
    commands:
      - sleep 0s
defaults:
  virtualCluster: default
extras:
  com.microsoft.pai.runtimeplugin:
    - plugin: ssh
      parameters:
        jobssh: true
  1. test backward-compatibility

This job should work:

protocolVersion: 2
name: covid-chestxray-dataset_88170423
description: >
  COVID-19 chest X-ray image data collection


  It is to build a public open dataset of chest X-ray and CT images of patients
  which are positive or suspected of COVID-19 or other viral and bacterial
  pneumonias
  ([MERS](https://en.wikipedia.org/wiki/Middle_East_respiratory_syndrome),
  [SARS](https://en.wikipedia.org/wiki/Severe_acute_respiratory_syndrome), and
  [ARDS](https://en.wikipedia.org/wiki/Acute_respiratory_distress_syndrome).).
contributor: OpenPAI
type: job
jobRetryCount: 0
prerequisites:
  - name: covid-chestxray-dataset
    type: data
    uri:
      - 'https://github.com/ieee8023/covid-chestxray-dataset.git'
  - name: default_image
    type: dockerimage
    uri: 'openpai/standard:python_3.6-pytorch_1.4.0-gpu'
taskRoles:
  taskrole:
    instances: 1
    completion:
      minFailedInstances: 1
    taskRetryCount: 0
    dockerImage: default_image
    data: covid-chestxray-dataset
    resourcePerInstance:
      cpu: 3
      memoryMB: 29065
      gpu: 1
    commands:
      - 'git clone <% $data.uri[0] %>'
defaults:
  virtualCluster: default
  1. test data prerequiste
protocolVersion: 2
name: pre1_f7a15a5c
type: job
jobRetryCount: 0
prerequisites:
  - type: script
    name: install-git
    plugin: com.microsoft.pai.runtimeplugin.cmd
    callbacks:
      - event: taskStarts
        commands:
          - apt update
          - apt install -y  git
  - type: data
    name: covid-19-data
    plugin: com.microsoft.pai.runtimeplugin.cmd
    callbacks:
      - event: taskStarts
        commands:
          - mkdir -p /dataset/covid-19
          - >-
            git clone https://github.com/ieee8023/covid-chestxray-dataset.git
            /dataset/covid-19
  - type: dockerimage
    uri: 'ubuntu:18.04'
    name: docker_image_0
taskRoles:
  taskrole:
    instances: 1
    completion:
      minFailedInstances: 1
    taskRetryCount: 0
    prerequisites:
      - install-git
      - covid-19-data
    dockerImage: docker_image_0
    resourcePerInstance:
      gpu: 1
      cpu: 3
      memoryMB: 29065
    commands:
      - ls -la /dataset/covid-19
defaults:
  virtualCluster: default
extras:
  com.microsoft.pai.runtimeplugin:
    - plugin: ssh
      parameters:
        jobssh: true

expect: the data is successfully listed

  1. test parameter and secrets
protocolVersion: 2
name: pre_secret_parameters
type: job
jobRetryCount: 0
prerequisites:
  - type: script
    name: justecho
    plugin: com.microsoft.pai.runtimeplugin.cmd
    callbacks:
      - event: taskStarts
        commands:
          - echo <% $parameters.x %>
          - echo <% $secrets.y %>
  - type: dockerimage
    uri: 'openpai/standard:python_3.6-pytorch_1.2.0-gpu'
    name: docker_image_0
parameters:
  x: '111'
taskRoles:
  taskrole:
    prerequisites: [justecho]
    instances: 1
    completion:
      minFailedInstances: 1
    taskRetryCount: 0
    dockerImage: docker_image_0
    resourcePerInstance:
      gpu: 1
      cpu: 3
      memoryMB: 29065
    commands:
      - sleep 0s
secrets:
  'y': '222'
defaults:
  virtualCluster: default
extras:
  com.microsoft.pai.runtimeplugin:
    - plugin: ssh
      parameters:
        jobssh: true

expect 111 and 222 in runtime.log

@hzy46 hzy46 closed this as completed Jan 29, 2021
@hzy46 hzy46 reopened this Jan 29, 2021
@hzy46
Copy link
Contributor Author

hzy46 commented Feb 4, 2021

After discussion, the interaction between prerequisites and marketplace could be:

In taskrole, the prerequisites referenced from marketplace are defined directly. No need to include them in the job level prerequisites.

Use marketplace://data/xxx and marketplace://script/xxx to indicate data and script:

taskRoles:
  taskrole:
    prerequisites: ["marketplace://data/mnist"]

In job protocol, prerequisites can use require to indicate required items. The required items can be from job protocol or marketplace.

prerequisites:
  - type: script
    name: copy_data
    require: ["marketplace://script/pai_copy"]
    plugin: com.microsoft.pai.runtimeplugin.cmd
    callbacks:
      - event: containerStarts
        commands:
          - pai_copy data

Rest-server read all prerequisites in taskrole's prerequisites and covert them to the real definition by calling marketplace's api. This can be treated as job add-ons and saved in database.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

4 participants