-
Notifications
You must be signed in to change notification settings - Fork 548
extend prerequisite field in job protocol #5145
Comments
Update of this issue:
|
Question about this:
|
I think they are used for different scenarios. |
I have updated the full spec in the main body and here are some examples, including
prerequisites:
- name: install-pai-copy
type: script # indicate the purpose, not used by backend but for statistical analyzing (except dockerimage)
plugin: com.microsoft.pai.runtimeplugin.cmd # default plugin if not specified
callbacks:
- event: containerStart
commands:
- xxx # commands to setup nodejs
- npm install -g @swordfaith/pai_copy
failurePolicy: ignore/fail
- name: covid_data
type: data
plugin: com.microsoft.pai.runtimeplugin.cmd #
callbacks:
- event: containerStart
commands:
- mkdir -p /data/covid19/data/
- cd /data/covid19/data/
- 'wget https://x.x.x/yyy.zip'
- export DATA_DIR=/data/covid19/data/
- name: nfs-storage-1
type: storage # indicate the purpose, not used by backend but for statistical analyzing
plugin: com.microsoft.pai.rest.storage # handled by REST server
config: nfsconfig # special arguments for storage plugin only
mountPoint: /mnt/nfs-storage-1
- name: mnist-data
type: data
plugin: com.microsoft.pai.runtimeplugin.cmd
require:
- nfs-storage-1 # also inherit parameters like mountPoint
callbacks:
- event: containerStart
commands:
- export MNIST_DIR=<% this.mountPoint %>/mnist
- name: output-dir
type: output
plugin: com.microsoft.pai.runtimeplugin.cmd
require:
- nfs-storage-1
callbacks:
- event: containerStart
commands:
- export OUTPUT_DIR=/tmp/output
- event: containerExit
commands:
- 'if [ -z ${OUTPUT_DIR+x}]; then'
- echo "Not found OUTPUT_DIR environ"
- else
- pai_copy upload paiuploadtest //
- fi
- name: enable-ssh
type: script
plugin: com.microsoft.pai.runtimeplugin.ssh
jobssh: true
publicKeys: # optional, if not specified, only public keys in user.extensions.sshKeys will be added
- ... # public keys
taskRoles:
taskrole:
dockerImage: default_image
prerequisites:
- mnist-data # required will be automatically parsed and added in backend
- output-dir |
(TBD) Test Cases for v1.5.0 release
protocolVersion: 2
name: pre1
type: job
jobRetryCount: 0
prerequisites:
- type: script
name: justecho
plugin: com.microsoft.pai.runtimeplugin.cmd
callbacks:
- event: taskStarts
commands:
- echo 111
- echo 222
- event: taskSucceeds
commands:
- echo 333
- echo 444
- type: dockerimage
uri: 'openpai/standard:python_3.6-pytorch_1.2.0-gpu'
name: docker_image_0
taskRoles:
taskrole:
instances: 1
completion:
minFailedInstances: 1
taskRetryCount: 0
prerequisites:
- justecho
dockerImage: docker_image_0
resourcePerInstance:
gpu: 1
cpu: 3
memoryMB: 29065
commands:
- sleep 0s
defaults:
virtualCluster: default
extras:
com.microsoft.pai.runtimeplugin:
- plugin: ssh
parameters:
jobssh: true expected
protocolVersion: 2
name: pre2
type: job
jobRetryCount: 0
prerequisites:
- type: script
name: justecho_first
plugin: com.microsoft.pai.runtimeplugin.cmd
callbacks:
- event: taskStarts
commands:
- echo 111
- event: taskSucceeds
commands:
- echo 222
- type: script
name: justecho_later
plugin: com.microsoft.pai.runtimeplugin.cmd
callbacks:
- event: taskStarts
commands:
- echo aaa
- event: taskSucceeds
commands:
- echo bbb
- type: dockerimage
uri: 'openpai/standard:python_3.6-pytorch_1.2.0-gpu'
name: docker_image_0
taskRoles:
taskrole:
instances: 1
completion:
minFailedInstances: 1
taskRetryCount: 0
prerequisites:
- justecho_first
- justecho_later
dockerImage: docker_image_0
resourcePerInstance:
gpu: 1
cpu: 3
memoryMB: 29065
commands:
- sleep 0s
defaults:
virtualCluster: default
extras:
com.microsoft.pai.runtimeplugin:
- plugin: ssh
parameters:
jobssh: true expected
protocolVersion: 2
name: pre3
type: job
jobRetryCount: 0
prerequisites:
- type: script
name: justecho
plugin: com.microsoft.pai.runtimeplugin.cmd
callbacks:
- event: taskStarts
commands:
- echo 111
- echo 222
- event: taskSucceeds
commands:
- echo 333
- echo 444
- type: dockerimage
uri: 'openpai/standard:python_3.6-pytorch_1.2.0-gpu'
name: docker_image_0
taskRoles:
taskrole:
instances: 1
completion:
minFailedInstances: 1
taskRetryCount: 0
prerequisites:
- justecho_wrong
dockerImage: docker_image_0
resourcePerInstance:
gpu: 0
cpu: 1
memoryMB: 9672
commands:
- sleep 0s
defaults:
virtualCluster: default
extras:
com.microsoft.pai.runtimeplugin:
- plugin: ssh
parameters:
jobssh: true
This job should work: protocolVersion: 2
name: covid-chestxray-dataset_88170423
description: >
COVID-19 chest X-ray image data collection
It is to build a public open dataset of chest X-ray and CT images of patients
which are positive or suspected of COVID-19 or other viral and bacterial
pneumonias
([MERS](https://en.wikipedia.org/wiki/Middle_East_respiratory_syndrome),
[SARS](https://en.wikipedia.org/wiki/Severe_acute_respiratory_syndrome), and
[ARDS](https://en.wikipedia.org/wiki/Acute_respiratory_distress_syndrome).).
contributor: OpenPAI
type: job
jobRetryCount: 0
prerequisites:
- name: covid-chestxray-dataset
type: data
uri:
- 'https://github.com/ieee8023/covid-chestxray-dataset.git'
- name: default_image
type: dockerimage
uri: 'openpai/standard:python_3.6-pytorch_1.4.0-gpu'
taskRoles:
taskrole:
instances: 1
completion:
minFailedInstances: 1
taskRetryCount: 0
dockerImage: default_image
data: covid-chestxray-dataset
resourcePerInstance:
cpu: 3
memoryMB: 29065
gpu: 1
commands:
- 'git clone <% $data.uri[0] %>'
defaults:
virtualCluster: default
protocolVersion: 2
name: pre1_f7a15a5c
type: job
jobRetryCount: 0
prerequisites:
- type: script
name: install-git
plugin: com.microsoft.pai.runtimeplugin.cmd
callbacks:
- event: taskStarts
commands:
- apt update
- apt install -y git
- type: data
name: covid-19-data
plugin: com.microsoft.pai.runtimeplugin.cmd
callbacks:
- event: taskStarts
commands:
- mkdir -p /dataset/covid-19
- >-
git clone https://github.com/ieee8023/covid-chestxray-dataset.git
/dataset/covid-19
- type: dockerimage
uri: 'ubuntu:18.04'
name: docker_image_0
taskRoles:
taskrole:
instances: 1
completion:
minFailedInstances: 1
taskRetryCount: 0
prerequisites:
- install-git
- covid-19-data
dockerImage: docker_image_0
resourcePerInstance:
gpu: 1
cpu: 3
memoryMB: 29065
commands:
- ls -la /dataset/covid-19
defaults:
virtualCluster: default
extras:
com.microsoft.pai.runtimeplugin:
- plugin: ssh
parameters:
jobssh: true expect: the data is successfully listed
protocolVersion: 2
name: pre_secret_parameters
type: job
jobRetryCount: 0
prerequisites:
- type: script
name: justecho
plugin: com.microsoft.pai.runtimeplugin.cmd
callbacks:
- event: taskStarts
commands:
- echo <% $parameters.x %>
- echo <% $secrets.y %>
- type: dockerimage
uri: 'openpai/standard:python_3.6-pytorch_1.2.0-gpu'
name: docker_image_0
parameters:
x: '111'
taskRoles:
taskrole:
prerequisites: [justecho]
instances: 1
completion:
minFailedInstances: 1
taskRetryCount: 0
dockerImage: docker_image_0
resourcePerInstance:
gpu: 1
cpu: 3
memoryMB: 29065
commands:
- sleep 0s
secrets:
'y': '222'
defaults:
virtualCluster: default
extras:
com.microsoft.pai.runtimeplugin:
- plugin: ssh
parameters:
jobssh: true expect 111 and 222 in runtime.log |
After discussion, the interaction between prerequisites and marketplace could be: In taskrole, the prerequisites referenced from marketplace are defined directly. No need to include them in the job level prerequisites. Use
In job protocol, prerequisites can use
Rest-server read all prerequisites in taskrole's prerequisites and covert them to the real definition by calling marketplace's api. This can be treated as job add-ons and saved in database. |
Motivation
OpenPAI protocol support users to specify
prerequisites
(e.g.dockerimage
,data
, andscript
) and then reference them intaskrole
. There are some limitations in current version.uri
) definition. This is enough for the most frequently useddockerimage
because docker plays a role of corresponding runtime executor. However, it is too limited for other types. For example, commands has to be injected in every taskrole to make the data ready in the job config below.wget
is actions with the data, but it could not be placed together.wget
commands must be injected everywhere.Goal
prerequisites
be well organized and object-oriented. Besides defining parameters, it also supports real functions (callbacks on specific events).prerequisites
Proposal
prerequisites
prerequisites
Examples
Full Spec:
Each of
prerequisites
will be handled in a way likeThe text was updated successfully, but these errors were encountered: