Skip to content
This repository has been archived by the owner on Jun 6, 2024. It is now read-only.

[User Story] Dataset: integrate data prerequisite into marketplace and job submission page #5345

Open
6 tasks
hzy46 opened this issue Mar 4, 2021 · 1 comment

Comments

@hzy46
Copy link
Contributor

hzy46 commented Mar 4, 2021

Motivation

#5145 has extended the prerequisite field. But users can only use and share prerequisites in job yaml. We can support UI for prerequistes, especially for data prerequisite. This issue will explain how the users create and use a data prerequisite in the cluster. With this feature, cluster users can easily share datasets with each other, and it may benefit future features e.g. dataset caching and optimization.

Explanation

How do users create a dataset in the cluster?

Dataset item that doesn't need a PVC storage

The user should create a dataset item in marketplace. dataset item has a prerequisite spec and other misc info (e.g. title, usage) in marketplace.

If the dataset is just downloaded from the Internet, it should have the following spec:

name: mnist
type: data
plugin: com.microsoft.pai.runtimeplugin.cmd
callbacks:
- event: taskStarts
   commands:
    - wget "<.....>" -O /dataset/mnist/<...>

Dataset item that needs a PVC storage

If the dataset is already saved in a PVC, it should have the following spec:

name: imagenet
type: data
plugin: com.microsoft.pai.runtimeplugin.cmd
requireStorages:
- name: confignfs
  mountPath: /mnt/confignfs
callbacks:
- event: taskStarts
   commands:
    - ln -s "/dataset/imagenet" "/mnt/confignfs/users/mine/presaved-imagenet"

Here we define a new field: requireStorages. It shares the same spec as the current implementation. If this prerequisite is included in a job, we should merge the storage field here with other PVC storage.

How do users use dataset in the cluster?

On marketplace pages

On marketplace pages, users can click use to create an empty job with the corresponding dataset.

image

On job submission page

On job submission page, users can select his/her dataset by the field under taskrole section.

image

How to represent marketplace prerequisite in job yaml?

The dataset prerequisite from marketplace will be expressed as marketplace://prerequisites/itemId/<item-id>

One example is as follows:

taskRoles:
  taskrole:
    prerequisites: ["marketplace://prerequisites/itemId/1"]
    instances: 1
    completion:
      minFailedInstances: 1
    taskRetryCount: 0
    dockerImage: docker_image_0
    resourcePerInstance:
      gpu: 1
      cpu: 3
      memoryMB: 29065
    commands:
      - echo 1

The webportal page should provide a link to marketplace for the user.

After submission, rest-server will parse these marketplace items and pass them to db controller and runtime. Rest-server should also take care of requireStorages, and merge it with other storage spec carefully.

The following errors can happen in rest-server:

  • The user does not have permission to requireStorages.
    • Do we need to hide these datasets for users? Currently it is hard to implement. Maybe left to future work.
  • The corresponding prerequisite can not be found.
  • Fail to call marketplace's API.
  • Fail to download the prerequisite item.

Other features

We can enable urls like http(s):// in addition to marketplace://. It will bring a lot of convenience and easy to implement.

taskRoles:
  taskrole:
    prerequisites: ["https://raw.githubusercontent.com/microsoft/pai/master/contrib/xxxx.yml"]
    instances: 1
    completion:
      minFailedInstances: 1
    taskRetryCount: 0
    dockerImage: docker_image_0
    resourcePerInstance:
      gpu: 1
      cpu: 3
      memoryMB: 29065
    commands:
      - echo 1

Implementation

  • marketplace: provide backend and ui
  • add field requireStorages
  • webportal: ui change and validation
  • rest-server: validation, and parse prerequisites
  • database-controller
  • runtime: run prerequisites
@hzy46 hzy46 mentioned this issue Mar 8, 2021
14 tasks
@hzy46
Copy link
Contributor Author

hzy46 commented Mar 19, 2021

Main Design Ideas

  • One prerequisite is mainly made up of name, plugin, plugin_params, type, require.
  • To extend the usage, we introduce template_variables in plugin_params. However, when users require a prerequisite, he/she must specify all template_variables. This is a simplification of the mechanism, which ensures that we will never require a prerequisite with unfulfilled template variables.
  • Put marketplace prerequisites into extras field to make the other parts of the job config cluster-agnostic.

Examples

Set up a mnist dataset

# in marketplace
- name: install_wget
  plugin: cmd
  plugin_params:
    callbacks:
      - event: taskStarts
        commands:
          - "apt update"
          - "apt install -y wget"
 
# in marketplace
- name: mnist
  require:
    - name: marketplace://name/install_wget
  plugin: cmd
  plugin_params:
    callbacks:
      - event: taskStarts
        commands:
          - mkdir -p {{ dataPath }}
          - wget http://1.2.3.4/mnist.zip -O {{ dataPath }}
          - cd {{ dataPath }}
          - unzip mnist.zip
  template_variables:
    - name: dataPath
 
 
# in job
prerequisites:
  - type: dockerimage
    uri: 'openpai/standard:python_3.6-pytorch_1.2.0-gpu'
    name: docker_image_0
taskRoles:
  taskrole:
    instances: 1
    completion:
      minFailedInstances: 1
    taskRetryCount: 0
    prerequisites:
      - mnist
    dockerImage: docker_image_0
    resourcePerInstance:
      gpu: 1
      cpu: 3
      memoryMB: 29065
    commands:
      - sleep 0s
defaults:
  virtualCluster: default
extras:
  reference_prerequisites:
    - name: mnist
      require:
        - name: marketplace://name/mnist
          template_variables:
            dataPath: /dataset/mnist

Set up a imagenet dataset

# set up a imagenet
 
#  in marketplace
- name: confignfs_pvc
  plugin: pvc_storage
  plugin_params:
     name: confignfs
     mountPath: {{ mountPath }}
  template_variables:
    - name: mountPath
 
# in marketplace
- name: imagenet
  require: # if the required prerequisite has template_variables, all the template_variables MUST be fulfilled.
    - name: marketplace://name/confignfs_pvc
      template_variables:
        mountPath: /mnt/confignfs_pvc
  plugin: cmd
  plugin_params:
    callbacks:
    - event: taskStarts
      commands:
      - mkdir -p {{ dataPath }}
      - cp -r /mnt/confignfs_pvc/imagenet/* {{ dataPath }}
  template_variables:
    - name: dataPath
 
# in marketplace
- name: imagenet_only_validation
  require: # if the required prerequisite has template_variables, all the template_variables MUST be fulfilled.
    - name: marketplace://name/confignfs_pvc
      template_variables:
        mountPath: /mnt/confignfs_pvc
  plugin: cmd
  plugin_params:
    callbacks:
    - event: taskStarts
      commands:
      - mkdir -p {{ dataPath }}
      - cp -r /mnt/confignfs_pvc/imagenet/validation/* {{ dataPath }}
  template_variables:
    - name: dataPath
 
 
# in job
prerequisites:
  - type: dockerimage
    uri: 'openpai/standard:python_3.6-pytorch_1.2.0-gpu'
    name: docker_image_0
taskRoles:
  taskrole:
    instances: 1
    completion:
      minFailedInstances: 1
    taskRetryCount: 0
    prerequisites:
      - imagenet
    dockerImage: docker_image_0
    resourcePerInstance:
      gpu: 1
      cpu: 3
      memoryMB: 29065
    commands:
      - sleep 0s
defaults:
  virtualCluster: default
extras:
  reference_prerequisites:
  - name: imagenet
    require:
    - name: marketplace://name/imagenet
      template_variables:
        dataPath: /dataset/imagenet

Set up a debug hook

# set up a debug hook
 
#  in marketplace
- name: debug_hook
  plugin: cmd
  plugin_params:
    callbacks:
      - event: taskFails
        commands:
          - echo "will sleep for {{ min }} minutes for debugging..."
          - sleep {{ min }}m
  template_variables:
      - name: min
 
# in job
prerequisites:
  - type: dockerimage
    uri: 'openpai/standard:python_3.6-pytorch_1.2.0-gpu'
    name: docker_image_0
taskRoles:
  taskrole:
    instances: 1
    completion:
      minFailedInstances: 1
    taskRetryCount: 0
    prerequisites:
      - debug_hook
    dockerImage: docker_image_0
    resourcePerInstance:
      gpu: 1
      cpu: 3
      memoryMB: 29065
    commands:
      - sleep 0s
defaults:
  virtualCluster: default
extras:
  reference_prerequisites:
  - name: debug_hook
    require:
    - name: marketplace://name/debug_hook
      template_variables:
        min: 30

@yiyione yiyione mentioned this issue Apr 26, 2021
16 tasks
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant