Are you looking for a tool to manage your training runs locally, on Slurm/Open Grid Engine clusters, SSH servers or Google Cloud Platform VMs? mle-scheduler
provides a lightweight API to launch and monitor job queues. It smoothly orchestrates simultaneous runs for different configurations and/or random seeds. It is meant to reduce boilerplate and to make job resource specification intuitive. It comes with two core pillars:
MLEJob
: Launches and monitors a single job on a resource (Slurm, Open Grid Engine, GCP, SSH, etc.).MLEQueue
: Launches and monitors a queue of jobs with different training configurations and/or seeds.
For a quickstart check out the notebook blog or the example scripts π
Local | Slurm | Grid Engine | SSH | GCP |
---|
A PyPI installation is available via:
pip install mle-scheduler
If you want to get the most recent commit, please install directly from the repository:
pip install git+https://github.com/mle-infrastructure/mle-hyperopt.git@main
from mle_scheduler import MLEJob
# python train.py -config base_config_1.yaml -exp_dir logs_single -seed_id 1
job = MLEJob(
resource_to_run="local",
job_filename="train.py",
config_filename="base_config_1.yaml",
experiment_dir="logs_single",
seed_id=1
)
_ = job.run()
from mle_scheduler import MLEQueue
# python train.py -config base_config_1.yaml -seed 0 -exp_dir logs_queue/<date>_base_config_1
# python train.py -config base_config_1.yaml -seed 1 -exp_dir logs_queue/<date>_base_config_1
# python train.py -config base_config_2.yaml -seed 0 -exp_dir logs_queue/<date>_base_config_2
# python train.py -config base_config_2.yaml -seed 1 -exp_dir logs_queue/<date>_base_config_2
queue = MLEQueue(
resource_to_run="local",
job_filename="train.py",
config_filenames=["base_config_1.yaml",
"base_config_2.yaml"],
random_seeds=[0, 1],
experiment_dir="logs_queue"
)
queue.run()
# Each job requests 5 CPU cores & 1 V100S GPU & loads CUDA 10.0
job_args = {
"partition": "<SLURM_PARTITION>", # Partition to schedule jobs on
"env_name": "mle-toolbox", # Env to activate at job start-up
"use_conda_venv": True, # Whether to use anaconda venv
"num_logical_cores": 5, # Number of requested CPU cores per job
"num_gpus": 1, # Number of requested GPUs per job
"gpu_type": "V100S", # GPU model requested for each job
"modules_to_load": "nvidia/cuda/10.0" # Modules to load at start-up
}
queue = MLEQueue(
resource_to_run="slurm-cluster",
job_filename="train.py",
job_arguments=job_args,
config_filenames=["base_config_1.yaml",
"base_config_2.yaml"],
experiment_dir="logs_slurm",
random_seeds=[0, 1]
)
queue.run()
# Each job requests 5 CPU cores & 1 V100S GPU w. CUDA 10.0 loaded
job_args = {
"queue": "<GRID_ENGINE_QUEUE>", # Queue to schedule jobs on
"env_name": "mle-toolbox", # Env to activate at job start-up
"use_conda_venv": True, # Whether to use anaconda venv
"num_logical_cores": 5, # Number of requested CPU cores per job
"num_gpus": 1, # Number of requested GPUs per job
"gpu_type": "V100S", # GPU model requested for each job
"gpu_prefix": "cuda" #$ -l {gpu_prefix}="{num_gpus}"
}
queue = MLEQueue(
resource_to_run="sge-cluster",
job_filename="train.py",
job_arguments=job_args,
config_filenames=["base_config_1.yaml",
"base_config_2.yaml"],
experiment_dir="logs_grid_engine",
random_seeds=[0, 1]
)
queue.run()
ssh_settings = {
"user_name": "<SSH_USER_NAME>", # SSH server user name
"pkey_path": "<PKEY_PATH>", # Private key path (e.g. ~/.ssh/id_rsa)
"main_server": "<SSH_SERVER>", # SSH Server address
"jump_server": '', # Jump host address
"ssh_port": 22, # SSH port
"remote_dir": "mle-code-dir", # Dir to sync code to on server
"start_up_copy_dir": True, # Whether to copy code to server
"clean_up_remote_dir": True # Whether to delete remote_dir on exit
}
job_args = {
"env_name": "mle-toolbox", # Env to activate at job start-up
"use_conda_venv": True # Whether to use anaconda venv
}
queue = MLEQueue(
resource_to_run="ssh-node",
job_filename="train.py",
config_filenames=["base_config_1.yaml",
"base_config_2.yaml"],
random_seeds=[0, 1],
experiment_dir="logs_ssh_queue",
job_arguments=job_args,
ssh_settings=ssh_settings)
queue.run()
cloud_settings = {
"project_name": "<GCP_PROJECT_NAME>", # Name of your GCP project
"bucket_name": "<GCS_BUCKET_NAME>", # Name of your GCS bucket
"remote_dir": "<GCS_CODE_DIR_NAME>", # Name of code dir in bucket
"start_up_copy_dir": True, # Whether to copy code to bucket
"clean_up_remote_dir": True # Whether to delete remote_dir on exit
}
job_args = {
"num_gpus": 0, # Number of requested GPUs per job
"gpu_type": None, # GPU requested e.g. "nvidia-tesla-v100"
"num_logical_cores": 1, # Number of requested CPU cores per job
}
queue = MLEQueue(
resource_to_run="gcp-cloud",
job_filename="train.py",
config_filenames=["base_config_1.yaml",
"base_config_2.yaml"],
random_seeds=[0, 1],
experiment_dir="logs_gcp_queue",
job_arguments=job_args,
cloud_settings=cloud_settings,
)
queue.run()
If you use mle-scheduler
in your research, please cite it as follows:
@software{mle_infrastructure2021github,
author = {Robert Tjarko Lange},
title = {{MLE-Infrastructure}: A Set of Lightweight Tools for Distributed Machine Learning Experimentation},
url = {http://github.com/mle-infrastructure},
year = {2021},
}
You can run the test suite via python -m pytest -vv tests/
. If you find a bug or are missing your favourite feature, feel free to create an issue and/or start contributing π€.