ML Scheduler is a lightweight machine learning experiment scheduler that automates resource management (e.g., GPUs and models) and batch runs experiments with just a few lines of Python code.
- Install ml-scheduler
pip install ml-scheduler
or install from the github repository:
git clone https://github.com/huyiwen/ml_scheduler
cd ml_scheduler
pip install -e .
- Create a Python script:
cuda = ml_scheduler.pools.CUDAPool([0, 2], 90)
disk = ml_scheduler.pools.DiskPool('/one-fs')
@ml_scheduler.exp_func
async def mmlu(exp: ml_scheduler.Exp, model, checkpoint):
source_dir = f"/another-fs/model/{model}/checkpoint-{checkpoint}"
target_dir = f"/one-fs/model/{model}-{checkpoint}"
# resources will be cleaned up after exiting the function
disk_resource = await exp.get(
disk.copy_folder,
source_dir,
target_dir,
cleanup_target=True,
)
cuda_resource = await exp.get(cuda.allocate, 1)
# run inference
args = [
"python", "inference.py", "--model", target_dir, "--dataset", "mmlu", "--cuda", str(cuda_resource[0])
]
stdout = await exp.run(args=args)
await exp.report({'Accuracy', stdout})
mmlu.run_csv("experiments.csv", ['Accuracy'])
Mark the function with @ml_scheduler.exp_func
and async
to make it an experiment function. The function should take an exp
argument as the first argument.
Then use await exp.get
to get resources (non-blocking) and await exp.run
to run the experiment (also non-blocking). Non-blocking means that when you can run multiple experiments concurrently.
- Create a CSV file
experiments.csv
with your arguments (model
andcheckpoint
in this case):
model,checkpoint
alpacaflan-packing,200
alpacaflan-packing,400
alpacaflan-qlora,200-merged
alpacaflan-qlora,400-merged
- Run the script:
python run.py
The results (Accuracy
in this case) and some other information will be saved in results.csv
.