Skip to content

A lightweight machine learning experiment scheduler that automates resource management (e.g., GPUs and models) and batch runs experiments with just a few lines of Python code.

License

Notifications You must be signed in to change notification settings

huyiwen/ml_scheduler

Repository files navigation

ml_scheduler

PyPI version License Code style: black Imports: isort

ML Scheduler is a lightweight machine learning experiment scheduler that automates resource management (e.g., GPUs and models) and batch runs experiments with just a few lines of Python code.

Quick Start

  1. Install ml-scheduler
pip install ml-scheduler

or install from the github repository:

git clone https://github.com/huyiwen/ml_scheduler
cd ml_scheduler
pip install -e .
  1. Create a Python script:
cuda = ml_scheduler.pools.CUDAPool([0, 2], 90)
disk = ml_scheduler.pools.DiskPool('/one-fs')


@ml_scheduler.exp_func
async def mmlu(exp: ml_scheduler.Exp, model, checkpoint):

    source_dir = f"/another-fs/model/{model}/checkpoint-{checkpoint}"
    target_dir = f"/one-fs/model/{model}-{checkpoint}"

    # resources will be cleaned up after exiting the function
    disk_resource = await exp.get(
        disk.copy_folder,
        source_dir,
        target_dir,
        cleanup_target=True,
    )
    cuda_resource = await exp.get(cuda.allocate, 1)

    # run inference
    args = [
        "python", "inference.py", "--model", target_dir, "--dataset", "mmlu", "--cuda",  str(cuda_resource[0])
    ]
    stdout = await exp.run(args=args)
    await exp.report({'Accuracy', stdout})


mmlu.run_csv("experiments.csv", ['Accuracy'])

Mark the function with @ml_scheduler.exp_func and async to make it an experiment function. The function should take an exp argument as the first argument.

Then use await exp.get to get resources (non-blocking) and await exp.run to run the experiment (also non-blocking). Non-blocking means that when you can run multiple experiments concurrently.

  1. Create a CSV file experiments.csv with your arguments (model and checkpoint in this case):
model,checkpoint
alpacaflan-packing,200
alpacaflan-packing,400
alpacaflan-qlora,200-merged
alpacaflan-qlora,400-merged
  1. Run the script:
python run.py

The results (Accuracy in this case) and some other information will be saved in results.csv.

More Examples

About

A lightweight machine learning experiment scheduler that automates resource management (e.g., GPUs and models) and batch runs experiments with just a few lines of Python code.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages