A simple scheduler of dockerized tasks, built with LLM use-cases in mind.
WARNING: This project is still a work in progress and is not released.
You spent a lot of $$$ for a brand new 8xA100 machine so that your Machine Learning team can train LLM's. A month later it turns out that:
- Everyone is logging in with the "ubuntu" user and messing up each other's environments.
- Some people leave their Jupyter notebooks running without releasing GPU's.
- Getting an available GPU usually involves slacking the entire team and threatening them with a restaet.
- IT department is complaining that your GPU utilization is too low.
Now you want to restore some semblance of order, which would involve fair and transparent allocation and good resource utilization. You begin to look into available tools and discover that they are either difficult to use, requires an entire team to maintain, or both.
This simple task scheduling system is designed to fill that gap. It provides the following features:
- Very simple installation.
- All actions can be done either via UI or a REST API.
- Environment isolation is achieved by requiring every task to be a Docker container.
- Tracking of running, scheduled and completed tasks.
- Each task can request a number of GPU's. It will be queued until the requested resources are available.
- Commonly used tasks can be pre-configured to be scheduled with one click.
Docker and docker-compose are required. For GPU support, please install NVIDIA Container Toolkit.
For a quick glance at how the system works, run the following command:
docker compose up
.
The web UI will be available at http://localhost:8004.
Please note that this is for demonstration only - the system is not secure without configuration.
Please create a .env
file in the root directory of the project. The following variables are required:
SECRET_KEY
FIRST_SUPERUSER_USERNAME # default: admin
FIRST_SUPERUSER_PASSWORD
FIRST_SUPERUSER_EMAIL
A secret key can be obtained with openssl rand -hex 32
.
Optional variables:
REQUIRE_LOGIN_FOR_SUBMIT # default: True
The web UI is available at http://localhost:8004. The following features are available:
- Create a new task.
- View running, scheduled and completed tasks.
- View task details, including real-time logs.
- Manage users: create, change password, delete.
- Pre-configure commonly used tasks.
- Add support for GPU utilization monitoring.
- Add support for multiple worker machines.
Contributions are welcome. Please open an issue to discuss your ideas before submitting a PR.