TurboSched: A Modern and Configurable Job Scheduling System
TurboSched aims to be a modern alternative to the traditional job schedulers, e.g. SLURM, PBS, etc. It is designed to be highly configurable and extensible, but mainly focuses on GPU clusters.
TurboSched is still under heavy development and is far from even a working prototype at the moment. Any contributions are welcome!
- Install Protobuf Compiler
protoc
and Go plugins (guide). Generate Go code from the proto files:protoc --go_out=. --go_opt=paths=source_relative --go-grpc_out=. --go-grpc_opt=paths=source_relative common/proto/*.proto
- Copy
config.toml.example
toconfig.toml
and modify the configuration file as needed. - Run on the nodes in the following order:
go run turbod/main.go -c # Controller go run turbod/main.go -m # Compute Node
- As a client, you can use the following command:
go run turbo/main.go python # submit an interactive job running python go run turbo/main.go stop <job_id> # stop a job immediately
Note: Please do not rely on this roadmap as it is outdated. The issue page is the most up-to-date source of information.
- Basic single-node execution
- Basic single-node scheduling
- Task Cancellation
- GPU resource management
- GPU-aware scheduling
- Basic multi-node scheduling
- Failure-aware scheduling
- Multi-node discovery
- Task accounting