discord-cluster-manager

This is the code for the Discord bot we'll be using to queue jobs to a cluster of GPUs that our generous sponsors have provided. Our goal is to be able to queue kernels that can run end to end in seconds that way things feel interactive and social.

The key idea is that we're using Github Actions as a job scheduling engine and primarily making the Discord bot interact with the cluster via issuing Github Actions and and monitoring their status and while we're focused on having a nice user experience on discord.gg/gpumode, we're happy to accept PRs that make it easier for other Discord communities to hook GPUs.

Demo!

Supported schedulers

GitHub Actions
Modal
Slurm (not implemented yet)

How to run and develop the bot locally

To run and develop the bot locally, you need to add it to your own server. Follow the steps here and here to create a bot application and then add it to your server.

Here is a visual walk-through of the steps (after clicking on the New Application button):

The bot needs the Message Content Intent permission.

Click here for visual.
The bot also needs applications.commands and bot scopes.

Click here for visual.
The bot also needs to permissions to read and write messages which is easy to setup if you click on this link. Finally, generate an invite link for the bot and enter it into any browser.

Click here for visual.

Note

Bot permissions involving threads/mentions/messages should suffice, but you can naively give it Administrator since it's just a test bot in your own testing Discord server.

Environment Variables

After this, you should be able to create a .env file with the following environment variables:

DISCORD_DEBUG_TOKEN : The token of the bot you want to run locally
DISCORD_DEBUG_CLUSTER_STAGING_ID : The ID of the staging server you want to connect to
GITHUB_TOKEN : A Github token with permissions to trigger workflows, for now only new branches from discord-cluster-manager are tested, since the bot triggers workflows on your behalf

Below is where to find these environment variables:

DISCORD_DEBUG_TOKEN or DISCORD_TOKEN: Found in your bot's page within the Discord Developer Portal:

Click here for visual.
DISCORD_DEBUG_CLUSTER_STAGING_ID or DISCORD_CLUSTER_STAGING_ID: Right-click your staging Discord server and select Copy Server ID:

Click here for visual.
GITHUB_TOKEN: Found in Settings -> Developer Settings (or here).

How to run the bot

Install dependencies with pip install -r requirements.txt
Create a .env file with the environment variables listed above
python src/discord-cluster-manager/bot.py --debug

Usage instructions

Note

To test functionality of the Modal runner, you also need to be authenticated with Modal. Modal provides free credits to get started.

To test functionality of the GitHub runner, you may need direct access to this repo.

/run modal <gpu_type> which you can use to pick a specific gpu, right now defaults to T4
/run github <NVIDIA/AMD> which picks one of two workflow files
/resync to clear all the commands and resync them
/ping to check if the bot is online

How to test the bot

The smoke test script in tests/discord-bot-smoke-test.py should be run to verify basic functionality of the cluster bot. For usage information, run with python tests/discord-bot-smoke-test.py -h. Run it against your own server.

[!IMPORTANT] You need to have multiple environment variables set to run the bot on your own server:

You can run the bot in two modes:

Production mode: python discord-bot.py
Debug/staging mode: python discord-bot.py --debug

When running in debug mode, the bot will use your DISCORD_DEBUG_TOKEN and DISCORD_DEBUG_CLUSTER_STAGING_ID and display as "Cluster Bot (Staging)" to clearly indicate it's not the production instance.

How to add a new GPU to the cluster

If you'd like to donate a GPU to our efforts, we can make you a CI admin in Github and have you add an org level runner https://github.com/organizations/gpu-mode/settings/actions/runners

Acknowledgements

Thank you to AMD for sponsoring an MI250 node
Thank you to NVIDIA for sponsoring an H100 node
Thank you to Nebius for sponsoring credits and an H100 node
Thank you Modal for credits and speedy spartup times
Luca Antiga did something very similar for the NeurIPS LLM efficiency competition, it was great!
Midjourney was a similar inspiration in terms of UX

Name		Name	Last commit message	Last commit date
Latest commit History 129 Commits
.github		.github
scripts		scripts
src/discord-cluster-manager		src/discord-cluster-manager
tests		tests
.gitignore		.gitignore
Aptfile		Aptfile
Procfile		Procfile
README.md		README.md
requirements.txt		requirements.txt
runtime.txt		runtime.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

discord-cluster-manager

Supported schedulers

How to run and develop the bot locally

Environment Variables

How to run the bot

Usage instructions

How to test the bot

How to add a new GPU to the cluster

Acknowledgements

About

Releases

Packages

Contributors 6

Languages

gpu-mode/discord-cluster-manager

Folders and files

Latest commit

History

Repository files navigation

discord-cluster-manager

Supported schedulers

How to run and develop the bot locally

Environment Variables

How to run the bot

Usage instructions

How to test the bot

How to add a new GPU to the cluster

Acknowledgements

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 6

Languages

Packages