Skip to content

konstantinjdobler/nlp-research-template

Repository files navigation

An opinionated template for reproducible NLP research code

Docker Hub Code style: black Linter License: MIT

NLP research template for training language models using PyTorch + Lightning + Weights & Biases + HuggingFace. It's built to be customized but provides comprehensive, sensible default functionality.

If you are not doing NLP or want to use your own training code or template, the setup and environment tooling with Docker, mamba, and conda-lock in this template might still be interesting for you.

Setup

Preliminaries

It's recommended to use mamba to manage dependencies. mamba is a drop-in replacement for conda re-written in C++ to speed things up significantly (you can stick with conda though). To provide reproducible environments, we use conda-lock to generate lockfiles for each platform.

Installing mamba

On Unix-like platforms, run the snippet below. Otherwise, visit the mambaforge repo. Note this does not use the Anaconda installer, which reduces bloat.

curl -L -O "https://github.com/conda-forge/miniforge/releases/latest/download/Mambaforge-$(uname)-$(uname -m).sh"
bash Mambaforge-$(uname)-$(uname -m).sh
Installing conda-lock

The preferred method is to install conda-lock using pipx install conda-lock. For other options, visit the conda-lock repo. For basic usage, have a look at the commands below:

conda-lock install --name gpt5 conda-lock.yml # create environment with name gpt5 based on lockfile
conda-lock # create new lockfile based on environment.yml
conda-lock --update <package-name> # update specific packages in lockfile

Environment

Lockfiles are an easy way to exactly reproduce an environment.

After having installed mamba and conda-lock, you can create a mamba environment named gpt5 from a lockfile with all necessary dependencies installed like this:

conda-lock install --name gpt5 conda-lock.yml

You can then activate your environment with

mamba activate gpt5

To generate new lockfiles after updating the environment.yml file, simply run conda-lock -f environment.yml.

Setup on ppc64le

If you're not using a PowerPC machine, do not worry about this.

Whenever you create an environment for a different processor architecture, some packages (especially pytorch) need to be compiled specifically for that architecture. IBM PowerPC machines for example use a processor architecture called ppc64le. Setting up the environment ppc64le is a bit tricky because the official channels do not provide packages compiled for ppc64le. However, we can use the amazing Open-CE channel instead. A lockfile containing the relevant dependencies is already prepared in ppc64le.conda-lock.yml and the environment again can be simply installed with:

conda-lock install --name gpt5-ppc64le ppc64le.conda-lock.yml

Dependencies for ppc64le should go into the separate ppc64le.environment.yml file. Use the following command to generate a new lockfile after updating the dependencies:

conda-lock --file ppc64le.environment.yml --lockfile ppc64le.conda-lock.yml

Docker (recommended)

For fully reproducible environments and running on HPC clusters, we provide pre-built docker images at konstantinjdobler/nlp-research-template. We also provide a Dockerfile that allows you to build new docker images with updated dependencies:

# first update `environment.yml` with your dependencies
# then this command will create a new conda-lock.yml file
conda-lock -f environment.yml
# this automatically uses your latest conda-lock.yml to create a reproducible docker image
docker build --tag <username>/<imagename>:<tag> --platform="linux/amd64" .

The specified username should be your personal dockerhub username. This will make distribution and usage of your images easier with docker push/pull <your image>.

We also provide shell commands and a convenience script to run all your training commands inside docker (recommended).

Training

After all of this setup you are finally ready for some training. First of all, you need to create your data directory with a train.txt and dev.txt. Then you can start a training run in your environment with:

python train.py -n <run-name> -d /path/to/data --model roberta-base --offline

To see an overview over all options and their defaults, run python train.py --help or have a look inside args.py. We have disabled Weights & Biases syncing with the --offline flag. If you want to log your results, enable W&B as described here and omit the --offline flag.

Using GPUs for hardware acceleration

By default, train.py tries to use a single CUDA GPU if available. If you want to train on multiple GPUs, increase the --num_devices flag (this then uses DistributedDataParallel under the hood). IMPORTANT: you should always select the GPUs that are visible to the script via the CUDA_VISIBLE_DEVICES environment variable (e.g. CUDA_VISIBLE_DEVICES=0,2 python train.py ...) or via the docker flags if training inside a container (recommended). To use different hardware accelerators, use the --accelerator flag. You can use advanced parallel training strategies with --distributed_strategy.

Using the Docker for training (recommended)

To conveniently run the training code inside a docker container, you can use the run-in-docker.sh script.

# execute the training inside your container
# -g 2 means only GPU 2 is visible to the script
# -g 0,2 would make the GPUs 0 and 2 visible
bash ./scripts/run-in-docker.sh -g 2 python train.py --num_devices 1 -n <run-name> -d /path/to/data/ --model roberta-base --offline

By default (no -g flag), no GPUs are available inside the container. You probably want to adjust the run-in-docker.sh script to add your own mounts for data and other things you want to load / save.

Docker + GPUs: You should always select specific GPUs to be visible inside the container. When using the run-in-docker.sh script, use the -g flag. When using docker natively, use e.g. --gpus='"device=0,7"' (for the GPUs 0 and 7) and adjust the --num_devices flag according to your number of selected GPUs. Yes, the weird format of --gpus='"device=0,7"' is important, otherwise the shell might not pass the flag correctly to nvidia-docker (official Nvidia recommendation).

Single-line docker command

You can start a script inside a docker container in a single command:

docker run -it --user $(id -u):$(id -g) --ipc host -v "$(pwd)":/workspace -w /workspace --gpus='"device=7"' konstantinjdobler/nlp-research-template:latest python train.py --num_devices=1 ...

Since we have not mounted any cache directories (only the current working directory with $(pwd)), nothing that is written to disk outside $(pwd) is persistent in this example. You can add those with -v or --mount.

Using Docker with SLURM / pyxis

For security reasons, docker might be disabled on your HPC cluster. You might be able to use the SLURM plugin pyxis instead like this:

srun ... --container-image konstantinjdobler/nlp-research-template:latest python train.py ...

This uses enroot under the hood to import your docker image and run your code inside the container. See the pyxis documentation for more options, such as --container-mounts or --container-writable.

It might take a long time to start the container. You can prepare this by doing enroot import docker://konstantinjdobler/nlp-research-template:latest -o prepared-image.sqsh and then modify the srun:

srun ... --container-image /path/to/prepared-image.sqsh python train.py ...

If you want to run an interactive session with bash don't forget the --pty flag.

Weights & Biases

Weights & Biases allows you to easily log metrics, training results, checkpoints, and hyperparameters. To enable Weights & Biases, enter your WANDB_ENTITY and WANDB_PROJECT in train.py and omit the --offline flag for training.

Weights & Biases + Docker

When using docker we also have to get our WANDB_API_KEY inside the container. You can find your personal API key at wandb.ai/authorize. Set WANDB_API_KEY on your host machine and use the docker flag --env WANDB_API_KEY when starting your run. Or just use the run-in-docker.sh script, which will try to parse the WANDB_API_KEY from your ~/.netrc file (or get it from the environment).

Configs

To save the exact configurations of experiments and save yourself some time typing out arguments in the command line, you can use .yml style config files supplied via the --config_path argument. You can also combine multiple configs. The order of importance is default args < config args (multiple configs are resolved in order) < command line args.

python train.py --config_path ./cfgs/example.yml ./cfgs/llama-from-scratch.yml --devices 8 -n my-training-run ...

Development

If you want to connect to a remote host machine with GPUs for development, we recommend the VS Code Remote-SSH extension.

Dev Containers (recommended)

Ideally, you should also do your development inside the same docker container to reduce a mismatch between training and development. For this, use VS Code Dev Containers. They allow you to develop in VS Code inside a docker container with full support for IntelliSense, type hints and more. The template already contains a .devcontainer directory, where all the settings for it are stored - you can start right away!

VS Code Dev Container example

After having installed the Remote-SSH-, and Dev Containers-Extension, you set up your Dev Container in the following way:

  1. Establish the SSH-connection with the host by opening your VS Code command pallette and typing Remote-SSH: Connect to Host. Now you can connect to your host machine.
  2. Open the folder that contains this template on the host machine.
  3. VS Code will automatically detect the .devcontainer directory and ask you to reopen the folder in a Dev Container. Alternatively, use the command pallette and type Dev Containers.
  4. Press Reopen in Container and wait for VS Code to set everything up. for the first time or when you change devcontainer.json, you will need to do Rebuild and reopen in Container.

There is a bit of setup: for a proper dev environment, you will need to configure mounts (cache directories, your datasets, ...) and environment variables like for a regular docker run command, have a look inside .devcontainer/devcontainer.json.

conda-lock is automatically installed for you but you have to add the --micromamba flag inside the Dev Container (e.g. conda-lock --micromamba -f environment.yml). Otherwise, conda-lock uses an anaconda installation, which takes over 8 hours to resolve the packages in the environments.

We automatically mount the ~/.gitconfig and ~/.netrc files for ease of use of Git and W&B, however these files have to exist on your host machine. They are created when executing git config --global user.email [email protected] and wandb login, respectively.

If you want to use GPUs for development, you also need to specify the GPU you want to use in .devcontainer/devcontainer.json. However, this is a bit cumbersome if you are often switching between GPUs. Alternatively, you edit your code in the Dev Container (without a GPU) but start all actual development runs of your script like you would for training with run-in-docker.sh and select the GPU ad-hoc. The nice advantage of Dev Containers is that you are still using the exact same docker container for both.

mamba and conda-lock

Sometimes it's just quicker or unavoidable to create an environment via conda-lock install --name gpt5 conda-lock.yml instead of using Docker. In most cases, this is fine since we are using lockfiles but there might be some tricky edge cases depending on the platform and OS. Just be careful to keep any local environments and your docker containers in sync. Docker containers also allow more advanced support for compiled CUDA kernels such as FlashAttention.

Code style

We use the ruff linter and black formatter. You should install their VS Code extensions and enable "Format on Save" inside VS Code.

Continuous Integration and Deployment

Our project uses GitHub Actions for CI/CD to automate the building and pushing of our Docker images to Docker Hub. This ensures that our Docker images are always up-to-date with the latest dependencies specified in conda-lock.yml.

Prerequisites for CI/CD

To work with this CI/CD setup, you need to:

  • Set the following secrets in your GitHub repository:
    • DOCKER_REGISTRY: The Docker registry URL (if using Docker Hub, this is not needed).
    • DOCKER_REGISTRY_TOKEN: Your Docker Hub access token or password.
  • Replace konstantinjdobler and mentions of nlp-research-template with your own Docker ID in the workflow file .github/workflows/docker.yml

If you do not want to automatically build and push images, just delete the workflow file.

How to Update Docker Images

To update the Docker image:

  1. Make necessary changes to the Dockerfile or update dependencies in the environment.yml.
  2. Generate a new conda-lock.yml by running conda-lock -f environment.yml.
  3. Commit and push the changes to the main branch.
  4. The GitHub Actions workflow will automatically build and push the new Docker image to Docker Hub.

Docker Tags

The Docker images are tagged with the PyTorch and CUDA versions extracted from conda-lock.yml, as well as a latest tag for the most recent build. Use the specific tags if you need a particular version of PyTorch or CUDA, or use the latest tag for the most recent build.