Tree Attention: Topology-aware Decoding for Long-Context Attention on GPU clusters

This repository contains a simple example of tree attention in Jax, and compares to a simple Jax implementation of ring attention. For a full implementation of tree attention in PyTorch, see https://github.com/lucidrains/ring-attention-pytorch.

Self-attention is the core mathematical operation of modern transformer architectures and is also a significant computational bottleneck due to its quadratic complexity in the sequence length. In this work, we derive the scalar energy function whose gradient computes the self-attention block, thus elucidating the theoretical underpinnings of self-attention, providing a Bayesian interpretation of the operation and linking it closely with energy-based models such as Hopfield Networks. Our formulation reveals that the reduction across the sequence axis can be efficiently computed in parallel through a tree reduction. Our algorithm, for parallelizing attention computation across multiple GPUs enables cross-device decoding to be performed asymptotically faster (up to 8× faster in our experiments) than alternative approaches such as Ring Attention, while also requiring significantly less communication volume and incurring 2× less peak memory.

Ring Attention	Tree Attention	Legend

Requirements

Python > 3.7
JAX-0.4.31

Install using Conda and pip

# Create a virtual environment
conda install cuda -c nvidia
pip install -U "jax[cuda12]"
pip install flash-attn-jax

Running experiments

To run a single layer tree attention experiment, you can launch the following code. You can specify various sequence lengths, head sizes and number of heads in the file.

python tree_shard_test.py

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
images		images
README.md		README.md
benchmark_generation.py		benchmark_generation.py
llama_flash_sp.py		llama_flash_sp.py
ring_shard_test.py		ring_shard_test.py
tree_shard_test.py		tree_shard_test.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Tree Attention: Topology-aware Decoding for Long-Context Attention on GPU clusters

Requirements

Install using Conda and pip

Running experiments

About

Releases

Packages

Contributors 3

Languages

Zyphra/tree_attention

Folders and files

Latest commit

History

Repository files navigation

Tree Attention: Topology-aware Decoding for Long-Context Attention on GPU clusters

Requirements

Install using Conda and pip

Running experiments

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages