Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to run inference of a (very) large model across mulitple GPUs ? #2007

Open
jorgeantonio21 opened this issue Apr 4, 2024 · 4 comments
Open

Comments

@jorgeantonio21
Copy link
Contributor

It is mentioned on README that candle supports multi GPU inference, using NCCL under the hood. How can this be implemented ? I wonder if there is any available example to look at..

Also, I know PyTorch has things like DDP and FSDP, is candle support for multi GPU inference comparable to these techniques ?

@EricLBuehler
Copy link
Member

Please see the llama multiprocess example. The multi-GPU inference is used to create parellelized linear layers:

fn load(vb: VarBuilder, cache: &Cache, cfg: &Config, comm: Rc<Comm>) -> Result<Self> {
let qkv_proj = TensorParallelColumnLinear::load_multi(
vb.clone(),
&["q_proj", "k_proj", "v_proj"],
comm.clone(),
)?;
let o_proj = TensorParallelRowLinear::load(vb.pp("o_proj"), comm.clone())?;
Ok(Self {
qkv_proj,
o_proj,
num_attention_heads: cfg.num_attention_heads / comm.world_size(),
num_key_value_heads: cfg.num_key_value_heads / comm.world_size(),
head_dim: cfg.hidden_size / cfg.num_attention_heads,
cache: cache.clone(),
})
}

@b0xtch
Copy link
Contributor

b0xtch commented Apr 12, 2024

Please see the llama multiprocess example. The multi-GPU inference is used to create parellelized linear layers:

fn load(vb: VarBuilder, cache: &Cache, cfg: &Config, comm: Rc<Comm>) -> Result<Self> {
let qkv_proj = TensorParallelColumnLinear::load_multi(
vb.clone(),
&["q_proj", "k_proj", "v_proj"],
comm.clone(),
)?;
let o_proj = TensorParallelRowLinear::load(vb.pp("o_proj"), comm.clone())?;
Ok(Self {
qkv_proj,
o_proj,
num_attention_heads: cfg.num_attention_heads / comm.world_size(),
num_key_value_heads: cfg.num_key_value_heads / comm.world_size(),
head_dim: cfg.hidden_size / cfg.num_attention_heads,
cache: cache.clone(),
})
}

That example is for a single node. How about multiple nodes? Can we just run the example with mpirun -n 2 --hostfile ../../hostfile target/release/llama_multiprocess 2 2000

Update:

I guess I must modify the code to support the world rank for MPI. I think sticking to NCCL as a backend might be better, but then is there support in Cudarc for cross-node communication?

Found this library https://github.com/oddity-ai/async-cuda

@b0xtch
Copy link
Contributor

b0xtch commented Jun 28, 2024

I started a draft here for the splitting a model across multiple GPUs on different nodes. There is a mapping feature as I linked above on mistral.rs repo

@deema-A
Copy link

deema-A commented Aug 12, 2024

I am having the same question.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants