Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

这是GPU集群版本的caffe吗 #9

Open
lyblsgo opened this issue Nov 23, 2015 · 18 comments
Open

这是GPU集群版本的caffe吗 #9

lyblsgo opened this issue Nov 23, 2015 · 18 comments

Comments

@lyblsgo
Copy link

lyblsgo commented Nov 23, 2015

这个版本的caffe可以运行在GPU集群上吗?如果可以的话,对集群有要求吗?谢谢

@yjxiong
Copy link
Owner

yjxiong commented Nov 23, 2015

可以。
安装openmpi>1.7.4即可。

@lyblsgo
Copy link
Author

lyblsgo commented Nov 23, 2015

Thanks

@zimenglan-sysu-512
Copy link

can you provide the installation instruction of open mpi?

@KeyKy
Copy link

KeyKy commented Aug 13, 2016

can you give a code snippet in python to show how to set multiple gpu devices? I find that caffe::SetDevice accepts an integer.

@yjxiong
Copy link
Owner

yjxiong commented Aug 13, 2016

Multi Gpu configuration is through command line. Python interfaces cannot
launch multiGPU training.

On Saturday, August 13, 2016, 康洋 [email protected] wrote:

can you give a code snippet in python to show how to set multiple gpu
devices? I find that caffe::SetDevice accepts an integer.


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#9 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AGg4dwQ9x2j-_-fJ_VCeMDiaA_hzAy8Mks5qfUB4gaJpZM4Gnkfk
.

熊元骏
Xiong, Yuanjun

Department of Information Engineering,
The Chinese University of Hong Kong.

E-mail: [email protected]
Mobile: +852 54282253
Mainland China Mobile: +86 147 1430 8814
Address: SHB703, The Chinese University of Hong Kong.
Shatin, N.T, Hong Kong.

@yjxiong
Copy link
Owner

yjxiong commented Aug 15, 2016

Hi @zimenglan-sysu-512 ,

To install OpenMPI, please see https://www.open-mpi.org/faq/?category=building#easy-build

In the configure step, please add the following options

--with-cuda --enable-mpi-thread-multiple

for optimal performance.

@quietsmile
Copy link

Excellent work!
I can now run mpirun perfectly in a single machine.
Suppose I have two machines each with 4 GPUs. And I want to train a model with machine 1's gpu 0 & gpu 1, and machine 2's gpu 1&gpu 2. What should I do then?
Thanks.

@sunnyxiaohu
Copy link

想请问下,在实现和效率上,这种数据并行的方式和caffe master(http://caffe.berkeleyvision.org/tutorial/interfaces.html --------Parallelism: the -gpu flag to the caffe tool can take a comma separated list of IDs to run on multiple GPUs. A solver and net will be instantiated for each GPU so the batch size is effectively multiplied by the number of GPUs. To reproduce single GPU training, reduce the batch size in the network definition accordingly.)有什么不一样呢?

@yjxiong
Copy link
Owner

yjxiong commented May 9, 2017

@sunnyxiaohu
They are both data parallelism, implemented in different manners.

@pkuCactus
Copy link

@yjxiong i have encountered that "unrecognized options: --enable-mpi-thread-multiple", how could solve it?

@yokattame
Copy link

@pkuCactus
The option was removed starting from the v3.0 series. It means v3.x openmpi always enable MPI_THREAD_MULTIPLE support.

@zzy123abc
Copy link

想请问一下,集群上是intelmpi的话,可以使用吗?我想设置成openmpi,但是不知道具体怎么做,因为cmake的时候总是自动检测intelmpi,ccmake的话我写的可能有问题

@yjxiong
Copy link
Owner

yjxiong commented Dec 21, 2017

@zzy123abc Intelmpi is not tested. You can manually modify the cache variables (search MPI in ccmake) in CMake to point to OpenMPI.

@zzy123abc
Copy link

谢谢,那请问您测试的是单节点多gpu还是多节点多gpu的呢?上面也有人问到,就是说,gpu02的gpu0和gpu1,加上gpu03的gpu0和gpu1怎么一起工作?solver设置里面写的0,1,2,3好像不可以,修改成0,1,0,1可以使用吗?

@yjxiong
Copy link
Owner

yjxiong commented Dec 25, 2017

@zzy123abc

Yes. Just as you said, [0, 1, 0, 1]

@melody-rain
Copy link

Hi, @yjxiong

I met the problem that MPI mode is disabled.

I1228 09:34:57.642899 25965 common.cpp:59] Caffe::MPI_all_rank() = 1
I1228 09:34:57.643031 25965 common.cpp:65] You are running caffe compiled with MPI support. Now it's running in non-parallel model

I have one PC with multiple GPUs. As you can see in the above log, the program runs in non-parallel mode. The problem should be that Caffe::MPI_all_rank() = 1. Could you give any hints why it happens? Thanks.

@melody-rain
Copy link

melody-rain commented Dec 28, 2017

I fix the problem.
I should have run the caffe command with mpirun -np.

@Erdos001
Copy link

can this synchronized batchnorm be used on the One-Device-Multi-GPU ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests