-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cluster backend #108
Comments
We're exploring using DIGITS as a frontend for an EC2 cluster. Presumably we'll design it to work with SGE (now OGE?) or whatever else you want to use. |
I use SGE on EC2 using StarCluster now. There are some tools like http://drmaa-python.readthedocs.org/en/latest/ which abstract cluster engine is used. SGE is aka OGE but also aka "Son of Grid Engine" 😄 |
Use DIGITS to manage inter-node training. For example, this can be use to train a single model across multiple GPU instances on EC2. |
I have dataset of 40 Million pics and i want to speed the training, so if i use multi EC2 GPU severs it will work and use the max usage of each EC2 GPU ? any limitations ? can i use 10 EC2 ? |
Caffe is currently multi-GPU but not multi-node. Until DIGITS supports a multi-node framework, having multiple instances will not improve the training time of any one training job. DIGITS is also single-node for now. So you would have to run one DIGITS instance on each node. |
I have just realised i can create a GPU cluster and combine multi node into 1 powerful machine using starcluster. http://hpc.nomad-labs.com/archives/139 |
And then run my caffe on multi GPU like i'm using all at the same server. |
That isn't going to do what you want. Even if you got that dusty code to work, the interconnect between the nodes is nowhere near PCIe speeds so your scaling performance will be quite poor with the current multi-GPU methodology. To scale across nodes the algorithms would need to change to an asychronous method (EASGD for example) which has different numerical behavior and convergence properties. |
Ok sound like you have experience in this field, tell me if i will not be able to run the multi GPU, can i just train batch on each server and after merge into 1 train file? |
Also we can use internal AWS ip so the data transfer between server is very very speedy. |
PCIe Gen 3 is 12GB/s. At best on AWS you get 10GbE interconnects on the general nodes, so 1.2GB/s on a good day. We are communication bound on PCIe Gen3, so scaling is not going to be that great. Again, to go to multiple nodes you need to move away from synchronous stochastic gradient descent and to a different solver, likely an asychronous techique. |
Hi @thatguymike , thanks very much for your time and help if you can enlighten me, and apologies for the wall of text. I have a database of ~10 million wine labels that I would like to train for an image recognition system, and moreover retrain on a regular basis, as more people take pictures.
|
Let's step back a bit. 10 million wine labels isn't "that big". There are a few ways to attack the problem. I would start with a strong pretrained network, like GoogleNet or VGG trained on ImageNet and then finetune for your categories. (e.g. slice the bottom layer and redefine number of outputs). It should in theory converge quite quickly. Then as you add images you can drop out parts of your older original data or just continue to grow. I assume you will need to add more categories, which will generally mean re-finetuning things. As for 1: As for 2: |
Thanks for the prompt response! I clearly need to read more about this. The issue I have is that we get thousands of new wine label brands a day, so I can't possibly know all my categories ahead of time. I can do 100k photos in 8.5h at my current rate (using g2.8xlarge, GoogLeNet, 256x256, color images), so I'd be spending 35 days building out a massive model with my entire dataset which will end up being a month stale and not include the new hotness in the wine industry. So either I need to come up with a way to run through my entire corpus 100x faster (diminish accuracy thru downsampling, prioritize the most relevant subset of data, or beast out my hardware), or I need a different kind of solver that allows me to incrementally add new categories and add new photos to existing ones. I like your idea about splitting out my corpus into a few different networks, but yeah, it will compromise accuracy because one network will think it's 50% sure of a match and other will think it's 80% sure and the 50% one will end up being correct because the 80% one just doesn't know enough. OK the search continues, thanks for your help! Happy Labor Day weekend and I'll keep an eye out for exciting developments here and maybe pop on over to the BVLC folks and see if they have any ideas. |
@thatguymike See some caffe experiments at http://arxiv.org/abs/1506.08272 |
@lukeyeager I wonder if DIGITS advanced to manage task execution on SGE clusters since issue creation ? I would like to use DIGITS with Intel-flavoured Caffe on Colfax Cluster. |
Support for using SGE for the job execution.
The text was updated successfully, but these errors were encountered: