Cluster backend #108

cancan101 · 2015-05-14T15:58:33Z

Support for using SGE for the job execution.

lukeyeager · 2015-05-14T16:51:14Z

We're exploring using DIGITS as a frontend for an EC2 cluster. Presumably we'll design it to work with SGE (now OGE?) or whatever else you want to use.

cancan101 · 2015-05-14T16:54:24Z

I use SGE on EC2 using StarCluster now. There are some tools like http://drmaa-python.readthedocs.org/en/latest/ which abstract cluster engine is used.

SGE is aka OGE but also aka "Son of Grid Engine" 😄

pmkulkarni · 2015-08-06T20:20:35Z

Use DIGITS to manage inter-node training. For example, this can be use to train a single model across multiple GPU instances on EC2.

loneae · 2015-09-01T14:43:25Z

I have dataset of 40 Million pics and i want to speed the training, so if i use multi EC2 GPU severs it will work and use the max usage of each EC2 GPU ? any limitations ? can i use 10 EC2 ?

lukeyeager · 2015-09-01T15:55:31Z

Caffe is currently multi-GPU but not multi-node. Until DIGITS supports a multi-node framework, having multiple instances will not improve the training time of any one training job.

DIGITS is also single-node for now. So you would have to run one DIGITS instance on each node.

loneae · 2015-09-01T15:58:11Z

I have just realised i can create a GPU cluster and combine multi node into 1 powerful machine using starcluster. http://hpc.nomad-labs.com/archives/139

loneae · 2015-09-01T15:58:43Z

And then run my caffe on multi GPU like i'm using all at the same server.

thatguymike · 2015-09-01T16:20:23Z

That isn't going to do what you want. Even if you got that dusty code to work, the interconnect between the nodes is nowhere near PCIe speeds so your scaling performance will be quite poor with the current multi-GPU methodology. To scale across nodes the algorithms would need to change to an asychronous method (EASGD for example) which has different numerical behavior and convergence properties.

loneae · 2015-09-01T16:35:23Z

Ok sound like you have experience in this field, tell me if i will not be able to run the multi GPU, can i just train batch on each server and after merge into 1 train file?

loneae · 2015-09-01T16:52:24Z

Also we can use internal AWS ip so the data transfer between server is very very speedy.

thatguymike · 2015-09-01T17:09:27Z

PCIe Gen 3 is 12GB/s. At best on AWS you get 10GbE interconnects on the general nodes, so 1.2GB/s on a good day. We are communication bound on PCIe Gen3, so scaling is not going to be that great. Again, to go to multiple nodes you need to move away from synchronous stochastic gradient descent and to a different solver, likely an asychronous techique.

c4ssio · 2015-09-04T20:04:44Z

Hi @thatguymike , thanks very much for your time and help if you can enlighten me, and apologies for the wall of text. I have a database of ~10 million wine labels that I would like to train for an image recognition system, and moreover retrain on a regular basis, as more people take pictures.

Is it possible to train 1000 categories of 100 images each in DIGITS, then use the last epoch's model in a subsequent classification model with an entirely different set of 100k images to achieve a model with 2k categories?
-- if yes, Can I go adding images to the same categories as more come in, so the model gets fine-tuned?
(similar to @loneae 's question above) Is it possible to train multiple sets of images on distinct ec2 g2.8xlarge instances, then use a script to combine their results into a composite model including all the categories and results? If I understand correctly you said above "not with the SGD solver." Is there a Solver you would recommend for this?

thatguymike · 2015-09-04T23:23:58Z

Let's step back a bit. 10 million wine labels isn't "that big". There are a few ways to attack the problem. I would start with a strong pretrained network, like GoogleNet or VGG trained on ImageNet and then finetune for your categories. (e.g. slice the bottom layer and redefine number of outputs). It should in theory converge quite quickly. Then as you add images you can drop out parts of your older original data or just continue to grow. I assume you will need to add more categories, which will generally mean re-finetuning things.

As for 1:
That isn't going to do what you want. You might train the upper layers the first round to build a good feature descriptor chain (like what I recommed above), but you just can't add more categories. When you finetune you need enough representatives in all of your categories. So your dataset is going to end up growing over time. You will want O(1k) examples per class to be robust, but it depends on your dataset and accuracy requirements.

As for 2:
Yes and no. Assuming everyone starts with the same seed and roughly equal representation of the dataset, you can try to train independently and then average the weights. I don't think that will end up being stable. Better might be to train multiple models for different subsets of categories and then run them as an ensemble. e.g. train 5 different networks each with 1/5th of the categories. Then when doing inference run the input through each of the models and choose the one with the highest probability. Might work, but you can also get a category split across models where you have a model that gets very confused and ranks a label high because it can't descriminate properly.

c4ssio · 2015-09-05T00:35:31Z

Thanks for the prompt response! I clearly need to read more about this.

The issue I have is that we get thousands of new wine label brands a day, so I can't possibly know all my categories ahead of time. I can do 100k photos in 8.5h at my current rate (using g2.8xlarge, GoogLeNet, 256x256, color images), so I'd be spending 35 days building out a massive model with my entire dataset which will end up being a month stale and not include the new hotness in the wine industry.

So either I need to come up with a way to run through my entire corpus 100x faster (diminish accuracy thru downsampling, prioritize the most relevant subset of data, or beast out my hardware), or I need a different kind of solver that allows me to incrementally add new categories and add new photos to existing ones.

I like your idea about splitting out my corpus into a few different networks, but yeah, it will compromise accuracy because one network will think it's 50% sure of a match and other will think it's 80% sure and the 50% one will end up being correct because the 80% one just doesn't know enough.

OK the search continues, thanks for your help! Happy Labor Day weekend and I'll keep an eye out for exciting developments here and maybe pop on over to the BVLC folks and see if they have any ideas.

bhack · 2015-09-05T10:39:11Z

@thatguymike See some caffe experiments at http://arxiv.org/abs/1506.08272

vfdev-5 · 2017-03-22T00:22:45Z

@lukeyeager I wonder if DIGITS advanced to manage task execution on SGE clusters since issue creation ?

I would like to use DIGITS with Intel-flavoured Caffe on Colfax Cluster.
In this case it is possible to hack Task class and configure it (choose LocalTask or GridEngineTask) on digits start. Then each task is executed with qsub in the interactive mode.
See my fork for details. For instance some widgets are not updated during the task execution, but it is ok for the usage I need. Anyway, I would like to get your feedback.

cancan101 mentioned this issue May 14, 2015

Upgrades to scheduler module #104

Closed

9 tasks

lukeyeager mentioned this issue May 14, 2015

Upload Text File has no Way of Specifying Path on Local System #106

Closed

lukeyeager changed the title ~~Support for SGE~~ Cluster backend May 19, 2015

lukeyeager added the enhancement label May 19, 2015

lukeyeager mentioned this issue Nov 11, 2015

Any plan to distributed DIGITS? #412

Closed

lukeyeager mentioned this issue Feb 4, 2016

Choosing data format[s] for creating generic inference datasets #197

Open

lukeyeager mentioned this issue Jun 14, 2016

Support scheduling systems #829

Open

lukeyeager mentioned this issue Oct 31, 2016

Moving jobs to another machine #1035

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cluster backend #108

Cluster backend #108

cancan101 commented May 14, 2015

lukeyeager commented May 14, 2015

cancan101 commented May 14, 2015

pmkulkarni commented Aug 6, 2015

loneae commented Sep 1, 2015

lukeyeager commented Sep 1, 2015

loneae commented Sep 1, 2015

loneae commented Sep 1, 2015

thatguymike commented Sep 1, 2015

loneae commented Sep 1, 2015

loneae commented Sep 1, 2015

thatguymike commented Sep 1, 2015

c4ssio commented Sep 4, 2015

thatguymike commented Sep 4, 2015

c4ssio commented Sep 5, 2015

bhack commented Sep 5, 2015

vfdev-5 commented Mar 22, 2017 •

edited

Loading

Cluster backend #108

Cluster backend #108

Comments

cancan101 commented May 14, 2015

lukeyeager commented May 14, 2015

cancan101 commented May 14, 2015

pmkulkarni commented Aug 6, 2015

loneae commented Sep 1, 2015

lukeyeager commented Sep 1, 2015

loneae commented Sep 1, 2015

loneae commented Sep 1, 2015

thatguymike commented Sep 1, 2015

loneae commented Sep 1, 2015

loneae commented Sep 1, 2015

thatguymike commented Sep 1, 2015

c4ssio commented Sep 4, 2015

thatguymike commented Sep 4, 2015

c4ssio commented Sep 5, 2015

bhack commented Sep 5, 2015

vfdev-5 commented Mar 22, 2017 • edited Loading

vfdev-5 commented Mar 22, 2017 •

edited

Loading