This project leverage the power of multiple GPUs with the target is to reduce the training time of complex models by data parallelism method with 2 approaches:
- Multi-worker Training using 2 PCs with GeForce RTX GPU as Workers via:
- Local area network (LAN).
- VPN tunnel using OpenVPN (not included in the demo).
- Parameter Server Training using 5 machines in LAN:
- 2 Laptops as Parameter Server connected via 5GHz Wi-Fi.
- 2 PCs with GeForce RTX GPU as Workers.
- 1 PC just with CPU as a Coordinator.
We used our self-built 30VNFoods dataset which includes collected and labeled images of 30 famous Vietnamese dishes. This dataset is divided into:
- 17,581 images for training.
- 2,515 images for validation.
- 5,040 images for testing.
In addition, we also used a small TensorFlow flowers dataset with about 3700 images of flowers, which includes 5 folders corresponding to 5 types of flowers (daisy
, dandelion
, roses
, sunflowers
, tulips
).
Image size | (224, 224) |
Batch size/worker | 32 |
Optimizer | Adam |
Learning rate | 0.001 |
The iperf3 tool is used to measure the bandwidth of machines in network.
Training method | Dataset | Connection | Avg. s/epoch |
---|---|---|---|
Single-worker | flowers | LAN | 14 |
Multi-worker | flowers | LAN | 18 |
Multi-worker | flowers | VPN Tunnel | 635 |
Multi-worker | 30VNFoods | LAN | 184 |
Parameter Server | 30VNFoods | LAN | 115 |
⇒ For more information, see Report.pdf.
- Distributed training with Keras
- A friendly introduction to distributed training (ML Tech Talks)
- Distributed TensorFlow training (Google I/O '18)
- Inside TensorFlow: Parameter server training
- Performance issue for Distributed TF
- When is TensorFlow's ParameterServerStrategy preferable to its MultiWorkerMirroredStrategy?