Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add doc for fast data loading #2069

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
Fast Data Loading
=================

OpenVINO™ Training Extensions provides several ways to boost model training speed,
one of which is fast data loading.


===================
Faster Augmentation
===================


******
AugMix
******
AugMix [1]_ is a simple yet powerful augmentation technique
to improve robustness and uncertainty estimates of image classification task.
OpenVINO™ Training Extensions implemented it in `Cython <https://cython.org/>`_ for faster augmentation.
Users do not need to configure anything as cythonized AugMix is used by default.



=======
Caching
=======


*****************
In-Memory Caching
*****************
OpenVINO™ Training Extensions provides in-memory caching for decoded images in main memory.
If the batch size is large, such as for classification tasks, or if dataset contains
high-resolution images, image decoding can account for a non-negligible overhead
in data pre-processing.
One can enable in-memory caching for maximizing GPU utilization and reducing model
training time in those cases.


.. code-block::

$ otx train --mem-cache-size=8GB ..



***************
Storage Caching
***************

OpenVINO™ Training Extensions uses `Datumaro <https://github.com/openvinotoolkit/datumaro>`_
under the hood for dataset managements.
Since Datumaro `supports <https://openvinotoolkit.github.io/datumaro/latest/docs/explanation/formats/arrow.html>`_
`Apache Arrow <https://arrow.apache.org/overview/>`_, OpenVINO™ Training Extensions
can exploit fast data loading using memory-mapped arrow file at the expanse of storage consumtion.


.. code-block::

$ otx train .. params --algo_backend.storage_cache_scheme JPEG/75


The cache would be saved in ``$HOME/.cache/otx`` by default.
One could change it by modifying ``OTX_CACHE`` environment variable.


.. code-block::

$ OTX_CACHE=/path/to/cache otx train .. params --algo_backend.storage_cache_scheme JPEG/75


Please refere `Datumaro document <https://openvinotoolkit.github.io/datumaro/latest/docs/explanation/formats/arrow.html#export-to-arrow>`_
for available schemes to choose but we recommend ``JPEG/75`` for fast data loaidng.

.. [1] Dan Hendrycks, Norman Mu, Ekin D. Cubuk, Barret Zoph, Justin Gilmer, and Balaji Lakshminarayanan. "AugMix: A Simple Data Processing Method to Improve Robustness and Uncertainty" International Conference on Learning Representations. 2020.
Original file line number Diff line number Diff line change
Expand Up @@ -11,3 +11,4 @@ Additional Features
auto_configuration
xai
noisy_label_detection
fast_data_loading
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
Noisy label detection
Noisy Label Detection
=====================

OpenVINO™ Training Extensions provide a feature for detecting noisy labels during model training.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -243,6 +243,18 @@ For example, that is how you can change the learning rate and the batch size for
--learning_parameters.batch_size 16 \
--learning_parameters.learning_rate 0.001

You could also enable storage caching to boost data loading at the expanse of storage:

.. code-block::

(otx) ...$ otx train SSD --train-data-roots <path/to/train/root> \
--val-data-roots <path/to/val/root> \
params \
--algo_backend.storage_cache_scheme JPEG/75

.. note::
Not all templates support stroage cache. We are working on extending supported templates.


As can be seen from the parameters list, the model can be trained using multiple GPUs. To do so, you simply need to specify a comma-separated list of GPU indices after the ``--gpus`` argument. It will start the distributed data-parallel training with the GPUs you have specified.

Expand Down