This repository holds the implementation of YOLOX-ViT, Knowledge Distillation (KD), evaluation metrics of the object detector, and the side-scan sonar image dataset for underwater wall detection from our paper:
Aubard, M., Antal, L., Madureira, A., Ábrahám, E. (2024). Knowledge Distillation in YOLOX-ViT for Side-Scan Sonar Object Detection. arXiv preprint arXiv: 2403.09313.
If any of this work has been useful in your research, please consider citing us 😃.
The Sonar Wall Detection Dataset (SWDD) is publicly accessible at https://zenodo.org/records/10528135.
The base of the code comes from the YOLOX repository: https://github.com/Megvii-BaseDetection/YOLOX/tree/main.
This code has two primary contributions:
-
Knowledge Distillation Enhancement: Integrate Knowledge Distillation between a Teacher model (e.g., YOLOX-L) and a Student model (e.g., YOLOX-Nano) to improve the accuracy of the Student model. This process involves transferring knowledge from the larger, more complex Teacher model to the smaller, more efficient Student model.
-
ViT Layer Integration: Implement a Vision Transformer (ViT) layer between the neck and the backbone to enhance the feature extraction process. This integration aims to leverage the strengths of ViT in understanding global dependencies within images, thereby improving the YOLOX feature representation capabilities.
Furthermore, YOLOX-ViT and KD-YOLOX-ViT have been evaluated using the proposed SWDD object detection dataset. This dataset is a Side Scan Sonar image of walls manually annotated following the COCO annotation. It has 864 training images, a 6-minute 57-second SSS video, and 6243 extracted video frames with its manually annotated ground truth.
The following table provides the weight used in our paper.
Model | Img size | Weights |
---|---|---|
Nano | 416 | github |
Nano-ViT | 416 | github |
L | 640 | github |
L-ViT | 640 | github |
Nano-noAug | 416 | github |
Nano-ViT-noAug | 416 | github |
Nano-noAug-L | 416 | github |
Nano-noAug-L-ViT | 416 | github |
Nano-noAug-ViT-L | 416 | github |
Nano-noAug-ViT-L-ViT | 416 | github |
Because of current computational limitations, the ViT model could not be pre-trained with a bigger dataset, such as the COCO dataset. Instead, we start the training with pre-trained weight from the model without the ViT layer.
The following illustration introduces the Knowledge Distillation principle used for the KD-YOLOX-ViT.
Object detection loss function is characterized by:
- Classification loss improves classification accuracy,
- Intersection over Union (IoU) loss enhances the precision of object localization,
- Objectness loss refines the model's ability to identify regions containing objects
Which gives the loss function:
Where
Since the
Knowledge Distillation aims to implement a new loss function called
Thus, the total loss is:
with
YOLOX is an anchor-free object detection model with a decoupled head. It uses an online random data augmentation, improving the model's robustness and accuracy.
Knowledge Distillation uses the Teacher inference output as a
Thus, to implement knowledge distillation into YOLOX, the teacher needs to launch the inference with random augmented data for each training batch. The following image characterizes the workflow.
Let's choose as an example the YOLOX-L as Teacher and YOLOX-nano as Student.
- The following command run the YOLOX-L model
python3 tools/train.py -f exps/default/yolox_l.py -b 8 --fp16 --logger wandb
or the following command using pre-trained weights
python3 tools/train.py -f exps/default/yolox_l.py -b 8 -c datasets/COCO/weight/yolox_l.pth --fp16 --logger wandb
-
The weights should be automatically saved under the folder /YOLOX_OUTPUTS/yolox_l/
-
Before launching the YOLOX-nano model, the YOLOX-nano file needs to be modified for Knowledge Distillation under /exps/default/yolox_nano.py. The parameters self.KD and self.KD_online needs to be set to True. Finally, the self.folder_KD_directory is the repository where the images and Teacher FPN logits are saved.
- The following command run the YOLOX-nano model
python3 tools/train.py -f exps/default/yolox_nano.py -b 8 --fp16 --logger wandb
or the following command using pre-trained weights
python3 tools/train.py -f exps/default/yolox_nano.py -b 8 -c datasets/COCO/weight/yolox_nano.pth --fp16 --logger wandb
During the Student training, the model saves the augmented images, launches the Teacher inference, saves and loads the FPN logits, calculates the
However, the training can take much time because of the online teacher inference. For instance, the SWDD dataset requires one week to train 300 epochs on a single GPU Geforce RTX 3070 Ti.
Because of the time-consuming nature of online Knowledge Distillation, we also proposed an offline version, which drastically reduces training time. The Offline Knowledge Distillation aims to disable online data augmentation and train Students by only using the dataset. However, the Teacher can still be trained using online data augmentation, which can increase the knowledge distillate to the Student. The offline Knowledge Distillation workflow is detailed below.
- The first steps Train Teacher, and Save teacher weights use the same command as for Online Knowledge Distillation
- Launch the Teacher Inference using the trained weights by running the following command:
python3 Teacher_Inference.py
The weights repository can be modified accordingly in the Teacher_Inference.py file. Furthermore, because YOLOX-nano only uses an image size of 416
-
Same as for the Online Knowledge Distillation, the YOLOX-nano file needs to be modified for Knowledge Distillation under /exps/default/yolox_nano.py before launching. Set self.KD to True, however, set the self.KD_online to False indicates the Knowledge Distillation in offline mode, which disable the online data augmentation for the Student training.
-
Finally, the YOLOX-nano training can be launch with
python3 tools/train.py -f exps/default/yolox_nano.py -b 8 --fp16 --logger wandb
or the following command using pre-trained weights
python3 tools/train.py -f exps/default/yolox_nano.py -b 8 -c datasets/COCO/weight/yolox_nano.pth --fp16 --logger wandb
Transformers, introduced by Vaswani et al. (Attention Is All You Need), initially designed for natural language processing, proved effective in handling sequential data, outperforming the state-of-the-art. Dosovitskiy et al. (An image is worth16x16 words: Transformers for image recognition at scale) introduced the Visual Transformer called ViT, the first computer vision transformer model, achieving state-of-the-art performance on image recognition tasks without convolutional layers. Carion et al. (End-to-End Object Detection with Transformers) presented DETR (DEtection TRansformer) for object detection directly predicting sets, without the need of separate region proposal and refinement stages.
Integrating transformers with CNNs enhances feature extraction in object detection tasks, combining the spatial hierarchy of CNNs with the global context of transformers. Yu et al. (Real-time underwater maritime object detection in side-scan sonar images based on transformer-yolov5) proposed a YOLOv5-TR for containers and shipwreck detection. Aubard et al. (Real-time automatic wall detection and localization based on side scan sonar images) demonstrated a 5.5% performance improvement using YOLOX. The ViT layer is set up with 4 Multi-Head Self-Attention (MHSA) layers. The following image shows the ViT layer integration into the YOLOX model. The ViT is represented by the red arrow, in contrast the basic YOLOX architecture is represented by the dotted line.
To activate the ViT layer in the YOLOX model, the parameter self.vit needs to be set to True in the model file, such as /exps/default/yolox_nano.py for the YOLOx-nano model.
Then, the YOLOX-ViT training can be launched using the same training command as the basic YOLOX model.
This project has received funding from the European Union's Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreement No. 956200.
This work is part of the Reliable AI for Marine Robotics (REMARO) Project. For more info, please visit: https://remaro.eu/