We developed a new network architecture that leverages image and text modalities, enhancing feature learning for long-tailed datasets. Our research shows that combining pre-trained image and text models via cross-modal attention compensates for the individual limitations of each model, significantly boosting long-tail recognition accuracy. Further experiments explored how text quality affects the model’s performance and identified key factors influencing multimodal model effectiveness. After the paper is accepted, we will release the code.
Table shows that image labels distill image content concisely and offer the most value in the multimodal fusion process. Although the descriptive text includes redundancies, its performance was still notable. The inclusion of nonsensical text somewhat impacted the multimodal model’s performance.
It can be observed from the table that our method achieves strong results across different types of methods. Taking CIFAR-100-LT (IF100) as an example from the table, our method reached an accuracy of 62.32%, superior to the multimodal training approach of CLIP2FL, which achieved 37.56%. Our method also performs better than generative methods, outperforming feature-based LDMLR(51.92%), label-based ProCo (52.80%), and sample-based DiffuLT (50.70%).
The Pure model indicates that we trained an image model, ResNet-32, from scratch. The blue bar chart represents our method. Since the class labels in this dataset are purely numerical, our textual content is also descriptive text generated by the BLIP-2 model. It can be observed from Figure 2 that our method enhances the classification performance for the tail categories while maintaining stability in the head categories.