MSCN: Noisy Correspondence Learning with Meta Similarity Correction (CVPR 2023, PyTorch Code)
- Python 3.8
- torch 1.7.0+cu110
- numpy
- scikit-learn
- pomegranate with TrueBetaDistribution (Install from https://github.com/rayleizhu/pomegranate. Note that pomegranate requires
Cython=0.29
,NumPy
,SciPy
,NetworkX
, andjoblib
. Then you can runpython setup.py build
andpython setup.py install
to install it.) - Punkt Sentence Tokenizer:
import nltk
nltk.download()
> d punkt
Despite the success of multimodal learning in cross-modal retrieval task, the remarkable progress relies on the correct correspondence among multimedia data. However, collecting such ideal data is expensive and time-consuming. In practice, most widely used datasets are harvested from the Internet and inevitably contain mismatched pairs. Training on such noisy correspondence datasets causes performance degradation because the cross-modal retrieval methods can wrongly enforce the mismatched data to be similar. To tackle this problem, we propose a Meta Similarity Correction Network (MSCN) to provide reliable similarity scores. We view a binary classification task as the meta-process that encourages the MSCN to learn discrimination from positive and negative meta-data. To further alleviate the influence of noise, we design an effective data purification strategy using meta-data as prior knowledge to remove the noisy samples. Extensive experiments are conducted to demonstrate the strengths of our method in both synthetic and real-world noises, including Flickr30K, MS-COCO, and Conceptual Captions.
We follow NCR to obtain image features and vocabularies. Our method needs an extra meta-data set to guide the training. For the Flickr30K dataset, we randomly split the meta-data from the validation set:
if opt.data_name == 'f30k_precomp':
meta_len = 2900 # 2% of 145,000
total_idsx = np.arange(0, len(images_dev)) #image length = caption length
meta_idxs = np.random.choice(total_idsx, meta_len, False)
captions_meta, images_meta = list(np.array(captions_dev)[meta_idxs]), images_dev[meta_idxs]
#save...
For the MS-COCO, the meta-data is split from the training set (6,328 pairs) and validation set (all 5,000 pairs):
if opt.data_name == 'coco_precomp':
im_div = [0, 1, 2, 3, 4]
sup_len = 6328 # 2%*566,435 - 5000
total_img_idsx = np.arange(0, len(images_train))
total_cap_idsx = np.arange(0, len(captions_train))
sup_img_idxs = np.random.choice(total_img_idsx, sup_len, False)
sup_0t4_idxs = np.random.choice(im_div, sup_len, True)
sup_cap_idxs = sup_img_idxs * 5 + sup_0t4_idxs
mask_img = np.ones(len(total_img_idsx), dtype=bool)
mask_img[sup_img_idxs,] = False
mask_cap = np.ones(len(total_cap_idsx), dtype=bool)
del_cap_idxs = []
for k in sup_img_idxs:
del_cap_idxs.extend(list(range(k * len(im_div), k * len(im_div) + len(im_div))))
del_cap_idxs = np.array(del_cap_idxs)
mask_cap[del_cap_idxs,] = False
# get meta data
img_meta_sup = images_train[sup_img_idxs]
cap_meta_sup = list(np.array(captions_train)[sup_cap_idxs])
images_meta = np.vstack((images_dev, img_meta_sup))
captions_meta = captions_dev + cap_meta_sup
# get new train data
images_train = images_train[mask_img]
captions_train = list(np.array(captions_train)[mask_cap])
#save
For the CC152K, the meta-data is split from the validation set of the original Conceptual Captions. You can download the meta-data from https://drive.google.com/drive/folders/1XnGr7S-rXRfDbdeIF0QmTJV8kQFHx71-?usp=sharing.
# Flickr30K: noise_ratio = {0.2, 0.5, 0.7}
python main_MSCN.py --gpu 0 --data_name f30k_precomp --noise_ratio 0.2 --data_path data_path --vocab_path vocab_path
# MS-COCO: noise_ratio = {0.2, 0.5, 0.7}
python main_MSCN.py --gpu 0 --data_name coco_precomp --noise_ratio 0.2 --data_path data_path --vocab_path vocab_path
# Conceptual Captions
python main_MSCN.py --gpu 0 --data_name cc152k_precomp --data_path data_path --vocab_path vocab_path
@InProceedings{Han_2023_CVPR,
author = {Han, Haochen and Miao, Kaiyao and Zheng, Qinghua and Luo, Minnan},
title = {Noisy Correspondence Learning With Meta Similarity Correction},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {June},
year = {2023},
pages = {7517-7526}
}
The code is based on NCR licensed under Apache 2.0 and MW-Net licensed under MIT.