please do not use BiomedCLIP for ARCH dataset #12

jinxixiang · 2023-11-11T07:33:48Z

Dear Author,

The ARCH dataset is divided into two subsets: the books_set and the pubmed_set.

I have noticed that the pubmed_set appears to overlap with BioMedCLip, which sources from PubMed Central.

In your paper, you combined these two datasets for cross-modality retrieval. However, I decided to separate them and compare their performance individually.

The retrieval performance on the pubmed_set was as follows:
{15.7; 79.8; 94.4; 16.7; 78.9; 93.7}

Meanwhile, the retrieval performance on the books_set was:
{7.3; 49.2; 74.2; 8.2; 49.7; 73.2}

In contrast, the performance of QUILT-GPT/77 showed different results:

The retrieval performance on the pubmed_set was:
{1.8; 23.6; 46.0; 1.6; 23.4; 45.7}

The retrieval performance on the books_set was:
{1.8; 27.7; 52.8; 1.5; 23.4; 46.4}

From these results, it's clear that there isn't as significant a domain gap between the two datasets as there is with BiomedCLIP.

jinxixiang · 2023-11-11T07:35:24Z

from left to right, the figures represent: text2image R1 R50 R200, image2text R1 R50 R200

wisdomikezogwo · 2023-12-14T16:05:38Z

This is very valid, and points to some form of leakage that is expected on BiomedClip, Thank you for the evaluations, will make sure to add a note to the readme in future updates!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

please do not use BiomedCLIP for ARCH dataset #12

please do not use BiomedCLIP for ARCH dataset #12

jinxixiang commented Nov 11, 2023

jinxixiang commented Nov 11, 2023

wisdomikezogwo commented Dec 14, 2023

please do not use BiomedCLIP for ARCH dataset #12

please do not use BiomedCLIP for ARCH dataset #12

Comments

jinxixiang commented Nov 11, 2023

jinxixiang commented Nov 11, 2023

wisdomikezogwo commented Dec 14, 2023