Setting up datasets

Pretraining datasets

We provide the download scripts for each of the pretraining datasets we used: CC-3M, CC-12M, YFCC-15M, LAION-400M, LAION-Aesthetics, SynthCI-30M, in their respective folders. For the Mayilvahanan et al. experiment, we have released the exact sample indices from LAION400M that are included in the final dataset here with the authors' permission---we thank the authors again for their great work!

Downstream datasets

For the zero-shot classification experiments, please follow the data download setup from the SuS-X github repository. For the retrieval experiments, we directly use the splits provided on huggingface: flickr1k and coco. For text-to-image generation experiments, we use the datasets from HEIM---for more details on these evaluations, please see the src/text_to_image_experiments folder.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Setting up datasets

Pretraining datasets

Downstream datasets

Files

README.md

Latest commit

History

README.md

File metadata and controls

Setting up datasets

Pretraining datasets

Downstream datasets