diff --git a/docs/announcements/CC_25M_community.md b/docs/announcements/CC_25M_community.md index fd0aab41f..839cd9d45 100644 --- a/docs/announcements/CC_25M_community.md +++ b/docs/announcements/CC_25M_community.md @@ -1,18 +1,24 @@ # 25 million Creative Commons image dataset released -[Fondant](https://fondant.ai) is an open-source project that aims to simplify and speed up -large-scale data processing by making containerized components reusable across pipelines & +[Fondant](https://fondant.ai) is an open-source project that aims to simplify and speed up +large-scale data processing by making containerized components reusable across pipelines & execution environments, shared within the community. -A current challenge for generative AI is compliance with copyright laws. For this reason, -Fondant has developed a data-processing pipeline to create a 500-million dataset of Creative +A current challenge for generative AI is compliance with copyright laws. For this reason, +Fondant has developed a data-processing pipeline to create a 500-million dataset of Creative Commons images to train a latent diffusion image generation model that respects copyright. Today, -as a first step, we are releasing a 25-million sample dataset and invite the open source +as a first step, we are releasing a 25-million sample dataset and invite the open source community to collaborate on further refinement steps. -Fondant offers tools to download, explore and process the data. The current example pipeline -includes a component for downloading the urls, a simple file type filter, one for downloading -the images and one for deduplicating the urls. Additional processing components which could be +Fondant offers tools to download, explore and process the data. The current example pipeline +includes a component for downloading the urls and one for downloading the images. + +Creating custom pipelines for specific purposes requires different building blocks. Fondant +pipelines can mix reusable components and custom components. + +![sample_pipeline](https://github.com/ml6team/fondant/blob/main/docs/art/announcements/sample_pipeline_cc25.png?raw=true) + +Additional processing components which could be contributed include, in order of priority: * Image-based deduplication @@ -25,6 +31,6 @@ contributed include, in order of priority: * AI generated image detection * Any components that you propose to develop -The Fondant team also invites contributors to the core framework and is looking for feedback on -the framework’s usability and for suggestions for improvement. Contact us at +The Fondant team also invites contributors to the core framework and is looking for feedback on +the framework’s usability and for suggestions for improvement. Contact us at [info@fondant.ai](mailto:info@fondant.ai) and/or join our [discord](https://discord.gg/HnTdWhydGp). \ No newline at end of file diff --git a/docs/art/announcements/sample_pipeline_cc25.png b/docs/art/announcements/sample_pipeline_cc25.png new file mode 100644 index 000000000..9cdc6817c Binary files /dev/null and b/docs/art/announcements/sample_pipeline_cc25.png differ diff --git a/docs/art/guides/component.png b/docs/art/guides/component.png index 9ea72bb93..e0ac2b7d1 100644 Binary files a/docs/art/guides/component.png and b/docs/art/guides/component.png differ diff --git a/docs/overrides/main.html b/docs/overrides/main.html index c506c3d98..8425ec50b 100644 --- a/docs/overrides/main.html +++ b/docs/overrides/main.html @@ -3,7 +3,7 @@ {% block announce %}
We released a 25 million Creative Commons image dataset! - Read more
{% endblock %} \ No newline at end of file