From a6455417c6655357eedd4bfa66f37ed1f7dc7a89 Mon Sep 17 00:00:00 2001 From: Robbe Sneyders Date: Wed, 27 Sep 2023 15:29:38 +0200 Subject: [PATCH 1/3] Add cc-25m announcement to docs --- docs/announcements/CC_25M_community.md | 30 ++++++++++++++++++++++ docs/announcements/CC_25M_press_release.md | 28 ++++++++++++++++++++ docs/{ => components}/component_spec.md | 0 docs/{ => components}/components.md | 0 docs/{ => components}/custom_component.md | 0 docs/{ => components}/generic_component.md | 0 docs/overrides/main.html | 9 +++++++ docs/stylesheets/extra.css | 5 ++++ mkdocs.yml | 13 +++++++--- 9 files changed, 81 insertions(+), 4 deletions(-) create mode 100644 docs/announcements/CC_25M_community.md create mode 100644 docs/announcements/CC_25M_press_release.md rename docs/{ => components}/component_spec.md (100%) rename docs/{ => components}/components.md (100%) rename docs/{ => components}/custom_component.md (100%) rename docs/{ => components}/generic_component.md (100%) create mode 100644 docs/overrides/main.html create mode 100644 docs/stylesheets/extra.css diff --git a/docs/announcements/CC_25M_community.md b/docs/announcements/CC_25M_community.md new file mode 100644 index 000000000..fd0aab41f --- /dev/null +++ b/docs/announcements/CC_25M_community.md @@ -0,0 +1,30 @@ +# 25 million Creative Commons image dataset released + +[Fondant](https://fondant.ai) is an open-source project that aims to simplify and speed up +large-scale data processing by making containerized components reusable across pipelines & +execution environments, shared within the community. + +A current challenge for generative AI is compliance with copyright laws. For this reason, +Fondant has developed a data-processing pipeline to create a 500-million dataset of Creative +Commons images to train a latent diffusion image generation model that respects copyright. Today, +as a first step, we are releasing a 25-million sample dataset and invite the open source +community to collaborate on further refinement steps. + +Fondant offers tools to download, explore and process the data. The current example pipeline +includes a component for downloading the urls, a simple file type filter, one for downloading +the images and one for deduplicating the urls. Additional processing components which could be +contributed include, in order of priority: + +* Image-based deduplication +* Visual quality / aesthetic quality estimation +* Watermark detection +* Not safe for work (NSFW) content detection +* Face detection +* Personal Identifiable Information (PII) detection +* Text detection +* AI generated image detection +* Any components that you propose to develop + +The Fondant team also invites contributors to the core framework and is looking for feedback on +the framework’s usability and for suggestions for improvement. Contact us at +[info@fondant.ai](mailto:info@fondant.ai) and/or join our [discord](https://discord.gg/HnTdWhydGp). \ No newline at end of file diff --git a/docs/announcements/CC_25M_press_release.md b/docs/announcements/CC_25M_press_release.md new file mode 100644 index 000000000..6468a6a94 --- /dev/null +++ b/docs/announcements/CC_25M_press_release.md @@ -0,0 +1,28 @@ +# 25 million Creative Commons image dataset released + +> Fondant is an open-source project that aims to enable compliant, large-scale processing in a +> simple and cost-efficient way. As a first step, we have developed a pipeline to create a Creative Commons image dataset and are releasing a first 25 million sample with a call to action to help develop additional data processing pipelines. + +[Fondant](https://fondant.ai) simplifies and speeds up large-scale data processing by making +self-contained pipeline components reusable across pipelines, infrastructures and shareable +within the community. By offering a library of ready-to-use, off-the-shelf components and a +standardized way of building and combining them with custom components, it significantly reduces +the time required to build and maintain data processing infrastructure for generative AI +applications in production. + +Supported by [Flanders innovation & entrepreneurship](https://vlaio.be) and European AI Service +Provider [ML6](https://ml6.eu), Fondant developed a pipeline to create a +[dataset](https://huggingface.co/datasets/fondantai/fondant-cc-25m) of over 500 million Creative +Commons-licensed images from Common Crawl to train an image-generation model that respects +copyright. Now we are releasing a first 25 million sample dataset with tools to download, +explore and process the data. We are inviting developers and data enthusiasts to collaborate on +large-scale data processing pipelines by building custom components for advanced filtering and +captioning and to contribute to the core framework. We are also looking for feedback on the +framework’s usability with suggestions for improvement. Contact us at +[info@fondant.ai](mailto:info@fondant.ai) and/or join our [discord](https://discord.gg/HnTdWhydGp) +to help realize this vision. + +[Creative Commons](https://creativecommons.org) is a non-profit organization which provides +licenses that allow other creators to reuse one’s work under certain conditions. +[Common Crawl](https://commoncrawl.org) is a non-profit organization which publishes monthly +archives of the public Internet. diff --git a/docs/component_spec.md b/docs/components/component_spec.md similarity index 100% rename from docs/component_spec.md rename to docs/components/component_spec.md diff --git a/docs/components.md b/docs/components/components.md similarity index 100% rename from docs/components.md rename to docs/components/components.md diff --git a/docs/custom_component.md b/docs/components/custom_component.md similarity index 100% rename from docs/custom_component.md rename to docs/components/custom_component.md diff --git a/docs/generic_component.md b/docs/components/generic_component.md similarity index 100% rename from docs/generic_component.md rename to docs/components/generic_component.md diff --git a/docs/overrides/main.html b/docs/overrides/main.html new file mode 100644 index 000000000..4ec5b52bc --- /dev/null +++ b/docs/overrides/main.html @@ -0,0 +1,9 @@ +{% extends "base.html" %} + +{% block announce %} +

+ We released a 25 million Creative Commons image dataset! + Read more +

+{% endblock %} \ No newline at end of file diff --git a/docs/stylesheets/extra.css b/docs/stylesheets/extra.css new file mode 100644 index 000000000..861d222cc --- /dev/null +++ b/docs/stylesheets/extra.css @@ -0,0 +1,5 @@ +.md-banner { + background-color: OrangeRed; /* This setting prevents the Material header from + imposing into + the space of the banner! */ +} \ No newline at end of file diff --git a/mkdocs.yml b/mkdocs.yml index 610a0635d..050b06cdf 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -23,9 +23,12 @@ theme: toggle: icon: material/brightness-4 name: Switch to light mode + custom_dir: docs/overrides features: - content.code.copy - navigation.tracking +extra_css: + - stylesheets/extra.css nav: - Home: index.md - Getting Started: getting_started.md @@ -35,13 +38,15 @@ nav: - Implement custom components: guides/implement_custom_components.md - Building a pipeline: pipeline.md - Components: - - Components: components.md - - Creating custom components: custom_component.md - - Read / write components: generic_component.md - - Component spec: component_spec.md + - Components: components/components.md + - Creating custom components: components/custom_component.md + - Read / write components: components/generic_component.md + - Component spec: components/component_spec.md - Data explorer: data_explorer.md - Infrastructure: infrastructure.md - Manifest: manifest.md + - Announcements: + - announcements/CC_25M_community.md plugins: - mkdocstrings From 7713d663b773a6fed95af697c73356a929b986f1 Mon Sep 17 00:00:00 2001 From: Robbe Sneyders Date: Wed, 27 Sep 2023 15:39:28 +0200 Subject: [PATCH 2/3] Remove prefix slash in banner url --- docs/overrides/main.html | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/overrides/main.html b/docs/overrides/main.html index 4ec5b52bc..c506c3d98 100644 --- a/docs/overrides/main.html +++ b/docs/overrides/main.html @@ -3,7 +3,7 @@ {% block announce %}

We released a 25 million Creative Commons image dataset! - Read more

{% endblock %} \ No newline at end of file From 32d43efa9d39304b1b399d08f24351f903f24c93 Mon Sep 17 00:00:00 2001 From: Robbe Sneyders Date: Wed, 27 Sep 2023 15:53:31 +0200 Subject: [PATCH 3/3] Replace notes on getting started page by admonition notes --- docs/getting_started.md | 10 ++++++++-- 1 file changed, 8 insertions(+), 2 deletions(-) diff --git a/docs/getting_started.md b/docs/getting_started.md index f11c54e75..9f17acd86 100644 --- a/docs/getting_started.md +++ b/docs/getting_started.md @@ -1,8 +1,14 @@ # Getting started -Note: To execute the pipeline locally, you must have docker compose, Python >=3.8 and Git installed on your system. +!!! note -Note: For Apple M1/M2 ship users: - Make sure that Docker uses linux/amd64 platform and not arm64. - In Docker Dashboards’ Settings=3.8 and Git + installed on your system. + +!!! note + + For Apple M1/M2 ship users: - Make sure that Docker uses linux/amd64 platform and not + arm64. - In Docker Dashboards’ Settings