Add CC-25M announcement to docs (#468)

This PR adds the CC-25M announcement to the docs website in two parts: - The tech community announcement, which can be reached both from an announcement banner at the top of the page, and the navigation bar on the left. - The press release, which is available by its url directly, but not linked from the docs site.
ml6team · Sep 27, 2023 · 39436e7 · 39436e7
1 parent 67c58db
commit 39436e7
Show file tree

Hide file tree

Showing 10 changed files with 89 additions and 6 deletions.
diff --git a/docs/announcements/CC_25M_community.md b/docs/announcements/CC_25M_community.md
@@ -0,0 +1,30 @@
+# 25 million Creative Commons image dataset released
+
+[Fondant](https://fondant.ai) is an open-source project that aims to simplify and speed up 
+large-scale data processing by making containerized components reusable across pipelines & 
+execution environments, shared within the community.
+
+A current challenge for generative AI is compliance with copyright laws. For this reason, 
+Fondant has developed a data-processing pipeline to create a 500-million dataset of Creative 
+Commons images to train a latent diffusion image generation model that respects copyright. Today,
+as a first step, we are releasing a 25-million sample dataset and invite the open source 
+community to collaborate on further refinement steps.
+
+Fondant offers tools to download, explore and process the data. The current example pipeline 
+includes a component for downloading the urls, a simple file type filter, one for downloading 
+the images and one for deduplicating the urls. Additional processing components which could be 
+contributed include, in order of priority:
+
+* Image-based deduplication
+* Visual quality / aesthetic quality estimation
+* Watermark detection
+* Not safe for work (NSFW) content detection
+* Face detection
+* Personal Identifiable Information (PII) detection
+* Text detection
+* AI generated image detection
+* Any components that you propose to develop
+
+The Fondant team also invites contributors to the core framework and is looking for feedback on 
+the framework’s usability and for suggestions for improvement. Contact us at 
+[[email protected]](mailto:[email protected]) and/or join our [discord](https://discord.gg/HnTdWhydGp).
diff --git a/docs/announcements/CC_25M_press_release.md b/docs/announcements/CC_25M_press_release.md
@@ -0,0 +1,28 @@
+# 25 million Creative Commons image dataset released
+
+> Fondant is an open-source project that aims to enable compliant, large-scale processing in a
+> simple and cost-efficient way. As a first step, we have developed a pipeline to create a  Creative Commons image dataset and are releasing a first 25 million sample with a call to action to help develop additional data processing pipelines.
+
+[Fondant](https://fondant.ai) simplifies and speeds up large-scale data processing by making
+self-contained pipeline components reusable across pipelines, infrastructures and shareable
+within the community. By offering a library of ready-to-use, off-the-shelf components and a
+standardized way of building and combining them with custom components, it significantly reduces
+the time required to build and maintain data processing infrastructure for generative AI
+applications in production.
+
+Supported by [Flanders innovation & entrepreneurship](https://vlaio.be) and European AI Service
+Provider [ML6](https://ml6.eu), Fondant developed a pipeline to create a
+[dataset](https://huggingface.co/datasets/fondantai/fondant-cc-25m) of over 500 million Creative
+Commons-licensed images from Common Crawl to train an image-generation model that respects
+copyright. Now we are releasing a first 25 million sample dataset with tools to download,
+explore and process the data. We are inviting developers and data enthusiasts to collaborate on
+large-scale data processing pipelines by building custom components for advanced filtering and
+captioning and to contribute to the core framework. We are also looking for feedback on the
+framework’s usability with suggestions for improvement. Contact us at
+[[email protected]](mailto:[email protected]) and/or join our [discord](https://discord.gg/HnTdWhydGp)
+to help realize this vision.
+
+[Creative Commons](https://creativecommons.org) is a non-profit organization which provides
+licenses that allow other creators to reuse one’s work under certain conditions.
+[Common Crawl](https://commoncrawl.org) is a non-profit organization which publishes monthly
+archives of the public Internet.
diff --git a/docs/component_spec.md → docs/components/component_spec.md b/docs/component_spec.md → docs/components/component_spec.md
diff --git a/docs/components.md → docs/components/components.md b/docs/components.md → docs/components/components.md
diff --git a/docs/custom_component.md → docs/components/custom_component.md b/docs/custom_component.md → docs/components/custom_component.md
diff --git a/docs/generic_component.md → docs/components/generic_component.md b/docs/generic_component.md → docs/components/generic_component.md
diff --git a/docs/getting_started.md b/docs/getting_started.md
@@ -1,8 +1,14 @@
 # Getting started
 
-Note: To execute the pipeline locally, you must have docker compose, Python >=3.8 and Git installed on your system.
+!!! note
 
-Note: For Apple M1/M2 ship users: - Make sure that Docker uses linux/amd64 platform and not arm64. - In Docker Dashboards’ Settings<Features in development, make sure to uncheck Use containerid for pulling and storing images .
+    To execute the pipeline locally, you must have docker compose, Python >=3.8 and Git 
+    installed on your system.
+
+!!! note
+
+    For Apple M1/M2 ship users: - Make sure that Docker uses linux/amd64 platform and not 
+    arm64. - In Docker Dashboards’ Settings<Features in development, make sure to uncheck Use containerid for pulling and storing images.
 
 For demonstration purposes, we provide sample pipelines in the Fondant GitHub repository. A great starting point is the pipeline that loads and filters creative commons images. To follow along with the upcoming instructions, you can clone the [repository](https://github.com/ml6team/fondant) and navigate to the `examples/pipelines/filter-cc-25m` folder.
 

diff --git a/docs/overrides/main.html b/docs/overrides/main.html
@@ -0,0 +1,9 @@
+{% extends "base.html" %}
+
+{% block announce %}
+    <p style="text-align: center">
+        We released a 25 million Creative Commons image dataset!
+        <a href="announcements/CC_25M_community/"
+           style="color: white; text-decoration: underline">Read more</a>
+    </p>
+{% endblock %}
diff --git a/docs/stylesheets/extra.css b/docs/stylesheets/extra.css
@@ -0,0 +1,5 @@
+.md-banner {
+    background-color: OrangeRed;          /* This setting prevents the Material header from
+    imposing into
+     the space of the banner! */
+}
diff --git a/mkdocs.yml b/mkdocs.yml
@@ -23,9 +23,12 @@ theme:
       toggle:
         icon: material/brightness-4
         name: Switch to light mode
+  custom_dir: docs/overrides
   features:
     - content.code.copy
     - navigation.tracking
+extra_css:
+  - stylesheets/extra.css
 nav:
   - Home: index.md
   - Getting Started: getting_started.md
@@ -35,13 +38,15 @@ nav:
     - Implement custom components: guides/implement_custom_components.md
   - Building a pipeline: pipeline.md
   - Components:
-    - Components: components.md
-    - Creating custom components: custom_component.md
-    - Read / write components: generic_component.md
-    - Component spec: component_spec.md
+    - Components: components/components.md
+    - Creating custom components: components/custom_component.md
+    - Read / write components: components/generic_component.md
+    - Component spec: components/component_spec.md
   - Data explorer: data_explorer.md
   - Infrastructure: infrastructure.md
   - Manifest: manifest.md
+  - Announcements:
+      - announcements/CC_25M_community.md
 
 plugins:
   - mkdocstrings