Skip to content

Commit

Permalink
Add CC-25M announcement to docs (#468)
Browse files Browse the repository at this point in the history
This PR adds the CC-25M announcement to the docs website in two parts:

- The tech community announcement, which can be reached both from an
announcement banner at the top of the page, and the navigation bar on
the left.
- The press release, which is available by its url directly, but not
linked from the docs site.
  • Loading branch information
RobbeSneyders authored Sep 27, 2023
1 parent 67c58db commit 39436e7
Show file tree
Hide file tree
Showing 10 changed files with 89 additions and 6 deletions.
30 changes: 30 additions & 0 deletions docs/announcements/CC_25M_community.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
# 25 million Creative Commons image dataset released

[Fondant](https://fondant.ai) is an open-source project that aims to simplify and speed up
large-scale data processing by making containerized components reusable across pipelines &
execution environments, shared within the community.

A current challenge for generative AI is compliance with copyright laws. For this reason,
Fondant has developed a data-processing pipeline to create a 500-million dataset of Creative
Commons images to train a latent diffusion image generation model that respects copyright. Today,
as a first step, we are releasing a 25-million sample dataset and invite the open source
community to collaborate on further refinement steps.

Fondant offers tools to download, explore and process the data. The current example pipeline
includes a component for downloading the urls, a simple file type filter, one for downloading
the images and one for deduplicating the urls. Additional processing components which could be
contributed include, in order of priority:

* Image-based deduplication
* Visual quality / aesthetic quality estimation
* Watermark detection
* Not safe for work (NSFW) content detection
* Face detection
* Personal Identifiable Information (PII) detection
* Text detection
* AI generated image detection
* Any components that you propose to develop

The Fondant team also invites contributors to the core framework and is looking for feedback on
the framework’s usability and for suggestions for improvement. Contact us at
[[email protected]](mailto:[email protected]) and/or join our [discord](https://discord.gg/HnTdWhydGp).
28 changes: 28 additions & 0 deletions docs/announcements/CC_25M_press_release.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
# 25 million Creative Commons image dataset released

> Fondant is an open-source project that aims to enable compliant, large-scale processing in a
> simple and cost-efficient way. As a first step, we have developed a pipeline to create a Creative Commons image dataset and are releasing a first 25 million sample with a call to action to help develop additional data processing pipelines.
[Fondant](https://fondant.ai) simplifies and speeds up large-scale data processing by making
self-contained pipeline components reusable across pipelines, infrastructures and shareable
within the community. By offering a library of ready-to-use, off-the-shelf components and a
standardized way of building and combining them with custom components, it significantly reduces
the time required to build and maintain data processing infrastructure for generative AI
applications in production.

Supported by [Flanders innovation & entrepreneurship](https://vlaio.be) and European AI Service
Provider [ML6](https://ml6.eu), Fondant developed a pipeline to create a
[dataset](https://huggingface.co/datasets/fondantai/fondant-cc-25m) of over 500 million Creative
Commons-licensed images from Common Crawl to train an image-generation model that respects
copyright. Now we are releasing a first 25 million sample dataset with tools to download,
explore and process the data. We are inviting developers and data enthusiasts to collaborate on
large-scale data processing pipelines by building custom components for advanced filtering and
captioning and to contribute to the core framework. We are also looking for feedback on the
framework’s usability with suggestions for improvement. Contact us at
[[email protected]](mailto:[email protected]) and/or join our [discord](https://discord.gg/HnTdWhydGp)
to help realize this vision.

[Creative Commons](https://creativecommons.org) is a non-profit organization which provides
licenses that allow other creators to reuse one’s work under certain conditions.
[Common Crawl](https://commoncrawl.org) is a non-profit organization which publishes monthly
archives of the public Internet.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
10 changes: 8 additions & 2 deletions docs/getting_started.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,14 @@
# Getting started

Note: To execute the pipeline locally, you must have docker compose, Python >=3.8 and Git installed on your system.
!!! note

Note: For Apple M1/M2 ship users: - Make sure that Docker uses linux/amd64 platform and not arm64. - In Docker Dashboards’ Settings<Features in development, make sure to uncheck Use containerid for pulling and storing images .
To execute the pipeline locally, you must have docker compose, Python >=3.8 and Git
installed on your system.

!!! note

For Apple M1/M2 ship users: - Make sure that Docker uses linux/amd64 platform and not
arm64. - In Docker Dashboards’ Settings<Features in development, make sure to uncheck Use containerid for pulling and storing images.

For demonstration purposes, we provide sample pipelines in the Fondant GitHub repository. A great starting point is the pipeline that loads and filters creative commons images. To follow along with the upcoming instructions, you can clone the [repository](https://github.com/ml6team/fondant) and navigate to the `examples/pipelines/filter-cc-25m` folder.

Expand Down
9 changes: 9 additions & 0 deletions docs/overrides/main.html
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
{% extends "base.html" %}

{% block announce %}
<p style="text-align: center">
We released a 25 million Creative Commons image dataset!
<a href="announcements/CC_25M_community/"
style="color: white; text-decoration: underline">Read more</a>
</p>
{% endblock %}
5 changes: 5 additions & 0 deletions docs/stylesheets/extra.css
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
.md-banner {
background-color: OrangeRed; /* This setting prevents the Material header from
imposing into
the space of the banner! */
}
13 changes: 9 additions & 4 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -23,9 +23,12 @@ theme:
toggle:
icon: material/brightness-4
name: Switch to light mode
custom_dir: docs/overrides
features:
- content.code.copy
- navigation.tracking
extra_css:
- stylesheets/extra.css
nav:
- Home: index.md
- Getting Started: getting_started.md
Expand All @@ -35,13 +38,15 @@ nav:
- Implement custom components: guides/implement_custom_components.md
- Building a pipeline: pipeline.md
- Components:
- Components: components.md
- Creating custom components: custom_component.md
- Read / write components: generic_component.md
- Component spec: component_spec.md
- Components: components/components.md
- Creating custom components: components/custom_component.md
- Read / write components: components/generic_component.md
- Component spec: components/component_spec.md
- Data explorer: data_explorer.md
- Infrastructure: infrastructure.md
- Manifest: manifest.md
- Announcements:
- announcements/CC_25M_community.md

plugins:
- mkdocstrings
Expand Down

0 comments on commit 39436e7

Please sign in to comment.