-
Notifications
You must be signed in to change notification settings - Fork 26
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Add CC-25M announcement to docs (#468)
This PR adds the CC-25M announcement to the docs website in two parts: - The tech community announcement, which can be reached both from an announcement banner at the top of the page, and the navigation bar on the left. - The press release, which is available by its url directly, but not linked from the docs site.
- Loading branch information
1 parent
67c58db
commit 39436e7
Showing
10 changed files
with
89 additions
and
6 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,30 @@ | ||
# 25 million Creative Commons image dataset released | ||
|
||
[Fondant](https://fondant.ai) is an open-source project that aims to simplify and speed up | ||
large-scale data processing by making containerized components reusable across pipelines & | ||
execution environments, shared within the community. | ||
|
||
A current challenge for generative AI is compliance with copyright laws. For this reason, | ||
Fondant has developed a data-processing pipeline to create a 500-million dataset of Creative | ||
Commons images to train a latent diffusion image generation model that respects copyright. Today, | ||
as a first step, we are releasing a 25-million sample dataset and invite the open source | ||
community to collaborate on further refinement steps. | ||
|
||
Fondant offers tools to download, explore and process the data. The current example pipeline | ||
includes a component for downloading the urls, a simple file type filter, one for downloading | ||
the images and one for deduplicating the urls. Additional processing components which could be | ||
contributed include, in order of priority: | ||
|
||
* Image-based deduplication | ||
* Visual quality / aesthetic quality estimation | ||
* Watermark detection | ||
* Not safe for work (NSFW) content detection | ||
* Face detection | ||
* Personal Identifiable Information (PII) detection | ||
* Text detection | ||
* AI generated image detection | ||
* Any components that you propose to develop | ||
|
||
The Fondant team also invites contributors to the core framework and is looking for feedback on | ||
the framework’s usability and for suggestions for improvement. Contact us at | ||
[[email protected]](mailto:[email protected]) and/or join our [discord](https://discord.gg/HnTdWhydGp). |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,28 @@ | ||
# 25 million Creative Commons image dataset released | ||
|
||
> Fondant is an open-source project that aims to enable compliant, large-scale processing in a | ||
> simple and cost-efficient way. As a first step, we have developed a pipeline to create a Creative Commons image dataset and are releasing a first 25 million sample with a call to action to help develop additional data processing pipelines. | ||
[Fondant](https://fondant.ai) simplifies and speeds up large-scale data processing by making | ||
self-contained pipeline components reusable across pipelines, infrastructures and shareable | ||
within the community. By offering a library of ready-to-use, off-the-shelf components and a | ||
standardized way of building and combining them with custom components, it significantly reduces | ||
the time required to build and maintain data processing infrastructure for generative AI | ||
applications in production. | ||
|
||
Supported by [Flanders innovation & entrepreneurship](https://vlaio.be) and European AI Service | ||
Provider [ML6](https://ml6.eu), Fondant developed a pipeline to create a | ||
[dataset](https://huggingface.co/datasets/fondantai/fondant-cc-25m) of over 500 million Creative | ||
Commons-licensed images from Common Crawl to train an image-generation model that respects | ||
copyright. Now we are releasing a first 25 million sample dataset with tools to download, | ||
explore and process the data. We are inviting developers and data enthusiasts to collaborate on | ||
large-scale data processing pipelines by building custom components for advanced filtering and | ||
captioning and to contribute to the core framework. We are also looking for feedback on the | ||
framework’s usability with suggestions for improvement. Contact us at | ||
[[email protected]](mailto:[email protected]) and/or join our [discord](https://discord.gg/HnTdWhydGp) | ||
to help realize this vision. | ||
|
||
[Creative Commons](https://creativecommons.org) is a non-profit organization which provides | ||
licenses that allow other creators to reuse one’s work under certain conditions. | ||
[Common Crawl](https://commoncrawl.org) is a non-profit organization which publishes monthly | ||
archives of the public Internet. |
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,9 @@ | ||
{% extends "base.html" %} | ||
|
||
{% block announce %} | ||
<p style="text-align: center"> | ||
We released a 25 million Creative Commons image dataset! | ||
<a href="announcements/CC_25M_community/" | ||
style="color: white; text-decoration: underline">Read more</a> | ||
</p> | ||
{% endblock %} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
.md-banner { | ||
background-color: OrangeRed; /* This setting prevents the Material header from | ||
imposing into | ||
the space of the banner! */ | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters