Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add CC-25M announcement to docs #468

Merged
merged 3 commits into from
Sep 27, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
30 changes: 30 additions & 0 deletions docs/announcements/CC_25M_community.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
# 25 million Creative Commons image dataset released

[Fondant](https://fondant.ai) is an open-source project that aims to simplify and speed up
large-scale data processing by making containerized components reusable across pipelines &
execution environments, shared within the community.

A current challenge for generative AI is compliance with copyright laws. For this reason,
Fondant has developed a data-processing pipeline to create a 500-million dataset of Creative
Commons images to train a latent diffusion image generation model that respects copyright. Today,
as a first step, we are releasing a 25-million sample dataset and invite the open source
community to collaborate on further refinement steps.

Fondant offers tools to download, explore and process the data. The current example pipeline
includes a component for downloading the urls, a simple file type filter, one for downloading
the images and one for deduplicating the urls. Additional processing components which could be
contributed include, in order of priority:

* Image-based deduplication
* Visual quality / aesthetic quality estimation
* Watermark detection
* Not safe for work (NSFW) content detection
* Face detection
* Personal Identifiable Information (PII) detection
* Text detection
* AI generated image detection
* Any components that you propose to develop

The Fondant team also invites contributors to the core framework and is looking for feedback on
the framework’s usability and for suggestions for improvement. Contact us at
[[email protected]](mailto:[email protected]) and/or join our [discord](https://discord.gg/HnTdWhydGp).
28 changes: 28 additions & 0 deletions docs/announcements/CC_25M_press_release.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
# 25 million Creative Commons image dataset released

> Fondant is an open-source project that aims to enable compliant, large-scale processing in a
> simple and cost-efficient way. As a first step, we have developed a pipeline to create a Creative Commons image dataset and are releasing a first 25 million sample with a call to action to help develop additional data processing pipelines.

[Fondant](https://fondant.ai) simplifies and speeds up large-scale data processing by making
self-contained pipeline components reusable across pipelines, infrastructures and shareable
within the community. By offering a library of ready-to-use, off-the-shelf components and a
standardized way of building and combining them with custom components, it significantly reduces
the time required to build and maintain data processing infrastructure for generative AI
applications in production.

Supported by [Flanders innovation & entrepreneurship](https://vlaio.be) and European AI Service
Provider [ML6](https://ml6.eu), Fondant developed a pipeline to create a
[dataset](https://huggingface.co/datasets/fondantai/fondant-cc-25m) of over 500 million Creative
Commons-licensed images from Common Crawl to train an image-generation model that respects
copyright. Now we are releasing a first 25 million sample dataset with tools to download,
explore and process the data. We are inviting developers and data enthusiasts to collaborate on
large-scale data processing pipelines by building custom components for advanced filtering and
captioning and to contribute to the core framework. We are also looking for feedback on the
framework’s usability with suggestions for improvement. Contact us at
[[email protected]](mailto:[email protected]) and/or join our [discord](https://discord.gg/HnTdWhydGp)
to help realize this vision.

[Creative Commons](https://creativecommons.org) is a non-profit organization which provides
licenses that allow other creators to reuse one’s work under certain conditions.
[Common Crawl](https://commoncrawl.org) is a non-profit organization which publishes monthly
archives of the public Internet.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
10 changes: 8 additions & 2 deletions docs/getting_started.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,14 @@
# Getting started

Note: To execute the pipeline locally, you must have docker compose, Python >=3.8 and Git installed on your system.
!!! note

Note: For Apple M1/M2 ship users: - Make sure that Docker uses linux/amd64 platform and not arm64. - In Docker Dashboards’ Settings<Features in development, make sure to uncheck Use containerid for pulling and storing images .
To execute the pipeline locally, you must have docker compose, Python >=3.8 and Git
installed on your system.

!!! note

For Apple M1/M2 ship users: - Make sure that Docker uses linux/amd64 platform and not
arm64. - In Docker Dashboards’ Settings<Features in development, make sure to uncheck Use containerid for pulling and storing images.

For demonstration purposes, we provide sample pipelines in the Fondant GitHub repository. A great starting point is the pipeline that loads and filters creative commons images. To follow along with the upcoming instructions, you can clone the [repository](https://github.com/ml6team/fondant) and navigate to the `examples/pipelines/filter-cc-25m` folder.

Expand Down
9 changes: 9 additions & 0 deletions docs/overrides/main.html
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
{% extends "base.html" %}

{% block announce %}
<p style="text-align: center">
We released a 25 million Creative Commons image dataset!
<a href="announcements/CC_25M_community/"
style="color: white; text-decoration: underline">Read more</a>
</p>
{% endblock %}
5 changes: 5 additions & 0 deletions docs/stylesheets/extra.css
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
.md-banner {
background-color: OrangeRed; /* This setting prevents the Material header from
imposing into
the space of the banner! */
}
13 changes: 9 additions & 4 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -23,9 +23,12 @@ theme:
toggle:
icon: material/brightness-4
name: Switch to light mode
custom_dir: docs/overrides
features:
- content.code.copy
- navigation.tracking
extra_css:
- stylesheets/extra.css
nav:
- Home: index.md
- Getting Started: getting_started.md
Expand All @@ -35,13 +38,15 @@ nav:
- Implement custom components: guides/implement_custom_components.md
- Building a pipeline: pipeline.md
- Components:
- Components: components.md
- Creating custom components: custom_component.md
- Read / write components: generic_component.md
- Component spec: component_spec.md
- Components: components/components.md
- Creating custom components: components/custom_component.md
- Read / write components: components/generic_component.md
- Component spec: components/component_spec.md
- Data explorer: data_explorer.md
- Infrastructure: infrastructure.md
- Manifest: manifest.md
- Announcements:
- announcements/CC_25M_community.md

plugins:
- mkdocstrings
Expand Down
Loading