Simplified version of a common crawl fetcher. This is yet another attempt to make it easy to extract files from Common Crawl.
Warning!!! THIS IS STILL ALPHA! There will be bugs and things will change without warning.
Make it easy to extract or refetch a smallish sample (~ couple of million) of complete files from CommonCrawl data. My primary interest is in binary files, so there's an emphasis on being able to sample and extract files by mime-type. If you're building large language models, or if you want to process ALL of Common Crawl, this project is not for you.
This section is critical to understanding enough of how CommonCrawl is structured to make use of this fetcher.
Common Crawl crawls a large portion of the internet each month (roughly 3 billion URLs).
Common Crawl packages the raw files inside WARC files, each of which is gzipped and then
appended to other gzipped WARC files to create compound WARC files stored in AWS's S3 buckets.
For a single crawl, there are ~90,000 compound WARC files, each ~1 GB in size.
Uncompressed, a single's month crawl can weigh in at ~400TiB.
Note: Common Crawl truncates files at 1MB. Researchers who want complete files must refetch truncated files from the original URLs.
The index files and the compound WARC files are accessible via S3
(s3://commoncrawl/
) and HTTPS (https://data.commoncrawl.org/
).
Users may extract individual gzipped WARC files with an HTTP/S3 range query from the compound WARC files.
To ease access to files, Common Crawl publishes index files, 300 per crawl, totalling 300GB compressed/1TB uncompressed. Each line of these index files stores information about a fetch of a single URL:
- the URL in a format optimized for sorting by host and domain
- a timestamp
- a JSON object
This is an example of a line:
app,web,hurmaninvesterarimli)/89132/31224.html 20230208165905 {"url": "https://hurmaninvesterarimli.web.app/89132/31224.html", "mime": "text/html", "mime-detected": "text/html", "status": "200", "digest": "AQTOHAB777KB7F6E4GRMPNY2IGLGUA2K", "length": "6615", "offset": "325152484", "filename": "crawl-data/CC-MAIN-2023-06/segments/1674764500837.65/warc/CC-MAIN-20230208155417-20230208185417-00224.warc.gz", "charset": "UTF-8", "languages": "slk,eng,spa"}
The JSON object above:
{
"charset": "UTF-8",
"digest": "AQTOHAB777KB7F6E4GRMPNY2IGLGUA2K",
"filename": "crawl-data/CC-MAIN-2023-06/segments/1674764500837.65/warc/CC-MAIN-20230208155417-20230208185417-00224.warc.gz",
"languages": "slk,eng,spa",
"length": "6615",
"mime": "text/html",
"mime-detected": "text/html",
"offset": "325152484",
"status": "200",
"url": "https://hurmaninvesterarimli.web.app/89132/31224.html"
}
charset
-- what the http header claimed the charset was (???)digest
-- the BASE32-encoded SHA1 of the filefilename
-- the name of the compound WARC file as stored inS3
which contains this individual file's WARClanguages
-- languages as identified by (???)length
-- length of the compressed WARC filemime
-- what the http header claimed was the content-typemime-detected
-- mime type as detected by Apache Tikaoffset
-- offset in the compound WARC where this individual WARC file startsstatus
-- the http status returned during the fetchredirect
-- (not included in the example) if a302
, the redirect URL is storedtruncated
-- (not included in the example) if the file was truncated, this is populated with a cause. The most common cause islength
.url
-- the target URL for this file
Note: AWS does not allow anonymous access to Common Crawl's S3 bucket -- see access-the-data. If you'd prefer to use S3 instead of HTTPS, see ConfiguringTheFetch.
Warning!!! For each month's crawl, this process downloads ~300GB of index data via HTTPS or S3. If you are running locally and you'd prefer to download the index data only once, see AdvancedScenarios.
At a minimum, the user defines a few parameters in a JSON file (minimal-config, for example).
In the following, we have given a minimal example for extracting all files identified as mp4
files
from two crawls: CC-MAIN-2023-06
and CC-MAIN-2022-49
. See the CommonCrawl blog
for the names of crawls.
The indices
element contains paths to the index lists for target crawls as described in step 1. The recordSelector
tells
the fetcher which records to process. The following requires that the URL have a status of 200
and that the mime_detected
value is video/mp4
or video/quicktime
.
We have included the optional dryRun
parameter in the example. This reports
files that it would have extracted from Common Crawl if the dryRun
had been set to false
.
There are many configuration options -- see ConfiguringTheFetch for more details.
{
"dryRun" : true,
"indices": {
"paths": [
"crawl-data/CC-MAIN-2023-06/cc-index.paths.gz",
"crawl-data/CC-MAIN-2022-49/cc-index.paths.gz"
]
},
"recordSelector": {
"must": {
"status": [
{
"match": "200"
}
]
},
"should": {
"mime_detected": [
{
"match": "video/mp4"
},
{
"match": "video/quicktime"
}
]
}
}
}
Users must have Java (>= 17) installed. To check your version: java -version
.
Get the latest released jar from github.
The commandline:
java -jar commoncrawl-fetcher-lite-X.Y.Z.jar minimal-config.json
This fetcher will extract non-truncated mp4
files from CommonCrawl's compound WARC files
and put the mp4
files in the docs/
directory. The fetcher
will write the URLs for the mp4
files that CommonCrawl truncated
to a file named logs/urls-truncated.csv
for later re-fetching.
There are many options for this. The simplest might be wget:
wget -i logs/urls-truncated.csv
See also options for curl, such as this article.
The Nutch project may be excessive, but it is extremely scaleable and robust (it powers Common Crawl!), and it records the WARC information for each fetch.
For other scenarios, see AdvancedScenarios.
For those who want to build the latest, you'll need Java jdk >= 17, and recent versions of git and maven installed.
git clone https://github.com/tballison/commoncrawl-fetcher-lite
cd commoncrawl-fetcher-lite
mvn install
The jar file will be built in target/
.
This is intended to be light-weight. The current design does not store
provenance information for the URLs nor for the refetches.
The goal is simply to select files for extraction and to
extract them or record the URLs for truncated files.
Warning!!! AWS throttles download rates, especially for HTTP requests. This code is designed with back-off logic for the default HTTP fetcher so that it will pause if it gets a throttle warning from AWS. While this code is multi-threaded, it is not useful to run more than about 3 threads when fetching from HTTP. When running this code on AWS instances pulling from S3, I've run 50 threads without receiving throttling complaints.
Over time, Common Crawl has added new fields. If you're pulling from old crawls,
the fields that you need may not exist in the older index files. For example, mime-detected
,
was added in the CC-MAIN-2017-22
crawl. If a value is null
for a requested field,
there will be logging that warns about a null value for a given field. If the field
is in a must_not
clause, that clause will not be considered; if the field is in a must
or should
clause, that record will not be selected.
Please open bug reports and issues to help prioritize future development.
Some features that could be added are included below.
- Add more features to the record selector -- handle numeric values (e.g.
int
) and allow forgt
,gte
andrange
options - Add a refetcher that stores the refetched files with the same policy that the fetcher stores files (same file+directory naming strategy) and potentially writes a table of URLs and fetched digest.
- Allow a different extracted file naming scheme -- CC's sha1 or any encoding+digest combination?
- Allow extraction of truncated files.
- Allow counting of mimes or other features -- not just those selected.
- Store fetch status and refetch status in an actual database -- against the design goals of this project. LOL...
- Use parquet files (with or without duckdb and httpfs) instead of the current json-based index files (see this post)
- ...
The release is currently triggered by pushing a tag starting with 'v'.
git tag -a "v1.0.0-alpha1" -m "v1.0.0-alpha1-release" && git push origin v1.0.0-alpha1
Then navigate to Github and actually make the release.
- CommonCrawlDocumentDownload (my initial inspiration)
- SimpleCommonCrawlExtractor (my first attempt)
- commoncrawl-fetcher module in file-observatory (second attempt)
If you're AWS-savvy, you can more simply select the records for extraction via Athena; see this Common Crawl post on how to run queries against the indices.