S3-FileSystem
is an implementation of the Hadoop file system contract backed by AWS S3.
For a details on configuration see our usage guide.
S3-FileSystem
was created to enable a more efficient usage of AWS S3. This means:
- provide strong read after write consistency (in the meantime AWS has also rolled out native s3 strong consystency).
- provide file
rename
as an atomic O(1) operation. Natively, files cannot be renamed in S3(other file system implementations on top of S3 implement file rename as a copy + delete). - avoid S3 partition hotspot problem regardless of client defined file paths.
S3-FileSystem
does not aim be a drop in replacement for HDFS nor to fully implement the FileSystem specification.
There are differences between HDFS
and S3-FileSystem
, most notably:
S3-FileSystem
does not support atomic rename of directories.S3-FileSystem
does not support POSIX like permissions.
For a full list of differences between S3-FileSystem
and the Hadoop API specification see our contract definition
and our API compatibility analysis.
For the full Hadoop API specification please see these docs. For the implicit assumptions(including atomicity and concurrency) of the API please see these docs.
A few projects that tackle the same issues:
- S3 Guard tackles S3 consistency issues:
- Since S3 rolled out native strong consistency, the open source community has decided to deprecate S3 Guard.
- S3A committers tackles both consistency and S3's rename problems
- The S3A committers do not attempt to solve these issues at the
FileSystem
level, but at theOutputCommitter
level. Thus, they are primarily targeted at improving Spark/MR job performance and correctness when running on S3.
- The S3A committers do not attempt to solve these issues at the
Contributions are welcomed! Read the Contributing Guide for more information.
This project is licensed under the Apache V2 License. See LICENSE for more information.