Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-6655: [Python] Filesystem bindings for S3 #5423

Closed
wants to merge 39 commits into from
Closed

Conversation

kszucs
Copy link
Member

@kszucs kszucs commented Sep 18, 2019

  • Add support for S3FileSystem in the python bindings.
  • Fixed issue with reading all the content of an S3 object
  • Introduce minio_server fixture for parametrized testing of all filesystem implementations
  • Fixed s3fs parquet test and updated it to use minio_server fixture

@codecov-io
Copy link

codecov-io commented Sep 18, 2019

Codecov Report

❗ No coverage uploaded for pull request base (master@511c089). Click here to learn what that means.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff            @@
##             master    #5423   +/-   ##
=========================================
  Coverage          ?   80.84%           
=========================================
  Files             ?       90           
  Lines             ?     8854           
  Branches          ?        0           
=========================================
  Hits              ?     7158           
  Misses            ?     1342           
  Partials          ?      354

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 511c089...db89859. Read the comment docs.

python/pyarrow/_s3.pyx Outdated Show resolved Hide resolved
@kszucs kszucs marked this pull request as ready for review September 22, 2019 10:51
@kszucs kszucs changed the title [Python] Filesystem bindings for S3 ARROW-6655: [Python] Filesystem bindings for S3 Sep 22, 2019
@kszucs
Copy link
Member Author

kszucs commented Sep 23, 2019

@pitrou PTAL, I still need to fix the AppVeyor builds.

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather see the testing simplified. Also I'm not sure we want to depend on the minio client for testing. Actual checks are already done on the C++ side.

ci/conda_env_python.yml Outdated Show resolved Hide resolved
cpp/cmake_modules/ThirdpartyToolchain.cmake Show resolved Hide resolved
python/pyarrow/_fs.pyx Outdated Show resolved Hide resolved
python/pyarrow/_fs.pxd Show resolved Hide resolved
python/pyarrow/_s3.pyx Outdated Show resolved Hide resolved
python/pyarrow/fs.py Outdated Show resolved Hide resolved
python/pyarrow/includes/libarrow_fs.pxd Show resolved Hide resolved
python/pyarrow/tests/conftest.py Show resolved Hide resolved
python/pyarrow/tests/test_fs.py Outdated Show resolved Hide resolved
python/pyarrow/tests/test_fs.py Outdated Show resolved Hide resolved
with pytest.raises(ValueError):
aaa_stat.size
assert mtime_almost_equal(aaa_stat.mtime, fs.mtime(_aaa))
# type is inconsistent base_name has a trailing slas for 'aaa' and 'aaa/'
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pitrou if the directory's path ends with a slash then the file type is File, otherwise it is Directory, is this the expected outcome?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In which situation exactly? For guidance, you can take a look at the C++ tests, they should tell you what is expected.

@@ -340,6 +340,12 @@ class ObjectInputFile : public io::RandomAccessFile {
RETURN_NOT_OK(CheckClosed());
RETURN_NOT_OK(CheckPosition(position, "read"));

nbytes = std::min(nbytes, content_length_ - position);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pitrou this is the fix for stream.read() issue

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! And thanks for the tests too.

@pitrou
Copy link
Member

pitrou commented Sep 25, 2019

@kszucs Sorry, what is the S3Options question already?

Would be nice if you could create those JIRAs and point them to me :-)

@kszucs
Copy link
Member Author

kszucs commented Sep 25, 2019

@pitrou Created the JIRAs. That question I'm referring to is the keywords vs options object one. It is hidden by default in the GitHub thread, here is the link #5423 (comment)

@pitrou
Copy link
Member

pitrou commented Sep 26, 2019

@kszucs I agree that using kwargs directly on S3FileSystem may look reasonable with the current settings. But S3 may/will need to grow more settings in the future. For example see the settings supported by boto3.

Once there are many settings it is nicer to have property access on an options object, like we already have on the CSV and JSON options objects.

@kszucs
Copy link
Member Author

kszucs commented Sep 26, 2019

@pitrou now that has convinced me :)

@kszucs kszucs requested a review from pitrou September 27, 2019 13:04
Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's getting good! Just a couple questions below.

@@ -164,6 +168,10 @@ export PYARROW_BUILD_TYPE=$ARROW_BUILD_TYPE
export PYARROW_WITH_PARQUET=1
export PYARROW_WITH_PLASMA=1
export PYARROW_WITH_ORC=1
export PYARROW_WITH_S3=1
if [ "$ARROW_TRAVIS_S3" == "1" ]; then
export PYARROW_WITH_S3=1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... I suppose there's a leftover here or above?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ARROW_TRAVIS_S3 is still optional, so we need it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but look at the diff :-)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean, have I removed ARROW_TRAVIS_S3 or PYARROW_WITH_S3?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean this:

export PYARROW_WITH_S3=1
if [ "$ARROW_TRAVIS_S3" == "1" ]; then
  export PYARROW_WITH_S3=1

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:), fixing it

@@ -340,6 +340,12 @@ class ObjectInputFile : public io::RandomAccessFile {
RETURN_NOT_OK(CheckClosed());
RETURN_NOT_OK(CheckPosition(position, "read"));

nbytes = std::min(nbytes, content_length_ - position);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! And thanks for the tests too.


Parameters
----------
access_key: str, default None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should update the docstring now :-)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed.

python/pyarrow/tests/test_parquet.py Show resolved Hide resolved
@kszucs
Copy link
Member Author

kszucs commented Sep 30, 2019

@pitrou updated

@pitrou pitrou closed this in 155415c Oct 1, 2019
wesm pushed a commit that referenced this pull request Oct 3, 2019
depends on #5423

Closes #5484 from kszucs/s3-conda and squashes the following commits:

7a3c67c <Krisztián Szűcs> test s3fs import
ff6eb22 <Krisztián Szűcs> Include s3 support in the conda packages

Authored-by: Krisztián Szűcs <[email protected]>
Signed-off-by: Wes McKinney <[email protected]>
kszucs added a commit that referenced this pull request Oct 5, 2019
- Add support for S3FileSystem in the python bindings.
- Fixed issue with reading all the content of an S3 object
- Introduce `minio_server` fixture for parametrized testing of all filesystem implementations
- Fixed s3fs parquet test and updated it to use minio_server fixture

Closes #5423 from kszucs/s3 and squashes the following commits:

384c960 <Krisztián Szűcs> Resolve review comments
73e6625 <Krisztián Szűcs> S3Options
98bd91a <Krisztián Szűcs> remove commented tests
db89859 <Krisztián Szűcs> rename to s3fs
4478458 <Krisztián Szűcs> fix read() issue
c1df10b <Krisztián Szűcs> initialization in first use
192ab65 <Krisztián Szűcs> flake8
f70f9fb <Krisztián Szűcs> remove minio-client dependency
d399643 <Krisztián Szűcs> simplify test suite
fee57a9 <Krisztián Szűcs> remove accidentally committed files
751cfd4 <Krisztián Szűcs> resolve a couple of review comments; enum workaround
45436f7 <Krisztián Szűcs> cython flake8
38dcb88 <Krisztián Szűcs> rat
c541b3e <Krisztián Szűcs> comment left
098048a <Krisztián Szűcs> more compat
00340ed <Krisztián Szűcs> fixture error handling
2be25ce <Krisztián Szűcs> auto initialize s3 on import
88e0c9f <Krisztián Szűcs> py2 compat
8585a60 <Krisztián Szűcs> py2 compat
d372287 <Krisztián Szűcs> install minio in the conda-toolchain build
041cad4 <Krisztián Szűcs> executable flag
fb0f281 <Krisztián Szűcs> travis minio install script
8cbe0ee <Krisztián Szűcs> travis osx
72e56a6 <Krisztián Szűcs> enable S3 in travis python builds
7800c75 <Krisztián Szűcs> appveyor flag
7daf566 <Krisztián Szűcs> fix syntax error in travis script
68eb591 <Krisztián Szűcs> enable PYARROW_WITH_S3 on appveyor
2cb19d1 <Krisztián Szűcs> conditional import of test dependencies
efa05d2 <Krisztián Szűcs> use minio for dask.s3fs test too
f25ae5a <Krisztián Szűcs> travis
9042c7e <Krisztián Szűcs> use S3FS_DIR
9ce7180 <Krisztián Szűcs> cmake format; fix orc cimport
45a2a17 <Krisztián Szűcs> docstrings
c0b9162 <Krisztián Szűcs> test requirements; flake8
44aedfd <Krisztián Szűcs> stat test
a343950 <Krisztián Szűcs> testing suite
1551b52 <Krisztián Szűcs> wip
dd41d21 <Krisztián Szűcs> imports
5200af1 <Krisztián Szűcs> s3 filesystem bindings

Authored-by: Krisztián Szűcs <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
kszucs added a commit that referenced this pull request Oct 5, 2019
depends on #5423

Closes #5484 from kszucs/s3-conda and squashes the following commits:

7a3c67c <Krisztián Szűcs> test s3fs import
ff6eb22 <Krisztián Szűcs> Include s3 support in the conda packages

Authored-by: Krisztián Szűcs <[email protected]>
Signed-off-by: Wes McKinney <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants