Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-39968: [Python][FS][Azure] Minimal Python bindings for AzureFileSystem #40021

Conversation

Tom-Newton
Copy link
Contributor

@Tom-Newton Tom-Newton commented Feb 9, 2024

Rationale for this change

We want to use the new AzureFileSystem in pyarrow.

What changes are included in this PR?

  • Add minimal python bindings for AzureFileSystem. This includes just enough to run the python tests against azurite plus default credential auth to enable real use of this once this PR merges.
  • Adding additional configuration options and remaining authentication options can be done as a follow up.
  • I tried to copy the existing pybinds for GCS and S3
  • Explicitly set ARROW_AZURE=OFF rather than relying on defaults. The defaults are different for builds vs tests so this was causing tests to be enabled while Azure was disabled during the build.

Are these changes tested?

Enabled the the python filesystem tests for the new filesystem. I had to skip azure in a couple of the tests though because they are not yet working on the C++ side. I created Github issues to resolve these #40025 and #40026 and added TODO comments where relevant, that reference these Github issues.

Are there any user-facing changes?

pyarrow users can now use the native AzureFileSystem to get much better reliability and performance compared to adlfs based options.

@Tom-Newton Tom-Newton force-pushed the tomnewton/minimal_python_bindings/GH-39968 branch from 8d6b96d to 99e1354 Compare February 10, 2024 12:26
@Tom-Newton Tom-Newton changed the title GH-39968: WIP Minimal python bindings for AzureFileSystem GH-39968: Minimal python bindings for AzureFileSystem Feb 10, 2024
@Tom-Newton Tom-Newton marked this pull request as ready for review February 10, 2024 19:15
@kou kou changed the title GH-39968: Minimal python bindings for AzureFileSystem GH-39968: [Python][FS][Azure] Minimal python bindings for AzureFileSystem Feb 10, 2024
@kou kou changed the title GH-39968: [Python][FS][Azure] Minimal python bindings for AzureFileSystem GH-39968: [Python][FS][Azure] Minimal Python bindings for AzureFileSystem Feb 10, 2024
Copy link
Member

@kou kou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may want to update ci/scripts/python_*.sh/.github/workflows/python.yml too for PYARROW_WITH_AZURE in this PR. Or we can do it in a separated PR to keep this PR minimal.

cpp/src/arrow/util/config.h.cmake Outdated Show resolved Hide resolved
@github-actions github-actions bot added awaiting changes Awaiting changes awaiting change review Awaiting change review and removed awaiting review Awaiting review awaiting changes Awaiting changes labels Feb 10, 2024
@Tom-Newton
Copy link
Contributor Author

We may want to update ci/scripts/python_*.sh/.github/workflows/python.yml too for PYARROW_WITH_AZURE in this PR. Or we can do it in a separated PR to keep this PR minimal.

I updated .github/workflows/python.yml and ci/scripts/python_sdist_build.sh. I think these are the only ones I missed in #39971. Probably I missed them because GCS was disabled.

@Tom-Newton
Copy link
Contributor Author

The MATLAB builds seem to be having issues. I don't think these can be related to my changes

@kou
Copy link
Member

kou commented Feb 11, 2024

Yes. MATLAB related failures are unrelated. Could you open an issue for it to ignore the failures in this PR?

@kou
Copy link
Member

kou commented Feb 11, 2024

@github-actions crossbow submit -g cpp -g wheel

This comment was marked as outdated.

@Tom-Newton
Copy link
Contributor Author

Yes. MATLAB related failures are unrelated. Could you open an issue for it to ignore the failures in this PR?

Created an issue: #40034

python/pyarrow/_azurefs.pyx Outdated Show resolved Hide resolved
python/pyarrow/_azurefs.pyx Show resolved Hide resolved
@github-actions github-actions bot added awaiting changes Awaiting changes awaiting change review Awaiting change review and removed awaiting change review Awaiting change review awaiting changes Awaiting changes labels Feb 11, 2024
@Tom-Newton
Copy link
Contributor Author

Tom-Newton commented Feb 12, 2024

2 CI failures:
appvayor: Build execution time has reached the maximum allowed time for your plan (90 minutes).
C++ / AMD64 macOS 12 C++ (pull_request): 97/97 Test #73: arrow-s3fs-test ..............................***Timeout 300.06 sec

I think both are unrelated to this PR

@Tom-Newton Tom-Newton force-pushed the tomnewton/minimal_python_bindings/GH-39968 branch from 793db20 to 20e7a31 Compare March 13, 2024 10:33
@Tom-Newton
Copy link
Contributor Author

I've just rebased after #40455

@jorisvandenbossche when you get a chance please could you re-review. Sorry to be impatient, but I really want to start using this and I'm hoping it will be merged in time for the 16.0.0 release.

@jorisvandenbossche
Copy link
Member

No need to apologize for the ping! ;)
I actually had re-reviewed the last code changes last week and everything looks good to me, forgot to comment that.

The conclusion from the last discussion just above about extra test builds to enable this (#40021 (comment)) is that this can wait for later? Do we want to create a follow-up issue to add this to some additional non-conda builds?

@jorisvandenbossche jorisvandenbossche merged commit 9f6dc1f into apache:main Mar 13, 2024
53 checks passed
@jorisvandenbossche jorisvandenbossche removed the awaiting change review Awaiting change review label Mar 13, 2024
@github-actions github-actions bot added the awaiting merge Awaiting merge label Mar 13, 2024
@Tom-Newton
Copy link
Contributor Author

Thanks for reviewing and merging. I created an issue for the minio and azurite thing #40509

Copy link

After merging your PR, Conbench analyzed the 7 benchmarking runs that have been run so far on merge-commit 9f6dc1f.

There were 7 benchmark results indicating a performance regression:

The full Conbench report has more details. It also includes information about 2 possible false positives for unstable benchmarks that are known to sometimes produce them.

Comment on lines +59 to +64
blob_storage_scheme : str, default None
Either `http` or `https`. Defaults to `https`. Useful for connecting to a local
emulator, like Azurite.
dfs_storage_scheme : str, default None
Either `http` or `https`. Defaults to `https`. Useful for connecting to a local
emulator, like Azurite.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kou should this also change to enable_tls like you did in the URI parsing? cc @Tom-Newton

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I think so. We may want to use AzureOptions::FromUri() instead of re-implementing the same logic.

@Tom-Newton Could you follow-up this?

Copy link
Contributor Author

@Tom-Newton Tom-Newton Mar 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I created the issue for completing the python bindings and referenced this conversation #40572. There is a good chance that I will work on it but I can't say when.

@Tom-Newton
Copy link
Contributor Author

After merging your PR, Conbench analyzed the 7 benchmarking runs that have been run so far on merge-commit 9f6dc1f.

There were 7 benchmark results indicating a performance regression:

The full Conbench report has more details. It also includes information about 2 possible false positives for unstable benchmarks that are known to sometimes produce them.

I'm not really sure what to make of this. These benchmarks do seem potentially relevant but all I've done is add a feature not modify anything so I don't see how this PR could have caused a regression.

@jorisvandenbossche
Copy link
Member

I'm not really sure what to make of this. These benchmarks do seem potentially relevant but all I've done is add a feature not modify anything so I don't see how this PR could have caused a regression.

You can ignore all of them. I checked earlier today, and all are spurious spikes in the timings.
If this PR would actually have caused a regression, you would see that the later commits would also all be slower, which is not the case here:

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Python][FS][Azure] Minimal python bindings for AzureFilesystem
5 participants