Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: Consider removing PyArrow as a required DBAPI dependency #2413

Open
henryharbeck opened this issue Jan 6, 2025 · 10 comments
Labels
Type: enhancement New feature or request

Comments

@henryharbeck
Copy link

What feature or improvement would you like to see?

Thanks to the Arrow PyCapsule Interface, one can read data directly into supported libraries (e.g., DuckDB, Polars) without requiring PyArrow. This is great!

Writing on the hand (e.g., via adbc_ingest), does require PyArrow. This is a bit of a shame given that the data supplied to adbc_ingest also supports the PyCapsule Interface.

Furthermore, removal of PyArrow as a required DBAPI dependency would allow reading data with a higher level API.

These changes would be particularly beneficial to a library like Polars, removing the need for PyArrow completely for database I/O.

NB. I am certainly no expert regarding this, so please correct me if I have said anything incorrect or there are fundamental limitations that make this request unreasonable.

@henryharbeck henryharbeck added the Type: enhancement New feature or request label Jan 6, 2025
@lidavidm
Copy link
Member

lidavidm commented Jan 6, 2025

I suppose it's doable enough to change it so that as long as you don't try and read result sets (which needs something to process the Arrow data), we can accept PyCapsule and not require PyArrow.

You can always use the low-level interface as well, though it doesn't implement DBAPI. (The proposal here would technically not implement DBAPI either, but I suppose it'd be closer!)

@paleolimbot
Copy link
Member

Technically nanoarrow for Python can do row tuples and possibly a few other things (but I would personally consider it something that should be opted in to and treated experimentally).

https://arrow.apache.org/nanoarrow/0.6.0/reference/python/array-stream.html#nanoarrow.array_stream.ArrayStream.iter_tuples

@lidavidm
Copy link
Member

lidavidm commented Jan 6, 2025

I was thinking about that, but I figure if we can get away with no dependencies at all that might be useful too.

@paleolimbot
Copy link
Member

Probably best would be to eliminate the dependency and give an example that uses nanoarrow.ArrayStream(raw_capsule).iter_tuples() in the documentation (instead of a dependency).

@henryharbeck
Copy link
Author

Thanks both for the prompt responses.

For a little more context behind the request (if completed), I am wanting to propose to Polars that they also remove PyArrow as a required dependency for using the ADBC engine in database I/O.


I suppose it's doable enough to change it so that as long as you don't try and read result sets (which needs something to process the Arrow data)

Upon reading the adbc_driver_manager.dbapi.Cursor Python docs again, I realise there is no current API to fetch the results as a PyCapsule. Would you consider this as an additional feature request? If so, happy to raise an additional issue for this. Otherwise I'll propose Polars use the lower-level API for database reads (which I don't imagine will be a deal-breaker by any stretch).

The proposal here would technically not implement DBAPI either, but I suppose it'd be closer!

Sorry for any confusion, I was conflating DBAPI 2.0 (PEP 249) and the adbc_driver_manager.dbapi and adbc_driver_<database>.dbapi modules/namespaces under the term "DBAPI"

I figure if we can get away with no dependencies at all that might be useful too

That would be ideal from my (and I dare say other libraries') POV, where some features/APIs require dependencies (e.g., fetch_arrow_table, fetch_df), but none are explicitly required by the library itself.

@lidavidm
Copy link
Member

lidavidm commented Jan 7, 2025

I realise there is no current API to fetch the results as a PyCapsule

I think that's reasonable to add at the same time

The proposal here would technically not implement DBAPI either, but I suppose it'd be closer!

Sorry for any confusion, I was conflating DBAPI 2.0 (PEP 249) and the adbc_driver_manager.dbapi and adbc_driver_.dbapi modules/namespaces under the term "DBAPI"

I just mean that, because certain methods (like fetchone) wouldn't work, we technically wouldn't be in full compliance, but otherwise we would appear to look and function like a real DBAPI driver (unless you try and fetch Python objects from result sets)

@henryharbeck
Copy link
Author

I realise there is no current API to fetch the results as a PyCapsule

I think that's reasonable to add at the same time

That would be awesome!

I just mean that, because certain methods (like fetchone) wouldn't work, we technically wouldn't be in full compliance...

Ah got it, thanks for clarifying

@WillAyd
Copy link
Contributor

WillAyd commented Jan 8, 2025

I know we've gone back and forth on it but maybe its worth having nanoarrow as a dependency to stay compliant with the DBAPI? I also think it would be a good way to promote more usage of that library

@lidavidm
Copy link
Member

lidavidm commented Jan 8, 2025

I think we could do something where if you have neither nanoarrow nor pyarrow installed, it will function with limited support, otherwise it can use whichever one is available (not sure what would happen if you have both)

@paleolimbot
Copy link
Member

Perhaps concretely:

  • pip install "adbc-driver-manager[dbapi]" installs pyarrow and uses it if available (no change from current)
  • pip install adbc-driver-manager installs nothing else (no change from current) and errors for any calls to a dbapi function that require inspecting a non-ADBC object like an array, schema, or array stream if pyarrow is not available
  • Some combination of me and/or Will work to add a fallback to pyarrow using nanoarrow that can be opted in to (or used of pyarrow is not available)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants