Feature Request: Consider removing PyArrow as a required DBAPI dependency #2413

henryharbeck · 2025-01-06T11:48:37Z

What feature or improvement would you like to see?

Thanks to the Arrow PyCapsule Interface, one can read data directly into supported libraries (e.g., DuckDB, Polars) without requiring PyArrow. This is great!

Writing on the hand (e.g., via adbc_ingest), does require PyArrow. This is a bit of a shame given that the data supplied to adbc_ingest also supports the PyCapsule Interface.

Furthermore, removal of PyArrow as a required DBAPI dependency would allow reading data with a higher level API.

These changes would be particularly beneficial to a library like Polars, removing the need for PyArrow completely for database I/O.

NB. I am certainly no expert regarding this, so please correct me if I have said anything incorrect or there are fundamental limitations that make this request unreasonable.

The text was updated successfully, but these errors were encountered:

lidavidm · 2025-01-06T12:41:41Z

I suppose it's doable enough to change it so that as long as you don't try and read result sets (which needs something to process the Arrow data), we can accept PyCapsule and not require PyArrow.

You can always use the low-level interface as well, though it doesn't implement DBAPI. (The proposal here would technically not implement DBAPI either, but I suppose it'd be closer!)

paleolimbot · 2025-01-06T22:11:16Z

Technically nanoarrow for Python can do row tuples and possibly a few other things (but I would personally consider it something that should be opted in to and treated experimentally).

https://arrow.apache.org/nanoarrow/0.6.0/reference/python/array-stream.html#nanoarrow.array_stream.ArrayStream.iter_tuples

lidavidm · 2025-01-06T23:01:43Z

I was thinking about that, but I figure if we can get away with no dependencies at all that might be useful too.

paleolimbot · 2025-01-07T02:49:48Z

Probably best would be to eliminate the dependency and give an example that uses nanoarrow.ArrayStream(raw_capsule).iter_tuples() in the documentation (instead of a dependency).

henryharbeck · 2025-01-07T13:18:28Z

Thanks both for the prompt responses.

For a little more context behind the request (if completed), I am wanting to propose to Polars that they also remove PyArrow as a required dependency for using the ADBC engine in database I/O.

I suppose it's doable enough to change it so that as long as you don't try and read result sets (which needs something to process the Arrow data)

Upon reading the adbc_driver_manager.dbapi.Cursor Python docs again, I realise there is no current API to fetch the results as a PyCapsule. Would you consider this as an additional feature request? If so, happy to raise an additional issue for this. Otherwise I'll propose Polars use the lower-level API for database reads (which I don't imagine will be a deal-breaker by any stretch).

The proposal here would technically not implement DBAPI either, but I suppose it'd be closer!

Sorry for any confusion, I was conflating DBAPI 2.0 (PEP 249) and the adbc_driver_manager.dbapi and adbc_driver_<database>.dbapi modules/namespaces under the term "DBAPI"

I figure if we can get away with no dependencies at all that might be useful too

That would be ideal from my (and I dare say other libraries') POV, where some features/APIs require dependencies (e.g., fetch_arrow_table, fetch_df), but none are explicitly required by the library itself.

lidavidm · 2025-01-07T23:31:50Z

I realise there is no current API to fetch the results as a PyCapsule

I think that's reasonable to add at the same time

The proposal here would technically not implement DBAPI either, but I suppose it'd be closer!

Sorry for any confusion, I was conflating DBAPI 2.0 (PEP 249) and the adbc_driver_manager.dbapi and adbc_driver_.dbapi modules/namespaces under the term "DBAPI"

I just mean that, because certain methods (like fetchone) wouldn't work, we technically wouldn't be in full compliance, but otherwise we would appear to look and function like a real DBAPI driver (unless you try and fetch Python objects from result sets)

henryharbeck · 2025-01-07T23:43:31Z

I realise there is no current API to fetch the results as a PyCapsule

I think that's reasonable to add at the same time

That would be awesome!

I just mean that, because certain methods (like fetchone) wouldn't work, we technically wouldn't be in full compliance...

Ah got it, thanks for clarifying

WillAyd · 2025-01-08T01:56:49Z

I know we've gone back and forth on it but maybe its worth having nanoarrow as a dependency to stay compliant with the DBAPI? I also think it would be a good way to promote more usage of that library

lidavidm · 2025-01-08T02:47:05Z

I think we could do something where if you have neither nanoarrow nor pyarrow installed, it will function with limited support, otherwise it can use whichever one is available (not sure what would happen if you have both)

paleolimbot · 2025-01-08T16:34:53Z

Perhaps concretely:

pip install "adbc-driver-manager[dbapi]" installs pyarrow and uses it if available (no change from current)
pip install adbc-driver-manager installs nothing else (no change from current) and errors for any calls to a dbapi function that require inspecting a non-ADBC object like an array, schema, or array stream if pyarrow is not available
Some combination of me and/or Will work to add a fallback to pyarrow using nanoarrow that can be opted in to (or used of pyarrow is not available)

henryharbeck added the Type: enhancement New feature or request label Jan 6, 2025

henryharbeck mentioned this issue Jan 21, 2025

support adbc APIs in DB-API package googleapis/python-bigquery#2015

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Request: Consider removing PyArrow as a required DBAPI dependency #2413

Feature Request: Consider removing PyArrow as a required DBAPI dependency #2413

henryharbeck commented Jan 6, 2025

lidavidm commented Jan 6, 2025

paleolimbot commented Jan 6, 2025

lidavidm commented Jan 6, 2025

paleolimbot commented Jan 7, 2025

henryharbeck commented Jan 7, 2025

lidavidm commented Jan 7, 2025

henryharbeck commented Jan 7, 2025

WillAyd commented Jan 8, 2025

lidavidm commented Jan 8, 2025

paleolimbot commented Jan 8, 2025

Feature Request: Consider removing PyArrow as a required DBAPI dependency #2413

Feature Request: Consider removing PyArrow as a required DBAPI dependency #2413

Comments

henryharbeck commented Jan 6, 2025

What feature or improvement would you like to see?

lidavidm commented Jan 6, 2025

paleolimbot commented Jan 6, 2025

lidavidm commented Jan 6, 2025

paleolimbot commented Jan 7, 2025

henryharbeck commented Jan 7, 2025

lidavidm commented Jan 7, 2025

henryharbeck commented Jan 7, 2025

WillAyd commented Jan 8, 2025

lidavidm commented Jan 8, 2025

paleolimbot commented Jan 8, 2025