SNOW-966003: Fix Arrow return value for zero-row queries #1801

thomasaarholt · 2023-11-09T19:07:45Z

Please answer these questions before submitting your pull requests. Thanks!

What GitHub issue is this PR addressing? Make sure that there is an accompanying issue to your PR.

Fixes SNOW-966003: Unexpected arrow table behaviour when query returns zero rows #1800
Fill out the following pre-review checklist:
- I am adding a new automated test(s) to verify correctness of my new code
- I am adding new logging messages
- I am adding a new telemetry message
- I am modifying authorization mechanisms
- I am adding new credentials
- I am modifying OCSP code
- I am adding a new dependency
Please describe how your code solves the related issue.

This PR copies the approach done in Return empty Arrow table instead of None #1275 by @JakobGM, but based on the latest main (I'm sorry @JakobGM, but it was easier to just copy rather than rebase your solution, and I hope you don't mind that), with one more type hint added due to some later commits. That approach itself is based on the current implementation of .fetch_pandas_all().

This is a minor code change, copying methodology already implemented by snowflake on the pandas side, but it makes it a lot easier to work with arrow data from the snowflake-connector-python.

Please let me know if anything more is required from me.

sfc-gh-aling · 2023-11-09T19:52:55Z

can we add some test here?

thomasaarholt · 2023-11-09T20:09:12Z

Thanks for the quick response. I've added a quick test to check that the dtype of a integer column in fact is what we expect (the highest bit-depth).

Feel free to add or change to this PR. My goal is only to see this behaviour fixed.

thomasaarholt · 2023-11-25T20:00:28Z

I'm a little worried about seeing this PR go stale like #1275. Is there anything further that I can do to finish this?

sfc-gh-aling · 2023-12-08T17:30:04Z

src/snowflake/connector/cursor.py

@@ -1117,7 +1117,7 @@ def fetch_arrow_batches(self) -> Iterator[Table]:
        )
        return self._result_set._fetch_arrow_batches()

-    def fetch_arrow_all(self) -> Table | None:
+    def fetch_arrow_all(self) -> Table:


@sfc-gh-mkeller , do you think this is a API breaking change that previously we return None now we return an empty Table.

I'm thinking of introducing a parameter to protect the change.
(name TBD)
def fetch_arrow_all(self, return_table_even_no_data=False)

Yes, we could do that and then use typing.overload to express the different return types based on that boolean flag.

I'd love if we had a plan on how to introduce such changes with a timeline to deprecate them over some time. Maybe some warning wit a date'd comment?

Some thoughts

Here's a suggested fix using your ideas:

from typing import overload, Literal from warnings import warn from pyarrow import Table # Temporarily omitting `self` in the methods in order to make it easy to test @overload def fetch_arrow_all(return_table_even_no_data: Literal[False] = False) -> None: ... @overload def fetch_arrow_all(return_table_even_no_data: Literal[True] = True) -> Table: ... def fetch_arrow_all(return_table_even_no_data: bool = False) -> Table | None: if not return_table_even_no_data: warn( "The default behaviour of fetch_arrow_all when the query returns zero rows " "will change from returning None to returning an empty pyarrow table with " "schema using the highest bit length in version X.X.X.", DeprecationWarning, stacklevel=2, ) # regular code here foo = fetch_arrow_all(return_table_even_no_data=True) # Table, or "Unknown" if no type stubs for pyarrow are present bar = fetch_arrow_all(return_table_even_no_data=False) # None

I'm happy to change any of this.

You could consider this a bug-fix though. It is a discrepancy between the pandas and pyarrow apis. Depends on how strict you want to be about potentially breaking changes.

If you follow semver, then I'd expect that you would only introduce breaking changes with major version change. As a user of the connector, it would be nice to know when one expects breaking changes, even if you don't follow semver.

hi @thomasaarholt, thanks for the proposal, I personally like it. re your third point, yeah.. sometime I feel torn when facing a breaking change bug fix, but the fact is that there might be programs that depend on this "bug" behavior, so even with a small change like this, something can go wrong in users env.

so I feel it would be safer to introduce a parameter for the new behavior.

I'm thinking of a better naming force_return_table instead of the return_table_even_no_data (or you have a better idea for the name)

can you help make the code change based upon our discussion? really appreciate your contribution and the discussion!

Done! I decided to omit the line containing the X.X.X version entirely.

thomasaarholt · 2023-12-15T13:57:46Z

Changes are implemented. This is ready for review again.

src/snowflake/connector/result_set.py

src/snowflake/connector/cursor.py

test/integ/pandas/test_arrow_pandas.py

sfc-gh-aling · 2023-12-15T19:35:05Z

src/snowflake/connector/cursor.py

+    def fetch_arrow_all(self, force_return_table: Literal[True] = True) -> Table:
+        ...
+
+    def fetch_arrow_all(self, force_return_table: bool = False) -> Table | None:


we shall also update the changelog

sfc-gh-aling · 2023-12-15T19:35:57Z

thanks! @thomasaarholt I did another round of review and left some comments.

thomasaarholt · 2023-12-18T13:55:11Z

Thank you! I can't believe I didn't actually implement it...! My two-year-old is making me sleep-deprived. I'm very grateful for your code suggestions.

I've incorporated them and updated the docstring with a bit more info and also overloaded the _fetch_arrow_all method. I'm unable to make the tests run locally, but I think it looks good.

thomasaarholt · 2023-12-18T13:55:44Z

Btw, I would encourage you to change the pyarrow import here so that you don't set Table = None, and instead use if TYPE_CHECKING, since Table is only used as a type, not directly called anywhere. Pyarrow typing is pretty non-existant, but this would remove a lot of red error messages in the function definitions.

I considered doing so in this PR, but decided to avoid scope creep 😅

sfc-gh-aling · 2023-12-18T18:51:31Z

hey @thomasaarholt , I like your proposal on the Table , are you willing to make another PR for it?

for this PR, I tried to do some small changes on your PR, but looks like I don't have the permission to push to your branch..

Total 0 (delta 0), reused 0 (delta 0), pack-reused 0
To github.com:thomasaarholt/snowflake-connector-python.git
 ! [remote rejected]   thomasaarholt/thomasaarholt/fix-arrow-empty-rows -> thomasaarholt/thomasaarholt/fix-arrow-empty-rows (permission denied)
error: failed to push some refs to 'github.com:thomasaarholt/snowflake-connector-python.git'

I added another two commits which tried to fix lint, update doc: #1832.

would you be good if I merge the other PR to avoid you going back and forth tweaking the minor changes?

thomasaarholt · 2023-12-18T18:58:21Z

That’s fine! Go right ahead! I’ll be happy to make a PR on the Table tomorrow (evening in Europe now).

I’ll close this when you merge the other one!

sfc-gh-aling · 2023-12-18T22:22:47Z

sounds good @thomasaarholt , I will make sure the PR is merged this week.
appreciate your contribution!

thomasaarholt · 2024-01-03T11:03:04Z

This PR was merged via #1832! Closing.

Return empty table with correct schema, no longer None

39dfed3

sfc-gh-aling requested a review from sfc-gh-mkeller November 9, 2023 19:52

add test

1bd22c7

sfc-gh-aling reviewed Dec 8, 2023

View reviewed changes

sfc-gh-aling self-assigned this Dec 14, 2023

include overload and warning

2821511