Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SNOW-966003: Fix Arrow return value for zero-row queries #1801

Conversation

thomasaarholt
Copy link
Contributor

@thomasaarholt thomasaarholt commented Nov 9, 2023

Please answer these questions before submitting your pull requests. Thanks!

  1. What GitHub issue is this PR addressing? Make sure that there is an accompanying issue to your PR.

    Fixes SNOW-966003: Unexpected arrow table behaviour when query returns zero rows #1800

  2. Fill out the following pre-review checklist:

    • I am adding a new automated test(s) to verify correctness of my new code
    • I am adding new logging messages
    • I am adding a new telemetry message
    • I am modifying authorization mechanisms
    • I am adding new credentials
    • I am modifying OCSP code
    • I am adding a new dependency
  3. Please describe how your code solves the related issue.

    This PR copies the approach done in Return empty Arrow table instead of None #1275 by @JakobGM, but based on the latest main (I'm sorry @JakobGM, but it was easier to just copy rather than rebase your solution, and I hope you don't mind that), with one more type hint added due to some later commits. That approach itself is based on the current implementation of .fetch_pandas_all().

    This is a minor code change, copying methodology already implemented by snowflake on the pandas side, but it makes it a lot easier to work with arrow data from the snowflake-connector-python.

    Please let me know if anything more is required from me.

@sfc-gh-aling
Copy link
Collaborator

can we add some test here?

@thomasaarholt
Copy link
Contributor Author

thomasaarholt commented Nov 9, 2023

Thanks for the quick response. I've added a quick test to check that the dtype of a integer column in fact is what we expect (the highest bit-depth).

Feel free to add or change to this PR. My goal is only to see this behaviour fixed.

@thomasaarholt
Copy link
Contributor Author

I'm a little worried about seeing this PR go stale like #1275. Is there anything further that I can do to finish this?

@@ -1117,7 +1117,7 @@ def fetch_arrow_batches(self) -> Iterator[Table]:
)
return self._result_set._fetch_arrow_batches()

def fetch_arrow_all(self) -> Table | None:
def fetch_arrow_all(self) -> Table:
Copy link
Collaborator

@sfc-gh-aling sfc-gh-aling Dec 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sfc-gh-mkeller , do you think this is a API breaking change that previously we return None now we return an empty Table.

I'm thinking of introducing a parameter to protect the change.
(name TBD)
def fetch_arrow_all(self, return_table_even_no_data=False)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we could do that and then use typing.overload to express the different return types based on that boolean flag.

I'd love if we had a plan on how to introduce such changes with a timeline to deprecate them over some time. Maybe some warning wit a date'd comment?

Copy link
Contributor Author

@thomasaarholt thomasaarholt Dec 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some thoughts

  1. Here's a suggested fix using your ideas:
from typing import overload, Literal
from warnings import warn
from pyarrow import Table

# Temporarily omitting `self` in the methods in order to make it easy to test

@overload
def fetch_arrow_all(return_table_even_no_data: Literal[False] = False) -> None:
    ...


@overload
def fetch_arrow_all(return_table_even_no_data: Literal[True] = True) -> Table:
    ...


def fetch_arrow_all(return_table_even_no_data: bool = False) -> Table | None:
    if not return_table_even_no_data:
        warn(
            "The default behaviour of fetch_arrow_all when the query returns zero rows "
            "will change from returning None to returning an empty pyarrow table with "
            "schema using the highest bit length in version X.X.X.",
            DeprecationWarning,
            stacklevel=2,
        )
    # regular code here


foo = fetch_arrow_all(return_table_even_no_data=True) # Table, or "Unknown" if no type stubs for pyarrow are present
bar = fetch_arrow_all(return_table_even_no_data=False) # None

I'm happy to change any of this.

  1. You could consider this a bug-fix though. It is a discrepancy between the pandas and pyarrow apis. Depends on how strict you want to be about potentially breaking changes.

  2. If you follow semver, then I'd expect that you would only introduce breaking changes with major version change. As a user of the connector, it would be nice to know when one expects breaking changes, even if you don't follow semver.

Copy link
Collaborator

@sfc-gh-aling sfc-gh-aling Dec 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi @thomasaarholt, thanks for the proposal, I personally like it. re your third point, yeah.. sometime I feel torn when facing a breaking change bug fix, but the fact is that there might be programs that depend on this "bug" behavior, so even with a small change like this, something can go wrong in users env.

so I feel it would be safer to introduce a parameter for the new behavior.

I'm thinking of a better naming force_return_table instead of the return_table_even_no_data (or you have a better idea for the name)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you help make the code change based upon our discussion? really appreciate your contribution and the discussion!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done! I decided to omit the line containing the X.X.X version entirely.

@sfc-gh-aling sfc-gh-aling self-assigned this Dec 14, 2023
@thomasaarholt
Copy link
Contributor Author

Changes are implemented. This is ready for review again.

def fetch_arrow_all(self, force_return_table: Literal[True] = True) -> Table:
...

def fetch_arrow_all(self, force_return_table: bool = False) -> Table | None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we shall also update the changelog

@sfc-gh-aling
Copy link
Collaborator

thanks! @thomasaarholt I did another round of review and left some comments.

@thomasaarholt
Copy link
Contributor Author

Thank you! I can't believe I didn't actually implement it...! My two-year-old is making me sleep-deprived. I'm very grateful for your code suggestions.

I've incorporated them and updated the docstring with a bit more info and also overloaded the _fetch_arrow_all method. I'm unable to make the tests run locally, but I think it looks good.

@thomasaarholt
Copy link
Contributor Author

Btw, I would encourage you to change the pyarrow import here so that you don't set Table = None, and instead use if TYPE_CHECKING, since Table is only used as a type, not directly called anywhere. Pyarrow typing is pretty non-existant, but this would remove a lot of red error messages in the function definitions.

I considered doing so in this PR, but decided to avoid scope creep 😅

@sfc-gh-aling
Copy link
Collaborator

hey @thomasaarholt , I like your proposal on the Table , are you willing to make another PR for it?

for this PR, I tried to do some small changes on your PR, but looks like I don't have the permission to push to your branch..

Total 0 (delta 0), reused 0 (delta 0), pack-reused 0
To github.com:thomasaarholt/snowflake-connector-python.git
 ! [remote rejected]   thomasaarholt/thomasaarholt/fix-arrow-empty-rows -> thomasaarholt/thomasaarholt/fix-arrow-empty-rows (permission denied)
error: failed to push some refs to 'github.com:thomasaarholt/snowflake-connector-python.git'

I added another two commits which tried to fix lint, update doc: #1832.

would you be good if I merge the other PR to avoid you going back and forth tweaking the minor changes?

@thomasaarholt
Copy link
Contributor Author

That’s fine! Go right ahead! I’ll be happy to make a PR on the Table tomorrow (evening in Europe now).

I’ll close this when you merge the other one!

@sfc-gh-aling
Copy link
Collaborator

sounds good @thomasaarholt , I will make sure the PR is merged this week.
appreciate your contribution!

@thomasaarholt
Copy link
Contributor Author

This PR was merged via #1832! Closing.

@github-actions github-actions bot locked and limited conversation to collaborators Jan 3, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

SNOW-966003: Unexpected arrow table behaviour when query returns zero rows
3 participants