-
Notifications
You must be signed in to change notification settings - Fork 795
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove PyArrow dependency for Polars support #3445
Comments
Not a maintainer here, but #3426, #3384 and this all landing in the next release sounds great to me - if all goes well. Found some additional Lines 406 to 408 in 76a9ce1
Also, if this ends up being a PR and only changes a few files - it may be worth checking https://github.com/vega/altair/pull/3431/files since there are a lot of minor changes that may have simplified/standardized what is on |
Thanks for opening this issue! I'm surely in favor of improving native polars support without relying on a pyarrow dependency. If I understand right, the suggested change here is to introduce a direct link to polars for a polars DataFrame if that object is compatible with the dataframe interchange protocol? Eg, some changes here: Lines 406 to 408 in 76a9ce1
And here: Lines 406 to 416 in 76a9ce1
The introduction of the dataframe interchange protocol is not only for polars, but others are using it too, eg ibis ( #3110), or custom objects as in vega/vegafusion#386. I'm not really in favor of introducing an optional polars dependency, but in line with #3377, we are already working towards a direction where we can use direct arrow conversion methods without using the pyarrow from_dataframe interface. If we can improve this direction even more so that pyarrow is not a hard dependency anymore within the part that serialize objects that are compatible with the dataframe interchange protocol that would be great. Simultaneously, I also noticed that vegafusion has recently introduced an alternative approach for serializing objects that are compatible with polars, see eg this part: Anyways! PRs in this direction are welcome! |
Thanks for your response! 🙏 I think there are some methods which aren't covered by the interchange protocol unfortunately, such as converting a Date / Datetime column to string. So, with regards to
would you be OK with having a couple of specialised Polars branches, where you can do something like
? I think there's only 2 places where that would be necessary. Other parts where you use the dataframe interchange protocol are fine, it's just the part where you convert to a pyarrow table which could be skipped for the Polars case
Thanks! I'll put something together to drive this forwards, just wanted to check that you'd be OK with a couple (just a couple!) of specialised Polars paths |
First, we try to solve things pragmatic if the dataframe interchange protocol is incomplete. There is not always a royal road. But for my own understanding, isn't this what we are doing already in #3377, where we use a direct |
Yeah, and |
Ah.. therefor raising this issue:) So the Lines 309 to 310 in 62ab14d
should be renamed to sanitize_pandas_dataframe
And next to Lines 434 to 437 in 62ab14d
Will come a new I agree that pyarrow is a rather large dependency and we hope to stay lightweight, especially for wasm environments. |
Yup, thanks for appreciating the value in this! Alright, PR incoming this week |
Hi @MarcoGorelli, just catching up on this thread and on your and @dangotbanned's comments in #3384. I'm all in favor of streamlining polars support in Altair, but I'm less enthused about having polars specific logic scattered around the code base. On the surface, it seems like narwhals is exactly the kind of abstraction layer we would like in Altair going forward. Could we jump straight to supporting polars without pyarrow using narwhals? Then ideally over time we could move the other DataFrame types that narwhals supports behind this code path as well. cc @mattijn, I would personally prefer this approach to making pandas optional, but interested to hear your thoughts. |
Thanks for your response!
Well, I'm not going to make you ask me twice 😄 Thanks, I'll give this a go and open a PR for your consideration! I don't think it'll be too big of an effort, if you then decide you don't want it, I promise no offense will be taken 😇 |
Just browsed through the narwhals docs, sounds very interesting! I'm also onboard to allow Polars users to use Altair without the need to have PyArrow or Pandas alongside it. @MarcoGorelli Thanks for offering to open a PR! Just for you to know, #3431 will introduce a lot of changes to the code base. It's almost ready to be merged so you might want to wait a few more days. |
@MarcoGorelli Altair supports Pandas >= 0.25. I think it would be ok to bump this to 1.1.5 in your PR to be in line with Narwhal. Btw, really enjoyed your talk at PyData Berlin in April :) |
Just fyi, #3431 is merged. |
We like it when Altair is lightweight and dataframe agnostic. We have experimented with adopting the dataframe interchange protocol through relying on Eventhough the interface protocol brought us a lot in relation to becoming dataframe agnostic, but not all (ref #3377) and unfortunately pyarrow is not a library that makes Altair lightweight. If there are better options to become dataframe agnostic and lightweight by not having hard dependencies on heavy dataframe packages, than it seems like something to consider seriously. If we found out that column type inference and data serialization for objects that are like dataframes can be delegated safely to narwhals, than it at least is something we should consider as a potential candidate. I say, 'like dataframes', because I don't think narwhals will supports arrays is it? (something we eventually hope to cover one day). Surely we aim not to introduce regressions during this consideration. |
Here we go 🚀 #3452 I totally agree that it's an imperative that this would come with zero regressions. I'm not aware of any:
That's right, arrays are a different beast - I'd suggest taking a look at https://data-apis.org/array-api/latest/ for that. Array libraries were sufficiently aligned to begin with that a standard like that was possible and successful, whereas for dataframes the related effort was discontinued (and that's how Narwhals was born - I'm not calling it a "standard", but I am trying to use my position as pandas + Polars dev to enable libraries to support the latter at no cost to the former) I think a consideration that will come up will be "can we trust Narwhals? Will it stay maintained? Will it make breaking changes". If so, these would be totally valid :)
|
Thanks for addressing this proactively. If other major libraries use it and if it has such a compatibility policy, that does definitely provide some comfort. In addition, Altair does not have that much data transformation logic built in. So even if for whatever reason we would need to remove the dependency on Narwhals again, I think that would be doable in a weekend :) To me the project looks well structured, documented, and the best shot we have. The alternative would be to implement different code paths ourselves. |
What is your suggestion?
Currently, PyArrow is required by Altair for Polars support. I think it shouldn't be too hard to remove it, given that Polars implements the dataframe interchange protocol natively (without depending on PyArrow)
If #3384 can make it in, then Altair would actually support plotting Polars dataframe natively without any extra heavy dependencies. That'd be...pretty amazing? I'd suggest using Altair for
polars.DataFrame.plot
if that was the caseI think what would need doing is:
pyarrow
to be installed for thedfi = data.__dataframe__
partsanitize_arrow_table
, for Polars, just select date/datetime columns and call.dt.to_string()
to_pylist
from PyArrow, just useDataFrame.rows(named=True)
for Polarsinfer_vegalite_type_for_dfi_column
. I haven't tried this yet, but it looks straightforward-ishWould you open to considering this? Happy to work on a PR if so
Have you considered any alternative solutions?
Just keep the status-quo :) But, I think Altair is the only plotting library that gets close to native Polars support without extra large dependencies, and it doesn't look like a large stretch to go all the way there, so I'm hoping we can do it 💪
Demo from having tried this locally:
The text was updated successfully, but these errors were encountered: